nucleotideFrequency {Biostrings} | R Documentation |
Given a DNA or RNA sequence (or a set of DNA or RNA sequences),
the oligonucleotideFrequency
function computes the frequency
of all possible oligonucleotides of a given length (called the "width"
in this particular context).
The dinucleotideFrequency
and trinucleotideFrequency
functions are convenient wrappers for calling oligonucleotideFrequency
with width=2
and width=3
, respectively.
The nucleotideFrequencyAt
function computes the frequency
of the short sequences formed by extracting the nucleotides found
at some fixed positions from each sequence of a set of DNA or RNA
sequences.
In this man page we call "DNA input" (or "RNA input") an XString, XStringSet, XStringViews or MaskedXString object of base type DNA (or RNA).
oligonucleotideFrequency(x, width, as.prob=FALSE, as.array=FALSE, fast.moving.side="right", with.labels=TRUE, ...) ## S4 method for signature 'XStringSet' oligonucleotideFrequency(x, width, as.prob=FALSE, as.array=FALSE, fast.moving.side="right", with.labels=TRUE, simplify.as="matrix") dinucleotideFrequency(x, as.prob=FALSE, as.matrix=FALSE, fast.moving.side="right", with.labels=TRUE, ...) trinucleotideFrequency(x, as.prob=FALSE, as.array=FALSE, fast.moving.side="right", with.labels=TRUE, ...) nucleotideFrequencyAt(x, at, as.prob=FALSE, as.array=TRUE, fast.moving.side="right", with.labels=TRUE, ...) ## Some related functions: oligonucleotideTransitions(x, left=1, right=1, as.prob=FALSE) mkAllStrings(alphabet, width, fast.moving.side="right")
x |
Any DNA or RNA input for the An XStringSet or XStringViews object of base type DNA or RNA
for |
width |
The number of nucleotides per oligonucleotide for
The number of letters per string for |
at |
An integer vector containing the positions to look at in each element
of |
as.prob |
If |
as.array,as.matrix |
Controls the "shape" of the returned object.
If |
fast.moving.side |
Which side of the strings should move fastest?
Note that, when |
with.labels |
If |
... |
Further arguments to be passed to or from other methods. |
simplify.as |
Together with the |
left, right |
The number of nucleotides per oligonucleotide for the rows
and columns respectively in the transition matrix created
by |
alphabet |
The alphabet to use to make the strings. |
If x
is an XString or MaskedXString object,
the *Frequency
functions return a numeric vector of length
4^width
. If as.array
(or as.matrix
) is TRUE
,
then this vector is formatted as an array (or matrix).
If x
is an XStringSet or XStringViews object,
the returned object has the shape specified by the simplify.as
argument.
H. Pages and P. Aboyoun
alphabetFrequency
,
alphabet
,
hasLetterAt
,
XString-class,
XStringSet-class,
XStringViews-class,
MaskedXString-class,
GENETIC_CODE
,
AMINO_ACID_CODE
,
reverseComplement
,
rev
## --------------------------------------------------------------------- ## A. BASIC *Frequency() EXAMPLES ## --------------------------------------------------------------------- data(yeastSEQCHR1) yeast1 <- DNAString(yeastSEQCHR1) dinucleotideFrequency(yeast1) trinucleotideFrequency(yeast1) oligonucleotideFrequency(yeast1, 4) ## Get the less and most represented 6-mers: f6 <- oligonucleotideFrequency(yeast1, 6) f6[f6 == min(f6)] f6[f6 == max(f6)] ## Get the result as an array: tri <- trinucleotideFrequency(yeast1, as.array=TRUE) tri["A", "A", "C"] # == trinucleotideFrequency(yeast1)["AAC"] tri["T", , ] # frequencies of trinucleotides starting with a "T" ## With input made of multiple sequences: library(drosophila2probe) probes <- DNAStringSet(drosophila2probe) dfmat <- dinucleotideFrequency(probes) # a big matrix dinucleotideFrequency(probes, simplify.as="collapsed") dinucleotideFrequency(probes, simplify.as="collapsed", as.matrix=TRUE) ## --------------------------------------------------------------------- ## B. OBSERVED DINUCLEOTIDE FREQUENCY VERSUS EXPECTED DINUCLEOTIDE ## FREQUENCY ## --------------------------------------------------------------------- ## The expected frequency of dinucleotide "ab" based on the frequencies ## of its individual letters "a" and "b" is: ## exp_Fab = Fa * Fb / N if the 2 letters are different (e.g. CG) ## exp_Faa = Fa * (Fa-1) / N if the 2 letters are the same (e.g. TT) ## where Fa and Fb are the frequencies of "a" and "b" (respectively) and ## N the length of the sequence. ## Here is a simple function that implements the above formula for a ## DNAString object 'x'. The expected frequencies are returned in a 4x4 ## matrix where the rownames and colnames correspond to the 1st and 2nd ## base in the dinucleotide: expectedDinucleotideFrequency <- function(x) { # Individual base frequencies. bf <- alphabetFrequency(x, baseOnly=TRUE)[DNA_BASES] (as.matrix(bf) %*% t(bf) - diag(bf)) / length(x) } ## On Celegans chrI: library(BSgenome.Celegans.UCSC.ce2) chrI <- Celegans$chrI obs_df <- dinucleotideFrequency(chrI, as.matrix=TRUE) obs_df # CG has the lowest frequency exp_df <- expectedDinucleotideFrequency(chrI) ## A sanity check: stopifnot(as.integer(sum(exp_df)) == sum(obs_df)) ## Ratio of observed frequency to expected frequency: obs_df / exp_df # TA has the lowest ratio, not CG! ## --------------------------------------------------------------------- ## C. nucleotideFrequencyAt() ## --------------------------------------------------------------------- nucleotideFrequencyAt(probes, 13) nucleotideFrequencyAt(probes, c(13, 20)) nucleotideFrequencyAt(probes, c(13, 20), as.array=FALSE) ## nucleotideFrequencyAt() can be used to answer questions like: "how ## many probes in the drosophila2 chip have T, G, T, A at position ## 2, 4, 13 and 20, respectively?" nucleotideFrequencyAt(probes, c(2, 4, 13, 20))["T", "G", "T", "A"] ## or "what's the probability to have an A at position 25 if there is ## one at position 13?" nf <- nucleotideFrequencyAt(probes, c(13, 25)) sum(nf["A", "A"]) / sum(nf["A", ]) ## Probabilities to have other bases at position 25 if there is an A ## at position 13: sum(nf["A", "C"]) / sum(nf["A", ]) # C sum(nf["A", "G"]) / sum(nf["A", ]) # G sum(nf["A", "T"]) / sum(nf["A", ]) # T ## See ?hasLetterAt for another way to get those results. ## --------------------------------------------------------------------- ## D. oligonucleotideTransitions() ## --------------------------------------------------------------------- ## Get nucleotide transition matrices for yeast1 oligonucleotideTransitions(yeast1) oligonucleotideTransitions(yeast1, 2, as.prob=TRUE) ## --------------------------------------------------------------------- ## E. ADVANCED *Frequency() EXAMPLES ## --------------------------------------------------------------------- ## Note that when dropping the dimensions of the 'tri' array, elements ## in the resulting vector are ordered as if they were obtained with ## 'fast.moving.side="left"': triL <- trinucleotideFrequency(yeast1, fast.moving.side="left") all(as.vector(tri) == triL) # TRUE ## Convert the trinucleotide frequency into the amino acid frequency ## based on translation: tri1 <- trinucleotideFrequency(yeast1) names(tri1) <- GENETIC_CODE[names(tri1)] sapply(split(tri1, names(tri1)), sum) # 12512 occurrences of the stop codon ## When the returned vector is very long (e.g. width >= 10), using ## 'with.labels=FALSE' can improve performance significantly. ## Here for example, the observed speed up is between 25x and 500x: f12 <- oligonucleotideFrequency(yeast1, 12, with.labels=FALSE) # very fast! ## Spome related functions: dict1 <- mkAllStrings(LETTERS[1:3], 4) dict2 <- mkAllStrings(LETTERS[1:3], 4, fast.moving.side="left") stopifnot(identical(reverse(dict1), dict2))