The purpose of the
bioseq package is to provide a collection of classes and functions for biological sequence manipulation in R. This vignette will introduce you to the basics of the package so you can get an overview of its functionnalities and start to use it rapidly.
It is assumed that you already installed the package either from CRAN using the function
install.packages("bioseq) or from GitHub using
remotes::install_github("fkeck/bioseq"). Now, let’s get started by loading the package.
One of the core functionnality of
bioseq is to provide vector classes to store DNA, RNA and amino acid sequences. We create our first DNA sequence vector using the function
is_dna() is useful to test if an object is a DNA vector. The print method nicely indicates that
x is a DNA vector of 3 sequences. Note that contrary to the standard print methods for R vectors, each element is printed on its own line, next to its (optional) name. Apart from this, a DNA vector behave very like a
character vector. For example we can select and reorder elements by name, logical or index:
However, the key difference between a DNA vector and a character vector is that DNA uses a restricted alphabet. For DNA this alphabet is A, C, G, T, W, S, M, K, R, Y, B, D, H, V, N and -, which correpond to the IUPAC symbols for DNA nucleotides. What happens if you include a forbidden character in a sequence?
Here we included two forbidden characters (? and F). Both are automatically converted to a N (which stands for any nucleotide). Forbidden characters often mean that something went wrong somewhere, so the function warns the user. Additionally we included a lowercase symbol (c) which is automatically and silently converted to uppercase. This mechanism guarantees that DNA objects contain only uppercase characters.
RNA and amino acid sequences can be constructed just like DNA using
aa() functions. It is possible to check the allowed alphabet for each type of sequences by typing
?alphabets in the console.
In living organisms, DNA is typically transcribed to RNA which is translated to a proteic sequence. Similarly, conversion among sequence classes can be achieved using the
During transcription thymine is simply replaced by uracil. The translation decodes the RNA sequence into amino acids using the standard genetic code. Non standard genetic codes are also available for translation (see the help
?seq_translate). The reverse transcription can be achieved using the function
seq_rev_transcribe(). The reverse translation is biologically not possible but is implemented in the function
seq_rev_translate(). Because of the degeneracy of the genetic code, the reverse translation typically produces many ambiguous bases.
Finally, it is often useful to compute the complement and the reverse complement of DNA and RNA sequences. this can be achieved using the functions
bioseq package comes with numerous functions to perform string operations at the sequence level. We will not review the complete list of functions provided by the package, but we will see below how use some of them.
We will take a simple example with 3 sequences. The first two sequences have four A repeated. We will focus on this particular pattern.
Let’s start by selecting only the sequences that match the pattern. This can be easily achieved by combining
seq_detect_pattern with the
When using a simple character vector as pattern, the pattern is evaluated as a regular expression. This means that you can perform very complex queries using the regular expression syntax. Regular expressions are beyond the scope of this vignette but if you are interested to learn more the String chapter from the book R for Data Science by Grolemund and Wickham is a good place to get started.
As an example to illustrate regex support, the same pattern (AAAA) could be also formulated:
Alternatively, a biological sequence (i.e a DNA, RNA or AA vector) can be used as pattern. This is less flexible than regular expression but can present several advantages. First it is safer and clearer because it forces the user to be more specific. Second, it allows to deal with ambiguous characters.
However it is important to remember that a pattern which contains ambiguous characters is less specific and can capture several strings. How many and which ones? This can be answered using the function
If the AAAA pattern is an incorrect insertion, we may want to remove it from the sequences. This can be done with the function
We can also replace a specific pattern with another sequence.
So far we performed operations using pattern recognition (functions with prefix
seq_ and suffix
_pattern). Several operations (remove, replace, extract and crop) can also be applied to a specific region delimited by to positions (in and out). This is typically more useful with aligned sequences.
Instead of removing a pattern it is possible to remove specific region by providing to positions.
For example if we want to replace the last 3 nucleotides with CCC:
It is important to know that patterns, positions and replacements are recycled along the sequences (usually the
x argument). This means that if a pattern (vector or list), a position or a replacement is of length > 1, it will be replicated until it is the same length as x. This is powerful but it must be used with caution. The exemple below show an exemple with a vector of position (in) and a vector of replacement.