This vignette demonstrates the use of the package vivaldi to analyze viral single nucleotide variants (SNVs) from Illumina sequencing data. The vivaldi package provides tools for visualizing and summarizing allele frequency information, and genetic diversity, from heterogeneous viral samples.
The vivaldi package is available on CRAN and is
loaded as follows. We also use dplyr
for data wrangling and
kableExtra
for rendering tables in vignette
library(vivaldi)
library(kableExtra)
The data used in this vignette are simulated mixed influenza samples with simulated variants at random frequencies. VCF files were generated using the MAD2 genome alignment and variation calling pipeline. VCF files were generated using the variant callers timo and iVar; however, vivaldi functions are designed to work with all variant callers.
The simulated data comprise 12 samples that were sequenced in technical replicate resulting in a total of 24 VCF files. The VCF files were annotated using SNPeff.
The data used in this vignette is included with the package. To use your own data, users should set the path to those files.
vardir = system.file("extdata", "vcfs", package="vivaldi")
Metadata used for the calculations includes a .csv
file
containing the length of each of the viral segments. For non-segmented
viruses, simply report the whole length of the genome. Users should also
set a variable with the total length of the viral genome (i.e. the sum
of the segment lengths).
seg_sizes = system.file("extdata", "SegmentSize.csv", package="vivaldi")
sizes = read.csv(file=seg_sizes,header=T,sep=",",na.strings = c(''))
#select only the relevant segment sizes for H1N1
sizes = dplyr::filter(sizes, STRAIN == "H1N1")
genome_size = 13133
The user should also supply a .csv
file with sample
information containing 3 columns:
rep_info = system.file("extdata", "reps.csv", package="vivaldi")
replicates = read.csv(file = rep_info, header = T, sep = ",", na.strings = c(""))
kable(head(replicates))
filename | replicate | sample |
---|---|---|
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | a_1_fb |
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_ivar_custom.ann | rep1 | a_1_iv |
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_varscan_custom.ann | rep1 | a_1_vs |
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann | rep1 | a_2_fb |
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_ivar_custom.ann | rep1 | a_2_iv |
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_varscan_custom.ann | rep1 | a_2_vs |
dim(replicates)
#> [1] 24 3
The first step is performed using arrange_data()
to load
the VCF files into R, extract the important information. and arrange
data as a tidy dataframe. The dataframe contains the sample name, pulled
from the VCF file name, and information about the reference and
alternative allele. We recommend that the VCF files have been annotated
using SNPeff. However, if the VCF has not been annotated using SNPeff,
the user should specify annotated = "no"
.
Once the data has been arranged we can see that there are a total of 2,493 variants across the 24 samples. Each sample has between 95 - 110 variants.
VCF_DF = arrange_data(vardir, ref = system.file("extdata", "H1N1.fa", package="vivaldi"), annotated = 'yes')
kable(head(VCF_DF))
sample | CHROM | POS | REF | ALT | ANN | gt_DP | REF_COUNT | ALT_COUNT | REF_FREQ | ALT_FREQ | ALT_TYPE | major | minor | majorcount | minorcount | majorfreq | minorfreq |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | H1N1_HA | 1007 | G | A | A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1007G>A|p.Arg336Lys|1007/1701|1007/1701|336/566|| | 834 | 796 | 38 | 0.9544365 | 0.0455635 | minor | G | A | 796 | 38 | 0.9544365 | 0.0455635 |
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | H1N1_HA | 1145 | T | A | A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1145T>A|p.Leu382Gln|1145/1701|1145/1701|382/566|| | 770 | 740 | 30 | 0.9610390 | 0.0389610 | minor | T | A | 740 | 30 | 0.9610390 | 0.0389610 |
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | H1N1_HA | 1293 | T | C | C|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1293T>C|p.Gly431Gly|1293/1701|1293/1701|431/566|| | 571 | 554 | 17 | 0.9702277 | 0.0297723 | minor | T | C | 554 | 17 | 0.9702277 | 0.0297723 |
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | H1N1_HA | 1319 | C | T | T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1319C>T|p.Ala440Val|1319/1701|1319/1701|440/566|| | 587 | 580 | 7 | 0.9880750 | 0.0119250 | minor | C | T | 580 | 7 | 0.9880750 | 0.0119250 |
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | H1N1_HA | 1386 | A | G | G|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1386A>G|p.Leu462Leu|1386/1701|1386/1701|462/566|| | 552 | 545 | 7 | 0.9873188 | 0.0126812 | minor | A | G | 545 | 7 | 0.9873188 | 0.0126812 |
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | H1N1_HA | 1505 | A | G | G|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1505A>G|p.Asp502Gly|1505/1701|1505/1701|502/566|| | 386 | 380 | 5 | 0.9844560 | 0.0129534 | minor | A | G | 380 | 5 | 0.9844560 | 0.0129534 |
VCF_DF %>%
dplyr::group_by(sample) %>%
dplyr::summarise(n = dplyr::n()) %>%
kable()
sample | n |
---|---|
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 105 |
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_ivar_custom.ann | 107 |
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_varscan_custom.ann | 104 |
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann | 107 |
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_ivar_custom.ann | 109 |
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_varscan_custom.ann | 105 |
H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann | 108 |
H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_ivar_custom.ann | 110 |
H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_varscan_custom.ann | 108 |
H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | 103 |
H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_ivar_custom.ann | 104 |
H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_varscan_custom.ann | 103 |
H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann | 102 |
H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_ivar_custom.ann | 103 |
H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_varscan_custom.ann | 102 |
H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann | 105 |
H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_ivar_custom.ann | 105 |
H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_varscan_custom.ann | 101 |
H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann | 95 |
H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_ivar_custom.ann | 101 |
H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_varscan_custom.ann | 100 |
H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann | 97 |
H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_ivar_custom.ann | 106 |
H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_varscan_custom.ann | 103 |
dim(VCF_DF)
#> [1] 2493 18
The VCF_DF dataframe contains all information from the VCF files. The 18 columns are:
If working with a segmented genome, users should provide a vector containing the names of the segments in the desired order for plotting.
SEGMENTS = c("H1N1_PB2","H1N1_PB1","H1N1_PA","H1N1_HA","H1N1_NP","H1N1_NA","H1N1_MP","H1N1_NS")
VCF_DF$CHROM = factor(VCF_DF$CHROM, levels = SEGMENTS)
Performing technical replicates of library preparation and DNA sequencing improves the accuracy of variant calling as sequencing or RT-PCR errors are unlikely to occur at the same site in independent experiments.
For this function, the user must provide the variant dataframe
(i.e. the output of the arrange_data() function
) and the
replicates dataframe. Additional information that is required is: * the
exact name used for replicates 1 and 2 from the replicates dataframe
(e.g. “Rep1”, “r1”, “rep1”, etc), which must be provided in quotes * a
vector containing the names of the columns that should be used to merge
the two dataframes. This vector must contain columns that have identical
values for the two replicates, such as the segment and nucleotide
position of the variant. It should not contain replicate-specific
information such as the allele frequency of the variant.
The merge_replicates()
functions generates a dataframe
with all variants that are found in both sequencing replicates. It
excludes variants that are only found in one replicate.
In the merged dataframe information from replicate 1 is denoted with
a .x
suffix. Information from replicate 2 is denoted with a
.y
suffix.
In addition, average frequencies are computed from the two replicates.
The dataframe now contains the 12 unique samples with between 82 - 99 variants. These variants were detected in both sequencing replicates whereas variants in only a single replicate were removed during the merge.
cols = c("sample","CHROM","POS","REF","ALT","ANN","ALT_TYPE","major","minor")
DF_reps = merge_replicates(VCF_DF,replicates,"rep1","rep2",cols)
kable(head(DF_reps))
sample | CHROM | POS | REF | ALT | ANN | ALT_TYPE | major | minor | filename.x | replicate.x | gt_DP.x | REF_COUNT.x | ALT_COUNT.x | REF_FREQ.x | ALT_FREQ.x | majorcount.x | minorcount.x | majorfreq.x | minorfreq.x | filename.y | replicate.y | gt_DP.y | REF_COUNT.y | ALT_COUNT.y | REF_FREQ.y | ALT_FREQ.y | majorcount.y | minorcount.y | majorfreq.y | minorfreq.y | minorfreq | majorfreq | weighted_minorfreq | weighted_majorfreq |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a_1_fb | H1N1_HA | 1007 | G | A | A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1007G>A|p.Arg336Lys|1007/1701|1007/1701|336/566|| | minor | G | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 834 | 796 | 38 | 0.9544365 | 0.0455635 | 796 | 38 | 0.9544365 | 0.0455635 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 816 | 780 | 35 | 0.9558824 | 0.0428922 | 780 | 35 | 0.9558824 | 0.0428922 | 0.0442279 | 0.9551594 | 0.0442424 | 0.9551515 |
a_1_fb | H1N1_HA | 1145 | T | A | A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1145T>A|p.Leu382Gln|1145/1701|1145/1701|382/566|| | minor | T | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 770 | 740 | 30 | 0.9610390 | 0.0389610 | 740 | 30 | 0.9610390 | 0.0389610 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 810 | 781 | 29 | 0.9641975 | 0.0358025 | 781 | 29 | 0.9641975 | 0.0358025 | 0.0373818 | 0.9626182 | 0.0373418 | 0.9626582 |
a_1_fb | H1N1_HA | 1386 | A | G | G|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1386A>G|p.Leu462Leu|1386/1701|1386/1701|462/566|| | minor | A | G | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 552 | 545 | 7 | 0.9873188 | 0.0126812 | 545 | 7 | 0.9873188 | 0.0126812 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 529 | 521 | 8 | 0.9848771 | 0.0151229 | 521 | 8 | 0.9848771 | 0.0151229 | 0.0139020 | 0.9860980 | 0.0138760 | 0.9861240 |
a_1_fb | H1N1_HA | 410 | C | T | T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.410C>T|p.Thr137Ile|410/1701|410/1701|137/566|| | minor | C | T | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 734 | 720 | 14 | 0.9809264 | 0.0190736 | 720 | 14 | 0.9809264 | 0.0190736 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 773 | 756 | 17 | 0.9780078 | 0.0219922 | 756 | 17 | 0.9780078 | 0.0219922 | 0.0205329 | 0.9794671 | 0.0205707 | 0.9794293 |
a_1_fb | H1N1_HA | 435 | G | A | A|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.435G>A|p.Ser145Ser|435/1701|435/1701|145/566|| | minor | G | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 749 | 726 | 23 | 0.9692924 | 0.0307076 | 726 | 23 | 0.9692924 | 0.0307076 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 790 | 766 | 22 | 0.9696203 | 0.0278481 | 766 | 22 | 0.9696203 | 0.0278481 | 0.0292779 | 0.9694563 | 0.0292398 | 0.9694607 |
a_1_fb | H1N1_HA | 532 | C | T | T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.532C>T|p.Leu178Phe|532/1701|532/1701|178/566|| | minor | C | T | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 756 | 743 | 13 | 0.9828042 | 0.0171958 | 743 | 13 | 0.9828042 | 0.0171958 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 784 | 763 | 21 | 0.9732143 | 0.0267857 | 763 | 21 | 0.9732143 | 0.0267857 | 0.0219907 | 0.9780093 | 0.0220779 | 0.9779221 |
DF_reps %>%
dplyr::group_by(sample) %>%
dplyr::summarise(n = dplyr::n()) %>%
kable()
sample | n |
---|---|
a_1_fb | 92 |
a_1_iv | 95 |
a_1_vs | 94 |
a_2_fb | 93 |
a_2_iv | 97 |
a_2_vs | 93 |
a_3_fb | 96 |
a_3_iv | 99 |
a_3_vs | 93 |
b_1_fb | 82 |
b_1_iv | 92 |
b_1_vs | 91 |
dim(DF_reps)
#> [1] 1117 35
Visual analysis of the variation in allele frequency estimations from the two sequencing experiments is informative about the technical variation in allele frequency estimation.
Here we generate a plot comparing the allele frequency of variants between replicates 1 and 2 without removing variants that were only found in one replicate. These are clearly observed as variants that have zero values along the x or y axes.
df = merge(replicates,VCF_DF, by.x = c("filename"), by.y = c("sample"))
df_rep1 = dplyr::filter(df, replicate == "rep1")
df_rep2 = dplyr::filter(df, replicate == "rep2")
df_merged_keep = merge(df_rep1, df_rep2, by = cols, all = TRUE)
df_merged_keep = df_merged_keep[!duplicated(df_merged_keep), ]
df_merged_keep$minorfreq.x[is.na(df_merged_keep$minorfreq.x)] = 0
df_merged_keep$minorfreq.y[is.na(df_merged_keep$minorfreq.y)] = 0
ggplot2::ggplot(df_merged_keep, ggplot2::aes(x = minorfreq.x, y = minorfreq.y)) +
ggplot2::geom_point()
Here, we generate the same plot from the merged dataframe with removing variants that were only found in one replicate. This serves as a visual check that these variants are removed from downstream analyses.
ggplot2::ggplot(DF_reps, ggplot2::aes(x = minorfreq.x, y = minorfreq.y)) +
ggplot2::geom_point()
#> Warning: Removed 371 rows containing missing values (`geom_point()`).
merge_replicates()
calculates two average allele
frequency values: 1) the average allele frequency and 2) the weighted
average allele frequency, for the major and minor allele.
The average and the weighted average should be similar unless one of the replicates has significantly higher coverage per sample than the other. In this case we can see that these two average estimated are very similar. However, in the case of variable sequence depth between the replicates a weighted average may be more appropriate.
ggplot2::ggplot(DF_reps, ggplot2::aes(x = minorfreq, y = weighted_minorfreq)) + ggplot2::geom_point()
#> Warning: Removed 371 rows containing missing values (`geom_point()`).
This function filters the variants by frequency or coverage of the ALT allele. Based on a benchmarking study performed by our lab, when using replicate sequencing data we recommend removing variants with less than 1% allele frequency and not imposing a sequence coverage cutoff.
Imposing a 1% allele frequency threshold results in 735 SNVs being retained.
For sequencing data without replicates, a much more stringent filtering step is recommended to ensure confidence in the variants called. We suggest a 3% allele frequency and 200x coverage for data without replicates. If data includes replicates, the function checks to make sure each replicate passes the cutoffs.
# Default coverage (200) and frequency (0.03) cutoffs
#DF_filt = filter_variants(DF_reps)
# To run with custom values, specify these in the function
DF_filt = filter_variants(DF_reps, coverage_cutoff = 0, frequency_cutoff = 0.01 )
#> Total number of SNP filtered out: 382
kable(head(DF_filt))
sample | CHROM | POS | REF | ALT | ANN | ALT_TYPE | major | minor | filename.x | replicate.x | gt_DP.x | REF_COUNT.x | ALT_COUNT.x | REF_FREQ.x | ALT_FREQ.x | majorcount.x | minorcount.x | majorfreq.x | minorfreq.x | filename.y | replicate.y | gt_DP.y | REF_COUNT.y | ALT_COUNT.y | REF_FREQ.y | ALT_FREQ.y | majorcount.y | minorcount.y | majorfreq.y | minorfreq.y | minorfreq | majorfreq | weighted_minorfreq | weighted_majorfreq |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a_1_fb | H1N1_HA | 1007 | G | A | A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1007G>A|p.Arg336Lys|1007/1701|1007/1701|336/566|| | minor | G | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 834 | 796 | 38 | 0.9544365 | 0.0455635 | 796 | 38 | 0.9544365 | 0.0455635 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 816 | 780 | 35 | 0.9558824 | 0.0428922 | 780 | 35 | 0.9558824 | 0.0428922 | 0.0442279 | 0.9551594 | 0.0442424 | 0.9551515 |
a_1_fb | H1N1_HA | 1145 | T | A | A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1145T>A|p.Leu382Gln|1145/1701|1145/1701|382/566|| | minor | T | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 770 | 740 | 30 | 0.9610390 | 0.0389610 | 740 | 30 | 0.9610390 | 0.0389610 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 810 | 781 | 29 | 0.9641975 | 0.0358025 | 781 | 29 | 0.9641975 | 0.0358025 | 0.0373818 | 0.9626182 | 0.0373418 | 0.9626582 |
a_1_fb | H1N1_HA | 1386 | A | G | G|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1386A>G|p.Leu462Leu|1386/1701|1386/1701|462/566|| | minor | A | G | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 552 | 545 | 7 | 0.9873188 | 0.0126812 | 545 | 7 | 0.9873188 | 0.0126812 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 529 | 521 | 8 | 0.9848771 | 0.0151229 | 521 | 8 | 0.9848771 | 0.0151229 | 0.0139020 | 0.9860980 | 0.0138760 | 0.9861240 |
a_1_fb | H1N1_HA | 410 | C | T | T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.410C>T|p.Thr137Ile|410/1701|410/1701|137/566|| | minor | C | T | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 734 | 720 | 14 | 0.9809264 | 0.0190736 | 720 | 14 | 0.9809264 | 0.0190736 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 773 | 756 | 17 | 0.9780078 | 0.0219922 | 756 | 17 | 0.9780078 | 0.0219922 | 0.0205329 | 0.9794671 | 0.0205707 | 0.9794293 |
a_1_fb | H1N1_HA | 435 | G | A | A|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.435G>A|p.Ser145Ser|435/1701|435/1701|145/566|| | minor | G | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 749 | 726 | 23 | 0.9692924 | 0.0307076 | 726 | 23 | 0.9692924 | 0.0307076 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 790 | 766 | 22 | 0.9696203 | 0.0278481 | 766 | 22 | 0.9696203 | 0.0278481 | 0.0292779 | 0.9694563 | 0.0292398 | 0.9694607 |
a_1_fb | H1N1_HA | 532 | C | T | T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.532C>T|p.Leu178Phe|532/1701|532/1701|178/566|| | minor | C | T | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 756 | 743 | 13 | 0.9828042 | 0.0171958 | 743 | 13 | 0.9828042 | 0.0171958 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 784 | 763 | 21 | 0.9732143 | 0.0267857 | 763 | 21 | 0.9732143 | 0.0267857 | 0.0219907 | 0.9780093 | 0.0220779 | 0.9779221 |
dim(DF_filt)
#> [1] 735 35
Until this point, SNPeff annotations are contained within a single
column. The function prepare_annotation()
separates this
information into individual columns with one attribute each.
DF_filt = prepare_annotations(DF_filt)
#> >2 annotations: character(0)
kable(head(DF_filt))
sample | CHROM | POS | REF | ALT | allele | annotation | putative_impact | gene_name | gene_id | feature_type | feature_id | transcript_biotype | rank_total | HGVS.c | HGVS.p | cDNA_position | CDS_position | protein_position | distance_to_feature | errors | ALT_TYPE | major | minor | filename.x | replicate.x | gt_DP.x | REF_COUNT.x | ALT_COUNT.x | REF_FREQ.x | ALT_FREQ.x | majorcount.x | minorcount.x | majorfreq.x | minorfreq.x | filename.y | replicate.y | gt_DP.y | REF_COUNT.y | ALT_COUNT.y | REF_FREQ.y | ALT_FREQ.y | majorcount.y | minorcount.y | majorfreq.y | minorfreq.y | minorfreq | majorfreq | weighted_minorfreq | weighted_majorfreq |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a_1_fb | H1N1_HA | 1007 | G | A | A | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1007G>A | p.Arg336Lys | 1007/1701 | 1007/1701 | 336/566 | minor | G | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 834 | 796 | 38 | 0.9544365 | 0.0455635 | 796 | 38 | 0.9544365 | 0.0455635 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 816 | 780 | 35 | 0.9558824 | 0.0428922 | 780 | 35 | 0.9558824 | 0.0428922 | 0.0442279 | 0.9551594 | 0.0442424 | 0.9551515 | ||
a_1_fb | H1N1_HA | 1145 | T | A | A | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1145T>A | p.Leu382Gln | 1145/1701 | 1145/1701 | 382/566 | minor | T | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 770 | 740 | 30 | 0.9610390 | 0.0389610 | 740 | 30 | 0.9610390 | 0.0389610 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 810 | 781 | 29 | 0.9641975 | 0.0358025 | 781 | 29 | 0.9641975 | 0.0358025 | 0.0373818 | 0.9626182 | 0.0373418 | 0.9626582 | ||
a_1_fb | H1N1_HA | 1386 | A | G | G | synonymous_variant | LOW | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1386A>G | p.Leu462Leu | 1386/1701 | 1386/1701 | 462/566 | minor | A | G | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 552 | 545 | 7 | 0.9873188 | 0.0126812 | 545 | 7 | 0.9873188 | 0.0126812 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 529 | 521 | 8 | 0.9848771 | 0.0151229 | 521 | 8 | 0.9848771 | 0.0151229 | 0.0139020 | 0.9860980 | 0.0138760 | 0.9861240 | ||
a_1_fb | H1N1_HA | 410 | C | T | T | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.410C>T | p.Thr137Ile | 410/1701 | 410/1701 | 137/566 | minor | C | T | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 734 | 720 | 14 | 0.9809264 | 0.0190736 | 720 | 14 | 0.9809264 | 0.0190736 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 773 | 756 | 17 | 0.9780078 | 0.0219922 | 756 | 17 | 0.9780078 | 0.0219922 | 0.0205329 | 0.9794671 | 0.0205707 | 0.9794293 | ||
a_1_fb | H1N1_HA | 435 | G | A | A | synonymous_variant | LOW | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.435G>A | p.Ser145Ser | 435/1701 | 435/1701 | 145/566 | minor | G | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 749 | 726 | 23 | 0.9692924 | 0.0307076 | 726 | 23 | 0.9692924 | 0.0307076 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 790 | 766 | 22 | 0.9696203 | 0.0278481 | 766 | 22 | 0.9696203 | 0.0278481 | 0.0292779 | 0.9694563 | 0.0292398 | 0.9694607 | ||
a_1_fb | H1N1_HA | 532 | C | T | T | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.532C>T | p.Leu178Phe | 532/1701 | 532/1701 | 178/566 | minor | C | T | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 756 | 743 | 13 | 0.9828042 | 0.0171958 | 743 | 13 | 0.9828042 | 0.0171958 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 784 | 763 | 21 | 0.9732143 | 0.0267857 | 763 | 21 | 0.9732143 | 0.0267857 | 0.0219907 | 0.9780093 | 0.0220779 | 0.9779221 |
dim(DF_filt)
#> [1] 829 50
Since the NS and M segments in influenza have splice variants, the variants on these two segments will be double counted. To prevent inaccurate counting of these variants, we remove duplicates before continuing with downstream analysis.
DF_filt_ns = dplyr::filter(DF_filt, feature_id != "H1N1_NS.1" & feature_id != "H1N1_NS.2" &
feature_id != "H1N1_M1.1" & feature_id != "H1N1_M1.2")
DF_filt_s = dplyr::filter(DF_filt, feature_id == "H1N1_NS.1" | feature_id == "H1N1_NS.2" |
feature_id =="H1N1_M1.1" | feature_id =="H1N1_M1.2")
DF_filt_s_unique = DF_filt_s %>% dplyr::group_by(sample,CHROM,POS,REF,ALT) %>%
dplyr::mutate(count = 1, totalsamp = sum(count)) %>%
dplyr::filter(totalsamp == 1) %>%
dplyr::ungroup()
# if variants are duplicated, only take those from NS.1 or M.1
DF_filt_s_double = DF_filt_s %>% dplyr::group_by(sample,CHROM,POS,REF,ALT) %>%
dplyr::mutate(count = 1, totalsamp = sum(count)) %>%
dplyr::filter(totalsamp > 1) %>%
dplyr::filter(feature_id == "H1N1_NS.1" | feature_id =="H1N1_M1.1") %>%
dplyr::ungroup() %>%
dplyr::select(!(count:totalsamp))
DF_filt_s_all = rbind(DF_filt_s_unique,DF_filt_s_double)
DF_filt_s_all = DF_filt_s_all[!duplicated(DF_filt_s_all), ] %>% droplevels()
DF_filt_SNVs = rbind(DF_filt_s_all,DF_filt_ns)
We add information about the sizes of the segments which is necessary for downstream calculations such as transition:transversion (tstv) and dNdS. The variant dataframe and metadata dataframe are required, as well as the names of the columns to define the merge.
DF_filt_SNVs = add_metadata(DF_filt_SNVs, sizes, c('CHROM'), c('segment'))
kable(head(DF_filt_SNVs))
CHROM | sample | POS | REF | ALT | allele | annotation | putative_impact | gene_name | gene_id | feature_type | feature_id | transcript_biotype | rank_total | HGVS.c | HGVS.p | cDNA_position | CDS_position | protein_position | distance_to_feature | errors | ALT_TYPE | major | minor | filename.x | replicate.x | gt_DP.x | REF_COUNT.x | ALT_COUNT.x | REF_FREQ.x | ALT_FREQ.x | majorcount.x | minorcount.x | majorfreq.x | minorfreq.x | filename.y | replicate.y | gt_DP.y | REF_COUNT.y | ALT_COUNT.y | REF_FREQ.y | ALT_FREQ.y | majorcount.y | minorcount.y | majorfreq.y | minorfreq.y | minorfreq | majorfreq | weighted_minorfreq | weighted_majorfreq | SegmentSize | STRAIN |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
H1N1_HA | a_1_fb | 1007 | G | A | A | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1007G>A | p.Arg336Lys | 1007/1701 | 1007/1701 | 336/566 | minor | G | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 834 | 796 | 38 | 0.9544365 | 0.0455635 | 796 | 38 | 0.9544365 | 0.0455635 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 816 | 780 | 35 | 0.9558824 | 0.0428922 | 780 | 35 | 0.9558824 | 0.0428922 | 0.0442279 | 0.9551594 | 0.0442424 | 0.9551515 | 1701 | H1N1 | ||
H1N1_HA | a_1_fb | 1145 | T | A | A | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1145T>A | p.Leu382Gln | 1145/1701 | 1145/1701 | 382/566 | minor | T | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 770 | 740 | 30 | 0.9610390 | 0.0389610 | 740 | 30 | 0.9610390 | 0.0389610 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 810 | 781 | 29 | 0.9641975 | 0.0358025 | 781 | 29 | 0.9641975 | 0.0358025 | 0.0373818 | 0.9626182 | 0.0373418 | 0.9626582 | 1701 | H1N1 | ||
H1N1_HA | a_1_fb | 1386 | A | G | G | synonymous_variant | LOW | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.1386A>G | p.Leu462Leu | 1386/1701 | 1386/1701 | 462/566 | minor | A | G | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 552 | 545 | 7 | 0.9873188 | 0.0126812 | 545 | 7 | 0.9873188 | 0.0126812 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 529 | 521 | 8 | 0.9848771 | 0.0151229 | 521 | 8 | 0.9848771 | 0.0151229 | 0.0139020 | 0.9860980 | 0.0138760 | 0.9861240 | 1701 | H1N1 | ||
H1N1_HA | a_1_fb | 410 | C | T | T | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.410C>T | p.Thr137Ile | 410/1701 | 410/1701 | 137/566 | minor | C | T | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 734 | 720 | 14 | 0.9809264 | 0.0190736 | 720 | 14 | 0.9809264 | 0.0190736 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 773 | 756 | 17 | 0.9780078 | 0.0219922 | 756 | 17 | 0.9780078 | 0.0219922 | 0.0205329 | 0.9794671 | 0.0205707 | 0.9794293 | 1701 | H1N1 | ||
H1N1_HA | a_1_fb | 435 | G | A | A | synonymous_variant | LOW | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.435G>A | p.Ser145Ser | 435/1701 | 435/1701 | 145/566 | minor | G | A | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 749 | 726 | 23 | 0.9692924 | 0.0307076 | 726 | 23 | 0.9692924 | 0.0307076 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 790 | 766 | 22 | 0.9696203 | 0.0278481 | 766 | 22 | 0.9696203 | 0.0278481 | 0.0292779 | 0.9694563 | 0.0292398 | 0.9694607 | 1701 | H1N1 | ||
H1N1_HA | a_1_fb | 532 | C | T | T | missense_variant | MODERATE | CDS_H1N1_HA_1_1701 | H1N1_HA | transcript | H1N1_HA.1 | protein_coding | 1/1 | c.532C>T | p.Leu178Phe | 532/1701 | 532/1701 | 178/566 | minor | C | T | H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep1 | 756 | 743 | 13 | 0.9828042 | 0.0171958 | 743 | 13 | 0.9828042 | 0.0171958 | H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann | rep2 | 784 | 763 | 21 | 0.9732143 | 0.0267857 | 763 | 21 | 0.9732143 | 0.0267857 | 0.0219907 | 0.9780093 | 0.0220779 | 0.9779221 | 1701 | H1N1 |
dim(DF_filt_SNVs)
#> [1] 735 52
The function af_distribution()
takes the variant
dataframe and generates a plot showing the distribution of the the minor
variants (defined as those variants at frequency less than 50% in a
sample).
af_distribution(DF_filt_SNVs)
The function tally_it()
allows the user to count the
number of variants over a given set of variables. These variables should
be provided as a vector and then passed into the function in addition to
the name, in quotes, for the new column containing the number of
variants. The user should construct a vector containing the name of the
sample column and the name of the segment column and pass that list into
the function.
For example, to get the sum of variants on every segment:
group_list_seg = c('sample','CHROM', "SegmentSize")
seg_count = tally_it(DF_filt_SNVs, group_list_seg, "snv_count")
kable(seg_count)
sample | CHROM | SegmentSize | snv_count |
---|---|---|---|
a_1_fb | H1N1_MP | 982 | 4 |
a_1_fb | H1N1_NS | 838 | 8 |
a_1_fb | H1N1_PB2 | 2280 | 18 |
a_1_fb | H1N1_PB1 | 2274 | 15 |
a_1_fb | H1N1_PA | 2151 | 12 |
a_1_fb | H1N1_HA | 1701 | 11 |
a_1_fb | H1N1_NP | 1497 | 17 |
a_1_fb | H1N1_NA | 1410 | 7 |
a_1_iv | H1N1_MP | 982 | 4 |
a_1_iv | H1N1_NS | 838 | 8 |
a_1_iv | H1N1_PB2 | 2280 | 19 |
a_1_iv | H1N1_PB1 | 2274 | 17 |
a_1_iv | H1N1_PA | 2151 | 11 |
a_1_iv | H1N1_HA | 1701 | 12 |
a_1_iv | H1N1_NP | 1497 | 15 |
a_1_iv | H1N1_NA | 1410 | 8 |
a_2_fb | H1N1_MP | 982 | 7 |
a_2_fb | H1N1_NS | 838 | 7 |
a_2_fb | H1N1_PB2 | 2280 | 19 |
a_2_fb | H1N1_PB1 | 2274 | 17 |
a_2_fb | H1N1_PA | 2151 | 11 |
a_2_fb | H1N1_HA | 1701 | 11 |
a_2_fb | H1N1_NP | 1497 | 16 |
a_2_fb | H1N1_NA | 1410 | 5 |
a_2_iv | H1N1_MP | 982 | 6 |
a_2_iv | H1N1_NS | 838 | 6 |
a_2_iv | H1N1_PB2 | 2280 | 18 |
a_2_iv | H1N1_PB1 | 2274 | 19 |
a_2_iv | H1N1_PA | 2151 | 11 |
a_2_iv | H1N1_HA | 1701 | 12 |
a_2_iv | H1N1_NP | 1497 | 16 |
a_2_iv | H1N1_NA | 1410 | 5 |
a_3_fb | H1N1_MP | 982 | 8 |
a_3_fb | H1N1_NS | 838 | 5 |
a_3_fb | H1N1_PB2 | 2280 | 19 |
a_3_fb | H1N1_PB1 | 2274 | 14 |
a_3_fb | H1N1_PA | 2151 | 14 |
a_3_fb | H1N1_HA | 1701 | 13 |
a_3_fb | H1N1_NP | 1497 | 18 |
a_3_fb | H1N1_NA | 1410 | 5 |
a_3_iv | H1N1_MP | 982 | 7 |
a_3_iv | H1N1_NS | 838 | 4 |
a_3_iv | H1N1_PB2 | 2280 | 18 |
a_3_iv | H1N1_PB1 | 2274 | 15 |
a_3_iv | H1N1_PA | 2151 | 12 |
a_3_iv | H1N1_HA | 1701 | 13 |
a_3_iv | H1N1_NP | 1497 | 17 |
a_3_iv | H1N1_NA | 1410 | 8 |
b_1_fb | H1N1_MP | 982 | 6 |
b_1_fb | H1N1_NS | 838 | 3 |
b_1_fb | H1N1_PB2 | 2280 | 12 |
b_1_fb | H1N1_PB1 | 2274 | 19 |
b_1_fb | H1N1_PA | 2151 | 15 |
b_1_fb | H1N1_HA | 1701 | 18 |
b_1_fb | H1N1_NP | 1497 | 5 |
b_1_fb | H1N1_NA | 1410 | 4 |
b_1_iv | H1N1_MP | 982 | 8 |
b_1_iv | H1N1_NS | 838 | 3 |
b_1_iv | H1N1_PB2 | 2280 | 11 |
b_1_iv | H1N1_PB1 | 2274 | 18 |
b_1_iv | H1N1_PA | 2151 | 22 |
b_1_iv | H1N1_HA | 1701 | 18 |
b_1_iv | H1N1_NP | 1497 | 5 |
b_1_iv | H1N1_NA | 1410 | 6 |
To count across genomes:
group_list_gen = c('sample')
gen_count = tally_it(DF_filt_SNVs, group_list_gen, "snv_count")
kable(gen_count)
sample | snv_count |
---|---|
a_1_fb | 92 |
a_1_iv | 94 |
a_2_fb | 93 |
a_2_iv | 93 |
a_3_fb | 96 |
a_3_iv | 94 |
b_1_fb | 82 |
b_1_iv | 91 |
This snv_location()
function takes the variant dataframe
and generates a large, interactive plot of all of the variants. The plot
is faceted by each segment for each individual sample. The user can
scroll over each point in the plot to get information about that
variant, including the nucleotide change and the position on the
segment.
snv_location(DF_filt_SNVs)
The snv_genome()
function takes the variant dataframe
and generates a plot of the number of variants per genome and colors
them by their SNPeff annotation.
snv_genome(DF_filt_SNVs)
The snv_segment()
function generates the same plot
faceted by segment.
snv_segment(DF_filt_SNVs)