This vignette demonstrates the use of the package vivaldi to analyze viral single nucleotide variants (SNVs) from Illumina sequencing data. The vivaldi package provides tools for visualizing and summarizing allele frequency information, and genetic diversity, from heterogeneous viral samples.

Load the Vivaldi package

The vivaldi package is available on CRAN and is loaded as follows. We also use dplyr for data wrangling and kableExtra for rendering tables in vignette

library(vivaldi)
library(kableExtra)

Vignette data set

The data used in this vignette are simulated mixed influenza samples with simulated variants at random frequencies. VCF files were generated using the MAD2 genome alignment and variation calling pipeline. VCF files were generated using the variant callers timo and iVar; however, vivaldi functions are designed to work with all variant callers.

The simulated data comprise 12 samples that were sequenced in technical replicate resulting in a total of 24 VCF files. The VCF files were annotated using SNPeff.

Step 1: Set path for variant data and metadata

The data used in this vignette is included with the package. To use your own data, users should set the path to those files.

vardir = system.file("extdata", "vcfs", package="vivaldi") 

Metadata used for the calculations includes a .csv file containing the length of each of the viral segments. For non-segmented viruses, simply report the whole length of the genome. Users should also set a variable with the total length of the viral genome (i.e. the sum of the segment lengths).

seg_sizes = system.file("extdata", "SegmentSize.csv", package="vivaldi")
sizes = read.csv(file=seg_sizes,header=T,sep=",",na.strings = c(''))

#select only the relevant segment sizes for H1N1
sizes = dplyr::filter(sizes, STRAIN ==  "H1N1")

genome_size = 13133

The user should also supply a .csv file with sample information containing 3 columns:

rep_info = system.file("extdata", "reps.csv", package="vivaldi")
replicates = read.csv(file = rep_info, header = T, sep = ",", na.strings = c(""))
kable(head(replicates))
filename replicate sample
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 a_1_fb
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_ivar_custom.ann rep1 a_1_iv
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_varscan_custom.ann rep1 a_1_vs
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann rep1 a_2_fb
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_ivar_custom.ann rep1 a_2_iv
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_varscan_custom.ann rep1 a_2_vs
dim(replicates)
#> [1] 24  3

Step 2: Loading data and arranging

Load VCF files into a dataframe

The first step is performed using arrange_data() to load the VCF files into R, extract the important information. and arrange data as a tidy dataframe. The dataframe contains the sample name, pulled from the VCF file name, and information about the reference and alternative allele. We recommend that the VCF files have been annotated using SNPeff. However, if the VCF has not been annotated using SNPeff, the user should specify annotated = "no".

Once the data has been arranged we can see that there are a total of 2,493 variants across the 24 samples. Each sample has between 95 - 110 variants.

VCF_DF = arrange_data(vardir, ref = system.file("extdata", "H1N1.fa", package="vivaldi"), annotated = 'yes')
kable(head(VCF_DF))
sample CHROM POS REF ALT ANN gt_DP REF_COUNT ALT_COUNT REF_FREQ ALT_FREQ ALT_TYPE major minor majorcount minorcount majorfreq minorfreq
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann H1N1_HA 1007 G A A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1007G>A|p.Arg336Lys|1007/1701|1007/1701|336/566|| 834 796 38 0.9544365 0.0455635 minor G A 796 38 0.9544365 0.0455635
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann H1N1_HA 1145 T A A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1145T>A|p.Leu382Gln|1145/1701|1145/1701|382/566|| 770 740 30 0.9610390 0.0389610 minor T A 740 30 0.9610390 0.0389610
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann H1N1_HA 1293 T C C|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1293T>C|p.Gly431Gly|1293/1701|1293/1701|431/566|| 571 554 17 0.9702277 0.0297723 minor T C 554 17 0.9702277 0.0297723
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann H1N1_HA 1319 C T T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1319C>T|p.Ala440Val|1319/1701|1319/1701|440/566|| 587 580 7 0.9880750 0.0119250 minor C T 580 7 0.9880750 0.0119250
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann H1N1_HA 1386 A G G|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1386A>G|p.Leu462Leu|1386/1701|1386/1701|462/566|| 552 545 7 0.9873188 0.0126812 minor A G 545 7 0.9873188 0.0126812
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann H1N1_HA 1505 A G G|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1505A>G|p.Asp502Gly|1505/1701|1505/1701|502/566|| 386 380 5 0.9844560 0.0129534 minor A G 380 5 0.9844560 0.0129534
VCF_DF %>%
  dplyr::group_by(sample) %>%
  dplyr::summarise(n = dplyr::n()) %>%
  kable()
sample n
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 105
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_ivar_custom.ann 107
H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_varscan_custom.ann 104
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann 107
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_ivar_custom.ann 109
H1N1-random3_m1_AF_random-a2_frac_0.01_BWA_varscan_custom.ann 105
H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann 108
H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_ivar_custom.ann 110
H1N1-random3_m1_AF_random-a3_frac_0.01_BWA_varscan_custom.ann 108
H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann 103
H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_ivar_custom.ann 104
H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_varscan_custom.ann 103
H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_freebayes_custom.ann 102
H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_ivar_custom.ann 103
H1N1-random3_m2_AF_random-a2_frac_0.01_BWA_varscan_custom.ann 102
H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_freebayes_custom.ann 105
H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_ivar_custom.ann 105
H1N1-random3_m2_AF_random-a3_frac_0.01_BWA_varscan_custom.ann 101
H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann 95
H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_ivar_custom.ann 101
H1N1-random4_m1_AF_random-a4_frac_0.01_BWA_varscan_custom.ann 100
H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_freebayes_custom.ann 97
H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_ivar_custom.ann 106
H1N1-random4_m2_AF_random-a4_frac_0.01_BWA_varscan_custom.ann 103
dim(VCF_DF)
#> [1] 2493   18

The VCF_DF dataframe contains all information from the VCF files. The 18 columns are:

  • sample: sample name (pulled from VCF file name)
  • CHROM: segment the variant was found on
  • POS: nucleotide position of the variant
  • REF: nucleotide found at this position in the reference genome used for alignment (“ref =”)
  • ALT: nucleotide found at this position in the sample
  • ANN: annotation information from SNPeff, not yet formatted
  • gt_DP: total number of reads (i.e. depth) that cover this position
  • REF_COUNT: number of reads that include the REF nucleotide
  • ALT_COUNT: number of reads that include the ALT nucleotide
  • REF_FREQ: frequency of the reference nucleotide, calculated using REF_COUNT / gt_DP
  • ALT_FREQ: frequency of the alternate nucleotide, calculated using ALT_COUNT / gt_DP
  • ALT_TYPE: categorization of the alternate nucleotide. If found at frequencies > 50%, it is labeled as major; if found at frequencies < 50%, it is labeled as minor.
  • major: major frequency nucleotide (>=0.5)
  • minor: minor frequency nucleotide (<0.5)
  • majorcount: number of reads containing the major nucleotide
  • minorcount: number of reads containing the minor nucleotide
  • majorfreq: frequency of the major nucleotide
  • minorfreq: frequency of the minor nucleotide

If working with a segmented genome, users should provide a vector containing the names of the segments in the desired order for plotting.

SEGMENTS = c("H1N1_PB2","H1N1_PB1","H1N1_PA","H1N1_HA","H1N1_NP","H1N1_NA","H1N1_MP","H1N1_NS")
VCF_DF$CHROM = factor(VCF_DF$CHROM, levels = SEGMENTS)

Merging Replicate Sequence Data

Performing technical replicates of library preparation and DNA sequencing improves the accuracy of variant calling as sequencing or RT-PCR errors are unlikely to occur at the same site in independent experiments.

For this function, the user must provide the variant dataframe (i.e. the output of the arrange_data() function) and the replicates dataframe. Additional information that is required is: * the exact name used for replicates 1 and 2 from the replicates dataframe (e.g. “Rep1”, “r1”, “rep1”, etc), which must be provided in quotes * a vector containing the names of the columns that should be used to merge the two dataframes. This vector must contain columns that have identical values for the two replicates, such as the segment and nucleotide position of the variant. It should not contain replicate-specific information such as the allele frequency of the variant.

The merge_replicates() functions generates a dataframe with all variants that are found in both sequencing replicates. It excludes variants that are only found in one replicate.

In the merged dataframe information from replicate 1 is denoted with a .x suffix. Information from replicate 2 is denoted with a .y suffix.

In addition, average frequencies are computed from the two replicates.

The dataframe now contains the 12 unique samples with between 82 - 99 variants. These variants were detected in both sequencing replicates whereas variants in only a single replicate were removed during the merge.

cols = c("sample","CHROM","POS","REF","ALT","ANN","ALT_TYPE","major","minor")

DF_reps = merge_replicates(VCF_DF,replicates,"rep1","rep2",cols)

kable(head(DF_reps))
sample CHROM POS REF ALT ANN ALT_TYPE major minor filename.x replicate.x gt_DP.x REF_COUNT.x ALT_COUNT.x REF_FREQ.x ALT_FREQ.x majorcount.x minorcount.x majorfreq.x minorfreq.x filename.y replicate.y gt_DP.y REF_COUNT.y ALT_COUNT.y REF_FREQ.y ALT_FREQ.y majorcount.y minorcount.y majorfreq.y minorfreq.y minorfreq majorfreq weighted_minorfreq weighted_majorfreq
a_1_fb H1N1_HA 1007 G A A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1007G>A|p.Arg336Lys|1007/1701|1007/1701|336/566|| minor G A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 834 796 38 0.9544365 0.0455635 796 38 0.9544365 0.0455635 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 816 780 35 0.9558824 0.0428922 780 35 0.9558824 0.0428922 0.0442279 0.9551594 0.0442424 0.9551515
a_1_fb H1N1_HA 1145 T A A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1145T>A|p.Leu382Gln|1145/1701|1145/1701|382/566|| minor T A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 770 740 30 0.9610390 0.0389610 740 30 0.9610390 0.0389610 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 810 781 29 0.9641975 0.0358025 781 29 0.9641975 0.0358025 0.0373818 0.9626182 0.0373418 0.9626582
a_1_fb H1N1_HA 1386 A G G|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1386A>G|p.Leu462Leu|1386/1701|1386/1701|462/566|| minor A G H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 552 545 7 0.9873188 0.0126812 545 7 0.9873188 0.0126812 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 529 521 8 0.9848771 0.0151229 521 8 0.9848771 0.0151229 0.0139020 0.9860980 0.0138760 0.9861240
a_1_fb H1N1_HA 410 C T T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.410C>T|p.Thr137Ile|410/1701|410/1701|137/566|| minor C T H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 734 720 14 0.9809264 0.0190736 720 14 0.9809264 0.0190736 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 773 756 17 0.9780078 0.0219922 756 17 0.9780078 0.0219922 0.0205329 0.9794671 0.0205707 0.9794293
a_1_fb H1N1_HA 435 G A A|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.435G>A|p.Ser145Ser|435/1701|435/1701|145/566|| minor G A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 749 726 23 0.9692924 0.0307076 726 23 0.9692924 0.0307076 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 790 766 22 0.9696203 0.0278481 766 22 0.9696203 0.0278481 0.0292779 0.9694563 0.0292398 0.9694607
a_1_fb H1N1_HA 532 C T T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.532C>T|p.Leu178Phe|532/1701|532/1701|178/566|| minor C T H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 756 743 13 0.9828042 0.0171958 743 13 0.9828042 0.0171958 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 784 763 21 0.9732143 0.0267857 763 21 0.9732143 0.0267857 0.0219907 0.9780093 0.0220779 0.9779221

DF_reps %>%
  dplyr::group_by(sample) %>%
  dplyr::summarise(n = dplyr::n()) %>%
  kable()
sample n
a_1_fb 92
a_1_iv 95
a_1_vs 94
a_2_fb 93
a_2_iv 97
a_2_vs 93
a_3_fb 96
a_3_iv 99
a_3_vs 93
b_1_fb 82
b_1_iv 92
b_1_vs 91

dim(DF_reps)
#> [1] 1117   35

Compare the allele frequencies between replicate 1 and 2

Visual analysis of the variation in allele frequency estimations from the two sequencing experiments is informative about the technical variation in allele frequency estimation.

Here we generate a plot comparing the allele frequency of variants between replicates 1 and 2 without removing variants that were only found in one replicate. These are clearly observed as variants that have zero values along the x or y axes.

df = merge(replicates,VCF_DF, by.x = c("filename"), by.y = c("sample"))

df_rep1 = dplyr::filter(df, replicate == "rep1")
df_rep2 = dplyr::filter(df, replicate == "rep2")

df_merged_keep = merge(df_rep1, df_rep2, by = cols, all = TRUE)
df_merged_keep = df_merged_keep[!duplicated(df_merged_keep), ]

df_merged_keep$minorfreq.x[is.na(df_merged_keep$minorfreq.x)] = 0
df_merged_keep$minorfreq.y[is.na(df_merged_keep$minorfreq.y)] = 0

ggplot2::ggplot(df_merged_keep, ggplot2::aes(x = minorfreq.x, y = minorfreq.y)) + 
  ggplot2::geom_point()

Here, we generate the same plot from the merged dataframe with removing variants that were only found in one replicate. This serves as a visual check that these variants are removed from downstream analyses.

ggplot2::ggplot(DF_reps, ggplot2::aes(x = minorfreq.x, y = minorfreq.y)) + 
  ggplot2::geom_point()
#> Warning: Removed 371 rows containing missing values (`geom_point()`).

Compare the similarity between the average allele frequency and the weighted average allele frequency

merge_replicates() calculates two average allele frequency values: 1) the average allele frequency and 2) the weighted average allele frequency, for the major and minor allele.

  • The average is calculated by simply taking the mean of the allele frequency estimates from both replicates.
  • The weighted average is calculated by adding the number of reads containing either the major or minor nucleotide from both replicates and dividing by the total number of reads.

The average and the weighted average should be similar unless one of the replicates has significantly higher coverage per sample than the other. In this case we can see that these two average estimated are very similar. However, in the case of variable sequence depth between the replicates a weighted average may be more appropriate.

ggplot2::ggplot(DF_reps, ggplot2::aes(x = minorfreq, y = weighted_minorfreq)) + ggplot2::geom_point()
#> Warning: Removed 371 rows containing missing values (`geom_point()`).

Filter out variants based on coverage and/or frequency cutoffs

This function filters the variants by frequency or coverage of the ALT allele. Based on a benchmarking study performed by our lab, when using replicate sequencing data we recommend removing variants with less than 1% allele frequency and not imposing a sequence coverage cutoff.

Imposing a 1% allele frequency threshold results in 735 SNVs being retained.

For sequencing data without replicates, a much more stringent filtering step is recommended to ensure confidence in the variants called. We suggest a 3% allele frequency and 200x coverage for data without replicates. If data includes replicates, the function checks to make sure each replicate passes the cutoffs.

# Default coverage (200) and frequency (0.03) cutoffs 
#DF_filt = filter_variants(DF_reps)

# To run with custom values, specify these in the function
DF_filt = filter_variants(DF_reps, coverage_cutoff = 0, frequency_cutoff = 0.01 )
#> Total number of SNP filtered out: 382

kable(head(DF_filt))
sample CHROM POS REF ALT ANN ALT_TYPE major minor filename.x replicate.x gt_DP.x REF_COUNT.x ALT_COUNT.x REF_FREQ.x ALT_FREQ.x majorcount.x minorcount.x majorfreq.x minorfreq.x filename.y replicate.y gt_DP.y REF_COUNT.y ALT_COUNT.y REF_FREQ.y ALT_FREQ.y majorcount.y minorcount.y majorfreq.y minorfreq.y minorfreq majorfreq weighted_minorfreq weighted_majorfreq
a_1_fb H1N1_HA 1007 G A A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1007G>A|p.Arg336Lys|1007/1701|1007/1701|336/566|| minor G A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 834 796 38 0.9544365 0.0455635 796 38 0.9544365 0.0455635 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 816 780 35 0.9558824 0.0428922 780 35 0.9558824 0.0428922 0.0442279 0.9551594 0.0442424 0.9551515
a_1_fb H1N1_HA 1145 T A A|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1145T>A|p.Leu382Gln|1145/1701|1145/1701|382/566|| minor T A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 770 740 30 0.9610390 0.0389610 740 30 0.9610390 0.0389610 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 810 781 29 0.9641975 0.0358025 781 29 0.9641975 0.0358025 0.0373818 0.9626182 0.0373418 0.9626582
a_1_fb H1N1_HA 1386 A G G|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.1386A>G|p.Leu462Leu|1386/1701|1386/1701|462/566|| minor A G H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 552 545 7 0.9873188 0.0126812 545 7 0.9873188 0.0126812 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 529 521 8 0.9848771 0.0151229 521 8 0.9848771 0.0151229 0.0139020 0.9860980 0.0138760 0.9861240
a_1_fb H1N1_HA 410 C T T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.410C>T|p.Thr137Ile|410/1701|410/1701|137/566|| minor C T H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 734 720 14 0.9809264 0.0190736 720 14 0.9809264 0.0190736 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 773 756 17 0.9780078 0.0219922 756 17 0.9780078 0.0219922 0.0205329 0.9794671 0.0205707 0.9794293
a_1_fb H1N1_HA 435 G A A|synonymous_variant|LOW|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.435G>A|p.Ser145Ser|435/1701|435/1701|145/566|| minor G A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 749 726 23 0.9692924 0.0307076 726 23 0.9692924 0.0307076 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 790 766 22 0.9696203 0.0278481 766 22 0.9696203 0.0278481 0.0292779 0.9694563 0.0292398 0.9694607
a_1_fb H1N1_HA 532 C T T|missense_variant|MODERATE|CDS_H1N1_HA_1_1701|H1N1_HA|transcript|H1N1_HA.1|protein_coding|1/1|c.532C>T|p.Leu178Phe|532/1701|532/1701|178/566|| minor C T H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 756 743 13 0.9828042 0.0171958 743 13 0.9828042 0.0171958 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 784 763 21 0.9732143 0.0267857 763 21 0.9732143 0.0267857 0.0219907 0.9780093 0.0220779 0.9779221

dim(DF_filt)
#> [1] 735  35

Format SNPeff information

Until this point, SNPeff annotations are contained within a single column. The function prepare_annotation() separates this information into individual columns with one attribute each.

DF_filt = prepare_annotations(DF_filt)
#> >2 annotations: character(0)
kable(head(DF_filt))
sample CHROM POS REF ALT allele annotation putative_impact gene_name gene_id feature_type feature_id transcript_biotype rank_total HGVS.c HGVS.p cDNA_position CDS_position protein_position distance_to_feature errors ALT_TYPE major minor filename.x replicate.x gt_DP.x REF_COUNT.x ALT_COUNT.x REF_FREQ.x ALT_FREQ.x majorcount.x minorcount.x majorfreq.x minorfreq.x filename.y replicate.y gt_DP.y REF_COUNT.y ALT_COUNT.y REF_FREQ.y ALT_FREQ.y majorcount.y minorcount.y majorfreq.y minorfreq.y minorfreq majorfreq weighted_minorfreq weighted_majorfreq
a_1_fb H1N1_HA 1007 G A A missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1007G>A p.Arg336Lys 1007/1701 1007/1701 336/566 minor G A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 834 796 38 0.9544365 0.0455635 796 38 0.9544365 0.0455635 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 816 780 35 0.9558824 0.0428922 780 35 0.9558824 0.0428922 0.0442279 0.9551594 0.0442424 0.9551515
a_1_fb H1N1_HA 1145 T A A missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1145T>A p.Leu382Gln 1145/1701 1145/1701 382/566 minor T A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 770 740 30 0.9610390 0.0389610 740 30 0.9610390 0.0389610 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 810 781 29 0.9641975 0.0358025 781 29 0.9641975 0.0358025 0.0373818 0.9626182 0.0373418 0.9626582
a_1_fb H1N1_HA 1386 A G G synonymous_variant LOW CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1386A>G p.Leu462Leu 1386/1701 1386/1701 462/566 minor A G H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 552 545 7 0.9873188 0.0126812 545 7 0.9873188 0.0126812 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 529 521 8 0.9848771 0.0151229 521 8 0.9848771 0.0151229 0.0139020 0.9860980 0.0138760 0.9861240
a_1_fb H1N1_HA 410 C T T missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.410C>T p.Thr137Ile 410/1701 410/1701 137/566 minor C T H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 734 720 14 0.9809264 0.0190736 720 14 0.9809264 0.0190736 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 773 756 17 0.9780078 0.0219922 756 17 0.9780078 0.0219922 0.0205329 0.9794671 0.0205707 0.9794293
a_1_fb H1N1_HA 435 G A A synonymous_variant LOW CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.435G>A p.Ser145Ser 435/1701 435/1701 145/566 minor G A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 749 726 23 0.9692924 0.0307076 726 23 0.9692924 0.0307076 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 790 766 22 0.9696203 0.0278481 766 22 0.9696203 0.0278481 0.0292779 0.9694563 0.0292398 0.9694607
a_1_fb H1N1_HA 532 C T T missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.532C>T p.Leu178Phe 532/1701 532/1701 178/566 minor C T H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 756 743 13 0.9828042 0.0171958 743 13 0.9828042 0.0171958 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 784 763 21 0.9732143 0.0267857 763 21 0.9732143 0.0267857 0.0219907 0.9780093 0.0220779 0.9779221
dim(DF_filt)
#> [1] 829  50

Remove duplicate variants in NS and MP

Since the NS and M segments in influenza have splice variants, the variants on these two segments will be double counted. To prevent inaccurate counting of these variants, we remove duplicates before continuing with downstream analysis.

DF_filt_ns = dplyr::filter(DF_filt, feature_id != "H1N1_NS.1" & feature_id != "H1N1_NS.2" & 
                      feature_id != "H1N1_M1.1" & feature_id != "H1N1_M1.2")

DF_filt_s = dplyr::filter(DF_filt, feature_id == "H1N1_NS.1" | feature_id == "H1N1_NS.2" | 
                          feature_id =="H1N1_M1.1" | feature_id =="H1N1_M1.2")

DF_filt_s_unique = DF_filt_s %>% dplyr::group_by(sample,CHROM,POS,REF,ALT) %>% 
  dplyr::mutate(count = 1, totalsamp = sum(count)) %>%
  dplyr::filter(totalsamp == 1) %>%
  dplyr::ungroup()

# if variants are duplicated, only take those from NS.1 or M.1
DF_filt_s_double = DF_filt_s %>% dplyr::group_by(sample,CHROM,POS,REF,ALT) %>% 
  dplyr::mutate(count = 1, totalsamp = sum(count)) %>%
  dplyr::filter(totalsamp > 1) %>%
  dplyr::filter(feature_id == "H1N1_NS.1" | feature_id =="H1N1_M1.1") %>%
  dplyr::ungroup() %>%
  dplyr::select(!(count:totalsamp))
  
DF_filt_s_all = rbind(DF_filt_s_unique,DF_filt_s_double)
DF_filt_s_all = DF_filt_s_all[!duplicated(DF_filt_s_all), ] %>% droplevels()

DF_filt_SNVs = rbind(DF_filt_s_all,DF_filt_ns)

Add metadata

We add information about the sizes of the segments which is necessary for downstream calculations such as transition:transversion (tstv) and dNdS. The variant dataframe and metadata dataframe are required, as well as the names of the columns to define the merge.

DF_filt_SNVs = add_metadata(DF_filt_SNVs, sizes, c('CHROM'), c('segment'))

kable(head(DF_filt_SNVs))
CHROM sample POS REF ALT allele annotation putative_impact gene_name gene_id feature_type feature_id transcript_biotype rank_total HGVS.c HGVS.p cDNA_position CDS_position protein_position distance_to_feature errors ALT_TYPE major minor filename.x replicate.x gt_DP.x REF_COUNT.x ALT_COUNT.x REF_FREQ.x ALT_FREQ.x majorcount.x minorcount.x majorfreq.x minorfreq.x filename.y replicate.y gt_DP.y REF_COUNT.y ALT_COUNT.y REF_FREQ.y ALT_FREQ.y majorcount.y minorcount.y majorfreq.y minorfreq.y minorfreq majorfreq weighted_minorfreq weighted_majorfreq SegmentSize STRAIN
H1N1_HA a_1_fb 1007 G A A missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1007G>A p.Arg336Lys 1007/1701 1007/1701 336/566 minor G A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 834 796 38 0.9544365 0.0455635 796 38 0.9544365 0.0455635 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 816 780 35 0.9558824 0.0428922 780 35 0.9558824 0.0428922 0.0442279 0.9551594 0.0442424 0.9551515 1701 H1N1
H1N1_HA a_1_fb 1145 T A A missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1145T>A p.Leu382Gln 1145/1701 1145/1701 382/566 minor T A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 770 740 30 0.9610390 0.0389610 740 30 0.9610390 0.0389610 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 810 781 29 0.9641975 0.0358025 781 29 0.9641975 0.0358025 0.0373818 0.9626182 0.0373418 0.9626582 1701 H1N1
H1N1_HA a_1_fb 1386 A G G synonymous_variant LOW CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.1386A>G p.Leu462Leu 1386/1701 1386/1701 462/566 minor A G H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 552 545 7 0.9873188 0.0126812 545 7 0.9873188 0.0126812 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 529 521 8 0.9848771 0.0151229 521 8 0.9848771 0.0151229 0.0139020 0.9860980 0.0138760 0.9861240 1701 H1N1
H1N1_HA a_1_fb 410 C T T missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.410C>T p.Thr137Ile 410/1701 410/1701 137/566 minor C T H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 734 720 14 0.9809264 0.0190736 720 14 0.9809264 0.0190736 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 773 756 17 0.9780078 0.0219922 756 17 0.9780078 0.0219922 0.0205329 0.9794671 0.0205707 0.9794293 1701 H1N1
H1N1_HA a_1_fb 435 G A A synonymous_variant LOW CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.435G>A p.Ser145Ser 435/1701 435/1701 145/566 minor G A H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 749 726 23 0.9692924 0.0307076 726 23 0.9692924 0.0307076 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 790 766 22 0.9696203 0.0278481 766 22 0.9696203 0.0278481 0.0292779 0.9694563 0.0292398 0.9694607 1701 H1N1
H1N1_HA a_1_fb 532 C T T missense_variant MODERATE CDS_H1N1_HA_1_1701 H1N1_HA transcript H1N1_HA.1 protein_coding 1/1 c.532C>T p.Leu178Phe 532/1701 532/1701 178/566 minor C T H1N1-random3_m1_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep1 756 743 13 0.9828042 0.0171958 743 13 0.9828042 0.0171958 H1N1-random3_m2_AF_random-a1_frac_0.01_BWA_freebayes_custom.ann rep2 784 763 21 0.9732143 0.0267857 763 21 0.9732143 0.0267857 0.0219907 0.9780093 0.0220779 0.9779221 1701 H1N1
dim(DF_filt_SNVs)
#> [1] 735  52

Step 3: Calculations and Visualization

Plot distribution of all minor variant frequencies

The function af_distribution() takes the variant dataframe and generates a plot showing the distribution of the the minor variants (defined as those variants at frequency less than 50% in a sample).

af_distribution(DF_filt_SNVs)

Count number of SNVs

The function tally_it() allows the user to count the number of variants over a given set of variables. These variables should be provided as a vector and then passed into the function in addition to the name, in quotes, for the new column containing the number of variants. The user should construct a vector containing the name of the sample column and the name of the segment column and pass that list into the function.

For example, to get the sum of variants on every segment:

group_list_seg = c('sample','CHROM', "SegmentSize")
seg_count = tally_it(DF_filt_SNVs, group_list_seg, "snv_count")

kable(seg_count)
sample CHROM SegmentSize snv_count
a_1_fb H1N1_MP 982 4
a_1_fb H1N1_NS 838 8
a_1_fb H1N1_PB2 2280 18
a_1_fb H1N1_PB1 2274 15
a_1_fb H1N1_PA 2151 12
a_1_fb H1N1_HA 1701 11
a_1_fb H1N1_NP 1497 17
a_1_fb H1N1_NA 1410 7
a_1_iv H1N1_MP 982 4
a_1_iv H1N1_NS 838 8
a_1_iv H1N1_PB2 2280 19
a_1_iv H1N1_PB1 2274 17
a_1_iv H1N1_PA 2151 11
a_1_iv H1N1_HA 1701 12
a_1_iv H1N1_NP 1497 15
a_1_iv H1N1_NA 1410 8
a_2_fb H1N1_MP 982 7
a_2_fb H1N1_NS 838 7
a_2_fb H1N1_PB2 2280 19
a_2_fb H1N1_PB1 2274 17
a_2_fb H1N1_PA 2151 11
a_2_fb H1N1_HA 1701 11
a_2_fb H1N1_NP 1497 16
a_2_fb H1N1_NA 1410 5
a_2_iv H1N1_MP 982 6
a_2_iv H1N1_NS 838 6
a_2_iv H1N1_PB2 2280 18
a_2_iv H1N1_PB1 2274 19
a_2_iv H1N1_PA 2151 11
a_2_iv H1N1_HA 1701 12
a_2_iv H1N1_NP 1497 16
a_2_iv H1N1_NA 1410 5
a_3_fb H1N1_MP 982 8
a_3_fb H1N1_NS 838 5
a_3_fb H1N1_PB2 2280 19
a_3_fb H1N1_PB1 2274 14
a_3_fb H1N1_PA 2151 14
a_3_fb H1N1_HA 1701 13
a_3_fb H1N1_NP 1497 18
a_3_fb H1N1_NA 1410 5
a_3_iv H1N1_MP 982 7
a_3_iv H1N1_NS 838 4
a_3_iv H1N1_PB2 2280 18
a_3_iv H1N1_PB1 2274 15
a_3_iv H1N1_PA 2151 12
a_3_iv H1N1_HA 1701 13
a_3_iv H1N1_NP 1497 17
a_3_iv H1N1_NA 1410 8
b_1_fb H1N1_MP 982 6
b_1_fb H1N1_NS 838 3
b_1_fb H1N1_PB2 2280 12
b_1_fb H1N1_PB1 2274 19
b_1_fb H1N1_PA 2151 15
b_1_fb H1N1_HA 1701 18
b_1_fb H1N1_NP 1497 5
b_1_fb H1N1_NA 1410 4
b_1_iv H1N1_MP 982 8
b_1_iv H1N1_NS 838 3
b_1_iv H1N1_PB2 2280 11
b_1_iv H1N1_PB1 2274 18
b_1_iv H1N1_PA 2151 22
b_1_iv H1N1_HA 1701 18
b_1_iv H1N1_NP 1497 5
b_1_iv H1N1_NA 1410 6

To count across genomes:

group_list_gen = c('sample')
gen_count = tally_it(DF_filt_SNVs, group_list_gen, "snv_count")

kable(gen_count)
sample snv_count
a_1_fb 92
a_1_iv 94
a_2_fb 93
a_2_iv 93
a_3_fb 96
a_3_iv 94
b_1_fb 82
b_1_iv 91

Plot location of SNVs across segments

This snv_location() function takes the variant dataframe and generates a large, interactive plot of all of the variants. The plot is faceted by each segment for each individual sample. The user can scroll over each point in the plot to get information about that variant, including the nucleotide change and the position on the segment.

snv_location(DF_filt_SNVs)

Plot number of SNVs per sample and per segment

The snv_genome() function takes the variant dataframe and generates a plot of the number of variants per genome and colors them by their SNPeff annotation.

snv_genome(DF_filt_SNVs)

The snv_segment() function generates the same plot faceted by segment.

snv_segment(DF_filt_SNVs)