VCF parsing#
Introduction#
fastDFE provides parser utilities that enable convenient parsing of frequency spectra from VCF files. By default, Parser looks at the AA tag in the VCF file’s info field to retrieve the correct polarization. Sites for which this tag is not well-defined are by default included (see skip_not_polarized). Note that non-polarized frequency spectra provide little information on the distribution of beneficial mutations, however.
We might also want to stratify the SFS by some site properties, such as site-degeneracy. This is done by passing stratifications to the parser. In this example, we will stratify the SFS by 0-fold and 4-fold degenerate sites using a VCF file for Betula spp.
library(fastdfe)
# load the fastdfe package
fd <- load_fastdfe()
# url to the GitHub repository of fastDFE
url <- "https://github.com/Sendrowski/fastDFE/blob/dev/"
# instantiate parser
p <- fd$Parser(
n = 8,
vcf = paste0(url, "resources/genome/betula/biallelic.polarized.subset.50000.vcf.gz?raw=true"),
fasta = paste0(url, "resources/genome/betula/genome.subset.1000.fasta.gz?raw=true"),
gff = paste0(url, "resources/genome/betula/genome.gff.gz?raw=true"),
annotations = c(
fd$DegeneracyAnnotation()
),
stratifications = c(fd$DegeneracyStratification())
)
# parse SFS
spectra <- fd$Parser$parse(p)
fastDFE relies here on VCF info tags to determine the degeneracy of a site but this behavior can be customized (cf. DegeneracyStratification).
Stratifications#
We can use also several stratifications in tandem by specifying a list of stratifications. In this example, we will stratify the SFS by synonymy as well as base transitions type. The resulting spectra can be fed directly into fastDFE’s inference routines. See parser module for a complete list of available stratifications.
# instantiate parser
p <- fd$Parser(
n = 8,
vcf = paste0(url, "resources/genome/betula/biallelic.polarized.subset.50000.vcf.gz?raw=true"),
fasta = paste0(url, "resources/genome/betula/genome.subset.1000.fasta.gz?raw=true"),
gff = paste0(url, "resources/genome/betula/genome.gff.gz?raw=true"),
annotations = c(
fd$DegeneracyAnnotation()
),
stratifications = c(
fd$DegeneracyStratification(),
fd$AncestralBaseStratification()
)
)
# parse SFS
spectra <- fd$Parser$parse(p)
Note that fastDFE requires the ancestral state of sites to be determined. The Parser achieves this by examining the AA field, although this behavior can be customized.
Annotations#
fastDFE provides a number of annotations accessible directly during the parsing process. To annotate a VCF file directly, consider using the Annotator class.
Degeneracy Annotation#
DegeneracyAnnotation annotates the SFS by the degeneracy of the site. This annotation requires information from a FASTA and GFF file and is useful for stratifying the SFS by 0-fold and 4-fold degenerate sites which is what we often want to do when inferring the DFE (see DegeneracyStratification).
# example for degeneracy annotation
ann <- fd$Annotator(
vcf = paste0(url, "resources/genome/betula/biallelic.subset.10000.vcf.gz?raw=true"),
fasta = paste0(url, "resources/genome/betula/genome.subset.1000.fasta.gz?raw=true"),
gff = paste0(url, "resources/genome/betula/genome.gff.gz?raw=true"),
annotations = c(fd$DegeneracyAnnotation()),
output = "genome.deg.vcf.gz"
)
fd$Annotator$annotate(ann)
Ancestral Allele Annotation#
Currently, two ancestral allele annotations are available: MaximumParsimonyAncestralAnnotation and MaximumLikelihoodAncestralAnnotation. The former is straightforward but susceptible to errors, and only appropriate if no outgroup information is available. Alternatively, if outgroups are missing, DFE inference can also be performed on folded spectra, but please note that this will yield less precise estimates. Ideally, we would like to use MaximumLikelihoodAncestralAnnotation, which is more sophisticated and requires one or several outgroup to be specified. Its underlying model is very similar to EST-SFS.
# example for ancestral allele annotation with outgroups
ann <- fd$Annotator(
vcf = paste0(url, "resources/genome/betula/all.with_outgroups.subset.10000.vcf.gz?raw=true"),
annotations = c(fd$MaximumLikelihoodAncestralAnnotation(
outgroups = list("ERR2103730"),
n_ingroups = 10
)),
output = "genome.aa.vcf.gz"
)
fd$Annotator$annotate(ann)
Filtrations#
fastDFE also offers a number of filtrations which can be accessed immediately while parsing. Alternatively, to filter a VCF file directly, use the Filterer class. Some useful filtrations include DeviantOutgroupFiltration, CodingSequenceFiltration, and BiasedGCConversionFiltration. For a complete list of available filtrations, refer to the API reference.
# example for filtration
f <- fd$Filterer(
vcf = paste0(url, "resources/genome/betula/biallelic.subset.10000.vcf.gz?raw=true"),
filtrations = c(fd$BiasedGCConversionFiltration()),
output = "genome.gc.vcf.gz"
)
fd$Filterer$filter(f)