VCF parsing#

Introduction#

fastDFE provides parser utilities that enable convenient parsing of frequency spectra from VCF files. By default, Parser looks at the AA tag in the VCF file’s info field to retrieve the correct polarization. Sites for which this tag is not well-defined are by default included (see skip_not_polarized). Note that non-polarized frequency spectra provide little information on the distribution of beneficial mutations, however.

We might also want to stratify the SFS by some site properties, such as site-degeneracy. This is done by passing stratifications to the parser. In this example, we will stratify the SFS by 0-fold and 4-fold degenerate sites using a VCF file for Betula spp.

import fastdfe as fd

# url to the GitHub repository of fastDFE
url = "https://github.com/Sendrowski/fastDFE/blob/dev/"
# instantiate parser
p = fd.Parser(
    n=8,
    vcf=url + "resources/genome/betula/biallelic.polarized.subset.50000.vcf.gz?raw=true",
    fasta=url + "resources/genome/betula/genome.subset.1000.fasta.gz?raw=true",
    gff=url + "resources/genome/betula/genome.gff.gz?raw=true",
    annotations=[
        fd.DegeneracyAnnotation()
    ],
    stratifications=[fd.DegeneracyStratification()]
)

# parse SFS
spectra: fd.Spectra = p.parse()
INFO:Parser: Using stratification: [neutral, selected].
INFO:Parser: Loading VCF file
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/70975cbc50c2.biallelic.polarized.subset.50000.vcf.gz
INFO:Parser: Loading GFF file
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/2d03158a125f.genome.gff.gz
INFO:Parser: Loading FASTA file
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/eb1f2b2b0185.genome.subset.1000.fasta.gz
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/70975cbc50c2.biallelic.polarized.subset.50000.vcf.gz
Parser>Counting sites: 50000it [00:01, 28368.14it/s]
Parser>Processing sites: 100%|██████████| 50000/50000 [01:38<00:00, 509.88it/s]
INFO:PolyAllelicFiltration: Filtered out 0 sites.
INFO:DegeneracyStratification: Number of sites with valid type: 13771
INFO:DegeneracyAnnotation: Annotated 19002 sites.
INFO:Parser: Skipped 1669 sites without ancestral allele information.
INFO:Parser: Included 13771 out of 50000 sites in total from the VCF file.
# visualize SFS
spectra.plot();
../../_images/669faa5d81ad12d024db9401ef7f2218014fe6fe056cf07a3f74a92944da0d00.png

fastDFE relies here on VCF info tags to determine the degeneracy of a site but this behavior can be customized (cf. DegeneracyStratification).

Stratifications#

We can use also several stratifications in tandem by specifying a list of stratifications. In this example, we will stratify the SFS by synonymy as well as base transitions type. The resulting spectra can be fed directly into fastDFE’s inference routines. See parser module for a complete list of available stratifications.

# instantiate parser
p = fd.Parser(
    n=10,
    vcf=url + "resources/genome/betula/biallelic.polarized.subset.50000.vcf.gz?raw=true",
    fasta=url + "resources/genome/betula/genome.subset.1000.fasta.gz?raw=true",
    gff=url + "resources/genome/betula/genome.gff.gz?raw=true",
    annotations=[
        fd.DegeneracyAnnotation()
    ],
    stratifications=[
        fd.DegeneracyStratification(),
        fd.AncestralBaseStratification()
    ]
)

# parse SFS
spectra: fd.Spectra = p.parse()
INFO:Parser: Using stratification: [neutral, selected].[A, C, G, T].
INFO:Parser: Loading VCF file
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/70975cbc50c2.biallelic.polarized.subset.50000.vcf.gz
INFO:Parser: Loading GFF file
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/2d03158a125f.genome.gff.gz
INFO:Parser: Loading FASTA file
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/eb1f2b2b0185.genome.subset.1000.fasta.gz
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/70975cbc50c2.biallelic.polarized.subset.50000.vcf.gz
Parser>Counting sites: 50000it [00:01, 31807.00it/s]
Parser>Processing sites: 100%|██████████| 50000/50000 [01:35<00:00, 525.23it/s]
INFO:PolyAllelicFiltration: Filtered out 0 sites.
INFO:DegeneracyStratification: Number of sites with valid type: 13771
INFO:AncestralBaseStratification: Number of sites with valid type: 13771
INFO:DegeneracyAnnotation: Annotated 19002 sites.
INFO:Parser: Skipped 1666 sites without ancestral allele information.
INFO:Parser: Included 13771 out of 50000 sites in total from the VCF file.
# visualize SFS
spectra.plot();
../../_images/8e39748b94d2355b1f448a22f892ec4e26cc98e91cfaef5fbff67c1922e8e215.png

Note that fastDFE requires the ancestral state of sites to be determined. The Parser achieves this by examining the AA field, although this behavior can be customized.

Annotations#

fastDFE provides a number of annotations accessible directly during the parsing process. To annotate a VCF file directly, consider using the Annotator class.

Degeneracy Annotation#

DegeneracyAnnotation annotates the SFS by the degeneracy of the site. This annotation requires information from a FASTA and GFF file and is useful for stratifying the SFS by 0-fold and 4-fold degenerate sites which is what we often want to do when inferring the DFE (see DegeneracyStratification).

# example for degeneracy annotation
ann = fd.Annotator(
    vcf=url + "resources/genome/betula/biallelic.subset.10000.vcf.gz?raw=true",
    fasta=url + "resources/genome/betula/genome.subset.1000.fasta.gz?raw=true",
    gff=url + "resources/genome/betula/genome.gff.gz?raw=true",
    annotations=[fd.DegeneracyAnnotation()],
    output="genome.deg.vcf.gz"
)

ann.annotate()
INFO:Annotator: Start annotating
INFO:Annotator: Loading GFF file
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/2d03158a125f.genome.gff.gz
INFO:Annotator: Loading FASTA file
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/eb1f2b2b0185.genome.subset.1000.fasta.gz
INFO:Annotator: Loading VCF file
INFO:FileHandler: Downloading file from https://github.com/Sendrowski/fastDFE/blob/dev/resources/genome/betula/biallelic.subset.10000.vcf.gz?raw=true
Annotator>Downloading file: 100%|██████████| 14.5M/14.5M [00:07<00:00, 1.93MB/s]
INFO:FileHandler: Cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/7a41c74709e8.biallelic.subset.10000.vcf.gz
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/7a41c74709e8.biallelic.subset.10000.vcf.gz
Annotator>Counting sites: 10000it [00:00, 34184.99it/s]
Annotator>Processing sites: 100%|██████████| 10000/10000 [00:03<00:00, 2631.79it/s]
INFO:DegeneracyAnnotation: Annotated 3566 sites.

Ancestral Allele Annotation#

Currently, two ancestral allele annotations are available: MaximumParsimonyAncestralAnnotation and MaximumLikelihoodAncestralAnnotation. The former is straightforward but susceptible to errors, and only appropriate if no outgroup information is available. Alternatively, if outgroups are missing, DFE inference can also be performed on folded spectra, but please note that this will yield less precise estimates. Ideally, we would like to use MaximumLikelihoodAncestralAnnotation, which is more sophisticated and requires one or several outgroup to be specified. Its underlying model is very similar to EST-SFS.

# example for ancestral allele annotation with outgroups
ann = fd.Annotator(
    vcf=url + "resources/genome/betula/all.with_outgroups.subset.10000.vcf.gz?raw=true",
    annotations=[fd.MaximumLikelihoodAncestralAnnotation(
        outgroups=["ERR2103730"],
        n_ingroups=10
    )],
    output="genome.aa.vcf.gz"
)

ann.annotate()
INFO:Annotator: Start annotating
INFO:Annotator: Loading VCF file
INFO:FileHandler: Downloading file from https://github.com/Sendrowski/fastDFE/blob/dev/resources/genome/betula/all.with_outgroups.subset.10000.vcf.gz?raw=true
Annotator>Downloading file: 100%|██████████| 4.60M/4.60M [00:00<00:00, 17.8MB/s]
INFO:FileHandler: Cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/a3b0e449a317.all.with_outgroups.subset.10000.vcf.gz
INFO:Annotator: Loading VCF file
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/a3b0e449a317.all.with_outgroups.subset.10000.vcf.gz
INFO:MaximumLikelihoodAncestralAnnotation: Subsampling 10 ingroup haplotypes probabilistically from 378 individuals in total.
INFO:MaximumLikelihoodAncestralAnnotation: Using 1 outgroup samples (ERR2103730).
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/a3b0e449a317.all.with_outgroups.subset.10000.vcf.gz
Annotator>Counting sites: 10000it [00:00, 88694.19it/s]
MaximumLikelihoodAncestralAnnotation>Parsing sites: 100%|██████████| 10000/10000 [00:28<00:00, 351.43it/s]
INFO:MaximumLikelihoodAncestralAnnotation: Included 6688 sites for the inference.
MaximumLikelihoodAncestralAnnotation>Optimizing rates: 100%|██████████| 10/10 [00:00<00:00, 27.84it/s]
Annotator>Processing sites: 100%|██████████| 10000/10000 [00:11<00:00, 887.02it/s]
INFO:MaximumLikelihoodAncestralAnnotation: Annotated 10000 sites.
INFO:MaximumLikelihoodAncestralAnnotation: There were 195 mismatches between the most likely ancestral allele and the ad-hoc ancestral allele annotation.

Filtrations#

fastDFE also offers a number of filtrations which can be accessed immediately while parsing. Alternatively, to filter a VCF file directly, use the Filterer class. Some useful filtrations include DeviantOutgroupFiltration, CodingSequenceFiltration, and BiasedGCConversionFiltration. For a complete list of available filtrations, refer to the API reference.

# example for filtration
f = fd.Filterer(
    vcf=url + "resources/genome/betula/biallelic.subset.10000.vcf.gz?raw=true",
    filtrations=[fd.BiasedGCConversionFiltration()],
    output="genome.gc.vcf.gz"
)

f.filter()
INFO:Filterer: Start filtering
INFO:Filterer: Loading VCF file
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/7a41c74709e8.biallelic.subset.10000.vcf.gz
INFO:FileHandler: Using cached file at /var/folders/w6/18ktl5312413jw46btlxrh59fzxvt4/T/7a41c74709e8.biallelic.subset.10000.vcf.gz
Filterer>Counting sites: 10000it [00:00, 33500.75it/s]
Filterer>Processing sites: 100%|██████████| 10000/10000 [00:01<00:00, 6810.30it/s]
INFO:BiasedGCConversionFiltration: Filtered out 7903 sites.
INFO:Filterer: Filtered out 7903 of 10000 sites in total.

Note that all components can easily be customized by extending the corresponding base class.