VCF annotation

Contents

VCF annotation#

VCF annotations and an annotator to apply them.

class Annotation[source]#

Bases: ABC

Base class for annotations.

__init__()[source]#

Create a new annotation instance.

n_annotated: int#

The number of annotated sites.

abstract annotate_site(variant: cyvcf2.Variant | DummyVariant)[source]#

Annotate a single site.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The variant to annotate.

Returns:

The annotated variant.

static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)[source]#

Count the number of target sites in a GFF file.

Parameters:
  • file (str) – The path to The GFF file path, possibly gzipped or a URL

  • remove_overlaps (bool) – Whether to remove overlapping target sites.

  • contigs (List[str]) – The contigs to count the target sites for.

Return type:

Dict[str, int]

Returns:

The number of target sites per chromosome/contig.

class DegeneracyAnnotation[source]#

Bases: Annotation

Degeneracy annotation. We annotate the degeneracy by looking at each codon for coding variants. This also annotates mono-allelic sites.

This annotation adds the info fields Degeneracy and Degeneracy_Info, which hold the degeneracy of a site (0, 2, 4) and extra information about the degeneracy, respectively. To be used with DegeneracyStratification.

For this annotation to work, we require a FASTA and GFF file (passed to Parser or Annotator).

Example usage:

import fastdfe as fd

ann = fd.Annotator(
    vcf="http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/"
        "1000_genomes_project/release/20181203_biallelic_SNV/"
        "ALL.chr21.shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz",
    fasta="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/"
          "dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz",
    gff="http://ftp.ensembl.org/pub/release-109/gff3/homo_sapiens/"
        "Homo_sapiens.GRCh38.109.chromosome.21.gff3.gz",
    output='sapiens.chr21.degeneracy.vcf.gz',
    annotations=[fd.DegeneracyAnnotation()],
    aliases=dict(chr21=['21'])
)

ann.annotate()
__init__()[source]#

Create a new annotation instance.

mismatches: List['cyvcf2.Variant']#

The variants that could not be annotated correctly.

n_skipped: int#

The variant that were skipped because they were not in coding regions.

errors: List['cyvcf2.Variant']#

The variants for which the codon could not be determined.

annotate_site(v: cyvcf2.Variant | DummyVariant)[source]#

Annotate a single site.

Parameters:

v (Union[cyvcf2.Variant, DummyVariant]) – The variant to annotate.

static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#

Count the number of target sites in a GFF file.

Parameters:
  • file (str) – The path to The GFF file path, possibly gzipped or a URL

  • remove_overlaps (bool) – Whether to remove overlapping target sites.

  • contigs (List[str]) – The contigs to count the target sites for.

Return type:

Dict[str, int]

Returns:

The number of target sites per chromosome/contig.

n_annotated: int#

The number of annotated sites.

class SynonymyAnnotation[source]#

Bases: DegeneracyAnnotation

Synonymy annotation. This class annotates a variant with the synonymous/non-synonymous status.

This annotation adds the info fields Synonymous and Synonymous_Info, which hold the synonymy (Synonymous [0] or non-synonymous [1]) and the codon information, respectively. To be used with SynonymyStratification.

For this annotation to work, we require a FASTA and GFF file (passed to Parser or Annotator).

This class was tested against VEP and SnpEff and provides the same annotations in almost all cases.

Warning

Not recommended for use with Parser as we also need to annotate mono-allelic sites. Consider using DegeneracyAnnotation and DegeneracyStratification instead.

__init__()[source]#

Create a new annotation instance.

vep_mismatches: List['cyvcf2.Variant']#

The number of sites that did not match with VEP.

snpeff_mismatches: List['cyvcf2.Variant']#

The number of sites that did not math with the annotation provided by SnpEff

n_vep_comparisons: int#

The number of sites that were concordant with VEP.

n_snpeff_comparisons: int#

The number of sites that were concordant with SnpEff.

static mutate(codon: str, alt: str, pos: int)[source]#

Mutate the codon at the given position to the given alternative allele.

Parameters:
  • codon (str) – The codon to mutate.

  • alt (str) – The alternative allele.

  • pos (int) – The position to mutate.

Return type:

str

Returns:

Mutated codon.

static is_synonymous(codon1: str, codon2: str)[source]#

Check if two codons are synonymous.

Parameters:
  • codon1 (str) – The first codon.

  • codon2 (str) – The second codon.

Return type:

bool

Returns:

True if the codons are synonymous, False otherwise.

annotate_site(v: cyvcf2.Variant | DummyVariant)[source]#

Annotate a single site.

Parameters:

v (Union[cyvcf2.Variant, DummyVariant]) – The variant to annotate.

Returns:

The annotated variant.

static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#

Count the number of target sites in a GFF file.

Parameters:
  • file (str) – The path to The GFF file path, possibly gzipped or a URL

  • remove_overlaps (bool) – Whether to remove overlapping target sites.

  • contigs (List[str]) – The contigs to count the target sites for.

Return type:

Dict[str, int]

Returns:

The number of target sites per chromosome/contig.

mismatches: List['cyvcf2.Variant']#

The variants that could not be annotated correctly.

n_skipped: int#

The variant that were skipped because they were not in coding regions.

errors: List['cyvcf2.Variant']#

The variants for which the codon could not be determined.

n_annotated: int#

The number of annotated sites.

class AncestralAlleleAnnotation[source]#

Bases: Annotation, ABC

Base class for ancestral allele annotation.

__init__()#

Create a new annotation instance.

abstract annotate_site(variant: cyvcf2.Variant | DummyVariant)#

Annotate a single site.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The variant to annotate.

Returns:

The annotated variant.

static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#

Count the number of target sites in a GFF file.

Parameters:
  • file (str) – The path to The GFF file path, possibly gzipped or a URL

  • remove_overlaps (bool) – Whether to remove overlapping target sites.

  • contigs (List[str]) – The contigs to count the target sites for.

Return type:

Dict[str, int]

Returns:

The number of target sites per chromosome/contig.

n_annotated: int#

The number of annotated sites.

class MaximumParsimonyAncestralAnnotation(samples: List[str] = None)[source]#

Bases: AncestralAlleleAnnotation

Annotation of ancestral alleles using maximum parsimony. To be used with AncestralBaseStratification and Annotator or Parser.

Note that maximum parsimony is not a reliable way to determine ancestral alleles, so it is recommended to use this annotation together with the ancestral misidentification parameter eps or to fold spectra altogether. Alternatively, you can use DeviantOutgroupFiltration to filter out sites where the major allele among outgroups does not coincide with the major allele among ingroups. This annotation has the advantage of requiring no outgroup data. If outgroup data is available consider using :class`MLEAncestralAlleleAnnotation` instead.

__init__(samples: List[str] = None)[source]#

Create a new ancestral allele annotation instance.

Parameters:

samples (List[str]) – The samples to consider when determining the ancestral allele. If None, all samples are considered.

samples: Optional[List[str]]#

The samples to consider when determining the ancestral allele.

annotate_site(variant: cyvcf2.Variant | DummyVariant)[source]#

Annotate a single site.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The variant to annotate.

Returns:

The annotated variant.

static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#

Count the number of target sites in a GFF file.

Parameters:
  • file (str) – The path to The GFF file path, possibly gzipped or a URL

  • remove_overlaps (bool) – Whether to remove overlapping target sites.

  • contigs (List[str]) – The contigs to count the target sites for.

Return type:

Dict[str, int]

Returns:

The number of target sites per chromosome/contig.

n_annotated: int#

The number of annotated sites.

class SubstitutionModel(bounds: Dict[str, Tuple[float, float]] = {}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {})[source]#

Bases: ABC

Base class for substitution models.

__init__(bounds: Dict[str, Tuple[float, float]] = {}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {})[source]#

Create a new substitution model instance.

Parameters:
  • bounds (Dict[str, Tuple[float, float]]) – The bounds for the parameters.

  • pool_branch_rates (bool) – Whether to pool the branch rates. By default, each branch has its own rate which is optimized using MLE. If True, the branch rates are pooled and a single rate is optimized. This is useful if the number of sites used is small.

  • fixed_params (Dict[str, float]) – The fixed parameters. Parameters that are not fixed are optimized using MLE.

pool_branch_rates: bool#

Whether to pool the branch rates.

fixed_params: Dict[str, float]#

The fixed parameters.

bounds: Dict[str, Tuple[float, float]]#

Parameter bounds.

cache(params: Dict[str, float], n_branches: int)[source]#

Cache the probabilities for the given parameters.

Parameters:
  • params (Dict[str, float]) – The parameters.

  • n_branches (int) – The number of branches.

static get_x0(bounds: Dict[str, Tuple[float, float]], rng: Generator)[source]#

Get the initial values for the parameters.

Parameters:
  • bounds (Dict[str, Tuple[float, float]]) – The bounds for the parameters.

  • rng (Generator) – The random number generator.

Return type:

Dict[str, float]

Returns:

The initial values.

get_bounds(n_outgroups: int)[source]#

Get the bounds for the parameters.

Parameters:

n_outgroups (int) – The number of outgroups.

Return type:

Dict[str, Tuple[float, float]]

Returns:

The bounds.

static validate_bounds(bounds: Dict[str, Tuple[float, float]])[source]#

Make sure the lower bounds are positive and the upper bounds are larger than the lower bounds.

Parameters:

bounds (Dict[str, Tuple[float, float]]) – The bounds to validate

Raises:

ValueError – If the bounds are invalid

class JCSubstitutionModel(bounds: Dict[str, Tuple[float, float]] = {'K': (1e-05, 10)}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {})[source]#

Bases: SubstitutionModel

Jukes-Cantor substitution model.

__init__(bounds: Dict[str, Tuple[float, float]] = {'K': (1e-05, 10)}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {})[source]#

Create a new substitution model instance.

Parameters:
  • bounds (Dict[str, Tuple[float, float]]) – The bounds for the parameters. K is the branch rate.

  • pool_branch_rates (bool) – Whether to pool the branch rates. By default, each branch has its own rate which is optimized using MLE. If True, the branch rates are pooled and a single rate is optimized. This is useful if the number of sites used is small. If False, each branch has its own rate denoted by “K{i}”, where i is the branch index. If True, the branch rate is denoted by “K”.

  • fixed_params (Dict[str, float]) – The fixed parameters. Parameters that are not fixed are optimized using MLE.

get_bound(param: str)[source]#

Get the bounds for a parameter.

Parameters:

param (str) – The parameter.

Return type:

Tuple[float, float]

Returns:

The lower and upper bounds.

get_bounds(n_outgroups: int)[source]#

Get the bounds for the parameters.

Parameters:

n_outgroups (int) – The number of outgroups.

Return type:

Dict[str, Tuple[float, float]]

Returns:

The lower and upper bounds.

cache(params: Dict[str, float], n_branches: int)#

Cache the probabilities for the given parameters.

Parameters:
  • params (Dict[str, float]) – The parameters.

  • n_branches (int) – The number of branches.

static get_x0(bounds: Dict[str, Tuple[float, float]], rng: Generator)#

Get the initial values for the parameters.

Parameters:
  • bounds (Dict[str, Tuple[float, float]]) – The bounds for the parameters.

  • rng (Generator) – The random number generator.

Return type:

Dict[str, float]

Returns:

The initial values.

static validate_bounds(bounds: Dict[str, Tuple[float, float]])#

Make sure the lower bounds are positive and the upper bounds are larger than the lower bounds.

Parameters:

bounds (Dict[str, Tuple[float, float]]) – The bounds to validate

Raises:

ValueError – If the bounds are invalid

pool_branch_rates: bool#

Whether to pool the branch rates.

fixed_params: Dict[str, float]#

The fixed parameters.

bounds: Dict[str, Tuple[float, float]]#

Parameter bounds.

class K2SubstitutionModel(bounds: Dict[str, Tuple[float, float]] = {'K': (1e-05, 10), 'k': (0.1, 10)}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {}, fix_transition_transversion_ratio: bool = False)[source]#

Bases: JCSubstitutionModel

Kimura 2-parameter substitution model.

__init__(bounds: Dict[str, Tuple[float, float]] = {'K': (1e-05, 10), 'k': (0.1, 10)}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {}, fix_transition_transversion_ratio: bool = False)[source]#

Create a new substitution model instance.

Parameters:
  • bounds (Dict[str, Tuple[float, float]]) – The bounds for the parameters. K{i} are the branch rates. k is the transition/transversion ratio.

  • pool_branch_rates (bool) – Whether to pool the branch rates. By default, each branch has its own rate which is optimized using MLE. If True, the branch rates are pooled and a single rate is optimized. This is useful if the number of sites used is small.

  • fixed_params (Dict[str, float]) – The fixed parameters. Parameters that are not fixed are optimized using MLE.

  • fix_transition_transversion_ratio (bool) – Whether to fix the transition/transversion ratio to the ratio observed in the data.

fix_transition_transversion_ratio: bool#

Whether to fix the transition/transversion ratio to the ratio observed in the data.

get_bounds(n_outgroups: int)[source]#

Get the bounds for the parameters.

Parameters:

n_outgroups (int) – The number of outgroups.

Return type:

Dict[str, Tuple[float, float]]

Returns:

The lower and upper bounds.

cache(params: Dict[str, float], n_branches: int)#

Cache the probabilities for the given parameters.

Parameters:
  • params (Dict[str, float]) – The parameters.

  • n_branches (int) – The number of branches.

get_bound(param: str)#

Get the bounds for a parameter.

Parameters:

param (str) – The parameter.

Return type:

Tuple[float, float]

Returns:

The lower and upper bounds.

static get_x0(bounds: Dict[str, Tuple[float, float]], rng: Generator)#

Get the initial values for the parameters.

Parameters:
  • bounds (Dict[str, Tuple[float, float]]) – The bounds for the parameters.

  • rng (Generator) – The random number generator.

Return type:

Dict[str, float]

Returns:

The initial values.

static validate_bounds(bounds: Dict[str, Tuple[float, float]])#

Make sure the lower bounds are positive and the upper bounds are larger than the lower bounds.

Parameters:

bounds (Dict[str, Tuple[float, float]]) – The bounds to validate

Raises:

ValueError – If the bounds are invalid

pool_branch_rates: bool#

Whether to pool the branch rates.

fixed_params: Dict[str, float]#

The fixed parameters.

bounds: Dict[str, Tuple[float, float]]#

Parameter bounds.

class SiteConfig(n_major: int, major_base: int, minor_base: int, outgroup_bases: ndarray, multiplicity: float = 1.0, sites: ndarray = None, p_minor: float = nan, p_major: float = nan)[source]#

Bases: object

Ancestral allele site configuration for a single subsample.

__init__(n_major: int, major_base: int, minor_base: int, outgroup_bases: ndarray, multiplicity: float = 1.0, sites: ndarray = None, p_minor: float = nan, p_major: float = nan)[source]#

Create a new site configuration instance.

n_major: int#

The number of major alleles.

major_base: int#

The major allele base index.

minor_base: int#

The minor base index.

outgroup_bases: ndarray#

The outgroup base indices.

multiplicity: float#

The multiplicity of the site.

sites: ndarray#

The site indices.

p_minor: float#

The probability of the minor allele.

p_major: float#

The probability of the major allele.

class SiteInfo(n_major: Dict[int, float], major_base: str, minor_base: str, outgroup_bases: List[str], p_minor: float = nan, p_major: float = nan, p_major_ancestral: float = nan, major_ancestral: str = '.', p_bases_first_node: Dict[str, float] = None, p_first_node_ancestral: float = nan, first_node_ancestral: str = '.', rate_params: Dict[str, float] = None)[source]#

Bases: object

Ancestral allele information on a single site.

__init__(n_major: Dict[int, float], major_base: str, minor_base: str, outgroup_bases: List[str], p_minor: float = nan, p_major: float = nan, p_major_ancestral: float = nan, major_ancestral: str = '.', p_bases_first_node: Dict[str, float] = None, p_first_node_ancestral: float = nan, first_node_ancestral: str = '.', rate_params: Dict[str, float] = None)[source]#
n_major: Dict[int, float]#

Dictionary mapping number of major alleles to its probability of observation.

major_base: str#

The major allele base.

minor_base: str#

The minor base index.

outgroup_bases: List[str]#

The outgroup base indices.

p_minor: float#

The probability of the minor allele being the ancestral allele (without prior).

p_major: float#

The probability of the major allele being the ancestral allele (without prior).

p_major_ancestral: float#

The probability of the major allele being the ancestral allele rather than the minor allele (possibly with prior if specified).

major_ancestral: str#

The predicted ancestral base based on comparing major and minor allele.

p_bases_first_node: Dict[str, float]#

The probability of each base being the ancestral base for the first internal node.

p_first_node_ancestral: float#

The probability that the mostly likely base for the first internal node is the ancestral base.

first_node_ancestral: str#

The ancestral base index for the first internal node.

rate_params: Dict[str, float]#

The branch rates.

plot_tree(ax: plt.Axes = None, show: bool = True)[source]#

Plot the tree for a site. Only Python visualization is supported.

Parameters:
  • self – The site information.

  • ax (plt.Axes) – Axes to plot on.

  • show (bool) – Whether to show the plot.

class BaseType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: Enum

The base type, either major or minor.

MINOR: int = 0#
MAJOR: int = 1#
class PolarizationPrior(allow_divergence: bool = False)[source]#

Bases: ABC

Base class for priors used with :class:MaximumLikelihoodAncestralAnnotation. These priors incorporate information about the general probability of the major allele being ancestral across all sites with the same minor allele count. Prior thus take ingroup allele frequencies into account, when making predictions about the ancestral state of a site. This is useful because it enhances ancestral allele probability estimates, especially when outgroup information is unavailable for a particular site. Knowing the likelihood of the major allele being ancestral in general allows for more informed estimations.

__init__(allow_divergence: bool = False)[source]#

Create a new instance.

Parameters:

allow_divergence (bool) –

Whether to allow divergence. If True, the probability of the minor allele being ancestral, which is not contained in the ingroup subsample but rather in all specified ingroups or among the outgroup, is taken to be the same as if it was present in the ingroup subsample with frequency 1. This is a hack, but allows us to consider alleles that are not present in the ingroup subsample.

Warning

Setting this to True greatly increases the probability of high-frequency derived alleles which introduces a strong bias in the distribution of frequency counts, e.g., the SFS. Only use this if you’re interested in handling divergence counts, i.e., sites where the ingroup is mono-allelic.

allow_divergence: bool#

Whether to allow divergence.

probabilities: ndarray | None#

The polarization probabilities.

plot(file: str = None, show: bool = True, title: str = 'polarization probabilities', scale: Literal['lin', 'log'] = 'lin', ax: plt.Axes = None, ylabel: str = 'p')[source]#

Visualize the polarization probabilities using a scatter plot.

Parameters:
  • scale (Literal['lin', 'log']) – y-scale of the plot.

  • title (str) – Plot title.

  • file (str) – File to save plot to.

  • show (bool) – Whether to show plot.

  • ax (plt.Axes) – Axes to plot on. Only for Python visualization backend.

  • ylabel (str) – y-axis label.

Return type:

plt.Axes

Returns:

Axes object

class KingmanPolarizationPrior(allow_divergence: bool = False)[source]#

Bases: PolarizationPrior

Prior based on the standard Kingman coalescent. To be used with MaximumLikelihoodAncestralAnnotation.

__init__(allow_divergence: bool = False)#

Create a new instance.

Parameters:

allow_divergence (bool) –

Whether to allow divergence. If True, the probability of the minor allele being ancestral, which is not contained in the ingroup subsample but rather in all specified ingroups or among the outgroup, is taken to be the same as if it was present in the ingroup subsample with frequency 1. This is a hack, but allows us to consider alleles that are not present in the ingroup subsample.

Warning

Setting this to True greatly increases the probability of high-frequency derived alleles which introduces a strong bias in the distribution of frequency counts, e.g., the SFS. Only use this if you’re interested in handling divergence counts, i.e., sites where the ingroup is mono-allelic.

plot(file: str = None, show: bool = True, title: str = 'polarization probabilities', scale: Literal['lin', 'log'] = 'lin', ax: plt.Axes = None, ylabel: str = 'p')#

Visualize the polarization probabilities using a scatter plot.

Parameters:
  • scale (Literal['lin', 'log']) – y-scale of the plot.

  • title (str) – Plot title.

  • file (str) – File to save plot to.

  • show (bool) – Whether to show plot.

  • ax (plt.Axes) – Axes to plot on. Only for Python visualization backend.

  • ylabel (str) – y-axis label.

Return type:

plt.Axes

Returns:

Axes object

allow_divergence: bool#

Whether to allow divergence.

probabilities: ndarray | None#

The polarization probabilities.

class AdaptivePolarizationPrior(n_runs: int = 1, parallelize: bool = True, allow_divergence: bool = False, seed: int | None = 0)[source]#

Bases: PolarizationPrior

Adaptive prior. To be used with MaximumLikelihoodAncestralAnnotation. This is the same prior as used in the EST-SFS paper. This prior is adaptive in the sense that the most likely polarization probabilities given the site configurations are found. This is the most accurate prior, but requires a lot of sites in order to work properly. You can check that the polarization probabilities are smooth enough across frequency counts by calling plot(). If they are not smooth enough, you can increase the number of sites, decrease the number of ingroups, or use KingmanPolarizationPrior instead.

Note

In practice, this prior provides very similar results to KingmanPolarizationPrior in most cases.

__init__(n_runs: int = 1, parallelize: bool = True, allow_divergence: bool = False, seed: int | None = 0)[source]#

Create a new adaptive prior instance.

Parameters:
  • n_runs (int) – The number of runs to perform when determining the polarization parameters. One run should be sufficient as only one parameter is optimized.

  • parallelize (bool) – Whether to parallelize the optimization.

  • allow_divergence (bool) – Whether to allow divergence. See PolarizationPrior for details.

  • seed (int | None) – The seed for the random number generator.

n_runs: int#

The number of runs to use for the adaptive prior.

parallelize: bool#

Whether to parallelize the optimization.

seed: int | None#

The seed for the random number generator.

rng: Generator#

The random number generator.

plot(file: str = None, show: bool = True, title: str = 'polarization probabilities', scale: Literal['lin', 'log'] = 'lin', ax: plt.Axes = None, ylabel: str = 'p')#

Visualize the polarization probabilities using a scatter plot.

Parameters:
  • scale (Literal['lin', 'log']) – y-scale of the plot.

  • title (str) – Plot title.

  • file (str) – File to save plot to.

  • show (bool) – Whether to show plot.

  • ax (plt.Axes) – Axes to plot on. Only for Python visualization backend.

  • ylabel (str) – y-axis label.

Return type:

plt.Axes

Returns:

Axes object

allow_divergence: bool#

Whether to allow divergence.

probabilities: ndarray | None#

The polarization probabilities.

class MaximumLikelihoodAncestralAnnotation(outgroups: List[str], n_ingroups: int = 11, ingroups: List[str] | None = None, exclude: List[str] | None = None, n_runs: int = 10, model: SubstitutionModel = None, parallelize: bool = True, prior: PolarizationPrior | None = '', max_sites: int = 10000, seed: int | None = 0, confidence_threshold: float = 0, n_target_sites: int | None = None, n_samples_target_sites: int | None = 100000, adjust_target_sites: bool = True, subsample_mode: Literal['random', 'probabilistic'] = 'probabilistic')[source]#

Bases: _OutgroupAncestralAlleleAnnotation

Annotation of ancestral alleles following the probabilistic model of EST-SFS (https://doi.org/10.1534/genetics.118.301120). Note that only bi-allelic SNPs are supported by this model. By default, the info field AA (see Annotator.info_ancestral) is added to the VCF file, which holds the ancestral allele. To be used with Annotator or Parser. The info field AA_prob holds the probability that the predicted ancestral allele (AA tag) is correct, as opposed to the other allele. This probability can be also be used by Parser to polarize the SFS according to the ancestral allele probability. In addition to the ancestral allele, the info field AA_info is added, which contains additional information about the ancestral allele (see SiteInfo for an overview of the available information). This class can also be used independently, see the from_dataframe(), from_data() and from_est_sfs() methods.

Initially, the branch rates are determined using MLE. Similar to Parser, we can also specify the number of mutational target sites (see the n_target_sites argument) in case our VCF file does not contain the full set of monomorphic sites. This is necessary to obtain realistic branch rate estimates. You can also choose a prior for the polarization probabilities (see PolarizationPrior). Eventually, for every site, the probability that the major allele is ancestral is calculated.

When annotating the variants of a VCF file, we check the most likely ancestral allele against a naive ad-hoc ancestral allele annotation, and record the sites for which we have disagreement. You might want to sanity-check the mismatches to make sure the model has been properly specified (see mismatches).

Note

  • The polarization prior corresponds to the Kingman coalescent probability by default. Using an adaptive prior, as in the EST-SFS paper, is also possible, but this is only recommended if the number of sites used for the inference is large (see prior).

  • The model can only handle sites that have at most 2 alleles across the in- and outgroups, so sites with more than 2 alleles are ignored. Only variants that are at most bi-allelic in the provided in- and outgroups are annotated.

  • The model determines the probability of the major allele being ancestral opposed to the minor allele. This can be problematic if the actual ancestral allele is not contained in the ingroup (possibly due to subsampling). To avoid this issue, we also keep track of potential minor alleles at frequency 0. If we were to ignore this, it would be impossible to infer divergence, i.e. fixed derived allele that are no longer observed in the ingroups (see PolarizationPrior.allow_divergence). That said, divergence counts are not informative on DFE inference with fastDFE and allow_divergence should not be set to True if interested in the SFS.

  • The model assumes a single coalescent topology for all sites, in which all outgroups coalesce first with the ingroup and not with each other. It is important to specify the outgroups in order of increasing divergence and not to select outgroups that are not much more closely related to each other than to the ingroup (as this would give rise to a different coalescent topology than the one assumed). You can call get_outgroup_divergence() after the inference to check the estimated branch rates for each outgroup. The assumption of a single fixed topology should be good enough provided that in- and outgroups are sufficiently diverged.

Example usage:

import fastdfe as fd

ann = fd.Annotator(
    vcf="https://github.com/Sendrowski/fastDFE/"
        "blob/dev/resources/genome/betula/all."
        "with_outgroups.subset.10000.vcf.gz?raw=true",
    annotations=[fd.MaximumLikelihoodAncestralAnnotation(
        outgroups=["ERR2103730"],
        n_ingroups=15
    )],
    output="genome.polarized.vcf.gz"
)

ann.annotate()
__init__(outgroups: List[str], n_ingroups: int = 11, ingroups: List[str] | None = None, exclude: List[str] | None = None, n_runs: int = 10, model: SubstitutionModel = None, parallelize: bool = True, prior: PolarizationPrior | None = '', max_sites: int = 10000, seed: int | None = 0, confidence_threshold: float = 0, n_target_sites: int | None = None, n_samples_target_sites: int | None = 100000, adjust_target_sites: bool = True, subsample_mode: Literal['random', 'probabilistic'] = 'probabilistic')[source]#

Create a new ancestral allele annotation instance.

Parameters:
  • outgroups (List[str]) – The outgroup samples to consider when determining the ancestral allele in the order of increasing divergence. A list of sample names as they appear in the VCF file. The order of the outgroups is important as it determines the order of the branches in the tree, whose rates are optimized, and whose topology is predetermined. The first outgroup is the closest outgroup to the ingroups, and the last outgroup is the most distant outgroup. More outgroups lead to a more accurate inference of the ancestral allele, but also increase the computational cost. Using more than 1 outgroup is recommended, but more than 3 is likely not necessary. Sites where these outgroups are not present are not included when optimizing the rate parameters. Due to assumptions on the tree topology connecting the in- and outgroups, it is important that the outgroups are not much more closely related to each other than to the ingroups. Ideally, the optimized branch rates are show markedly different values, and in any case, they should be monotonically increasing with the outgroups (see get_outgroup_divergence()).

  • n_ingroups (int) – The minimum number of ingroups that must be present at a site for it to be considered for ancestral allele inference. The ingroup subsampling is necessary since our model requires an equal number of ingroups for all sites. Note that a larger number of ingroups does not necessarily improve the accuracy of the ancestral allele inference (see prior). A larger number of ingroups can lead to a large variance in the polarization probabilities, across different frequency counts. n_ingroups should thus only be large if the number of sites used for the inference is also large. A sensible value for a reasonably large number of sites (a few thousand) is 10 or perhaps 20 for a larger numbers of sites. Very small values can lead to the ingroup subsamples not being representative of the actual allele frequencies at a site, especially when not using probabilistic subsampling (see subsample_mode). This value also influences the number of frequency bins used for the polarization probabilities, and should thus not be too small. Note that if ingroups is an even number, the major allele is chosen arbitrarily if the number of major alleles is equal to the number of minor alleles. To avoid this, you can use an odd number of ingroups.

  • ingroups (Optional[List[str]]) – The ingroup samples to consider when determining the ancestral allele. If None, all (non-outgroup) samples are considered. A list of sample names as they appear in the VCF file. Has to be at least as large as n_ingroups.

  • exclude (Optional[List[str]]) – Samples to exclude from the ingroup. A list of sample names as they appear in the VCF file.

  • n_runs (int) – The number of optimization runs to perform when determining the branch rates. You can check that the likelihoods of the different runs are similar by calling plot_likelihoods().

  • model (SubstitutionModel) – The substitution model to use. By default, K2SubstitutionModel is used.

  • parallelize (bool) – Whether to parallelize the computation across multiple cores.

  • prior (PolarizationPrior | None) – The prior to use for the polarization probabilities. See KingmanPolarizationPrior and AdaptivePolarizationPrior for more information. By default, KingmanPolarizationPrior is used. Use None for no prior.

  • max_sites (int) – The maximum number of sites to consider. This is useful if the number of sites is very large. Choosing a reasonably large subset of sites (on the order of a few thousand bi-allelic sites) can speed up the computation considerably as parsing can be slow. This subset is then used to calibrate the rate parameters, and possibly the polarization priors.

  • seed (int | None) – The seed for the random number generator. If None, a random seed is chosen. By default, the seed is set to 0.

  • confidence_threshold (float) –

    The confidence threshold for the ancestral allele annotation. Only if the probability of the major allele being ancestral as opposed to the minor allele is not within ((1 - confidence_threshold) / 2, 1 - (1 - confidence_threshold) / 2), the ancestral allele is annotated. This is useful to avoid annotating sites where the ancestral allele state is not clear. Use values close to 0 to annotate as many sites as possible, and values close to 1 to annotate only sites where the ancestral allele state is very clear.

    Warning

    This threshold introduces a bias by excluding more sites with high-frequency derived alleles and should thus be kept at 0 if the distribution of frequency counts is important, e.g., if the SFS is to be determined.

  • n_target_sites (int | None) – The total number of target sites if this class is used in conjunction with Parser or Annotator. This is useful if the provided set of sites only consists of bi-allelic sites. Specify here the total number of sites underlying the given dataset, i.e., both mono- and bi-allelic sites. Ignoring mono-allelic sites will lead to overestimation of the rate parameters. For this to work, a FASTA file must be provided from which the mono-allelic sites can be sampled. Sampling takes place between the variants of the last and first site on every contig considered in the VCF file. Use None to disable this feature. Note that the number of target sites is automatically carried over if not specified and this class is used together with Parser. In order to use this feature, you also need to specify a FASTA file to Parser or Annotator. Also note that by default we extrapolate the number of mono-allelic sites to be sampled from the FASTA file based on the ratio of sites with called outgroup bases parsed from the VCF file (adjust_target_sites).

  • n_samples_target_sites (int | None) – The number of sites to sample from the FASTA file when determining the number of target sites (n_target_sites). From this the total number of target sites is extrapolated.

  • adjust_target_sites (bool) – Whether to adjust the number of target sites based on the parsed VCF sites relative to the total number of sites in the VCF. Defaults to True.

  • subsample_mode (Literal['random', 'probabilistic']) – The subsampling mode. For random, we draw once without replacement from the set of all available ingroup genotypes per site. For probabilistic, we integrate over the hypergeometric distribution when parsing and computing the ancestral probabilities. Probabilistic subsampling requires a bit more time, but produces much more stable results, while requiring far fewer sites, so it is highly recommended.

parallelize: bool#

Whether to parallelize the computation.

max_sites: int#

Maximum number of sites to consider

confidence_threshold: float#

The confidence threshold for the ancestral allele annotation.

prior: PolarizationPrior | None#

The prior to use for the polarization probabilities.

n_runs: int#

Number of random ML starts when determining the rate parameters

model: SubstitutionModel#

The substitution model.

configs: pd.DataFrame | None#

The data frame holding all site configurations.

p_bins: Dict[str, np.ndarray | None]#

The probability of all sites per frequency bin.

n_sites: int | None#

The total number of valid sites parsed (including sites not considered for ancestral allele inference).

param_names: List[str]#

The parameter names in the order they are passed to the optimizer.

likelihoods: np.ndarray | None#

The log likelihoods for the different runs when optimizing the rate parameters.

likelihood: float | None#

The best log likelihood when optimizing the rate parameters.

result: Optional['scipy.optimize.OptimizeResult']#

Optimization result of the best run.

params_mle: Dict[str, float] | None#

The MLE parameters.

params_mle_runs: pd.DataFrame | None#

The MLE parameters for all runs.

n_target_sites: int | None#

The total number of target sites.

n_samples_target_sites: int#

The number of sites to sample from the FASTA file when determining the number of target sites.

adjust_target_sites: bool#

Whether to adjust the number of target sites based on the number of sites parsed from the VCF file.

classmethod from_est_sfs(file: str, prior: PolarizationPrior | None = '', n_runs: int = 10, model: SubstitutionModel = None, parallelize: bool = True, seed: int = 0, chunk_size: int = 100000)[source]#

Create instance from EST-SFS input file.

Parameters:
  • file (str) – File containing EST-SFS-formatted input data.

  • prior (PolarizationPrior | None) – The prior to use for the polarization probabilities (see __init__()).

  • n_runs (int) – Number of runs for rate estimation (see __init__()).

  • model (SubstitutionModel) – The substitution model (see __init__()).

  • parallelize (bool) – Whether to parallelize the runs (see __init__()).

  • seed (int) – The seed to use for the random number generator.

  • chunk_size (int) – The chunk size for reading the file.

Return type:

MaximumLikelihoodAncestralAnnotation

Returns:

The instance.

to_est_sfs(file: str)[source]#

Write the object’s state to an EST-SFS formatted file.

Parameters:

file (str) – The output file name.

to_file(file: str)[source]#

Save object to file.

Note

References to the handler and the reader are discarded.

Parameters:

file (str) – File path.

classmethod from_file(file: str)[source]#

Load object from file.

Note

The handler and the reader are not restored, so serialization is mostly useful for analyzing the results in detail, not for further processing.

Parameters:

file (str) – File path.

Return type:

MaximumLikelihoodAncestralAnnotation

Returns:

The object.

to_json()[source]#

Serialize object.

Note

References to the handler and the reader are discarded.

Return type:

str

Returns:

JSON string

classmethod from_json(json: str)[source]#

Load object from file.

Note

The handler and the reader are not restored, so serialization is mostly useful for analyzing the results in detail, not for further processing.

Parameters:

json (str) – JSON string.

Return type:

MaximumLikelihoodAncestralAnnotation

Returns:

The object.

classmethod from_data(n_major: Iterable[int], major_base: Iterable[str | int], minor_base: Iterable[str | int], outgroup_bases: Iterable[Iterable[str | int]], n_ingroups: int, n_runs: int = 10, model: SubstitutionModel = None, parallelize: bool = True, prior: PolarizationPrior | None = '', seed: int = 0, pass_indices: bool = False, confidence_threshold: float = 0)[source]#

Create an instance by passing the data directly.

Parameters:
  • n_major (Iterable[int]) – The number of major alleles per site. Note that this number has to be lower than n_ingroups, as we consider the number of major alleles of subsamples of size n_ingroups.

  • major_base (Iterable[str | int]) – The major allele per site. A string representation of the base or the base index according to ['A', 'C', 'G', 'T'] if pass_indices is True. Use None if the base is not defined when pass_indices is False and -1 when pass_indices is True.

  • minor_base (Iterable[str | int]) – The minor allele per site. A string representation of the base or the base index according to ['A', 'C', 'G', 'T'] if pass_indices is True. Use None if the base is not defined when pass_indices is False and -1 when pass_indices is True.

  • outgroup_bases (Iterable[Iterable[str | int]]) – The outgroup alleles per site. A string representation of the base or the base index if pass_indices is True. This should be a list of lists, where the outer list corresponds to the sites and the inner list to the outgroups per site. All sites are required to have the same number of outgroups. Use None if the base is not defined when pass_indices is False and -1 when pass_indices is True.

  • n_ingroups (int) – The number of ingroup samples (see __init__()).

  • n_runs (int) – The number of runs for rate estimation (see __init__()).

  • model (SubstitutionModel) – The substitution model (see __init__()).

  • parallelize (bool) – Whether to parallelize the runs.

  • prior (PolarizationPrior | None) – The prior to use for the polarization probabilities (see __init__()).

  • seed (int) – The seed for the random number generator.

  • pass_indices (bool) – Whether to pass the base indices instead of the bases.

  • confidence_threshold (float) – The confidence threshold for the ancestral allele annotation (see __init__()).

Return type:

MaximumLikelihoodAncestralAnnotation

Returns:

The instance.

classmethod from_dataframe(data: DataFrame, n_ingroups: int, n_runs: int = 10, model: SubstitutionModel = None, parallelize: bool = True, prior: PolarizationPrior | None = '', seed: int = 0, grouped: bool = False, confidence_threshold: float = 0)[source]#

Create an instance from a dataframe.

Parameters:
  • data (DataFrame) – Dataframe with the columns: major_base, minor_base, outgroup_bases, n_major of type int, int, list and int, respectively. The outgroup bases should have the same length for every site.

  • n_ingroups (int) – The number of ingroups (see __init__()).

  • n_runs (int) – Number of runs for rate estimation (see __init__()).

  • model (SubstitutionModel) – The substitution model (see __init__()).

  • parallelize (bool) – Whether to parallelize computations.

  • prior (PolarizationPrior | None) – The prior to use for the polarization probabilities (see __init__()).

  • seed (int) – The seed for the random number generator. If None, a random seed is chosen.

  • grouped (bool) – Whether the dataframe is already grouped by all columns (used for internal purposes).

  • confidence_threshold (float) – The confidence threshold for the ancestral allele annotation (see __init__()).

Return type:

MaximumLikelihoodAncestralAnnotation

Returns:

The instance.

infer()[source]#

Infer the ancestral allele probabilities for the data provided. This method is only supposed to be called manually if the data is provided directly, e.g. using from_data(), from_dataframe() or from_est_sfs(). If the data is provided using a VCF file, this method is called automatically.

set_mle_params(params: Dict[str, float])[source]#

Set the MLE parameters and update the cache and site configurations. Use this method if you want to use different parameters for the annotation.

Parameters:

params (Dict[str, float]) – The new parameters.

is_monotonic()[source]#

Whether the outgroups are monotonically increasing in divergence.

Return type:

bool

Returns:

Whether the outgroups are monotonically increasing in divergence.

property p_polarization: ndarray | None#

Get the polarization probabilities or None if prior is no.

static get_p_tree(base: int, n_outgroups: int, internal_nodes: List[int] | ndarray, outgroup_bases: List[int] | ndarray, params: Dict[str, float], model: SubstitutionModel)[source]#

Get the probability of a tree.

Parameters:
  • base (int) – An observed ingroup base index.

  • n_outgroups (int) – The number of outgroups.

  • internal_nodes (Union[List[int], ndarray]) – The internal nodes of the tree. We have n_outgroups - 1 internal nodes.

  • outgroup_bases (Union[List[int], ndarray]) – The observed base indices for the outgroups.

  • params (Dict[str, float]) – The parameters of the model.

  • model (SubstitutionModel) – The model to use. Either ‘K2’ or ‘JC’.

Return type:

float

classmethod get_p_config(config: ~fastdfe.annotation.SiteConfig, base_type: ~fastdfe.annotation.BaseType, params: ~typing.Dict[str, float], model: ~fastdfe.annotation.SubstitutionModel = <fastdfe.annotation.K2SubstitutionModel object>, internal: ~numpy.ndarray | None = None)[source]#

Get the probability for a site configuration.

Parameters:
  • config (SiteConfig) – The site configuration.

  • base_type (BaseType) – The base type.

  • params (Dict[str, float]) – The parameters for the substitution model.

  • model (SubstitutionModel) – The substitution model to use.

  • internal (ndarray | None) – Base indices of internal nodes of the tree if fixed. If None, the internal nodes are considered as free parameters. -1 also indicates a free parameter. The number of internal nodes is the number of outgroups minus one.

Return type:

float

Returns:

The probability for a site.

classmethod get_p_configs(configs: DataFrame, model: SubstitutionModel, base_type: BaseType, params: Dict[str, float])[source]#

Get the probabilities for each site configuration.

Parameters:
  • configs (DataFrame) – The site configurations.

  • model (SubstitutionModel) – The substitution model.

  • base_type (BaseType) – The base type.

  • params (Dict[str, float]) – A dictionary of the rate parameters.

Return type:

ndarray

Returns:

The probability for each site.

evaluate_likelihood(params: Dict[str, float])[source]#

Evaluate the likelihood function for the rate parameters.

Parameters:

params (Dict[str, float]) – A dictionary of parameters.

Return type:

float

Returns:

The log likelihood.

get_inferred_site_info()[source]#

Get the site information for the sites included in the parsing process. The sites are in the same order as parsed. You can use get_site_info() to get the site information for a specific site.

Return type:

Generator[SiteInfo, None, None]

Returns:

A generator yielding a dictionary with the site information (see get_site_info()).

Raises:

RuntimeError – If the subsample mode is probabilistic.

get_site_info(n_major: int, major_base: int | str, minor_base: int | str, outgroup_bases: List[int | str] | ndarray, pass_indices: bool = False)[source]#

Get information on the specified sites using the inferred parameters.

Parameters:
  • n_major (int) – The number of copies of the major allele.

  • major_base (int | str) – The major bases indices or strings.

  • minor_base (int | str) – The minor bases indices or strings.

  • outgroup_bases (Union[List[int | str], ndarray]) – The outgroup base indices or strings.

  • pass_indices (bool) – Whether to pass the indices as strings or convert them to integers.

Return type:

SiteInfo

Returns:

The site information.

annotate_site(variant: cyvcf2.Variant | DummyVariant)[source]#

Annotate a single site.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The variant to annotate.

Returns:

The annotated variant.

plot_likelihoods(file: str = None, show: bool = True, title: str = 'rate likelihoods', scale: Literal['lin', 'log'] = 'lin', ax: plt.Axes = None, ylabel: str = 'lnl')[source]#

Visualize the likelihoods of the rate optimization runs using a scatter plot.

Parameters:
  • scale (Literal['lin', 'log']) – y-scale of the plot.

  • title (str) – Plot title.

  • file (str) – File to save plot to.

  • show (bool) – Whether to show plot.

  • ax (plt.Axes) – Axes to plot on. Only for Python visualization backend.

  • ylabel (str) – Label for y-axis.

Return type:

plt.Axes

Returns:

Axes object

get_folded_spectra(groups: List[Literal['major_base', 'minor_base', 'outgroup_bases']] = ['major_base'])[source]#

Get the folded spectra for the parsed sites (used to estimate the parameters).

Parameters:

groups (List[Literal['major_base', 'minor_base', 'outgroup_bases']]) – The groups to group the spectra by.

Return type:

Spectra

Returns:

Spectra object

get_outgroup_divergence()[source]#

Get the inferred branch rates between the ingroup and outgroups by combining the inferred branch rates.

Return type:

ndarray

Returns:

One rate for each outgroup.

get_observed_transition_transversion_ratio()[source]#

Get the observed transition/transversion ratio. Note that this may differ from the estimated ratio for K2SubstitutionModel.

Return type:

float

Returns:

The observed transition/transversion ratio.

static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#

Count the number of target sites in a GFF file.

Parameters:
  • file (str) – The path to The GFF file path, possibly gzipped or a URL

  • remove_overlaps (bool) – Whether to remove overlapping target sites.

  • contigs (List[str]) – The contigs to count the target sites for.

Return type:

Dict[str, int]

Returns:

The number of target sites per chromosome/contig.

classmethod get_base_index(base_string: str | ndarray)#

Get base index/indices from base string(s).

Parameters:

base_string (str | ndarray) – The base string(s).

Return type:

int | ndarray

Returns:

Base index/indices.

static get_base_string(indices: int | ndarray)#

Get base string(s) from base index/indices.

Parameters:

indices (int | ndarray) – The base index/indices.

Return type:

str | ndarray

Returns:

Base string(s).

ingroups: List[str] | None#

The ingroup samples to consider when determining the ancestral allele.

exclude: List[str]#

The samples excluded from the ingroup.

outgroups: List[str]#

The outgroup samples to consider when determining the ancestral allele.

n_ingroups: int#

The number of ingroups.

n_outgroups: int#

The number of outgroups.

seed: int | None#

The seed for the random number generator.

subsample_mode: Literal['random', 'probabilistic']#

The subsampling mode.

rng: np.random.Generator#

The random number generator.

n_annotated: int#

The number of annotated sites.

class AdHocAncestralAnnotation(outgroups: List[str], n_ingroups: int, ingroups: List[str] | None = None, exclude: List[str] = [], seed: int | None = 0, subsample_mode: Literal['random', 'probabilistic'] = 'random')[source]#

Bases: _OutgroupAncestralAlleleAnnotation

Ad-hoc ancestral allele annotation using simple rules. Used for testing and sanity checking.

annotate_site(variant: cyvcf2.Variant | DummyVariant)[source]#

Annotate a single site. Mono-allelic sites are assigned the major allele as ancestral. Sites with more than two alleles are ignored.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The variant to annotate.

Returns:

The annotated variant.

__init__(outgroups: List[str], n_ingroups: int, ingroups: List[str] | None = None, exclude: List[str] = [], seed: int | None = 0, subsample_mode: Literal['random', 'probabilistic'] = 'random')#

Create a new ancestral allele annotation instance.

Parameters:
  • outgroups (List[str]) – The outgroup samples to consider when determining the ancestral allele. A list of sample names as they appear in the VCF file.

  • n_ingroups (int) – The minimum number of ingroups that must be present at a site for it to be considered for ancestral allele inference.

  • ingroups (Optional[List[str]]) – The ingroup samples to consider when determining the ancestral allele. A list of sample names as they appear in the VCF file. If None, all samples except the outgroups are considered.

  • exclude (List[str]) – Samples to exclude from the ingroup. A list of sample names as they appear in the VCF file.

  • seed (int | None) – The seed for the random number generator.

  • subsample_mode (Literal['random', 'probabilistic']) – The subsampling mode. Either ‘random’ or ‘probabilistic’.

static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#

Count the number of target sites in a GFF file.

Parameters:
  • file (str) – The path to The GFF file path, possibly gzipped or a URL

  • remove_overlaps (bool) – Whether to remove overlapping target sites.

  • contigs (List[str]) – The contigs to count the target sites for.

Return type:

Dict[str, int]

Returns:

The number of target sites per chromosome/contig.

classmethod get_base_index(base_string: str | ndarray)#

Get base index/indices from base string(s).

Parameters:

base_string (str | ndarray) – The base string(s).

Return type:

int | ndarray

Returns:

Base index/indices.

static get_base_string(indices: int | ndarray)#

Get base string(s) from base index/indices.

Parameters:

indices (int | ndarray) – The base index/indices.

Return type:

str | ndarray

Returns:

Base string(s).

ingroups: Optional[List[str]]#

The ingroup samples to consider when determining the ancestral allele.

exclude: List[str]#

The samples excluded from the ingroup.

outgroups: List[str]#

The outgroup samples to consider when determining the ancestral allele.

n_ingroups: int#

The number of ingroups.

n_outgroups: int#

The number of outgroups.

seed: int | None#

The seed for the random number generator.

subsample_mode: Literal['random', 'probabilistic']#

The subsampling mode.

rng: Generator#

The random number generator.

n_annotated: int#

The number of annotated sites.

class Annotator(vcf: str, output: str, annotations: List[Annotation], gff: str | None = None, fasta: str | None = None, info_ancestral: str = 'AA', max_sites: int = inf, seed: int | None = 0, cache: bool = True, aliases: Dict[str, List[str]] = {})[source]#

Bases: MultiHandler

Annotate a VCF file with the given annotations.

Example usage:

import fastdfe as fd

ann = fd.Annotator(
    vcf="http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/"
        "1000_genomes_project/release/20181203_biallelic_SNV/"
        "ALL.chr21.shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz",
    fasta="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/"
          "dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz",
    gff="http://ftp.ensembl.org/pub/release-109/gff3/homo_sapiens/"
        "Homo_sapiens.GRCh38.109.chromosome.21.gff3.gz",
    output='sapiens.chr21.degeneracy.vcf.gz',
    annotations=[fd.DegeneracyAnnotation()],
    aliases=dict(chr21=['21'])
)

ann.annotate()
__init__(vcf: str, output: str, annotations: List[Annotation], gff: str | None = None, fasta: str | None = None, info_ancestral: str = 'AA', max_sites: int = inf, seed: int | None = 0, cache: bool = True, aliases: Dict[str, List[str]] = {})[source]#

Create a new annotator instance.

Parameters:
  • vcf (str) – The path to the VCF file, can be gzipped, urls are also supported

  • output (str) – The path to the output file

  • annotations (List[Annotation]) – The annotations to apply.

  • gff (str | None) – The path to the GFF file, can be gzipped, urls are also supported. Required for annotations that require a GFF file.

  • fasta (str | None) – The path to the FASTA file, can be gzipped, urls are also supported. Required for annotations that require a FASTA file.

  • info_ancestral (str) – The tag in the INFO field that contains the ancestral allele

  • max_sites (int) – Maximum number of sites to consider

  • seed (int | None) – Seed for the random number generator. Use None for no seed.

  • cache (bool) – Whether to cache files downloaded from urls

  • aliases (Dict[str, List[str]]) – Dictionary of aliases for the contigs in the VCF file, e.g. {'chr1': ['1']}. This is used to match the contig names in the VCF file with the contig names in the FASTA file and GFF file.

output: str#

The path to the output file.

annotations: List[Annotation]#

The annotations to apply.

annotate()[source]#

Annotate the VCF file.

count_sites()#

Count the number of sites in the VCF.

Return type:

int

Returns:

Number of sites

classmethod download_file(url: str, cache: bool = True, desc: str = 'Downloading file')#

Download a file from a URL.

Parameters:
  • cache (bool) – Whether to cache the file.

  • url (str) – The URL to download the file from.

  • desc (str) – Description for the progress bar

Return type:

str

Returns:

The path to the downloaded file.

download_if_url(path: str)#

Download the VCF file if it is a URL.

Parameters:

path (str) – The path to the VCF file.

Return type:

str

Returns:

The path to the downloaded file or the original path.

get_aliases(contig: str)#

Get all aliases for the given contig alias including the primary alias.

Parameters:

contig (str) – The contig.

Return type:

List[str]

Returns:

The aliases.

get_contig(aliases, rewind: bool = True, notify: bool = True)#

Get the contig from the FASTA file.

Note that pyfaidx would be more efficient here, but there were problems when running it in parallel.

Parameters:
  • aliases – The contig aliases.

  • rewind (bool) – Whether to allow for rewinding the iterator if the contig is not found.

  • notify (bool) – Whether to notify the user when rewinding the iterator.

Return type:

SeqRecord

Returns:

The contig.

get_contig_names()#

Get the names of the contigs in the FASTA file.

Return type:

List[str]

Returns:

The contig names.

static get_filename(url: str)#

Return the file extension of a URL.

Parameters:

url (str) – The URL to get the file extension from.

Returns:

The file extension.

get_pbar(desc: str = 'Processing sites', total: int | None = 0)#

Return a progress bar for the number of sites.

Parameters:
  • desc (str) – Description for the progress bar

  • total (int | None) – Total number of items

Return type:

tqdm

Returns:

tqdm

static hash(s: str)#

Return a truncated SHA1 hash of a string.

Parameters:

s (str) – The string to hash.

Return type:

str

Returns:

The SHA1 hash.

static is_url(path: str)#

Check if the given path is a URL.

Parameters:

path (str) – The path to check.

Return type:

bool

Returns:

True if the path is a URL, False otherwise.

load_fasta(file: str)#

Load a FASTA file into a dictionary.

Parameters:

file (str) – The path to The FASTA file path, possibly gzipped or a URL

Return type:

FastaIterator

Returns:

Iterator over the sequences.

load_vcf()#

Load a VCF file into a dictionary.

Return type:

cyvcf2.VCF

Returns:

The VCF reader.

property n_sites: int#

Get the number of sites in the VCF.

Returns:

Number of sites

static remove_overlaps(df: DataFrame)#

Remove overlapping coding sequences.

Parameters:

df (DataFrame) – The coding sequences.

Return type:

DataFrame

Returns:

The coding sequences without overlaps.

static unzip_if_zipped(file: str)#

If the given file is gzipped, unzip it and return the path to the unzipped file. If the file is not gzipped, return the path to the original file.

Parameters:

file (str) – The path to the file.

Returns:

The path to the unzipped file, or the original file if it was not gzipped.

vcf#

The path to the VCF file or an iterable of variants

info_ancestral: str#

The tag in the INFO field that contains the ancestral allele

max_sites: int#

Maximum number of sites to consider

seed: Optional[int]#

Seed for the random number generator

rng#

Random generator instance

fasta: str#

The path to the FASTA file.

gff#

The GFF file path

cache: bool#

Whether to cache files that are downloaded from URLs

aliases#

The contig mappings