VCF annotation#
VCF annotations and an annotator to apply them.
- class Annotation[source]#
Bases:
ABCBase class for annotations.
-
n_annotated:
int# The number of annotated sites.
- abstract annotate_site(variant: cyvcf2.Variant | DummyVariant)[source]#
Annotate a single site.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The variant to annotate.- Returns:
The annotated variant.
- static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)[source]#
Count the number of target sites in a GFF file.
- Parameters:
file (
str) – The path to The GFF file path, possibly gzipped or a URLremove_overlaps (
bool) – Whether to remove overlapping target sites.contigs (
List[str]) – The contigs to count the target sites for.
- Return type:
Dict[str,int]- Returns:
The number of target sites per chromosome/contig.
-
n_annotated:
- class DegeneracyAnnotation[source]#
Bases:
AnnotationDegeneracy annotation. We annotate the degeneracy by looking at each codon for coding variants. This also annotates mono-allelic sites.
This annotation adds the info fields
DegeneracyandDegeneracy_Info, which hold the degeneracy of a site (0, 2, 4) and extra information about the degeneracy, respectively. To be used withDegeneracyStratification.For this annotation to work, we require a FASTA and GFF file (passed to
ParserorAnnotator).Example usage:
import fastdfe as fd ann = fd.Annotator( vcf="http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/" "1000_genomes_project/release/20181203_biallelic_SNV/" "ALL.chr21.shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz", fasta="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/" "dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz", gff="http://ftp.ensembl.org/pub/release-109/gff3/homo_sapiens/" "Homo_sapiens.GRCh38.109.chromosome.21.gff3.gz", output='sapiens.chr21.degeneracy.vcf.gz', annotations=[fd.DegeneracyAnnotation()], aliases=dict(chr21=['21']) ) ann.annotate()
- mismatches: List['cyvcf2.Variant']#
The variants that could not be annotated correctly.
- n_skipped: int#
The variant that were skipped because they were not in coding regions.
- errors: List['cyvcf2.Variant']#
The variants for which the codon could not be determined.
- annotate_site(v: cyvcf2.Variant | DummyVariant)[source]#
Annotate a single site.
- Parameters:
v (
Union[cyvcf2.Variant,DummyVariant]) – The variant to annotate.
- static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#
Count the number of target sites in a GFF file.
- Parameters:
file (
str) – The path to The GFF file path, possibly gzipped or a URLremove_overlaps (
bool) – Whether to remove overlapping target sites.contigs (
List[str]) – The contigs to count the target sites for.
- Return type:
Dict[str,int]- Returns:
The number of target sites per chromosome/contig.
- n_annotated: int#
The number of annotated sites.
- class SynonymyAnnotation[source]#
Bases:
DegeneracyAnnotationSynonymy annotation. This class annotates a variant with the synonymous/non-synonymous status.
This annotation adds the info fields
SynonymousandSynonymous_Info, which hold the synonymy (Synonymous [0] or non-synonymous [1]) and the codon information, respectively. To be used withSynonymyStratification.For this annotation to work, we require a FASTA and GFF file (passed to
ParserorAnnotator).This class was tested against VEP and SnpEff and provides the same annotations in almost all cases.
Warning
Not recommended for use with
Parseras we also need to annotate mono-allelic sites. Consider usingDegeneracyAnnotationandDegeneracyStratificationinstead.- vep_mismatches: List['cyvcf2.Variant']#
The number of sites that did not match with VEP.
- snpeff_mismatches: List['cyvcf2.Variant']#
The number of sites that did not math with the annotation provided by SnpEff
- n_vep_comparisons: int#
The number of sites that were concordant with VEP.
- n_snpeff_comparisons: int#
The number of sites that were concordant with SnpEff.
- static mutate(codon: str, alt: str, pos: int)[source]#
Mutate the codon at the given position to the given alternative allele.
- Parameters:
codon (
str) – The codon to mutate.alt (
str) – The alternative allele.pos (
int) – The position to mutate.
- Return type:
str- Returns:
Mutated codon.
- static is_synonymous(codon1: str, codon2: str)[source]#
Check if two codons are synonymous.
- Parameters:
codon1 (
str) – The first codon.codon2 (
str) – The second codon.
- Return type:
bool- Returns:
True if the codons are synonymous, False otherwise.
- annotate_site(v: cyvcf2.Variant | DummyVariant)[source]#
Annotate a single site.
- Parameters:
v (
Union[cyvcf2.Variant,DummyVariant]) – The variant to annotate.- Returns:
The annotated variant.
- static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#
Count the number of target sites in a GFF file.
- Parameters:
file (
str) – The path to The GFF file path, possibly gzipped or a URLremove_overlaps (
bool) – Whether to remove overlapping target sites.contigs (
List[str]) – The contigs to count the target sites for.
- Return type:
Dict[str,int]- Returns:
The number of target sites per chromosome/contig.
- mismatches: List['cyvcf2.Variant']#
The variants that could not be annotated correctly.
- n_skipped: int#
The variant that were skipped because they were not in coding regions.
- errors: List['cyvcf2.Variant']#
The variants for which the codon could not be determined.
- n_annotated: int#
The number of annotated sites.
- class AncestralAlleleAnnotation[source]#
Bases:
Annotation,ABCBase class for ancestral allele annotation.
- __init__()#
Create a new annotation instance.
- abstract annotate_site(variant: cyvcf2.Variant | DummyVariant)#
Annotate a single site.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The variant to annotate.- Returns:
The annotated variant.
- static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#
Count the number of target sites in a GFF file.
- Parameters:
file (
str) – The path to The GFF file path, possibly gzipped or a URLremove_overlaps (
bool) – Whether to remove overlapping target sites.contigs (
List[str]) – The contigs to count the target sites for.
- Return type:
Dict[str,int]- Returns:
The number of target sites per chromosome/contig.
-
n_annotated:
int# The number of annotated sites.
- class MaximumParsimonyAncestralAnnotation(samples: List[str] = None)[source]#
Bases:
AncestralAlleleAnnotationAnnotation of ancestral alleles using maximum parsimony. To be used with
AncestralBaseStratificationandAnnotatororParser.Note that maximum parsimony is not a reliable way to determine ancestral alleles, so it is recommended to use this annotation together with the ancestral misidentification parameter
epsor to fold spectra altogether. Alternatively, you can useDeviantOutgroupFiltrationto filter out sites where the major allele among outgroups does not coincide with the major allele among ingroups. This annotation has the advantage of requiring no outgroup data. If outgroup data is available consider using :class`MLEAncestralAlleleAnnotation` instead.- __init__(samples: List[str] = None)[source]#
Create a new ancestral allele annotation instance.
- Parameters:
samples (
List[str]) – The samples to consider when determining the ancestral allele. IfNone, all samples are considered.
-
samples:
Optional[List[str]]# The samples to consider when determining the ancestral allele.
- annotate_site(variant: cyvcf2.Variant | DummyVariant)[source]#
Annotate a single site.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The variant to annotate.- Returns:
The annotated variant.
- static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#
Count the number of target sites in a GFF file.
- Parameters:
file (
str) – The path to The GFF file path, possibly gzipped or a URLremove_overlaps (
bool) – Whether to remove overlapping target sites.contigs (
List[str]) – The contigs to count the target sites for.
- Return type:
Dict[str,int]- Returns:
The number of target sites per chromosome/contig.
-
n_annotated:
int# The number of annotated sites.
- class SubstitutionModel(bounds: Dict[str, Tuple[float, float]] = {}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {})[source]#
Bases:
ABCBase class for substitution models.
- __init__(bounds: Dict[str, Tuple[float, float]] = {}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {})[source]#
Create a new substitution model instance.
- Parameters:
bounds (
Dict[str,Tuple[float,float]]) – The bounds for the parameters.pool_branch_rates (
bool) – Whether to pool the branch rates. By default, each branch has its own rate which is optimized using MLE. IfTrue, the branch rates are pooled and a single rate is optimized. This is useful if the number of sites used is small.fixed_params (
Dict[str,float]) – The fixed parameters. Parameters that are not fixed are optimized using MLE.
-
pool_branch_rates:
bool# Whether to pool the branch rates.
-
fixed_params:
Dict[str,float]# The fixed parameters.
-
bounds:
Dict[str,Tuple[float,float]]# Parameter bounds.
- cache(params: Dict[str, float], n_branches: int)[source]#
Cache the probabilities for the given parameters.
- Parameters:
params (
Dict[str,float]) – The parameters.n_branches (
int) – The number of branches.
- static get_x0(bounds: Dict[str, Tuple[float, float]], rng: Generator)[source]#
Get the initial values for the parameters.
- Parameters:
bounds (
Dict[str,Tuple[float,float]]) – The bounds for the parameters.rng (
Generator) – The random number generator.
- Return type:
Dict[str,float]- Returns:
The initial values.
- class JCSubstitutionModel(bounds: Dict[str, Tuple[float, float]] = {'K': (1e-05, 10)}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {})[source]#
Bases:
SubstitutionModelJukes-Cantor substitution model.
- __init__(bounds: Dict[str, Tuple[float, float]] = {'K': (1e-05, 10)}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {})[source]#
Create a new substitution model instance.
- Parameters:
bounds (
Dict[str,Tuple[float,float]]) – The bounds for the parameters. K is the branch rate.pool_branch_rates (
bool) – Whether to pool the branch rates. By default, each branch has its own rate which is optimized using MLE. IfTrue, the branch rates are pooled and a single rate is optimized. This is useful if the number of sites used is small. IfFalse, each branch has its own rate denoted by “K{i}”, where i is the branch index. IfTrue, the branch rate is denoted by “K”.fixed_params (
Dict[str,float]) – The fixed parameters. Parameters that are not fixed are optimized using MLE.
- get_bound(param: str)[source]#
Get the bounds for a parameter.
- Parameters:
param (
str) – The parameter.- Return type:
Tuple[float,float]- Returns:
The lower and upper bounds.
- get_bounds(n_outgroups: int)[source]#
Get the bounds for the parameters.
- Parameters:
n_outgroups (
int) – The number of outgroups.- Return type:
Dict[str,Tuple[float,float]]- Returns:
The lower and upper bounds.
- cache(params: Dict[str, float], n_branches: int)#
Cache the probabilities for the given parameters.
- Parameters:
params (
Dict[str,float]) – The parameters.n_branches (
int) – The number of branches.
- static get_x0(bounds: Dict[str, Tuple[float, float]], rng: Generator)#
Get the initial values for the parameters.
- Parameters:
bounds (
Dict[str,Tuple[float,float]]) – The bounds for the parameters.rng (
Generator) – The random number generator.
- Return type:
Dict[str,float]- Returns:
The initial values.
- static validate_bounds(bounds: Dict[str, Tuple[float, float]])#
Make sure the lower bounds are positive and the upper bounds are larger than the lower bounds.
- Parameters:
bounds (
Dict[str,Tuple[float,float]]) – The bounds to validate- Raises:
ValueError – If the bounds are invalid
-
pool_branch_rates:
bool# Whether to pool the branch rates.
-
fixed_params:
Dict[str,float]# The fixed parameters.
-
bounds:
Dict[str,Tuple[float,float]]# Parameter bounds.
- class K2SubstitutionModel(bounds: Dict[str, Tuple[float, float]] = {'K': (1e-05, 10), 'k': (0.1, 10)}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {}, fix_transition_transversion_ratio: bool = False)[source]#
Bases:
JCSubstitutionModelKimura 2-parameter substitution model.
- __init__(bounds: Dict[str, Tuple[float, float]] = {'K': (1e-05, 10), 'k': (0.1, 10)}, pool_branch_rates: bool = False, fixed_params: Dict[str, float] = {}, fix_transition_transversion_ratio: bool = False)[source]#
Create a new substitution model instance.
- Parameters:
bounds (
Dict[str,Tuple[float,float]]) – The bounds for the parameters.K{i}are the branch rates.kis the transition/transversion ratio.pool_branch_rates (
bool) – Whether to pool the branch rates. By default, each branch has its own rate which is optimized using MLE. IfTrue, the branch rates are pooled and a single rate is optimized. This is useful if the number of sites used is small.fixed_params (
Dict[str,float]) – The fixed parameters. Parameters that are not fixed are optimized using MLE.fix_transition_transversion_ratio (
bool) – Whether to fix the transition/transversion ratio to the ratio observed in the data.
-
fix_transition_transversion_ratio:
bool# Whether to fix the transition/transversion ratio to the ratio observed in the data.
- get_bounds(n_outgroups: int)[source]#
Get the bounds for the parameters.
- Parameters:
n_outgroups (
int) – The number of outgroups.- Return type:
Dict[str,Tuple[float,float]]- Returns:
The lower and upper bounds.
- cache(params: Dict[str, float], n_branches: int)#
Cache the probabilities for the given parameters.
- Parameters:
params (
Dict[str,float]) – The parameters.n_branches (
int) – The number of branches.
- get_bound(param: str)#
Get the bounds for a parameter.
- Parameters:
param (
str) – The parameter.- Return type:
Tuple[float,float]- Returns:
The lower and upper bounds.
- static get_x0(bounds: Dict[str, Tuple[float, float]], rng: Generator)#
Get the initial values for the parameters.
- Parameters:
bounds (
Dict[str,Tuple[float,float]]) – The bounds for the parameters.rng (
Generator) – The random number generator.
- Return type:
Dict[str,float]- Returns:
The initial values.
- static validate_bounds(bounds: Dict[str, Tuple[float, float]])#
Make sure the lower bounds are positive and the upper bounds are larger than the lower bounds.
- Parameters:
bounds (
Dict[str,Tuple[float,float]]) – The bounds to validate- Raises:
ValueError – If the bounds are invalid
-
pool_branch_rates:
bool# Whether to pool the branch rates.
-
fixed_params:
Dict[str,float]# The fixed parameters.
-
bounds:
Dict[str,Tuple[float,float]]# Parameter bounds.
- class SiteConfig(n_major: int, major_base: int, minor_base: int, outgroup_bases: ndarray, multiplicity: float = 1.0, sites: ndarray = None, p_minor: float = nan, p_major: float = nan)[source]#
Bases:
objectAncestral allele site configuration for a single subsample.
- __init__(n_major: int, major_base: int, minor_base: int, outgroup_bases: ndarray, multiplicity: float = 1.0, sites: ndarray = None, p_minor: float = nan, p_major: float = nan)[source]#
Create a new site configuration instance.
-
n_major:
int# The number of major alleles.
-
major_base:
int# The major allele base index.
-
minor_base:
int# The minor base index.
-
outgroup_bases:
ndarray# The outgroup base indices.
-
multiplicity:
float# The multiplicity of the site.
-
sites:
ndarray# The site indices.
-
p_minor:
float# The probability of the minor allele.
-
p_major:
float# The probability of the major allele.
- class SiteInfo(n_major: Dict[int, float], major_base: str, minor_base: str, outgroup_bases: List[str], p_minor: float = nan, p_major: float = nan, p_major_ancestral: float = nan, major_ancestral: str = '.', p_bases_first_node: Dict[str, float] = None, p_first_node_ancestral: float = nan, first_node_ancestral: str = '.', rate_params: Dict[str, float] = None)[source]#
Bases:
objectAncestral allele information on a single site.
- __init__(n_major: Dict[int, float], major_base: str, minor_base: str, outgroup_bases: List[str], p_minor: float = nan, p_major: float = nan, p_major_ancestral: float = nan, major_ancestral: str = '.', p_bases_first_node: Dict[str, float] = None, p_first_node_ancestral: float = nan, first_node_ancestral: str = '.', rate_params: Dict[str, float] = None)[source]#
-
n_major:
Dict[int,float]# Dictionary mapping number of major alleles to its probability of observation.
-
major_base:
str# The major allele base.
-
minor_base:
str# The minor base index.
-
outgroup_bases:
List[str]# The outgroup base indices.
-
p_minor:
float# The probability of the minor allele being the ancestral allele (without prior).
-
p_major:
float# The probability of the major allele being the ancestral allele (without prior).
-
p_major_ancestral:
float# The probability of the major allele being the ancestral allele rather than the minor allele (possibly with prior if specified).
-
major_ancestral:
str# The predicted ancestral base based on comparing major and minor allele.
-
p_bases_first_node:
Dict[str,float]# The probability of each base being the ancestral base for the first internal node.
-
p_first_node_ancestral:
float# The probability that the mostly likely base for the first internal node is the ancestral base.
-
first_node_ancestral:
str# The ancestral base index for the first internal node.
-
rate_params:
Dict[str,float]# The branch rates.
- class BaseType(value, names=None, *, module=None, qualname=None, type=None, start=1, boundary=None)[source]#
Bases:
EnumThe base type, either major or minor.
-
MINOR:
int= 0#
-
MAJOR:
int= 1#
-
MINOR:
- class PolarizationPrior(allow_divergence: bool = False)[source]#
Bases:
ABCBase class for priors used with :class:MaximumLikelihoodAncestralAnnotation. These priors incorporate information about the general probability of the major allele being ancestral across all sites with the same minor allele count. Prior thus take ingroup allele frequencies into account, when making predictions about the ancestral state of a site. This is useful because it enhances ancestral allele probability estimates, especially when outgroup information is unavailable for a particular site. Knowing the likelihood of the major allele being ancestral in general allows for more informed estimations.
- __init__(allow_divergence: bool = False)[source]#
Create a new instance.
- Parameters:
allow_divergence (
bool) –Whether to allow divergence. If
True, the probability of the minor allele being ancestral, which is not contained in the ingroup subsample but rather in all specified ingroups or among the outgroup, is taken to be the same as if it was present in the ingroup subsample with frequency 1. This is a hack, but allows us to consider alleles that are not present in the ingroup subsample.Warning
Setting this to
Truegreatly increases the probability of high-frequency derived alleles which introduces a strong bias in the distribution of frequency counts, e.g., the SFS. Only use this if you’re interested in handling divergence counts, i.e., sites where the ingroup is mono-allelic.
-
allow_divergence:
bool# Whether to allow divergence.
-
probabilities:
ndarray|None# The polarization probabilities.
- plot(file: str = None, show: bool = True, title: str = 'polarization probabilities', scale: Literal['lin', 'log'] = 'lin', ax: plt.Axes = None, ylabel: str = 'p')[source]#
Visualize the polarization probabilities using a scatter plot.
- Parameters:
scale (
Literal['lin','log']) – y-scale of the plot.title (
str) – Plot title.file (
str) – File to save plot to.show (
bool) – Whether to show plot.ax (plt.Axes) – Axes to plot on. Only for Python visualization backend.
ylabel (
str) – y-axis label.
- Return type:
plt.Axes
- Returns:
Axes object
- class KingmanPolarizationPrior(allow_divergence: bool = False)[source]#
Bases:
PolarizationPriorPrior based on the standard Kingman coalescent. To be used with
MaximumLikelihoodAncestralAnnotation.- __init__(allow_divergence: bool = False)#
Create a new instance.
- Parameters:
allow_divergence (
bool) –Whether to allow divergence. If
True, the probability of the minor allele being ancestral, which is not contained in the ingroup subsample but rather in all specified ingroups or among the outgroup, is taken to be the same as if it was present in the ingroup subsample with frequency 1. This is a hack, but allows us to consider alleles that are not present in the ingroup subsample.Warning
Setting this to
Truegreatly increases the probability of high-frequency derived alleles which introduces a strong bias in the distribution of frequency counts, e.g., the SFS. Only use this if you’re interested in handling divergence counts, i.e., sites where the ingroup is mono-allelic.
- plot(file: str = None, show: bool = True, title: str = 'polarization probabilities', scale: Literal['lin', 'log'] = 'lin', ax: plt.Axes = None, ylabel: str = 'p')#
Visualize the polarization probabilities using a scatter plot.
- Parameters:
scale (
Literal['lin','log']) – y-scale of the plot.title (
str) – Plot title.file (
str) – File to save plot to.show (
bool) – Whether to show plot.ax (plt.Axes) – Axes to plot on. Only for Python visualization backend.
ylabel (
str) – y-axis label.
- Return type:
plt.Axes
- Returns:
Axes object
-
allow_divergence:
bool# Whether to allow divergence.
-
probabilities:
ndarray|None# The polarization probabilities.
- class AdaptivePolarizationPrior(n_runs: int = 1, parallelize: bool = True, allow_divergence: bool = False, seed: int | None = 0)[source]#
Bases:
PolarizationPriorAdaptive prior. To be used with
MaximumLikelihoodAncestralAnnotation. This is the same prior as used in the EST-SFS paper. This prior is adaptive in the sense that the most likely polarization probabilities given the site configurations are found. This is the most accurate prior, but requires a lot of sites in order to work properly. You can check that the polarization probabilities are smooth enough across frequency counts by callingplot(). If they are not smooth enough, you can increase the number of sites, decrease the number of ingroups, or useKingmanPolarizationPriorinstead.Note
In practice, this prior provides very similar results to
KingmanPolarizationPriorin most cases.- __init__(n_runs: int = 1, parallelize: bool = True, allow_divergence: bool = False, seed: int | None = 0)[source]#
Create a new adaptive prior instance.
- Parameters:
n_runs (
int) – The number of runs to perform when determining the polarization parameters. One run should be sufficient as only one parameter is optimized.parallelize (
bool) – Whether to parallelize the optimization.allow_divergence (
bool) – Whether to allow divergence. SeePolarizationPriorfor details.seed (
int|None) – The seed for the random number generator.
-
n_runs:
int# The number of runs to use for the adaptive prior.
-
parallelize:
bool# Whether to parallelize the optimization.
-
seed:
int|None# The seed for the random number generator.
-
rng:
Generator# The random number generator.
- plot(file: str = None, show: bool = True, title: str = 'polarization probabilities', scale: Literal['lin', 'log'] = 'lin', ax: plt.Axes = None, ylabel: str = 'p')#
Visualize the polarization probabilities using a scatter plot.
- Parameters:
scale (
Literal['lin','log']) – y-scale of the plot.title (
str) – Plot title.file (
str) – File to save plot to.show (
bool) – Whether to show plot.ax (plt.Axes) – Axes to plot on. Only for Python visualization backend.
ylabel (
str) – y-axis label.
- Return type:
plt.Axes
- Returns:
Axes object
-
allow_divergence:
bool# Whether to allow divergence.
-
probabilities:
ndarray|None# The polarization probabilities.
- class MaximumLikelihoodAncestralAnnotation(outgroups: List[str], n_ingroups: int = 11, ingroups: List[str] | None = None, exclude: List[str] | None = None, n_runs: int = 10, model: SubstitutionModel = None, parallelize: bool = True, prior: PolarizationPrior | None = '', max_sites: int = 10000, seed: int | None = 0, confidence_threshold: float = 0, n_target_sites: int | None = None, n_samples_target_sites: int | None = 100000, adjust_target_sites: bool = True, subsample_mode: Literal['random', 'probabilistic'] = 'probabilistic')[source]#
Bases:
_OutgroupAncestralAlleleAnnotationAnnotation of ancestral alleles following the probabilistic model of EST-SFS (https://doi.org/10.1534/genetics.118.301120). Note that only bi-allelic SNPs are supported by this model. By default, the info field
AA(seeAnnotator.info_ancestral) is added to the VCF file, which holds the ancestral allele. To be used withAnnotatororParser. The info fieldAA_probholds the probability that the predicted ancestral allele (AAtag) is correct, as opposed to the other allele. This probability can be also be used byParserto polarize the SFS according to the ancestral allele probability. In addition to the ancestral allele, the info fieldAA_infois added, which contains additional information about the ancestral allele (seeSiteInfofor an overview of the available information). This class can also be used independently, see thefrom_dataframe(),from_data()andfrom_est_sfs()methods.Initially, the branch rates are determined using MLE. Similar to
Parser, we can also specify the number of mutational target sites (see then_target_sitesargument) in case our VCF file does not contain the full set of monomorphic sites. This is necessary to obtain realistic branch rate estimates. You can also choose a prior for the polarization probabilities (seePolarizationPrior). Eventually, for every site, the probability that the major allele is ancestral is calculated.When annotating the variants of a VCF file, we check the most likely ancestral allele against a naive ad-hoc ancestral allele annotation, and record the sites for which we have disagreement. You might want to sanity-check the mismatches to make sure the model has been properly specified (see
mismatches).Note
The polarization prior corresponds to the Kingman coalescent probability by default. Using an adaptive prior, as in the EST-SFS paper, is also possible, but this is only recommended if the number of sites used for the inference is large (see
prior).The model can only handle sites that have at most 2 alleles across the in- and outgroups, so sites with more than 2 alleles are ignored. Only variants that are at most bi-allelic in the provided in- and outgroups are annotated.
The model determines the probability of the major allele being ancestral opposed to the minor allele. This can be problematic if the actual ancestral allele is not contained in the ingroup (possibly due to subsampling). To avoid this issue, we also keep track of potential minor alleles at frequency 0. If we were to ignore this, it would be impossible to infer divergence, i.e. fixed derived allele that are no longer observed in the ingroups (see
PolarizationPrior.allow_divergence). That said, divergence counts are not informative on DFE inference with fastDFE and allow_divergence should not be set toTrueif interested in the SFS.The model assumes a single coalescent topology for all sites, in which all outgroups coalesce first with the ingroup and not with each other. It is important to specify the outgroups in order of increasing divergence and not to select outgroups that are not much more closely related to each other than to the ingroup (as this would give rise to a different coalescent topology than the one assumed). You can call
get_outgroup_divergence()after the inference to check the estimated branch rates for each outgroup. The assumption of a single fixed topology should be good enough provided that in- and outgroups are sufficiently diverged.
Example usage:
import fastdfe as fd ann = fd.Annotator( vcf="https://github.com/Sendrowski/fastDFE/" "blob/dev/resources/genome/betula/all." "with_outgroups.subset.10000.vcf.gz?raw=true", annotations=[fd.MaximumLikelihoodAncestralAnnotation( outgroups=["ERR2103730"], n_ingroups=15 )], output="genome.polarized.vcf.gz" ) ann.annotate()
- __init__(outgroups: List[str], n_ingroups: int = 11, ingroups: List[str] | None = None, exclude: List[str] | None = None, n_runs: int = 10, model: SubstitutionModel = None, parallelize: bool = True, prior: PolarizationPrior | None = '', max_sites: int = 10000, seed: int | None = 0, confidence_threshold: float = 0, n_target_sites: int | None = None, n_samples_target_sites: int | None = 100000, adjust_target_sites: bool = True, subsample_mode: Literal['random', 'probabilistic'] = 'probabilistic')[source]#
Create a new ancestral allele annotation instance.
- Parameters:
outgroups (
List[str]) – The outgroup samples to consider when determining the ancestral allele in the order of increasing divergence. A list of sample names as they appear in the VCF file. The order of the outgroups is important as it determines the order of the branches in the tree, whose rates are optimized, and whose topology is predetermined. The first outgroup is the closest outgroup to the ingroups, and the last outgroup is the most distant outgroup. More outgroups lead to a more accurate inference of the ancestral allele, but also increase the computational cost. Using more than 1 outgroup is recommended, but more than 3 is likely not necessary. Sites where these outgroups are not present are not included when optimizing the rate parameters. Due to assumptions on the tree topology connecting the in- and outgroups, it is important that the outgroups are not much more closely related to each other than to the ingroups. Ideally, the optimized branch rates are show markedly different values, and in any case, they should be monotonically increasing with the outgroups (seeget_outgroup_divergence()).n_ingroups (
int) – The minimum number of ingroups that must be present at a site for it to be considered for ancestral allele inference. The ingroup subsampling is necessary since our model requires an equal number of ingroups for all sites. Note that a larger number of ingroups does not necessarily improve the accuracy of the ancestral allele inference (seeprior). A larger number of ingroups can lead to a large variance in the polarization probabilities, across different frequency counts.n_ingroupsshould thus only be large if the number of sites used for the inference is also large. A sensible value for a reasonably large number of sites (a few thousand) is 10 or perhaps 20 for a larger numbers of sites. Very small values can lead to the ingroup subsamples not being representative of the actual allele frequencies at a site, especially when not using probabilistic subsampling (seesubsample_mode). This value also influences the number of frequency bins used for the polarization probabilities, and should thus not be too small. Note that ifingroupsis an even number, the major allele is chosen arbitrarily if the number of major alleles is equal to the number of minor alleles. To avoid this, you can use an odd number of ingroups.ingroups (
Optional[List[str]]) – The ingroup samples to consider when determining the ancestral allele. IfNone, all (non-outgroup) samples are considered. A list of sample names as they appear in the VCF file. Has to be at least as large asn_ingroups.exclude (
Optional[List[str]]) – Samples to exclude from the ingroup. A list of sample names as they appear in the VCF file.n_runs (
int) – The number of optimization runs to perform when determining the branch rates. You can check that the likelihoods of the different runs are similar by callingplot_likelihoods().model (
SubstitutionModel) – The substitution model to use. By default,K2SubstitutionModelis used.parallelize (
bool) – Whether to parallelize the computation across multiple cores.prior (
PolarizationPrior|None) – The prior to use for the polarization probabilities. SeeKingmanPolarizationPriorandAdaptivePolarizationPriorfor more information. By default,KingmanPolarizationPrioris used. UseNonefor no prior.max_sites (
int) – The maximum number of sites to consider. This is useful if the number of sites is very large. Choosing a reasonably large subset of sites (on the order of a few thousand bi-allelic sites) can speed up the computation considerably as parsing can be slow. This subset is then used to calibrate the rate parameters, and possibly the polarization priors.seed (
int|None) – The seed for the random number generator. IfNone, a random seed is chosen. By default, the seed is set to 0.confidence_threshold (
float) –The confidence threshold for the ancestral allele annotation. Only if the probability of the major allele being ancestral as opposed to the minor allele is not within
((1 - confidence_threshold) / 2, 1 - (1 - confidence_threshold) / 2), the ancestral allele is annotated. This is useful to avoid annotating sites where the ancestral allele state is not clear. Use values close to0to annotate as many sites as possible, and values close to1to annotate only sites where the ancestral allele state is very clear.Warning
This threshold introduces a bias by excluding more sites with high-frequency derived alleles and should thus be kept at
0if the distribution of frequency counts is important, e.g., if the SFS is to be determined.n_target_sites (
int|None) – The total number of target sites if this class is used in conjunction withParserorAnnotator. This is useful if the provided set of sites only consists of bi-allelic sites. Specify here the total number of sites underlying the given dataset, i.e., both mono- and bi-allelic sites. Ignoring mono-allelic sites will lead to overestimation of the rate parameters. For this to work, a FASTA file must be provided from which the mono-allelic sites can be sampled. Sampling takes place between the variants of the last and first site on every contig considered in the VCF file. UseNoneto disable this feature. Note that the number of target sites is automatically carried over if not specified and this class is used together withParser. In order to use this feature, you also need to specify a FASTA file toParserorAnnotator. Also note that by default we extrapolate the number of mono-allelic sites to be sampled from the FASTA file based on the ratio of sites with called outgroup bases parsed from the VCF file (adjust_target_sites).n_samples_target_sites (
int|None) – The number of sites to sample from the FASTA file when determining the number of target sites (n_target_sites). From this the total number of target sites is extrapolated.adjust_target_sites (
bool) – Whether to adjust the number of target sites based on the parsed VCF sites relative to the total number of sites in the VCF. Defaults toTrue.subsample_mode (
Literal['random','probabilistic']) – The subsampling mode. Forrandom, we draw once without replacement from the set of all available ingroup genotypes per site. Forprobabilistic, we integrate over the hypergeometric distribution when parsing and computing the ancestral probabilities. Probabilistic subsampling requires a bit more time, but produces much more stable results, while requiring far fewer sites, so it is highly recommended.
- parallelize: bool#
Whether to parallelize the computation.
- max_sites: int#
Maximum number of sites to consider
- confidence_threshold: float#
The confidence threshold for the ancestral allele annotation.
- prior: PolarizationPrior | None#
The prior to use for the polarization probabilities.
- n_runs: int#
Number of random ML starts when determining the rate parameters
- model: SubstitutionModel#
The substitution model.
- configs: pd.DataFrame | None#
The data frame holding all site configurations.
- p_bins: Dict[str, np.ndarray | None]#
The probability of all sites per frequency bin.
- n_sites: int | None#
The total number of valid sites parsed (including sites not considered for ancestral allele inference).
- param_names: List[str]#
The parameter names in the order they are passed to the optimizer.
- likelihoods: np.ndarray | None#
The log likelihoods for the different runs when optimizing the rate parameters.
- likelihood: float | None#
The best log likelihood when optimizing the rate parameters.
- result: Optional['scipy.optimize.OptimizeResult']#
Optimization result of the best run.
- params_mle: Dict[str, float] | None#
The MLE parameters.
- params_mle_runs: pd.DataFrame | None#
The MLE parameters for all runs.
- n_target_sites: int | None#
The total number of target sites.
- n_samples_target_sites: int#
The number of sites to sample from the FASTA file when determining the number of target sites.
- adjust_target_sites: bool#
Whether to adjust the number of target sites based on the number of sites parsed from the VCF file.
- classmethod from_est_sfs(file: str, prior: PolarizationPrior | None = '', n_runs: int = 10, model: SubstitutionModel = None, parallelize: bool = True, seed: int = 0, chunk_size: int = 100000)[source]#
Create instance from EST-SFS input file.
- Parameters:
file (
str) – File containing EST-SFS-formatted input data.prior (
PolarizationPrior|None) – The prior to use for the polarization probabilities (see__init__()).n_runs (
int) – Number of runs for rate estimation (see__init__()).model (
SubstitutionModel) – The substitution model (see__init__()).parallelize (
bool) – Whether to parallelize the runs (see__init__()).seed (
int) – The seed to use for the random number generator.chunk_size (
int) – The chunk size for reading the file.
- Return type:
- Returns:
The instance.
- to_est_sfs(file: str)[source]#
Write the object’s state to an EST-SFS formatted file.
- Parameters:
file (
str) – The output file name.
- to_file(file: str)[source]#
Save object to file.
Note
References to the handler and the reader are discarded.
- Parameters:
file (
str) – File path.
- classmethod from_file(file: str)[source]#
Load object from file.
Note
The handler and the reader are not restored, so serialization is mostly useful for analyzing the results in detail, not for further processing.
- Parameters:
file (
str) – File path.- Return type:
- Returns:
The object.
- to_json()[source]#
Serialize object.
Note
References to the handler and the reader are discarded.
- Return type:
str- Returns:
JSON string
- classmethod from_json(json: str)[source]#
Load object from file.
Note
The handler and the reader are not restored, so serialization is mostly useful for analyzing the results in detail, not for further processing.
- Parameters:
json (
str) – JSON string.- Return type:
- Returns:
The object.
- classmethod from_data(n_major: Iterable[int], major_base: Iterable[str | int], minor_base: Iterable[str | int], outgroup_bases: Iterable[Iterable[str | int]], n_ingroups: int, n_runs: int = 10, model: SubstitutionModel = None, parallelize: bool = True, prior: PolarizationPrior | None = '', seed: int = 0, pass_indices: bool = False, confidence_threshold: float = 0)[source]#
Create an instance by passing the data directly.
- Parameters:
n_major (
Iterable[int]) – The number of major alleles per site. Note that this number has to be lower thann_ingroups, as we consider the number of major alleles of subsamples of sizen_ingroups.major_base (
Iterable[str|int]) – The major allele per site. A string representation of the base or the base index according to['A', 'C', 'G', 'T']ifpass_indicesisTrue. UseNoneif the base is not defined whenpass_indicesisFalseand-1whenpass_indicesisTrue.minor_base (
Iterable[str|int]) – The minor allele per site. A string representation of the base or the base index according to['A', 'C', 'G', 'T']ifpass_indicesisTrue. UseNoneif the base is not defined whenpass_indicesisFalseand-1whenpass_indicesisTrue.outgroup_bases (
Iterable[Iterable[str|int]]) – The outgroup alleles per site. A string representation of the base or the base index ifpass_indicesisTrue. This should be a list of lists, where the outer list corresponds to the sites and the inner list to the outgroups per site. All sites are required to have the same number of outgroups. UseNoneif the base is not defined whenpass_indicesisFalseand-1whenpass_indicesisTrue.n_ingroups (
int) – The number of ingroup samples (see__init__()).n_runs (
int) – The number of runs for rate estimation (see__init__()).model (
SubstitutionModel) – The substitution model (see__init__()).parallelize (
bool) – Whether to parallelize the runs.prior (
PolarizationPrior|None) – The prior to use for the polarization probabilities (see__init__()).seed (
int) – The seed for the random number generator.pass_indices (
bool) – Whether to pass the base indices instead of the bases.confidence_threshold (
float) – The confidence threshold for the ancestral allele annotation (see__init__()).
- Return type:
- Returns:
The instance.
- classmethod from_dataframe(data: DataFrame, n_ingroups: int, n_runs: int = 10, model: SubstitutionModel = None, parallelize: bool = True, prior: PolarizationPrior | None = '', seed: int = 0, grouped: bool = False, confidence_threshold: float = 0)[source]#
Create an instance from a dataframe.
- Parameters:
data (
DataFrame) – Dataframe with the columns:major_base,minor_base,outgroup_bases,n_majorof typeint,int,listandint, respectively. The outgroup bases should have the same length for every site.n_ingroups (
int) – The number of ingroups (see__init__()).n_runs (
int) – Number of runs for rate estimation (see__init__()).model (
SubstitutionModel) – The substitution model (see__init__()).parallelize (
bool) – Whether to parallelize computations.prior (
PolarizationPrior|None) – The prior to use for the polarization probabilities (see__init__()).seed (
int) – The seed for the random number generator. IfNone, a random seed is chosen.grouped (
bool) – Whether the dataframe is already grouped by all columns (used for internal purposes).confidence_threshold (
float) – The confidence threshold for the ancestral allele annotation (see__init__()).
- Return type:
- Returns:
The instance.
- infer()[source]#
Infer the ancestral allele probabilities for the data provided. This method is only supposed to be called manually if the data is provided directly, e.g. using
from_data(),from_dataframe()orfrom_est_sfs(). If the data is provided using a VCF file, this method is called automatically.
- set_mle_params(params: Dict[str, float])[source]#
Set the MLE parameters and update the cache and site configurations. Use this method if you want to use different parameters for the annotation.
- Parameters:
params (
Dict[str,float]) – The new parameters.
- is_monotonic()[source]#
Whether the outgroups are monotonically increasing in divergence.
- Return type:
bool- Returns:
Whether the outgroups are monotonically increasing in divergence.
- property p_polarization: ndarray | None#
Get the polarization probabilities or
Noneifpriorisno.
- static get_p_tree(base: int, n_outgroups: int, internal_nodes: List[int] | ndarray, outgroup_bases: List[int] | ndarray, params: Dict[str, float], model: SubstitutionModel)[source]#
Get the probability of a tree.
- Parameters:
base (
int) – An observed ingroup base index.n_outgroups (
int) – The number of outgroups.internal_nodes (
Union[List[int],ndarray]) – The internal nodes of the tree. We haven_outgroups - 1internal nodes.outgroup_bases (
Union[List[int],ndarray]) – The observed base indices for the outgroups.params (
Dict[str,float]) – The parameters of the model.model (
SubstitutionModel) – The model to use. Either ‘K2’ or ‘JC’.
- Return type:
float
- classmethod get_p_config(config: ~fastdfe.annotation.SiteConfig, base_type: ~fastdfe.annotation.BaseType, params: ~typing.Dict[str, float], model: ~fastdfe.annotation.SubstitutionModel = <fastdfe.annotation.K2SubstitutionModel object>, internal: ~numpy.ndarray | None = None)[source]#
Get the probability for a site configuration.
- Parameters:
config (
SiteConfig) – The site configuration.base_type (
BaseType) – The base type.params (
Dict[str,float]) – The parameters for the substitution model.model (
SubstitutionModel) – The substitution model to use.internal (
ndarray|None) – Base indices of internal nodes of the tree if fixed. IfNone, the internal nodes are considered as free parameters. -1 also indicates a free parameter. The number of internal nodes is the number of outgroups minus one.
- Return type:
float- Returns:
The probability for a site.
- classmethod get_p_configs(configs: DataFrame, model: SubstitutionModel, base_type: BaseType, params: Dict[str, float])[source]#
Get the probabilities for each site configuration.
- Parameters:
configs (
DataFrame) – The site configurations.model (
SubstitutionModel) – The substitution model.base_type (
BaseType) – The base type.params (
Dict[str,float]) – A dictionary of the rate parameters.
- Return type:
ndarray- Returns:
The probability for each site.
- evaluate_likelihood(params: Dict[str, float])[source]#
Evaluate the likelihood function for the rate parameters.
- Parameters:
params (
Dict[str,float]) – A dictionary of parameters.- Return type:
float- Returns:
The log likelihood.
- get_inferred_site_info()[source]#
Get the site information for the sites included in the parsing process. The sites are in the same order as parsed. You can use
get_site_info()to get the site information for a specific site.- Return type:
Generator[SiteInfo,None,None]- Returns:
A generator yielding a dictionary with the site information (see
get_site_info()).- Raises:
RuntimeError – If the subsample mode is
probabilistic.
- get_site_info(n_major: int, major_base: int | str, minor_base: int | str, outgroup_bases: List[int | str] | ndarray, pass_indices: bool = False)[source]#
Get information on the specified sites using the inferred parameters.
- Parameters:
n_major (
int) – The number of copies of the major allele.major_base (
int|str) – The major bases indices or strings.minor_base (
int|str) – The minor bases indices or strings.outgroup_bases (
Union[List[int|str],ndarray]) – The outgroup base indices or strings.pass_indices (
bool) – Whether to pass the indices as strings or convert them to integers.
- Return type:
- Returns:
The site information.
- annotate_site(variant: cyvcf2.Variant | DummyVariant)[source]#
Annotate a single site.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The variant to annotate.- Returns:
The annotated variant.
- plot_likelihoods(file: str = None, show: bool = True, title: str = 'rate likelihoods', scale: Literal['lin', 'log'] = 'lin', ax: plt.Axes = None, ylabel: str = 'lnl')[source]#
Visualize the likelihoods of the rate optimization runs using a scatter plot.
- Parameters:
scale (
Literal['lin','log']) – y-scale of the plot.title (
str) – Plot title.file (
str) – File to save plot to.show (
bool) – Whether to show plot.ax (plt.Axes) – Axes to plot on. Only for Python visualization backend.
ylabel (
str) – Label for y-axis.
- Return type:
plt.Axes
- Returns:
Axes object
- get_folded_spectra(groups: List[Literal['major_base', 'minor_base', 'outgroup_bases']] = ['major_base'])[source]#
Get the folded spectra for the parsed sites (used to estimate the parameters).
- Parameters:
groups (
List[Literal['major_base','minor_base','outgroup_bases']]) – The groups to group the spectra by.- Return type:
- Returns:
Spectra object
- get_outgroup_divergence()[source]#
Get the inferred branch rates between the ingroup and outgroups by combining the inferred branch rates.
- Return type:
ndarray- Returns:
One rate for each outgroup.
- get_observed_transition_transversion_ratio()[source]#
Get the observed transition/transversion ratio. Note that this may differ from the estimated ratio for
K2SubstitutionModel.- Return type:
float- Returns:
The observed transition/transversion ratio.
- static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#
Count the number of target sites in a GFF file.
- Parameters:
file (
str) – The path to The GFF file path, possibly gzipped or a URLremove_overlaps (
bool) – Whether to remove overlapping target sites.contigs (
List[str]) – The contigs to count the target sites for.
- Return type:
Dict[str,int]- Returns:
The number of target sites per chromosome/contig.
- classmethod get_base_index(base_string: str | ndarray)#
Get base index/indices from base string(s).
- Parameters:
base_string (
str|ndarray) – The base string(s).- Return type:
int|ndarray- Returns:
Base index/indices.
- static get_base_string(indices: int | ndarray)#
Get base string(s) from base index/indices.
- Parameters:
indices (
int|ndarray) – The base index/indices.- Return type:
str|ndarray- Returns:
Base string(s).
- ingroups: List[str] | None#
The ingroup samples to consider when determining the ancestral allele.
- exclude: List[str]#
The samples excluded from the ingroup.
- outgroups: List[str]#
The outgroup samples to consider when determining the ancestral allele.
- n_ingroups: int#
The number of ingroups.
- n_outgroups: int#
The number of outgroups.
- seed: int | None#
The seed for the random number generator.
- subsample_mode: Literal['random', 'probabilistic']#
The subsampling mode.
- rng: np.random.Generator#
The random number generator.
- n_annotated: int#
The number of annotated sites.
- class AdHocAncestralAnnotation(outgroups: List[str], n_ingroups: int, ingroups: List[str] | None = None, exclude: List[str] = [], seed: int | None = 0, subsample_mode: Literal['random', 'probabilistic'] = 'random')[source]#
Bases:
_OutgroupAncestralAlleleAnnotationAd-hoc ancestral allele annotation using simple rules. Used for testing and sanity checking.
- annotate_site(variant: cyvcf2.Variant | DummyVariant)[source]#
Annotate a single site. Mono-allelic sites are assigned the major allele as ancestral. Sites with more than two alleles are ignored.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The variant to annotate.- Returns:
The annotated variant.
- __init__(outgroups: List[str], n_ingroups: int, ingroups: List[str] | None = None, exclude: List[str] = [], seed: int | None = 0, subsample_mode: Literal['random', 'probabilistic'] = 'random')#
Create a new ancestral allele annotation instance.
- Parameters:
outgroups (
List[str]) – The outgroup samples to consider when determining the ancestral allele. A list of sample names as they appear in the VCF file.n_ingroups (
int) – The minimum number of ingroups that must be present at a site for it to be considered for ancestral allele inference.ingroups (
Optional[List[str]]) – The ingroup samples to consider when determining the ancestral allele. A list of sample names as they appear in the VCF file. IfNone, all samples except the outgroups are considered.exclude (
List[str]) – Samples to exclude from the ingroup. A list of sample names as they appear in the VCF file.seed (
int|None) – The seed for the random number generator.subsample_mode (
Literal['random','probabilistic']) – The subsampling mode. Either ‘random’ or ‘probabilistic’.
- static count_target_sites(file: str, remove_overlaps: bool = False, contigs: List[str] = None)#
Count the number of target sites in a GFF file.
- Parameters:
file (
str) – The path to The GFF file path, possibly gzipped or a URLremove_overlaps (
bool) – Whether to remove overlapping target sites.contigs (
List[str]) – The contigs to count the target sites for.
- Return type:
Dict[str,int]- Returns:
The number of target sites per chromosome/contig.
- classmethod get_base_index(base_string: str | ndarray)#
Get base index/indices from base string(s).
- Parameters:
base_string (
str|ndarray) – The base string(s).- Return type:
int|ndarray- Returns:
Base index/indices.
- static get_base_string(indices: int | ndarray)#
Get base string(s) from base index/indices.
- Parameters:
indices (
int|ndarray) – The base index/indices.- Return type:
str|ndarray- Returns:
Base string(s).
-
ingroups:
Optional[List[str]]# The ingroup samples to consider when determining the ancestral allele.
-
exclude:
List[str]# The samples excluded from the ingroup.
-
outgroups:
List[str]# The outgroup samples to consider when determining the ancestral allele.
-
n_ingroups:
int# The number of ingroups.
-
n_outgroups:
int# The number of outgroups.
-
seed:
int|None# The seed for the random number generator.
-
subsample_mode:
Literal['random','probabilistic']# The subsampling mode.
-
rng:
Generator# The random number generator.
-
n_annotated:
int# The number of annotated sites.
- class Annotator(vcf: str, output: str, annotations: List[Annotation], gff: str | None = None, fasta: str | None = None, info_ancestral: str = 'AA', max_sites: int = inf, seed: int | None = 0, cache: bool = True, aliases: Dict[str, List[str]] = {})[source]#
Bases:
MultiHandlerAnnotate a VCF file with the given annotations.
Example usage:
import fastdfe as fd ann = fd.Annotator( vcf="http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/" "1000_genomes_project/release/20181203_biallelic_SNV/" "ALL.chr21.shapeit2_integrated_v1a.GRCh38.20181129.phased.vcf.gz", fasta="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/" "dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz", gff="http://ftp.ensembl.org/pub/release-109/gff3/homo_sapiens/" "Homo_sapiens.GRCh38.109.chromosome.21.gff3.gz", output='sapiens.chr21.degeneracy.vcf.gz', annotations=[fd.DegeneracyAnnotation()], aliases=dict(chr21=['21']) ) ann.annotate()
- __init__(vcf: str, output: str, annotations: List[Annotation], gff: str | None = None, fasta: str | None = None, info_ancestral: str = 'AA', max_sites: int = inf, seed: int | None = 0, cache: bool = True, aliases: Dict[str, List[str]] = {})[source]#
Create a new annotator instance.
- Parameters:
vcf (
str) – The path to the VCF file, can be gzipped, urls are also supportedoutput (
str) – The path to the output fileannotations (
List[Annotation]) – The annotations to apply.gff (
str|None) – The path to the GFF file, can be gzipped, urls are also supported. Required for annotations that require a GFF file.fasta (
str|None) – The path to the FASTA file, can be gzipped, urls are also supported. Required for annotations that require a FASTA file.info_ancestral (
str) – The tag in the INFO field that contains the ancestral allelemax_sites (
int) – Maximum number of sites to considerseed (
int|None) – Seed for the random number generator. UseNonefor no seed.cache (
bool) – Whether to cache files downloaded from urlsaliases (
Dict[str,List[str]]) – Dictionary of aliases for the contigs in the VCF file, e.g.{'chr1': ['1']}. This is used to match the contig names in the VCF file with the contig names in the FASTA file and GFF file.
- output: str#
The path to the output file.
- annotations: List[Annotation]#
The annotations to apply.
- count_sites()#
Count the number of sites in the VCF.
- Return type:
int- Returns:
Number of sites
- classmethod download_file(url: str, cache: bool = True, desc: str = 'Downloading file')#
Download a file from a URL.
- Parameters:
cache (
bool) – Whether to cache the file.url (
str) – The URL to download the file from.desc (
str) – Description for the progress bar
- Return type:
str- Returns:
The path to the downloaded file.
- download_if_url(path: str)#
Download the VCF file if it is a URL.
- Parameters:
path (
str) – The path to the VCF file.- Return type:
str- Returns:
The path to the downloaded file or the original path.
- get_aliases(contig: str)#
Get all aliases for the given contig alias including the primary alias.
- Parameters:
contig (
str) – The contig.- Return type:
List[str]- Returns:
The aliases.
- get_contig(aliases, rewind: bool = True, notify: bool = True)#
Get the contig from the FASTA file.
Note that
pyfaidxwould be more efficient here, but there were problems when running it in parallel.- Parameters:
aliases – The contig aliases.
rewind (
bool) – Whether to allow for rewinding the iterator if the contig is not found.notify (
bool) – Whether to notify the user when rewinding the iterator.
- Return type:
SeqRecord- Returns:
The contig.
- get_contig_names()#
Get the names of the contigs in the FASTA file.
- Return type:
List[str]- Returns:
The contig names.
- static get_filename(url: str)#
Return the file extension of a URL.
- Parameters:
url (
str) – The URL to get the file extension from.- Returns:
The file extension.
- get_pbar(desc: str = 'Processing sites', total: int | None = 0)#
Return a progress bar for the number of sites.
- Parameters:
desc (
str) – Description for the progress bartotal (
int|None) – Total number of items
- Return type:
tqdm- Returns:
tqdm
- static hash(s: str)#
Return a truncated SHA1 hash of a string.
- Parameters:
s (
str) – The string to hash.- Return type:
str- Returns:
The SHA1 hash.
- static is_url(path: str)#
Check if the given path is a URL.
- Parameters:
path (
str) – The path to check.- Return type:
bool- Returns:
Trueif the path is a URL,Falseotherwise.
- load_fasta(file: str)#
Load a FASTA file into a dictionary.
- Parameters:
file (
str) – The path to The FASTA file path, possibly gzipped or a URL- Return type:
FastaIterator- Returns:
Iterator over the sequences.
- load_vcf()#
Load a VCF file into a dictionary.
- Return type:
cyvcf2.VCF
- Returns:
The VCF reader.
- property n_sites: int#
Get the number of sites in the VCF.
- Returns:
Number of sites
- static remove_overlaps(df: DataFrame)#
Remove overlapping coding sequences.
- Parameters:
df (
DataFrame) – The coding sequences.- Return type:
DataFrame- Returns:
The coding sequences without overlaps.
- static unzip_if_zipped(file: str)#
If the given file is gzipped, unzip it and return the path to the unzipped file. If the file is not gzipped, return the path to the original file.
- Parameters:
file (
str) – The path to the file.- Returns:
The path to the unzipped file, or the original file if it was not gzipped.
- vcf#
The path to the VCF file or an iterable of variants
- info_ancestral: str#
The tag in the INFO field that contains the ancestral allele
- max_sites: int#
Maximum number of sites to consider
- seed: Optional[int]#
Seed for the random number generator
- rng#
Random generator instance
- fasta: str#
The path to the FASTA file.
- gff#
The GFF file path
- cache: bool#
Whether to cache files that are downloaded from URLs
- aliases#
The contig mappings