VCF parsing#
A VCF parser that can be used to extract the site frequency spectrum (SFS) from a VCF file.
Stratifying the SFS is supported by providing a list of Stratification instances.
- class Stratification[source]#
Bases:
ABCAbstract class for Stratifying the SFS by determining a site’s type based on its properties.
-
n_valid:
int# The number of sites that didn’t have a type.
- abstract get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Get type of given Variant. Only the types given by
get_types()are valid, orNoneif no type could be determined.- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
Optional[str]- Returns:
Type of the variant
-
n_valid:
- class SNPStratification[source]#
Bases:
Stratification,ABCAbstract class for stratifications that can only handle SNPs. We need to issue a warning in this case.
- __init__()#
Create instance.
- abstract get_type(variant: cyvcf2.Variant | DummyVariant)#
Get type of given Variant. Only the types given by
get_types()are valid, orNoneif no type could be determined.- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
Optional[str]- Returns:
Type of the variant
- abstract get_types()#
Get all possible types.
- Return type:
List[str]- Returns:
List of types
-
n_valid:
int# The number of sites that didn’t have a type.
- class BaseContextStratification(fasta: str, n_flanking: int = 1, aliases: Dict[str, List[str]] = {}, cache: bool = True)[source]#
Bases:
Stratification,FASTAHandlerStratify the SFS by the base context of the mutation. The number of flanking bases can be configured. Note that we attempt to take the ancestral allele as the middle base. If
skip_non_polarizedis set toFalse, we use the reference allele as the middle base.- __init__(fasta: str, n_flanking: int = 1, aliases: Dict[str, List[str]] = {}, cache: bool = True)[source]#
Create instance. Note that we require a fasta file to be specified for base context to be able to be inferred
- Parameters:
fasta (
str) – The fasta file path, possibly gzipped or a URLn_flanking (
int) – The number of flanking basesaliases (
Dict[str,List[str]]) – Dictionary of aliases for the contigs in the VCF file, e.g.{'chr1': ['1']}. This is used to match the contig names in the VCF file with the contig names in the FASTA file and GFF file.cache (
bool) – Whether to cache files that are downloaded from URLs
-
n_flanking:
int# The number of flanking bases
-
contig:
Optional[SeqRecord]# The current contig
- get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Get the base context for a given mutation
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
str- Returns:
Base context of the mutation
- get_types()[source]#
Create all possible base contexts.
- Return type:
List[str]- Returns:
List of contexts
- classmethod download_file(url: str, cache: bool = True, desc: str = 'Downloading file')#
Download a file from a URL.
- Parameters:
cache (
bool) – Whether to cache the file.url (
str) – The URL to download the file from.desc (
str) – Description for the progress bar
- Return type:
str- Returns:
The path to the downloaded file.
- download_if_url(path: str)#
Download the VCF file if it is a URL.
- Parameters:
path (
str) – The path to the VCF file.- Return type:
str- Returns:
The path to the downloaded file or the original path.
- get_aliases(contig: str)#
Get all aliases for the given contig alias including the primary alias.
- Parameters:
contig (
str) – The contig.- Return type:
List[str]- Returns:
The aliases.
- get_contig(aliases, rewind: bool = True, notify: bool = True)#
Get the contig from the FASTA file.
Note that
pyfaidxwould be more efficient here, but there were problems when running it in parallel.- Parameters:
aliases – The contig aliases.
rewind (
bool) – Whether to allow for rewinding the iterator if the contig is not found.notify (
bool) – Whether to notify the user when rewinding the iterator.
- Return type:
SeqRecord- Returns:
The contig.
- get_contig_names()#
Get the names of the contigs in the FASTA file.
- Return type:
List[str]- Returns:
The contig names.
- static get_filename(url: str)#
Return the file extension of a URL.
- Parameters:
url (
str) – The URL to get the file extension from.- Returns:
The file extension.
- static hash(s: str)#
Return a truncated SHA1 hash of a string.
- Parameters:
s (
str) – The string to hash.- Return type:
str- Returns:
The SHA1 hash.
- static is_url(path: str)#
Check if the given path is a URL.
- Parameters:
path (
str) – The path to check.- Return type:
bool- Returns:
Trueif the path is a URL,Falseotherwise.
- load_fasta(file: str)#
Load a FASTA file into a dictionary.
- Parameters:
file (
str) – The path to The FASTA file path, possibly gzipped or a URL- Return type:
FastaIterator- Returns:
Iterator over the sequences.
- static unzip_if_zipped(file: str)#
If the given file is gzipped, unzip it and return the path to the unzipped file. If the file is not gzipped, return the path to the original file.
- Parameters:
file (
str) – The path to the file.- Returns:
The path to the unzipped file, or the original file if it was not gzipped.
-
n_valid:
int# The number of sites that didn’t have a type.
-
fasta:
str# The path to the FASTA file.
-
cache:
bool# Whether to cache files that are downloaded from URLs
- aliases#
The contig mappings
- class BaseTransitionStratification[source]#
Bases:
SNPStratificationStratify the SFS by the base transition of the mutation, i.e.,
A>T.Warning
This stratification only works for SNPs. You thus need to update the number of mono-allelic sites manually.
- get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Get the base transition for the given variant.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
str- Returns:
Base transition
- Raises:
NoTypeException – if not type could be determined
- get_types()[source]#
Get all possible base transitions.
- Return type:
List[str]- Returns:
List of contexts
- __init__()#
Create instance.
-
n_valid:
int# The number of sites that didn’t have a type.
- class TransitionTransversionStratification[source]#
Bases:
BaseTransitionStratificationStratify the SFS by whether we have a transition or transversion.
Warning
This stratification only works for SNPs. You thus need to update the number of mono-allelic sites manually.
- get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Get the mutation type (transition or transversion) for a given mutation.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
str- Returns:
Mutation type
- get_types()[source]#
All possible mutation types (transition and transversion).
- Return type:
List[str]- Returns:
List of mutation types
- __init__()#
Create instance.
-
n_valid:
int# The number of sites that didn’t have a type.
- class AncestralBaseStratification[source]#
Bases:
StratificationStratify the SFS by the base context of the mutation: the reference base. If
skip_non_polarizedis set toFalse, we use the reference allele as ancestral base. By default, we use theAAtag to determine the ancestral allele.Any subclass of
AncestralAnnotationcan be used to annotate the ancestral allele.- get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Get the type which is the reference allele.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
str- Returns:
reference allele
- __init__()#
Create instance.
-
n_valid:
int# The number of sites that didn’t have a type.
- class DegeneracyStratification(custom_callback: Callable[[cyvcf2.Variant], str] = None)[source]#
Bases:
StratificationStratify SFS by degeneracy. We only consider sides which 4-fold degenerate (neutral) or 0-fold degenerate (selected) which facilitates counting.
DegeneracyAnnotationcan be used to annotate the degeneracy of a site.- __init__(custom_callback: Callable[[cyvcf2.Variant], str] = None)[source]#
Initialize the stratification.
- Parameters:
custom_callback (
Callable[[cyvcf2.Variant],str]) – Custom callback to determine the type of mutation
- get_degeneracy#
Custom callback to determine the degeneracy of mutation
- get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Get the degeneracy.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
Literal['neutral','selected']- Returns:
Type of the mutation
- Raises:
NoTypeException – If the mutation is not synonymous or non-synonymous
- get_types()[source]#
Get all possible degeneracy type (
neutralandselected).- Return type:
List[str]- Returns:
List of contexts
-
n_valid:
int# The number of sites that didn’t have a type.
- class SynonymyStratification[source]#
Bases:
SNPStratificationStratify SFS by synonymy (neutral or selected).
SynonymyAnnotationcan be used to annotate the synonymy of a site.Warning
This stratification only works for SNPs. You thus need to update the number of mono-allelic sites manually.
- get_types()[source]#
Get all possible synonymy types (
neutralandselected).- Return type:
List[str]- Returns:
List of contexts
- get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Get the synonymy using the custom synonymy annotation.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
Literal['neutral','selected']- Returns:
Type of the mutation, either
neutralorselected
- __init__()#
Create instance.
-
n_valid:
int# The number of sites that didn’t have a type.
- class VEPStratification[source]#
Bases:
SynonymyStratificationStratify SFS by synonymy (neutral or selected) based on annotation provided by VEP.
Warning
This stratification only works for SNPs. You thus need to update the number of mono-allelic sites manually.
- info_tag = 'CSQ'#
The tag used by VEP to annotate the synonymy
- get_types()[source]#
Get all possible synonymy types (
neutralandselected).- Return type:
List[str]- Returns:
List of contexts
- get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Get the synonymy of a site.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
Literal['neutral','selected']- Returns:
Type of the mutation, either
neutralorselected
- __init__()#
Create instance.
-
n_valid:
int# The number of sites that didn’t have a type.
- class SnpEffStratification[source]#
Bases:
VEPStratificationStratify SFS by synonymy (neutral or selected) based on annotation provided by SnpEff.
Warning
This stratification only works for SNPs. You thus need to update the number of mono-allelic sites manually.
- info_tag = 'ANN'#
The tag used by SnpEff to annotate the synonymy
- __init__()#
Create instance.
- get_type(variant: cyvcf2.Variant | DummyVariant)#
Get the synonymy of a site.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
Literal['neutral','selected']- Returns:
Type of the mutation, either
neutralorselected
- get_types()#
Get all possible synonymy types (
neutralandselected).- Return type:
List[str]- Returns:
List of contexts
-
n_valid:
int# The number of sites that didn’t have a type.
- class GenomePositionDependentStratification[source]#
Bases:
Stratification,ABC- __init__()#
Create instance.
- abstract get_type(variant: cyvcf2.Variant | DummyVariant)#
Get type of given Variant. Only the types given by
get_types()are valid, orNoneif no type could be determined.- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
Optional[str]- Returns:
Type of the variant
- abstract get_types()#
Get all possible types.
- Return type:
List[str]- Returns:
List of types
-
n_valid:
int# The number of sites that didn’t have a type.
- class ContigStratification(contigs: List[str] = None)[source]#
Bases:
GenomePositionDependentStratificationStratify SFS by contig.
- __init__(contigs: List[str] = None)[source]#
Initialize the stratification.
- Parameters:
contigs (
List[str]) – List of contigs to stratify by. Defaults to all contigs in the VCF file.
-
contigs:
List[str]# List of contigs
- get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Get the contig.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
str- Returns:
The contig name
-
n_valid:
int# The number of sites that didn’t have a type.
- class ChunkedStratification(n_chunks: int)[source]#
Bases:
GenomePositionDependentStratificationStratify SFS by creating
ncontiguous chunks of roughly equal size.Note
Since the total number of sites is not known in advance, we cannot create contiguous chunks of exactly equal size.
- __init__(n_chunks: int)[source]#
Initialize the stratification.
- Parameters:
n_chunks (
int) – Number of sites per window
-
n_chunks:
int# Number of chunks
-
chunk_sizes:
Optional[List[int]]# List of chunk sizes
-
counter:
int# Number of sites seen so far
- get_types()[source]#
Get all possible window types.
- Return type:
List[str]- Returns:
List of contexts
- get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Get the type.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The vcf site- Return type:
str- Returns:
The type
-
n_valid:
int# The number of sites that didn’t have a type.
- class RandomStratification(n_bins: int, seed: int | None = 0)[source]#
Bases:
StratificationStratify the SFS randomly into a fixed number of bins. Can be used to analyze expected sampling variance between different stratifications.
- __init__(n_bins: int, seed: int | None = 0)[source]#
Initialize random stratification.
- Parameters:
n_bins (
int) – Number of bins to randomly assign sites to.
-
num_bins:
int# Number of bins
-
seed:
Optional[int]# Random seed for reproducibility
- rng#
Random generator instance
- get_type(variant: cyvcf2.Variant | DummyVariant)[source]#
Assign the variant to a random bin.
- Parameters:
variant (
Union[cyvcf2.Variant,DummyVariant]) – The VCF site- Return type:
str- Returns:
Randomly chosen bin label
- get_types()[source]#
Get all possible bin labels.
- Return type:
List[str]- Returns:
List of bin labels
-
n_valid:
int# The number of sites that didn’t have a type.
- class TargetSiteCounter(n_target_sites: int, n_samples: int = 100000)[source]#
Bases:
objectClass for counting the number of target sites when parsing a VCF file that does not contain monomorphic sites. This class is used in conjunction with
Parserand samples sites from the given fasta file that are found in between variants on the same contig that were parsed in the VCF. Ideally, we obtain the SFS by parsing VCF files that contain both mono- and polymorphic sites. This is because we need to know about the number of mutational opportunities for synonymous and non-synonymous sites which contain plenty of information on the strength of selection. It is recommended to use a SNPFiltration when using this class to avoid biasing the result by monomorphic sites present in the VCF file.Warning
This class is not compatible with stratifications based on info tags that are pre-defined in the VCF file, as opposed to those added dynamically using the
annotationsargument of the parser. We also need to stratify mono-allelic sites which, in this case, won’t be present in the VCF file so that they have no info tags when sampling from the FASTA file, and are thus ignored by the stratifications. However, using theannotationsargument of the parser, the info tags the stratifications are based on are added on-the-fly, also for monomorphic sites sampled from the FASTA file.- __init__(n_target_sites: int, n_samples: int = 100000)[source]#
Initialize counter.
- Parameters:
n_target_sites (
int) – The total number of sites (mono- and polymorphic) that would be present in the VCF file if it contained monomorphic sites. This number should be considerably larger than the number of polymorphic sites in the VCF file. This value is not extremely important for the DFE inference, the ratio of synonymous to non-synonymous sites being more informative, but the order of magnitude should be correct, in any case.n_samples (
int) – The number of sites to sample from the fasta file. Many sampled sites will not be valid as they are non-coding. To obtain good estimates, a few thousand sites should be sampled per type of site (depending on the stratifications used).
-
n_target_sites:
int|None# The total number of sites considered when parsing the VCF
-
n_samples:
int# Number of samples
- class Parser(vcf: str, n: int, gff: str | None = None, fasta: str | None = None, info_ancestral: str = 'AA', info_ancestral_prob: str = 'AA_prob', skip_non_polarized: bool = True, stratifications: List[Stratification] = [], annotations: List[Annotation] = [], filtrations: List[Filtration] = None, include_samples: List[str] = None, exclude_samples: List[str] = None, max_sites: int = inf, seed: int | None = 0, cache: bool = True, aliases: Dict[str, List[str]] = {}, target_site_counter: TargetSiteCounter = None, subsample_mode: Literal['random', 'probabilistic'] = 'probabilistic', polarize_probabilistically: bool = False)[source]#
Bases:
MultiHandlerParse site-frequency spectra from VCF files.
By default, the parser looks at the
AAtag in the VCF file’s info field to retrieve the correct polarization. Polymorphic sites for which this tag is not well-defined are by default ignored (seeskip_non_polarized).This class also offers on-the-fly annotation of the VCF sites such as site degeneracy and ancestral allele state. This is done by providing a list of annotations to the parser which are applied in the order they are provided.
The parser also allows to filter sites based on site properties which is done by passing a list of filtrations. By default, we filter out poly-allelic sites as sites are assumed to be at most bi-allelic.
In addition, the parser allows to stratify the SFS by providing a list of stratifications. This is useful to obtain the SFS for different types of sites for which we can jointly infer the DFEs using
JointInference.To correctly determine the number of target sites when parsing a VCF file that does not contain monomorphic sites, we can use a
TargetSiteCounter. This class is used in conjunction with the parser and samples sites from the given FASTA file that are found in between variants on the same contig that were parsed in the VCF.Note that we assume the sites in the VCF file to be sorted by position in ascending order (per contig).
Example usage:
import fastdfe as fd # Parse selected and neutral SFS from human chromosome 1. p = fd.Parser( vcf="https://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516/" "hgdp_wgs.20190516.full.chr21.vcf.gz", fasta="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/" "dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz", gff="http://ftp.ensembl.org/pub/release-109/gff3/homo_sapiens/" "Homo_sapiens.GRCh38.109.chromosome.21.gff3.gz", aliases=dict(chr21=['21']), # mapping for contig names n=10, # SFS sample size # we use a target site counter to infer the number of target sites. target_site_counter=fd.TargetSiteCounter( n_samples=1000000, # determine number of target sites by looking at total length of coding sequences n_target_sites=fd.Annotation.count_target_sites( "http://ftp.ensembl.org/pub/release-109/gff3/homo_sapiens/" "Homo_sapiens.GRCh38.109.chromosome.21.gff3.gz" )['21'] ), # add degeneracy annotation for sites annotations=[ fd.DegeneracyAnnotation() ], filtrations=[ # exclude non-SNPs as we infer monomorphic sites with target site counter fd.SNPFiltration(), # filter out sites not in coding sequences fd.CodingSequenceFiltration() ], # stratify by 4-fold/0-fold degeneracy stratifications=[fd.DegeneracyStratification()], info_ancestral='AA_ensembl' ) sfs = p.parse() sfs.plot()
- __init__(vcf: str, n: int, gff: str | None = None, fasta: str | None = None, info_ancestral: str = 'AA', info_ancestral_prob: str = 'AA_prob', skip_non_polarized: bool = True, stratifications: List[Stratification] = [], annotations: List[Annotation] = [], filtrations: List[Filtration] = None, include_samples: List[str] = None, exclude_samples: List[str] = None, max_sites: int = inf, seed: int | None = 0, cache: bool = True, aliases: Dict[str, List[str]] = {}, target_site_counter: TargetSiteCounter = None, subsample_mode: Literal['random', 'probabilistic'] = 'probabilistic', polarize_probabilistically: bool = False)[source]#
Initialize the parser.
- Parameters:
vcf (
str) – The path to the VCF file, can be gzipped or a URL.gff (
str|None) – The path to the GFF file, possibly gzipped or a URL. This file is optional and depends on the stratifications, annotations and filtrations that are used.fasta (
str|None) – The path to the FASTA file, possibly gzipped or a URL. This file is optional and depends on the annotations and filtrations that are used.n (
int) – The size of the resulting SFS. We down-sample to this number by drawing without replacement from the set of all available genotypes per site. Sites with fewer thanngenotypes are skipped.info_ancestral (
str) – The tag in the INFO field that contains ancestral allele information. Consider using an ancestral allele annotation if this information is not available yet.skip_non_polarized (
bool) – Whether to skip poly-morphic sites that are not polarized, i.e., without a valid info tag providing the ancestral allele. IfFalse, we use the reference allele as ancestral allele (only recommended if working with folded spectra).stratifications (
List[Stratification]) – List of stratifications to use.annotations (
List[Annotation]) – List of annotations to use.filtrations (
List[Filtration]) – List of filtrations to use. By default, we usePolyAllelicFiltration.include_samples (
List[str]) – List of sample names to consider when determining the SFS. IfNone, all samples are used. Note that this restriction does not apply to the annotations and filtrations.exclude_samples (
List[str]) – List of sample names to exclude when determining the SFS. IfNone, no samples are excluded. Note that this restriction does not apply to the annotations and filtrations.max_sites (
int) – Maximum number of sites to parse from the VCF file.seed (
int|None) – Seed for the random number generator. UseNonefor no seed.cache (
bool) – Whether to cache files downloaded from URLs.aliases (
Dict[str,List[str]]) – Dictionary of aliases for the contigs in the VCF file, e.g.{'chr1': ['1']}. This is used to match the contig names in the VCF file with the contig names in the FASTA file and GFF file.target_site_counter (
TargetSiteCounter) – The target site counter. IfNone, we do not sample target sites.subsample_mode (
Literal['random','probabilistic']) – The subsampling mode. Forrandom, we draw once without replacement from the set of all available genotypes per site. Forprobabilistic, we add up the hypergeometric distribution for all sites. This will produce a smoother SFS, especially when a small number of sites is considered.polarize_probabilistically (
bool) – Whether to probabilistically polarize sites. In addition to theAAtag (seeinfo_ancestral), we use theAA_probtag (seeinfo_ancestral_prob) to polarize sites probabilistically. For example, if the ancestral allele isAwith a probability of 0.8 and the derived allele isG, we assign 0.8 probability mass to the ancestral allele and 0.2 to the derived allele. This should enhance accuracy, especially for small datasets. Whenever the ancestral probability tag is not present, we assume a probability of 1 for the ancestral allele.
-
target_site_counter:
TargetSiteCounter|None# The target site counter
-
n:
int# The number of individuals in the sample
-
include_samples:
Optional[List[str]]# The list of samples to include
-
exclude_samples:
Optional[List[str]]# The list of samples to exclude
-
skip_non_polarized:
bool# Whether to skip sites that are not polarized, i.e., without a valid info tag providing the ancestral allele
-
stratifications:
List[Stratification]# List of stratifications to use
-
annotations:
List[Annotation]# List of annotations to use
-
filtrations:
List[Filtration]# List of filtrations to use
-
n_skipped:
int# The number of sites that were skipped for various reasons
-
n_no_ancestral:
int# The number of sites that were skipped because they had no valid ancestral allele
-
sfs:
Dict[str,ndarray]# Dictionary of SFS indexed by joint type
-
subsample_mode:
Literal['random','probabilistic']# The subsampling mode
-
info_ancestral_prob:
str# The tag in the INFO field that contains the ancestral allele probability
-
polarize_probabilistically:
bool# Whether to probabilistically polarize sites
- parse()[source]#
Parse the VCF file.
- Return type:
- Returns:
The spectra for the different stratifications
- count_sites()#
Count the number of sites in the VCF.
- Return type:
int- Returns:
Number of sites
- classmethod download_file(url: str, cache: bool = True, desc: str = 'Downloading file')#
Download a file from a URL.
- Parameters:
cache (
bool) – Whether to cache the file.url (
str) – The URL to download the file from.desc (
str) – Description for the progress bar
- Return type:
str- Returns:
The path to the downloaded file.
- download_if_url(path: str)#
Download the VCF file if it is a URL.
- Parameters:
path (
str) – The path to the VCF file.- Return type:
str- Returns:
The path to the downloaded file or the original path.
- get_aliases(contig: str)#
Get all aliases for the given contig alias including the primary alias.
- Parameters:
contig (
str) – The contig.- Return type:
List[str]- Returns:
The aliases.
- get_contig(aliases, rewind: bool = True, notify: bool = True)#
Get the contig from the FASTA file.
Note that
pyfaidxwould be more efficient here, but there were problems when running it in parallel.- Parameters:
aliases – The contig aliases.
rewind (
bool) – Whether to allow for rewinding the iterator if the contig is not found.notify (
bool) – Whether to notify the user when rewinding the iterator.
- Return type:
SeqRecord- Returns:
The contig.
- get_contig_names()#
Get the names of the contigs in the FASTA file.
- Return type:
List[str]- Returns:
The contig names.
- static get_filename(url: str)#
Return the file extension of a URL.
- Parameters:
url (
str) – The URL to get the file extension from.- Returns:
The file extension.
- get_pbar(desc: str = 'Processing sites', total: int | None = 0)#
Return a progress bar for the number of sites.
- Parameters:
desc (
str) – Description for the progress bartotal (
int|None) – Total number of items
- Return type:
tqdm- Returns:
tqdm
- static hash(s: str)#
Return a truncated SHA1 hash of a string.
- Parameters:
s (
str) – The string to hash.- Return type:
str- Returns:
The SHA1 hash.
- static is_url(path: str)#
Check if the given path is a URL.
- Parameters:
path (
str) – The path to check.- Return type:
bool- Returns:
Trueif the path is a URL,Falseotherwise.
- load_fasta(file: str)#
Load a FASTA file into a dictionary.
- Parameters:
file (
str) – The path to The FASTA file path, possibly gzipped or a URL- Return type:
FastaIterator- Returns:
Iterator over the sequences.
- load_vcf()#
Load a VCF file into a dictionary.
- Return type:
cyvcf2.VCF
- Returns:
The VCF reader.
- property n_sites: int#
Get the number of sites in the VCF.
- Returns:
Number of sites
- static remove_overlaps(df: DataFrame)#
Remove overlapping coding sequences.
- Parameters:
df (
DataFrame) – The coding sequences.- Return type:
DataFrame- Returns:
The coding sequences without overlaps.
- static unzip_if_zipped(file: str)#
If the given file is gzipped, unzip it and return the path to the unzipped file. If the file is not gzipped, return the path to the original file.
- Parameters:
file (
str) – The path to the file.- Returns:
The path to the unzipped file, or the original file if it was not gzipped.
- vcf#
The path to the VCF file or an iterable of variants
-
info_ancestral:
str# The tag in the INFO field that contains the ancestral allele
-
max_sites:
int# Maximum number of sites to consider
-
seed:
Optional[int]# Seed for the random number generator
- rng#
Random generator instance
-
fasta:
str# The path to the FASTA file.
- gff#
The GFF file path
-
cache:
bool# Whether to cache files that are downloaded from URLs
- aliases#
The contig mappings