VCF parsing

Contents

VCF parsing#

A VCF parser that can be used to extract the site frequency spectrum (SFS) from a VCF file. Stratifying the SFS is supported by providing a list of Stratification instances.

class Stratification[source]#

Bases: ABC

Abstract class for Stratifying the SFS by determining a site’s type based on its properties.

__init__()[source]#

Create instance.

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

abstract get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Get type of given Variant. Only the types given by get_types() are valid, or None if no type could be determined.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

Optional[str]

Returns:

Type of the variant

abstract get_types()[source]#

Get all possible types.

Return type:

List[str]

Returns:

List of types

class SNPStratification[source]#

Bases: Stratification, ABC

Abstract class for stratifications that can only handle SNPs. We need to issue a warning in this case.

__init__()#

Create instance.

abstract get_type(variant: cyvcf2.Variant | DummyVariant)#

Get type of given Variant. Only the types given by get_types() are valid, or None if no type could be determined.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

Optional[str]

Returns:

Type of the variant

abstract get_types()#

Get all possible types.

Return type:

List[str]

Returns:

List of types

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class BaseContextStratification(fasta: str, n_flanking: int = 1, aliases: Dict[str, List[str]] = {}, cache: bool = True)[source]#

Bases: Stratification, FASTAHandler

Stratify the SFS by the base context of the mutation. The number of flanking bases can be configured. Note that we attempt to take the ancestral allele as the middle base. If skip_non_polarized is set to False, we use the reference allele as the middle base.

__init__(fasta: str, n_flanking: int = 1, aliases: Dict[str, List[str]] = {}, cache: bool = True)[source]#

Create instance. Note that we require a fasta file to be specified for base context to be able to be inferred

Parameters:
  • fasta (str) – The fasta file path, possibly gzipped or a URL

  • n_flanking (int) – The number of flanking bases

  • aliases (Dict[str, List[str]]) – Dictionary of aliases for the contigs in the VCF file, e.g. {'chr1': ['1']}. This is used to match the contig names in the VCF file with the contig names in the FASTA file and GFF file.

  • cache (bool) – Whether to cache files that are downloaded from URLs

n_flanking: int#

The number of flanking bases

contig: Optional[SeqRecord]#

The current contig

get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Get the base context for a given mutation

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

str

Returns:

Base context of the mutation

get_types()[source]#

Create all possible base contexts.

Return type:

List[str]

Returns:

List of contexts

classmethod download_file(url: str, cache: bool = True, desc: str = 'Downloading file')#

Download a file from a URL.

Parameters:
  • cache (bool) – Whether to cache the file.

  • url (str) – The URL to download the file from.

  • desc (str) – Description for the progress bar

Return type:

str

Returns:

The path to the downloaded file.

download_if_url(path: str)#

Download the VCF file if it is a URL.

Parameters:

path (str) – The path to the VCF file.

Return type:

str

Returns:

The path to the downloaded file or the original path.

get_aliases(contig: str)#

Get all aliases for the given contig alias including the primary alias.

Parameters:

contig (str) – The contig.

Return type:

List[str]

Returns:

The aliases.

get_contig(aliases, rewind: bool = True, notify: bool = True)#

Get the contig from the FASTA file.

Note that pyfaidx would be more efficient here, but there were problems when running it in parallel.

Parameters:
  • aliases – The contig aliases.

  • rewind (bool) – Whether to allow for rewinding the iterator if the contig is not found.

  • notify (bool) – Whether to notify the user when rewinding the iterator.

Return type:

SeqRecord

Returns:

The contig.

get_contig_names()#

Get the names of the contigs in the FASTA file.

Return type:

List[str]

Returns:

The contig names.

static get_filename(url: str)#

Return the file extension of a URL.

Parameters:

url (str) – The URL to get the file extension from.

Returns:

The file extension.

static hash(s: str)#

Return a truncated SHA1 hash of a string.

Parameters:

s (str) – The string to hash.

Return type:

str

Returns:

The SHA1 hash.

static is_url(path: str)#

Check if the given path is a URL.

Parameters:

path (str) – The path to check.

Return type:

bool

Returns:

True if the path is a URL, False otherwise.

load_fasta(file: str)#

Load a FASTA file into a dictionary.

Parameters:

file (str) – The path to The FASTA file path, possibly gzipped or a URL

Return type:

FastaIterator

Returns:

Iterator over the sequences.

static unzip_if_zipped(file: str)#

If the given file is gzipped, unzip it and return the path to the unzipped file. If the file is not gzipped, return the path to the original file.

Parameters:

file (str) – The path to the file.

Returns:

The path to the unzipped file, or the original file if it was not gzipped.

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

fasta: str#

The path to the FASTA file.

cache: bool#

Whether to cache files that are downloaded from URLs

aliases#

The contig mappings

class BaseTransitionStratification[source]#

Bases: SNPStratification

Stratify the SFS by the base transition of the mutation, i.e., A>T.

Warning

This stratification only works for SNPs. You thus need to update the number of mono-allelic sites manually.

get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Get the base transition for the given variant.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

str

Returns:

Base transition

Raises:

NoTypeException – if not type could be determined

get_types()[source]#

Get all possible base transitions.

Return type:

List[str]

Returns:

List of contexts

__init__()#

Create instance.

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class TransitionTransversionStratification[source]#

Bases: BaseTransitionStratification

Stratify the SFS by whether we have a transition or transversion.

Warning

This stratification only works for SNPs. You thus need to update the number of mono-allelic sites manually.

get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Get the mutation type (transition or transversion) for a given mutation.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

str

Returns:

Mutation type

get_types()[source]#

All possible mutation types (transition and transversion).

Return type:

List[str]

Returns:

List of mutation types

__init__()#

Create instance.

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class AncestralBaseStratification[source]#

Bases: Stratification

Stratify the SFS by the base context of the mutation: the reference base. If skip_non_polarized is set to False, we use the reference allele as ancestral base. By default, we use the AA tag to determine the ancestral allele.

Any subclass of AncestralAnnotation can be used to annotate the ancestral allele.

get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Get the type which is the reference allele.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

str

Returns:

reference allele

get_types()[source]#

The possible base types.

Return type:

List[str]

Returns:

List of contexts

__init__()#

Create instance.

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class DegeneracyStratification(custom_callback: Callable[[cyvcf2.Variant], str] = None)[source]#

Bases: Stratification

Stratify SFS by degeneracy. We only consider sides which 4-fold degenerate (neutral) or 0-fold degenerate (selected) which facilitates counting.

DegeneracyAnnotation can be used to annotate the degeneracy of a site.

__init__(custom_callback: Callable[[cyvcf2.Variant], str] = None)[source]#

Initialize the stratification.

Parameters:

custom_callback (Callable[[cyvcf2.Variant], str]) – Custom callback to determine the type of mutation

get_degeneracy#

Custom callback to determine the degeneracy of mutation

get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Get the degeneracy.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

Literal['neutral', 'selected']

Returns:

Type of the mutation

Raises:

NoTypeException – If the mutation is not synonymous or non-synonymous

get_types()[source]#

Get all possible degeneracy type (neutral and selected).

Return type:

List[str]

Returns:

List of contexts

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class SynonymyStratification[source]#

Bases: SNPStratification

Stratify SFS by synonymy (neutral or selected).

SynonymyAnnotation can be used to annotate the synonymy of a site.

Warning

This stratification only works for SNPs. You thus need to update the number of mono-allelic sites manually.

get_types()[source]#

Get all possible synonymy types (neutral and selected).

Return type:

List[str]

Returns:

List of contexts

get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Get the synonymy using the custom synonymy annotation.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

Literal['neutral', 'selected']

Returns:

Type of the mutation, either neutral or selected

__init__()#

Create instance.

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class VEPStratification[source]#

Bases: SynonymyStratification

Stratify SFS by synonymy (neutral or selected) based on annotation provided by VEP.

Warning

This stratification only works for SNPs. You thus need to update the number of mono-allelic sites manually.

info_tag = 'CSQ'#

The tag used by VEP to annotate the synonymy

get_types()[source]#

Get all possible synonymy types (neutral and selected).

Return type:

List[str]

Returns:

List of contexts

get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Get the synonymy of a site.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

Literal['neutral', 'selected']

Returns:

Type of the mutation, either neutral or selected

__init__()#

Create instance.

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class SnpEffStratification[source]#

Bases: VEPStratification

Stratify SFS by synonymy (neutral or selected) based on annotation provided by SnpEff.

Warning

This stratification only works for SNPs. You thus need to update the number of mono-allelic sites manually.

info_tag = 'ANN'#

The tag used by SnpEff to annotate the synonymy

__init__()#

Create instance.

get_type(variant: cyvcf2.Variant | DummyVariant)#

Get the synonymy of a site.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

Literal['neutral', 'selected']

Returns:

Type of the mutation, either neutral or selected

get_types()#

Get all possible synonymy types (neutral and selected).

Return type:

List[str]

Returns:

List of contexts

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class GenomePositionDependentStratification[source]#

Bases: Stratification, ABC

__init__()#

Create instance.

abstract get_type(variant: cyvcf2.Variant | DummyVariant)#

Get type of given Variant. Only the types given by get_types() are valid, or None if no type could be determined.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

Optional[str]

Returns:

Type of the variant

abstract get_types()#

Get all possible types.

Return type:

List[str]

Returns:

List of types

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class ContigStratification(contigs: List[str] = None)[source]#

Bases: GenomePositionDependentStratification

Stratify SFS by contig.

__init__(contigs: List[str] = None)[source]#

Initialize the stratification.

Parameters:

contigs (List[str]) – List of contigs to stratify by. Defaults to all contigs in the VCF file.

contigs: List[str]#

List of contigs

get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Get the contig.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

str

Returns:

The contig name

get_types()[source]#

Get all possible contig type.

Return type:

List[str]

Returns:

List of contexts

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class ChunkedStratification(n_chunks: int)[source]#

Bases: GenomePositionDependentStratification

Stratify SFS by creating n contiguous chunks of roughly equal size.

Note

Since the total number of sites is not known in advance, we cannot create contiguous chunks of exactly equal size.

__init__(n_chunks: int)[source]#

Initialize the stratification.

Parameters:

n_chunks (int) – Number of sites per window

n_chunks: int#

Number of chunks

chunk_sizes: Optional[List[int]]#

List of chunk sizes

counter: int#

Number of sites seen so far

get_types()[source]#

Get all possible window types.

Return type:

List[str]

Returns:

List of contexts

get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Get the type.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The vcf site

Return type:

str

Returns:

The type

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class RandomStratification(n_bins: int, seed: int | None = 0)[source]#

Bases: Stratification

Stratify the SFS randomly into a fixed number of bins. Can be used to analyze expected sampling variance between different stratifications.

__init__(n_bins: int, seed: int | None = 0)[source]#

Initialize random stratification.

Parameters:

n_bins (int) – Number of bins to randomly assign sites to.

num_bins: int#

Number of bins

seed: Optional[int]#

Random seed for reproducibility

rng#

Random generator instance

get_type(variant: cyvcf2.Variant | DummyVariant)[source]#

Assign the variant to a random bin.

Parameters:

variant (Union[cyvcf2.Variant, DummyVariant]) – The VCF site

Return type:

str

Returns:

Randomly chosen bin label

get_types()[source]#

Get all possible bin labels.

Return type:

List[str]

Returns:

List of bin labels

parser: Optional[Parser]#

Parser instance

n_valid: int#

The number of sites that didn’t have a type.

class TargetSiteCounter(n_target_sites: int, n_samples: int = 100000)[source]#

Bases: object

Class for counting the number of target sites when parsing a VCF file that does not contain monomorphic sites. This class is used in conjunction with Parser and samples sites from the given fasta file that are found in between variants on the same contig that were parsed in the VCF. Ideally, we obtain the SFS by parsing VCF files that contain both mono- and polymorphic sites. This is because we need to know about the number of mutational opportunities for synonymous and non-synonymous sites which contain plenty of information on the strength of selection. It is recommended to use a SNPFiltration when using this class to avoid biasing the result by monomorphic sites present in the VCF file.

Warning

This class is not compatible with stratifications based on info tags that are pre-defined in the VCF file, as opposed to those added dynamically using the annotations argument of the parser. We also need to stratify mono-allelic sites which, in this case, won’t be present in the VCF file so that they have no info tags when sampling from the FASTA file, and are thus ignored by the stratifications. However, using the annotations argument of the parser, the info tags the stratifications are based on are added on-the-fly, also for monomorphic sites sampled from the FASTA file.

__init__(n_target_sites: int, n_samples: int = 100000)[source]#

Initialize counter.

Parameters:
  • n_target_sites (int) – The total number of sites (mono- and polymorphic) that would be present in the VCF file if it contained monomorphic sites. This number should be considerably larger than the number of polymorphic sites in the VCF file. This value is not extremely important for the DFE inference, the ratio of synonymous to non-synonymous sites being more informative, but the order of magnitude should be correct, in any case.

  • n_samples (int) – The number of sites to sample from the fasta file. Many sampled sites will not be valid as they are non-coding. To obtain good estimates, a few thousand sites should be sampled per type of site (depending on the stratifications used).

n_target_sites: int | None#

The total number of sites considered when parsing the VCF

n_samples: int#

Number of samples

count()[source]#

Count the number of target sites.

Returns:

The number of target sites

class Parser(vcf: str, n: int, gff: str | None = None, fasta: str | None = None, info_ancestral: str = 'AA', info_ancestral_prob: str = 'AA_prob', skip_non_polarized: bool = True, stratifications: List[Stratification] = [], annotations: List[Annotation] = [], filtrations: List[Filtration] = None, include_samples: List[str] = None, exclude_samples: List[str] = None, max_sites: int = inf, seed: int | None = 0, cache: bool = True, aliases: Dict[str, List[str]] = {}, target_site_counter: TargetSiteCounter = None, subsample_mode: Literal['random', 'probabilistic'] = 'probabilistic', polarize_probabilistically: bool = False)[source]#

Bases: MultiHandler

Parse site-frequency spectra from VCF files.

By default, the parser looks at the AA tag in the VCF file’s info field to retrieve the correct polarization. Polymorphic sites for which this tag is not well-defined are by default ignored (see skip_non_polarized).

This class also offers on-the-fly annotation of the VCF sites such as site degeneracy and ancestral allele state. This is done by providing a list of annotations to the parser which are applied in the order they are provided.

The parser also allows to filter sites based on site properties which is done by passing a list of filtrations. By default, we filter out poly-allelic sites as sites are assumed to be at most bi-allelic.

In addition, the parser allows to stratify the SFS by providing a list of stratifications. This is useful to obtain the SFS for different types of sites for which we can jointly infer the DFEs using JointInference.

To correctly determine the number of target sites when parsing a VCF file that does not contain monomorphic sites, we can use a TargetSiteCounter. This class is used in conjunction with the parser and samples sites from the given FASTA file that are found in between variants on the same contig that were parsed in the VCF.

Note that we assume the sites in the VCF file to be sorted by position in ascending order (per contig).

Example usage:

import fastdfe as fd

# Parse selected and neutral SFS from human chromosome 1.
p = fd.Parser(
    vcf="https://ngs.sanger.ac.uk/production/hgdp/hgdp_wgs.20190516/"
        "hgdp_wgs.20190516.full.chr21.vcf.gz",
    fasta="http://ftp.ensembl.org/pub/release-109/fasta/homo_sapiens/"
          "dna/Homo_sapiens.GRCh38.dna.chromosome.21.fa.gz",
    gff="http://ftp.ensembl.org/pub/release-109/gff3/homo_sapiens/"
        "Homo_sapiens.GRCh38.109.chromosome.21.gff3.gz",
    aliases=dict(chr21=['21']),  # mapping for contig names
    n=10,  # SFS sample size
    # we use a target site counter to infer the number of target sites.
    target_site_counter=fd.TargetSiteCounter(
        n_samples=1000000,
        # determine number of target sites by looking at total length of coding sequences
        n_target_sites=fd.Annotation.count_target_sites(
            "http://ftp.ensembl.org/pub/release-109/gff3/homo_sapiens/"
            "Homo_sapiens.GRCh38.109.chromosome.21.gff3.gz"
        )['21']
    ),
    # add degeneracy annotation for sites
    annotations=[
        fd.DegeneracyAnnotation()
    ],
    filtrations=[
        # exclude non-SNPs as we infer monomorphic sites with target site counter
        fd.SNPFiltration(),
        # filter out sites not in coding sequences
        fd.CodingSequenceFiltration()
    ],
    # stratify by 4-fold/0-fold degeneracy
    stratifications=[fd.DegeneracyStratification()],
    info_ancestral='AA_ensembl'
)

sfs = p.parse()

sfs.plot()
__init__(vcf: str, n: int, gff: str | None = None, fasta: str | None = None, info_ancestral: str = 'AA', info_ancestral_prob: str = 'AA_prob', skip_non_polarized: bool = True, stratifications: List[Stratification] = [], annotations: List[Annotation] = [], filtrations: List[Filtration] = None, include_samples: List[str] = None, exclude_samples: List[str] = None, max_sites: int = inf, seed: int | None = 0, cache: bool = True, aliases: Dict[str, List[str]] = {}, target_site_counter: TargetSiteCounter = None, subsample_mode: Literal['random', 'probabilistic'] = 'probabilistic', polarize_probabilistically: bool = False)[source]#

Initialize the parser.

Parameters:
  • vcf (str) – The path to the VCF file, can be gzipped or a URL.

  • gff (str | None) – The path to the GFF file, possibly gzipped or a URL. This file is optional and depends on the stratifications, annotations and filtrations that are used.

  • fasta (str | None) – The path to the FASTA file, possibly gzipped or a URL. This file is optional and depends on the annotations and filtrations that are used.

  • n (int) – The size of the resulting SFS. We down-sample to this number by drawing without replacement from the set of all available genotypes per site. Sites with fewer than n genotypes are skipped.

  • info_ancestral (str) – The tag in the INFO field that contains ancestral allele information. Consider using an ancestral allele annotation if this information is not available yet.

  • skip_non_polarized (bool) – Whether to skip poly-morphic sites that are not polarized, i.e., without a valid info tag providing the ancestral allele. If False, we use the reference allele as ancestral allele (only recommended if working with folded spectra).

  • stratifications (List[Stratification]) – List of stratifications to use.

  • annotations (List[Annotation]) – List of annotations to use.

  • filtrations (List[Filtration]) – List of filtrations to use. By default, we use PolyAllelicFiltration.

  • include_samples (List[str]) – List of sample names to consider when determining the SFS. If None, all samples are used. Note that this restriction does not apply to the annotations and filtrations.

  • exclude_samples (List[str]) – List of sample names to exclude when determining the SFS. If None, no samples are excluded. Note that this restriction does not apply to the annotations and filtrations.

  • max_sites (int) – Maximum number of sites to parse from the VCF file.

  • seed (int | None) – Seed for the random number generator. Use None for no seed.

  • cache (bool) – Whether to cache files downloaded from URLs.

  • aliases (Dict[str, List[str]]) – Dictionary of aliases for the contigs in the VCF file, e.g. {'chr1': ['1']}. This is used to match the contig names in the VCF file with the contig names in the FASTA file and GFF file.

  • target_site_counter (TargetSiteCounter) – The target site counter. If None, we do not sample target sites.

  • subsample_mode (Literal['random', 'probabilistic']) – The subsampling mode. For random, we draw once without replacement from the set of all available genotypes per site. For probabilistic, we add up the hypergeometric distribution for all sites. This will produce a smoother SFS, especially when a small number of sites is considered.

  • polarize_probabilistically (bool) – Whether to probabilistically polarize sites. In addition to the AA tag (see info_ancestral), we use the AA_prob tag (see info_ancestral_prob) to polarize sites probabilistically. For example, if the ancestral allele is A with a probability of 0.8 and the derived allele is G, we assign 0.8 probability mass to the ancestral allele and 0.2 to the derived allele. This should enhance accuracy, especially for small datasets. Whenever the ancestral probability tag is not present, we assume a probability of 1 for the ancestral allele.

target_site_counter: TargetSiteCounter | None#

The target site counter

n: int#

The number of individuals in the sample

include_samples: Optional[List[str]]#

The list of samples to include

exclude_samples: Optional[List[str]]#

The list of samples to exclude

skip_non_polarized: bool#

Whether to skip sites that are not polarized, i.e., without a valid info tag providing the ancestral allele

stratifications: List[Stratification]#

List of stratifications to use

annotations: List[Annotation]#

List of annotations to use

filtrations: List[Filtration]#

List of filtrations to use

n_skipped: int#

The number of sites that were skipped for various reasons

n_no_ancestral: int#

The number of sites that were skipped because they had no valid ancestral allele

sfs: Dict[str, ndarray]#

Dictionary of SFS indexed by joint type

subsample_mode: Literal['random', 'probabilistic']#

The subsampling mode

info_ancestral_prob: str#

The tag in the INFO field that contains the ancestral allele probability

polarize_probabilistically: bool#

Whether to probabilistically polarize sites

parse()[source]#

Parse the VCF file.

Return type:

Spectra

Returns:

The spectra for the different stratifications

count_sites()#

Count the number of sites in the VCF.

Return type:

int

Returns:

Number of sites

classmethod download_file(url: str, cache: bool = True, desc: str = 'Downloading file')#

Download a file from a URL.

Parameters:
  • cache (bool) – Whether to cache the file.

  • url (str) – The URL to download the file from.

  • desc (str) – Description for the progress bar

Return type:

str

Returns:

The path to the downloaded file.

download_if_url(path: str)#

Download the VCF file if it is a URL.

Parameters:

path (str) – The path to the VCF file.

Return type:

str

Returns:

The path to the downloaded file or the original path.

get_aliases(contig: str)#

Get all aliases for the given contig alias including the primary alias.

Parameters:

contig (str) – The contig.

Return type:

List[str]

Returns:

The aliases.

get_contig(aliases, rewind: bool = True, notify: bool = True)#

Get the contig from the FASTA file.

Note that pyfaidx would be more efficient here, but there were problems when running it in parallel.

Parameters:
  • aliases – The contig aliases.

  • rewind (bool) – Whether to allow for rewinding the iterator if the contig is not found.

  • notify (bool) – Whether to notify the user when rewinding the iterator.

Return type:

SeqRecord

Returns:

The contig.

get_contig_names()#

Get the names of the contigs in the FASTA file.

Return type:

List[str]

Returns:

The contig names.

static get_filename(url: str)#

Return the file extension of a URL.

Parameters:

url (str) – The URL to get the file extension from.

Returns:

The file extension.

get_pbar(desc: str = 'Processing sites', total: int | None = 0)#

Return a progress bar for the number of sites.

Parameters:
  • desc (str) – Description for the progress bar

  • total (int | None) – Total number of items

Return type:

tqdm

Returns:

tqdm

static hash(s: str)#

Return a truncated SHA1 hash of a string.

Parameters:

s (str) – The string to hash.

Return type:

str

Returns:

The SHA1 hash.

static is_url(path: str)#

Check if the given path is a URL.

Parameters:

path (str) – The path to check.

Return type:

bool

Returns:

True if the path is a URL, False otherwise.

load_fasta(file: str)#

Load a FASTA file into a dictionary.

Parameters:

file (str) – The path to The FASTA file path, possibly gzipped or a URL

Return type:

FastaIterator

Returns:

Iterator over the sequences.

load_vcf()#

Load a VCF file into a dictionary.

Return type:

cyvcf2.VCF

Returns:

The VCF reader.

property n_sites: int#

Get the number of sites in the VCF.

Returns:

Number of sites

static remove_overlaps(df: DataFrame)#

Remove overlapping coding sequences.

Parameters:

df (DataFrame) – The coding sequences.

Return type:

DataFrame

Returns:

The coding sequences without overlaps.

static unzip_if_zipped(file: str)#

If the given file is gzipped, unzip it and return the path to the unzipped file. If the file is not gzipped, return the path to the original file.

Parameters:

file (str) – The path to the file.

Returns:

The path to the unzipped file, or the original file if it was not gzipped.

vcf#

The path to the VCF file or an iterable of variants

info_ancestral: str#

The tag in the INFO field that contains the ancestral allele

max_sites: int#

Maximum number of sites to consider

seed: Optional[int]#

Seed for the random number generator

rng#

Random generator instance

fasta: str#

The path to the FASTA file.

gff#

The GFF file path

cache: bool#

Whether to cache files that are downloaded from URLs

aliases#

The contig mappings