SFS simulation

SFS simulation#

fastDFE provides a Simulation module to generate an expected SFS under a specific DFE, and potentially demographic nuisance parameters. In fact, when inferring the DFE with fastDFE, the expected SFS is repeatedly simulated for different parameters to find the maximum likelihood estimates. Below, we illustrate how to use the simulation module to generate SFS data under a given DFE. To do this, we need to specify a neutral SFS (sfs_neut), which is informative on the population sample size, population mutation rate, the number of sites, and possibly demography. get_neutral_sfs() can be used to obtain a neutral SFS under a constant panmictic population. We also specify the DFE parameters and parametrization. By default, we assume not ancestral misidentification (eps=0) and mutation are assumed to be semi-dominant (h=0.5).

import fastdfe as fd

# create simulation object by specifying neutral SFS and DFE
sim = fd.Simulation(
    sfs_neut=fd.Simulation.get_neutral_sfs(n=20, n_sites=1e8, theta=1e-4),
    params=dict(S_d=-300, b=0.3, p_b=0, S_b=1),
    model=fd.GammaExpParametrization()
)

Let’s plot the DFE we are going to use for the simulation which is purely deleterious and follows a gamma distribution.

sim.dfe.plot();
../../_images/18cb6615080d120ba6b1e9de4d88bd417063baccda1d333a2877d8536bc67521.png

Let’s now run the simulation and plot the expected SFS for both neutral (specified) and selected (simulated) sites.

sim.run()
sim.get_spectra().plot();
INFO:Simulation: Simulating SFS under selection using parameters: S_d=-300, b=0.3, p_b=0, S_b=1, eps=0, h=0.5
INFO:Discretization: Precomputing semidominant DFE-SFS transformation using midpoint integration.
Discretization>Precomputing: 100%|██████████| 19/19 [00:00<00:00, 63.02it/s]
../../_images/f91549688ec2a839a8dfb111d47c6863cdec6680d574001fe9d7ed09b562da25.png

We can now use the simulated SFS to infer the DFE parameters and assess how closely the inference recovers the true DFE used to generate the data.

inf = fd.BaseInference(
    sfs_neut=sim.sfs_neut,
    sfs_sel=sim.sfs_sel,
    fixed_params=dict(all=dict(eps=0, h=0.5, p_b=0, S_b=1))
)

inf.run()

fd.DFE.plot_many([sim.dfe, inf.get_dfe()], labels=['True DFE', 'Inferred DFE']);
INFO:Discretization: Precomputing semidominant DFE-SFS transformation using midpoint integration.
Discretization>Precomputing: 100%|██████████| 19/19 [00:00<00:00, 66.18it/s]
INFO:Optimization: Optimizing 2 parameters: [all.b, all.S_d].
BaseInference>Performing inference: 100%|██████████| 10/10 [00:02<00:00,  3.47it/s]
INFO:BaseInference: Successfully finished optimization after 15 iterations and 93 function evaluations, obtaining a log-likelihood of -69.67
INFO:BaseInference: Inferred parameters: {all.S_d: -300, all.b: 0.3, all.p_b: 0, all.S_b: 1, all.eps: 0, all.h: 0.5}.
INFO:BaseInference: Standard deviations across runs: {all.S_d: 0.0008215, all.b: 2.279e-07, all.p_b: 0, all.S_b: 0, all.eps: 0, all.h: 0, likelihood: 3.484e-10}.
BaseInference>Bootstrapping (2 runs each): 100%|██████████| 100/100 [00:52<00:00,  1.90it/s]
INFO:BaseInference: Standard deviations across bootstraps: {all.S_d: 51.03, all.b: 0.01354, all.p_b: 0, all.S_b: 0, all.eps: 0, all.h: 0, likelihood: 3.079, i_best_run: 0.5024, likelihoods_std: 4.029}.
INFO:BaseInference: Average standard deviation of likelihoods across runs within bootstrap samples: 0.4029.
../../_images/dbdd178934a08842ec6cf535b736453a1abf68e74a50d39a8cd106420ea358e0.png

The DFE is recovered quite well which is expected given the simplicity of the DFE parametrization and lack of demographic complications.

We may want to see how the expected SFS changes as we vary the degree of dominance. Here, we run a second simulation with the same parameters except that we set h=0.3 so that mutations are partially recessive.

# run simulation with h=0.3
sim2 = fd.Simulation(
    sfs_neut=fd.Simulation.get_neutral_sfs(n=20, n_sites=1e8, theta=1e-4),
    params=dict(S_d=-300, b=0.3, p_b=0, S_b=1, h=0.3),
    model=fd.GammaExpParametrization()
)

sim2.run()

fd.Spectra.from_spectra(dict(
    neutral=sim.sfs_neut,
    semidominant=sim.sfs_sel,
    recessive=sim2.sfs_sel
)).plot();
INFO:Simulation: Simulating SFS under selection using parameters: S_d=-300, b=0.3, p_b=0, S_b=1, h=0.3, eps=0
INFO:Discretization: Precomputing DFE-SFS transformation for fixed h=0.3.
Discretization>Precomputing: 100%|██████████| 3819/3819 [00:47<00:00, 80.13it/s]
../../_images/bf2ead2f316e1edcf7fd7978cc94d5ec7638d5945a6d01f3bd2c9b6ee9c9f591.png

As expected, partially recessive mutations lead to an excess of rare variants because they are masked in heterozygotes and experience weaker purifying selection. At higher frequencies, the difference from the semidominant case diminishes as derived-allele homozygotes become more common.

We can now infer the DFE under the (incorrect) assumption that mutations are semidominant (h=0.5) and see how this affects the inference result.

inf = fd.BaseInference(
    sfs_neut=sim2.sfs_neut,
    sfs_sel=sim2.sfs_sel
)

inf.run()

fd.DFE.plot_many([sim2.dfe, inf.get_dfe()], labels=['True DFE (h=0.3)', 'Inferred DFE (h=0.5)']);
INFO:Discretization: Precomputing semidominant DFE-SFS transformation using midpoint integration.
Discretization>Precomputing: 100%|██████████| 19/19 [00:00<00:00, 41.28it/s]
INFO:Optimization: Optimizing 2 parameters: [all.b, all.S_d].
BaseInference>Performing inference: 100%|██████████| 10/10 [00:04<00:00,  2.19it/s]
INFO:BaseInference: Successfully finished optimization after 14 iterations and 63 function evaluations, obtaining a log-likelihood of -71.14
INFO:BaseInference: Inferred parameters: {all.S_d: -154.4, all.b: 0.3252, all.p_b: 0, all.S_b: 1, all.eps: 0, all.h: 0.5}.
INFO:BaseInference: Standard deviations across runs: {all.S_d: 0.0004465, all.b: 3.327e-07, all.p_b: 0, all.S_b: 0, all.eps: 0, all.h: 0, likelihood: 6.952e-10}.
BaseInference>Bootstrapping (2 runs each): 100%|██████████| 100/100 [01:25<00:00,  1.18it/s]
INFO:BaseInference: Standard deviations across bootstraps: {all.S_d: 20.61, all.b: 0.01421, all.p_b: 0, all.S_b: 0, all.eps: 0, all.h: 0, likelihood: 3.41, i_best_run: 0.4976, likelihoods_std: 7.301}.
INFO:BaseInference: Average standard deviation of likelihoods across runs within bootstrap samples: 0.7301.
../../_images/cae9f357c92900d23153e059be7a9409693a0695e8f8ccf4e579588cae9e2d4e.png

Assuming semidominance when mutations are actually partially recessive leads to an inferred DFE shifted toward weaker selection coefficients. This happens because rare variants generated by recessive mutations are misattributed to weaker selection when the dominance reduction in purifying selection is not accounted for.

We can also try to infer the DFE while letting the dominance coefficient h vary during the optimization. To optimize h, it needs to be discretized; in this example we use the default grid of 21 points between 0 and 1, with intermediate values obtained by linear interpolation. All required values are precomputed before the optimization, but allowing h to vary increases the computational cost because the expected SFS must be simulated for each h in the grid. To speed up the precomputation step, we instruct BaseInference to use a coarser DFE discretization grid than the default, which should be sufficient for most inference scenarios.

inf = fd.BaseInference(
    sfs_neut=sim.sfs_neut,
    sfs_sel=sim.sfs_sel,
    intervals_h=(0, 1, 21),
    intervals_del=(-1.0e+8, -1.0e-5, 100),
    intervals_ben=(1.0e-5, 1.0e4, 100),
    fixed_params=dict(all=dict(eps=0, p_b=0, S_b=1))
)

inf.run()

fd.DFE.plot_many([sim2.dfe, inf.get_dfe()], labels=['True DFE (h=0.3)', 'Inferred DFE']);
INFO:Discretization: Precomputing DFE-SFS transformation across dominance coefficients (grid size: 21).
Discretization>Precomputing: 100%|██████████| 8379/8379 [01:05<00:00, 127.93it/s]
INFO:Optimization: Optimizing 3 parameters: [all.b, all.h, all.S_d].
BaseInference>Performing inference: 100%|██████████| 10/10 [00:00<00:00, 26.84it/s]
INFO:BaseInference: Successfully finished optimization after 29 iterations and 172 function evaluations, obtaining a log-likelihood of -69.67
INFO:BaseInference: Inferred parameters: {all.S_d: -298.9, all.b: 0.3003, all.p_b: 0, all.S_b: 1, all.eps: 0, all.h: 0.5021}.
INFO:BaseInference: Standard deviations across runs: {all.S_d: 0.04796, all.b: 4.789e-06, all.p_b: 0, all.S_b: 0, all.eps: 0, all.h: 6.518e-05, likelihood: 5.245e-08}.
BaseInference>Bootstrapping (2 runs each): 100%|██████████| 100/100 [00:05<00:00, 18.82it/s]
INFO:BaseInference: Standard deviations across bootstraps: {all.S_d: 1.316e+04, all.b: 0.03581, all.p_b: 0, all.S_b: 0, all.eps: 0, all.h: 0.257, likelihood: 3.019, i_best_run: 0.5024, likelihoods_std: 0.1654}.
INFO:BaseInference: Average standard deviation of likelihoods across runs within bootstrap samples: 0.02186.
../../_images/a3acfda1cafade6cb1f0250f02af69b49f8cce8b4351eaa24abbf240d96e8547.png

We see that allowing h to vary during inference roughly recovers the true DFE used for the simulation, but with substantially greater uncertainty. Finally, we can examine the bootstrap results to assess how well h was estimated.

inf.bootstraps.h.hist();
../../_images/4196266dda83490126685fdde658bfb889559bc3f38a8e2b6eeaa595ee974785.png

We see that h was estimated with considerable uncertainty, but the bootstrap distribution still puts substantial mass near the true value h = 0.3. How well dominance can be inferred from SFS data alone remains an open question, especially when the DFE is more complex or additional nuisance parameters such as demography are included.

inf.bootstraps.assign(S_d=inf.bootstraps.S_d.abs()).plot.hexbin('S_d', 'h', xscale='log', gridsize=15, figsize=(4, 3));
../../_images/60f7f285fd73c14cc336b21c15cdb129d46aa7bdbd967b04df7bf1598f82b00b.png

Indeed, we see that h covaries with S_d, with larger h associated with less strongly deleterious mutations.