Usage
Installation
To use Monod, install it from pip:
pip install monod
To use it in your code, import the package components:
import monod
from monod import *
from monod.preprocess import *
from monod.extract_data import *
from monod.cme_toolbox import CMEModel
from monod import inference, mminference
from monod.analysis import *
Quantification
To generate an intronic index for kb
, obtain a reference genome and run kb ref
:
kb ref -i ./refdata-gex-mm10-2020-A/kallisto/index.idx
-g ./refdata-gex-mm10-2020-A/t2g_mm10.txt
-f1 ./refdata-gex-mm10-2020-A/kallisto/cdna.fa
-f2 ./refdata-gex-mm10-2020-A/kallisto/intron.fa
-c1 ./refdata-gex-mm10-2020-A/kallisto/cdna_t2c.txt
-c2 ./refdata-gex-mm10-2020-A/kallisto/intron_t2c.txt
--workflow lamanno
./refdata-gex-mm10-2020-A/fasta/genome.fa
./refdata-gex-mm10-2020-A/genes/genes.gtf
To then generate the count matrices in loom
format, use kb count
:
kb count --verbose
-i ../ref/refdata-gex-mm10-2020-A/kallisto/index.idx
-g ../ref/refdata-gex-mm10-2020-A/t2g_mm10.txt
-x 10xv3
-o OUTDIR/
-t 30 -m 30G
-c1 ../ref/refdata-gex-mm10-2020-A/kallisto/cdna_t2c.txt
-c2 ../ref/refdata-gex-mm10-2020-A/kallisto/intron_t2c.txt
--workflow lamanno --filter bustools --overwrite --loom
DATASET_FASTQ_LOCATIONS
This generates a loom
file in OUTDIR/counts_filtered/
.
Pre-processing
To define a “batch,” or a set of inference runs, run preprocess.construct_batch()
:
dir_string,dataset_strings = construct_batch(raw_filepaths, transcriptome_filepath, dataset_names, \
attribute_names, batch_location, meta, batch_id, n_genes)
To import the files, specify the raw data (loom
, mtx
, or adata
) filepaths in raw_filepaths
, a gene length annotation filepath transcriptome_filepath
, and various batch and dataset metadata. To specify the number of genes to analyze, set n_genes
.
This will create a batch directory, dataset-specific subdirectories, the file gene_set.csv
with the list of genes that meet filtering thresholds for all datasets, and the file genes.csv
with the list of genes selected for further analysis. This gene list can also be defined or updated manually, but inference is typically unsuitable for genes that do not meet the filtering thresholds.
Model, data, and parameter definition
To define a CME model of transcription and sequencing, initialize an instance of cme_toolbox.CMEModel
:
CMEModel(biological_model,sequencing_model)
where biological_model = {'Bursty','Constitutive','Extrinsic','CIR'}
represents the transcriptional process and sequencing_model = {'None','Bernoulli','Poisson'}
represents the dynamics of the sampling process.
To define the search parameters, initialize an instance of inference.InferenceParameters
:
inference_parameters = inference.InferenceParameters(phys_lb,phys_ub,samp_lb,samp_ub,gridsize,\
dataset_string,fitmodel,use_lengths)
where phys_lb
and phys_ub
are bounds on the transcriptional process model parameters, samp_lb
and samp_ub
are bounds on the sampling process model parameters, gridsize
defines the grid for the sampling parameter scan, and use_lengths
determines whether the unspliced mRNA capture rate depends on the gene length (to model priming at ubiquitous internal polyA sites).
Alternatively one can define the search parameters and cluster the data. This will run the meK-Means clustering algorithm (see the paper here). We initialize an instance of mminference.InferenceParameters
:
inference_parameters = mminference.InferenceParameters(phys_lb,phys_ub,samp_lb,samp_ub,gridsize,\
k,epochs,\
dataset_string,fitmodel,use_lengths)
where k
is the user-defined number of clusters to learn and epochs
is the numbers of rounds to learn the data clusters. All other parameters remain the same as inference.InferenceParameters
.
To create a SearchData
object to input into the inference process, run extract_data.extract_data()
:
search_data = extract_data(loom_filepath, transcriptome_filepath, dataset_name,
dataset_string, dir_string)
Running the inference pipeline
To run the pipeline, simply call the following parallelized code:
result_string = inference_parameters.fit_all_grid_points(n_cores,search_data)
This will iterate over all grid points using n_cores
processors.
Post-processing and QC
To load the search results, import the file string:
sr = load_search_results(result_string)
To identify the technical noise parameter optimum, call a method of a SearchResults object sr:
sr.find_sampling_optimum()
Optionally, test its stability under subsampling and chi-squared testing:
fig1,ax1 = plt.subplots(1,1)
sr.plot_landscape(ax1)
_=sr.chisquare_testing(sd)
sr.resample_opt_viz()
sr.resample_opt_mc_viz()
sr.chisq_best_param_correction(sd,viz=True)
Optionally, examine whether the distribution fits match the raw data:
sr.plot_gene_distributions(sd,marg='joint')
sr.plot_gene_distributions(sd,marg='nascent')
sr.plot_gene_distributions(sd,marg='mature')
To chracterize the uncertainty, variation, and bias in biological parameters, compute the standard errors of their maximum likelihood estimates, then plot their distributions and dependence on length (which should be minimal):
sr.compute_sigma(sd,num_cores)
sr.plot_param_L_dep(plot_errorbars=True,plot_fit=True)
sr.plot_param_marg()
As the standard error computation is typically computationally intensive, it is useful to store an updated copy on disk after evaluating it:
sr.update_on_disk()
Noise decomposition
Two complementary methods are available for investigating the contributions of different noise sources. The first is non-parametric; calling a method of a SearchData
object sd
returns the fractions of variation retained and discarded after normalization and log-transformation:
f = sd.get_noise_decomp()
These fractions are not guaranteed to be positive, because the transformations may increase the relative spread of the data. On the other hand, if a fit has been completed, a method of a SearchResults
object sr
reports the fractions of intrinsic, extrinsic (bursty), and technical variation for each gene and molecular species:
f = sr.get_noise_decomp()
Differential parameter value identification
Given a set of matched datasets, run with the same model over the same set of genes, two approaches are available for identifying putative patterns of differential expression and regulation. A moment-based, biology-agnostic one uses a simple t-test to identify differences in the means of spliced counts in SearchData
objects sd1
and sd2
:
gf = compute_diffexp(sd1,sd2)
where gf
is boolean vector that reports True
if the gene is identified as DE. However, this approach cannot identify differences if biological parameters change in a correlated way and the mean stays the same. We introduce a more mechanistic approach for the identification of differential expression suggested by parameter variation, based on two SearchResults
objects sr1
and sr2
:
gf = compute_diffreg(sr1,sr2)
where gf
is a two-dimensional boolean array that reports True
if a particular parameter is identified as DE. After using these arrays to find a subpopulation of interest – e.g., genes that do not exhibit variation in the spliced mean, but do exhibit modulation in the burst size – it is possible to plug the gene filter genes_to_plot
back in to inspect the raw data and fits:
gf = compare_gene_distributions(sr_arr,sd_arr,genes_to_plot=genes_to_plot)