Usage

Installation

To use Monod, install it from pip:

pip install monod

To use it in your code, import the package components:

import monod
from monod import *
from monod.preprocess import *
from monod.extract_data import *
from monod.cme_toolbox import CMEModel
from monod import inference, mminference
from monod.analysis import *

Quantification

To generate an intronic index for kb, obtain a reference genome and run kb ref:

kb ref -i ./refdata-gex-mm10-2020-A/kallisto/index.idx
-g ./refdata-gex-mm10-2020-A/t2g_mm10.txt
-f1 ./refdata-gex-mm10-2020-A/kallisto/cdna.fa
-f2 ./refdata-gex-mm10-2020-A/kallisto/intron.fa
-c1 ./refdata-gex-mm10-2020-A/kallisto/cdna_t2c.txt
-c2 ./refdata-gex-mm10-2020-A/kallisto/intron_t2c.txt
--workflow lamanno
./refdata-gex-mm10-2020-A/fasta/genome.fa
./refdata-gex-mm10-2020-A/genes/genes.gtf

To then generate the count matrices in loom format, use kb count:

kb count --verbose
-i ../ref/refdata-gex-mm10-2020-A/kallisto/index.idx
-g ../ref/refdata-gex-mm10-2020-A/t2g_mm10.txt
-x 10xv3
-o OUTDIR/
-t 30 -m 30G
-c1 ../ref/refdata-gex-mm10-2020-A/kallisto/cdna_t2c.txt
-c2 ../ref/refdata-gex-mm10-2020-A/kallisto/intron_t2c.txt
--workflow lamanno --filter bustools --overwrite --loom
DATASET_FASTQ_LOCATIONS

This generates a loom file in OUTDIR/counts_filtered/.

Pre-processing

To define a “batch,” or a set of inference runs, run preprocess.construct_batch():

dir_string,dataset_strings = construct_batch(raw_filepaths, transcriptome_filepath, dataset_names, \
                                             attribute_names, batch_location, meta, batch_id, n_genes)

To import the files, specify the raw data (loom, mtx, or adata) filepaths in raw_filepaths, a gene length annotation filepath transcriptome_filepath, and various batch and dataset metadata. To specify the number of genes to analyze, set n_genes.

This will create a batch directory, dataset-specific subdirectories, the file gene_set.csv with the list of genes that meet filtering thresholds for all datasets, and the file genes.csv with the list of genes selected for further analysis. This gene list can also be defined or updated manually, but inference is typically unsuitable for genes that do not meet the filtering thresholds.

Model, data, and parameter definition

To define a CME model of transcription and sequencing, initialize an instance of cme_toolbox.CMEModel:

CMEModel(biological_model,sequencing_model)

where biological_model = {'Bursty','Constitutive','Extrinsic','CIR'} represents the transcriptional process and sequencing_model = {'None','Bernoulli','Poisson'} represents the dynamics of the sampling process.

To define the search parameters, initialize an instance of inference.InferenceParameters:

inference_parameters = inference.InferenceParameters(phys_lb,phys_ub,samp_lb,samp_ub,gridsize,\
                    dataset_string,fitmodel,use_lengths)

where phys_lb and phys_ub are bounds on the transcriptional process model parameters, samp_lb and samp_ub are bounds on the sampling process model parameters, gridsize defines the grid for the sampling parameter scan, and use_lengths determines whether the unspliced mRNA capture rate depends on the gene length (to model priming at ubiquitous internal polyA sites).

Alternatively one can define the search parameters and cluster the data. This will run the meK-Means clustering algorithm (see the paper here). We initialize an instance of mminference.InferenceParameters:

inference_parameters = mminference.InferenceParameters(phys_lb,phys_ub,samp_lb,samp_ub,gridsize,\
                    k,epochs,\
                    dataset_string,fitmodel,use_lengths)

where k is the user-defined number of clusters to learn and epochs is the numbers of rounds to learn the data clusters. All other parameters remain the same as inference.InferenceParameters.

To create a SearchData object to input into the inference process, run extract_data.extract_data():

search_data = extract_data(loom_filepath, transcriptome_filepath, dataset_name,
                               dataset_string, dir_string)

Running the inference pipeline

To run the pipeline, simply call the following parallelized code:

result_string = inference_parameters.fit_all_grid_points(n_cores,search_data)

This will iterate over all grid points using n_cores processors.

Post-processing and QC

To load the search results, import the file string:

sr = load_search_results(result_string)

To identify the technical noise parameter optimum, call a method of a SearchResults object sr:

sr.find_sampling_optimum()

Optionally, test its stability under subsampling and chi-squared testing:

fig1,ax1 = plt.subplots(1,1)
sr.plot_landscape(ax1)
_=sr.chisquare_testing(sd)
sr.resample_opt_viz()
sr.resample_opt_mc_viz()
sr.chisq_best_param_correction(sd,viz=True)

Optionally, examine whether the distribution fits match the raw data:

sr.plot_gene_distributions(sd,marg='joint')
sr.plot_gene_distributions(sd,marg='nascent')
sr.plot_gene_distributions(sd,marg='mature')

To chracterize the uncertainty, variation, and bias in biological parameters, compute the standard errors of their maximum likelihood estimates, then plot their distributions and dependence on length (which should be minimal):

sr.compute_sigma(sd,num_cores)
sr.plot_param_L_dep(plot_errorbars=True,plot_fit=True)
sr.plot_param_marg()

As the standard error computation is typically computationally intensive, it is useful to store an updated copy on disk after evaluating it:

sr.update_on_disk()

Noise decomposition

Two complementary methods are available for investigating the contributions of different noise sources. The first is non-parametric; calling a method of a SearchData object sd returns the fractions of variation retained and discarded after normalization and log-transformation:

f = sd.get_noise_decomp()

These fractions are not guaranteed to be positive, because the transformations may increase the relative spread of the data. On the other hand, if a fit has been completed, a method of a SearchResults object sr reports the fractions of intrinsic, extrinsic (bursty), and technical variation for each gene and molecular species:

f = sr.get_noise_decomp()

Differential parameter value identification

Given a set of matched datasets, run with the same model over the same set of genes, two approaches are available for identifying putative patterns of differential expression and regulation. A moment-based, biology-agnostic one uses a simple t-test to identify differences in the means of spliced counts in SearchData objects sd1 and sd2:

gf = compute_diffexp(sd1,sd2)

where gf is boolean vector that reports True if the gene is identified as DE. However, this approach cannot identify differences if biological parameters change in a correlated way and the mean stays the same. We introduce a more mechanistic approach for the identification of differential expression suggested by parameter variation, based on two SearchResults objects sr1 and sr2:

gf = compute_diffreg(sr1,sr2)

where gf is a two-dimensional boolean array that reports True if a particular parameter is identified as DE. After using these arrays to find a subpopulation of interest – e.g., genes that do not exhibit variation in the spliced mean, but do exhibit modulation in the burst size – it is possible to plug the gene filter genes_to_plot back in to inspect the raw data and fits:

gf = compare_gene_distributions(sr_arr,sd_arr,genes_to_plot=genes_to_plot)