Usage
=====

.. _installation:

Installation
------------

To use *Monod*, install it from `pip`:

.. code-block:: console

 pip install monod
 
To use it in your code, import the package components:

.. code-block:: python

 import monod
 from monod import *
 from monod.preprocess import *
 from monod.extract_data import *
 from monod.cme_toolbox import CMEModel
 from monod import inference, mminference
 from monod.analysis import *

Quantification 
----------------
To generate an intronic index for ``kb``, obtain a reference genome and run ``kb ref``:

.. code-block:: console

 kb ref -i ./refdata-gex-mm10-2020-A/kallisto/index.idx 
 -g ./refdata-gex-mm10-2020-A/t2g_mm10.txt 
 -f1 ./refdata-gex-mm10-2020-A/kallisto/cdna.fa 
 -f2 ./refdata-gex-mm10-2020-A/kallisto/intron.fa 
 -c1 ./refdata-gex-mm10-2020-A/kallisto/cdna_t2c.txt 
 -c2 ./refdata-gex-mm10-2020-A/kallisto/intron_t2c.txt 
 --workflow lamanno 
 ./refdata-gex-mm10-2020-A/fasta/genome.fa 
 ./refdata-gex-mm10-2020-A/genes/genes.gtf
 
To then generate the count matrices in ``loom`` format, use ``kb count``:

.. code-block:: console

   kb count --verbose 
   -i ../ref/refdata-gex-mm10-2020-A/kallisto/index.idx 
   -g ../ref/refdata-gex-mm10-2020-A/t2g_mm10.txt 
   -x 10xv3 
   -o OUTDIR/ 
   -t 30 -m 30G 
   -c1 ../ref/refdata-gex-mm10-2020-A/kallisto/cdna_t2c.txt 
   -c2 ../ref/refdata-gex-mm10-2020-A/kallisto/intron_t2c.txt 
   --workflow lamanno --filter bustools --overwrite --loom 
   DATASET_FASTQ_LOCATIONS

This generates a ``loom`` file in ``OUTDIR/counts_filtered/``.


Pre-processing 
----------------

To define a "batch," or a set of inference runs, run :py:func:`preprocess.construct_batch`:

.. code-block:: python

 dir_string,dataset_strings = construct_batch(raw_filepaths, transcriptome_filepath, dataset_names, \
                                              attribute_names, batch_location, meta, batch_id, n_genes)

To import the files, specify the raw data (``loom``, ``mtx``, or ``adata``) filepaths in ``raw_filepaths``, a gene length annotation filepath ``transcriptome_filepath``, and various batch and dataset metadata. To specify the number of genes to analyze, set ``n_genes``. 

This will create a batch directory, dataset-specific subdirectories, the file ``gene_set.csv`` with the list of genes that meet filtering thresholds for all datasets, and the file ``genes.csv`` with the list of genes selected for further analysis. This gene list can also be defined or updated manually, but inference is typically unsuitable for genes that do not meet the filtering thresholds.

Model, data, and parameter definition 
----------------

To define a CME model of transcription and sequencing, initialize an instance of :py:class:`cme_toolbox.CMEModel`:

.. code-block:: python

 CMEModel(biological_model,sequencing_model)

where ``biological_model = {'Bursty','Constitutive','Extrinsic','CIR'}`` represents the transcriptional process and ``sequencing_model = {'None','Bernoulli','Poisson'}`` represents the dynamics of the sampling process.

To define the search parameters, initialize an instance of :py:class:`inference.InferenceParameters`:

.. code-block:: python

 inference_parameters = inference.InferenceParameters(phys_lb,phys_ub,samp_lb,samp_ub,gridsize,\
                     dataset_string,fitmodel,use_lengths)

where ``phys_lb`` and ``phys_ub`` are bounds on the transcriptional process model parameters, ``samp_lb`` and ``samp_ub`` are bounds on the sampling process model parameters, ``gridsize`` defines the grid for the sampling parameter scan, and ``use_lengths`` determines whether the unspliced mRNA capture rate depends on the gene length (to model priming at ubiquitous internal polyA sites).

Alternatively one can define the search parameters and cluster the data. This will run the `meK-Means <https://github.com/pachterlab/CGP_2023/>`_ clustering algorithm (see the paper `here <https://www.biorxiv.org/content/10.1101/2023.09.17.558131v2>`_). We initialize an instance of :py:class:`mminference.InferenceParameters`:

.. code-block:: python

 inference_parameters = mminference.InferenceParameters(phys_lb,phys_ub,samp_lb,samp_ub,gridsize,\
                     k,epochs,\
                     dataset_string,fitmodel,use_lengths)

where ``k`` is the user-defined number of clusters to learn and ``epochs`` is the numbers of rounds to learn the data clusters. All other parameters remain the same as :py:class:`inference.InferenceParameters`. 

To create a ``SearchData`` object to input into the inference process, run :py:func:`extract_data.extract_data`:

.. code-block:: python

 search_data = extract_data(loom_filepath, transcriptome_filepath, dataset_name,
                                dataset_string, dir_string)

Running the inference pipeline 
----------------

To run the pipeline, simply call the following parallelized code:

.. code-block:: python

  result_string = inference_parameters.fit_all_grid_points(n_cores,search_data)

This will iterate over all grid points using ``n_cores`` processors.

Post-processing and QC
----------------
To load the search results, import the file string:

.. code-block:: python

 sr = load_search_results(result_string)

To identify the technical noise parameter optimum, call a method of a SearchResults object `sr`:

.. code-block:: python

 sr.find_sampling_optimum()

Optionally, test its stability under subsampling and chi-squared testing:

.. code-block:: python

 fig1,ax1 = plt.subplots(1,1)
 sr.plot_landscape(ax1)
 _=sr.chisquare_testing(sd)
 sr.resample_opt_viz()
 sr.resample_opt_mc_viz()
 sr.chisq_best_param_correction(sd,viz=True)

Optionally, examine whether the distribution fits match the raw data:

.. code-block:: python

 sr.plot_gene_distributions(sd,marg='joint')
 sr.plot_gene_distributions(sd,marg='nascent')
 sr.plot_gene_distributions(sd,marg='mature')

To chracterize the uncertainty, variation, and bias in biological parameters, compute the standard errors of their maximum likelihood estimates, then plot their distributions and dependence on length (which should be minimal):

.. code-block:: python

 sr.compute_sigma(sd,num_cores)
 sr.plot_param_L_dep(plot_errorbars=True,plot_fit=True)
 sr.plot_param_marg()

As the standard error computation is typically computationally intensive, it is useful to store an updated copy on disk after evaluating it:

.. code-block:: python

 sr.update_on_disk()

Noise decomposition
----------------
Two complementary methods are available for investigating the contributions of different noise sources. The first is non-parametric; calling a method of a ``SearchData`` object ``sd`` returns the fractions of variation retained and discarded after normalization and log-transformation:

.. code-block:: python

 f = sd.get_noise_decomp()
 
These fractions are not guaranteed to be positive, because the transformations may *increase* the relative spread of the data. On the other hand, if a fit has been completed, a method of a ``SearchResults`` object ``sr`` reports the fractions of intrinsic, extrinsic (bursty), and technical variation for each gene and molecular species:

.. code-block:: python

 f = sr.get_noise_decomp()
 
Differential parameter value identification
----------------
Given a set of matched datasets, run with the same model over the same set of genes, two approaches are available for identifying putative patterns of differential expression and regulation. A moment-based, biology-agnostic one uses a simple *t*-test to identify differences in the means of spliced counts in ``SearchData`` objects ``sd1`` and ``sd2``:

.. code-block:: python

 gf = compute_diffexp(sd1,sd2)
 
where ``gf`` is boolean vector that reports ``True`` if the gene is identified as DE. However, this approach cannot identify differences if biological parameters change in a correlated way and the mean stays the same. We introduce a more mechanistic approach for the identification of differential expression suggested by parameter variation, based on two ``SearchResults`` objects ``sr1`` and ``sr2``:

.. code-block:: python

 gf = compute_diffreg(sr1,sr2)
 
where ``gf`` is a two-dimensional boolean array that reports ``True`` if a particular *parameter* is identified as DE. After using these arrays to find a subpopulation of interest -- e.g., genes that do not exhibit variation in the spliced mean, but do exhibit modulation in the burst size -- it is possible to plug the gene filter ``genes_to_plot`` back in to inspect the raw data and fits:

.. code-block:: python

 gf = compare_gene_distributions(sr_arr,sd_arr,genes_to_plot=genes_to_plot)