.. _cgi:
CRISPRi-DR
==========
CRISPRi-DR is designed to analyze CRISPRi libraries from CGI experiments and identify significant CGIs ie genes that affect sensitivity to the drug when depleted.
`Choudhery S, DeJesus MA, Srinivasan A, Rock J, Schnappinger D, Ioerger
TR. A dose-response model for statistical analysis of chemical genetic
interactions in CRISPRi screens. PLoS Comput Biol. 2024 May
20;20(5):e1011408. doi: 10.1371/journal.pcbi.1011408. PMID: 38768228;
PMCID: PMC11104602. `_
Workflow
--------
Starting with fastq files, barcode counts are extracted. The user creates their own metadata file, for the counts. Fractional abundances are and used to run the CRISPRi-DR model. The output of this model is a file that lists genes with their statistacal parameters and significance. Genes with significant interactions are those with qval of concentration dependence < 0.05 and ebFDR < 0.05 on the slope coefficient. However, genes can be ranked by depletion by sorting the coefficient of concentration dependence in ascending order
.. image:: _images/CGI_workflow.png
:width: 1000
:alt: Alternative text
Command-line Steps
------------------
**Preprocessing: Fastq to Count Files**
This is a longer process, taking a few minutes each. However, the number of reads processed is printed to the console to indicate progress.
::
> python3 ../src/transit.py cgi extract_counts [Optional Arguments]
Optional Arguments:
-delete_temp_fastQ := if fastq files are provided as gzipped files, this flag indicates whether user would like to delete the temp uncompressed files
* **** : raw reads in *.fastq or *.fastq.gz (gzipped) format (if gzipped, they will be automatically uncompressed)
* **-delete_temp_fastq** (optional flag) : if gzipped files were provided, the user can add this flag to automatically delete the temporary uncompressed files afterward, to help save disk space
* **** : A tab-separated text file that contains metadata for each sgRNA, where the columns include:
1. sgRNA ids (user defined, must be unique per sgRNA). Must be first column for both command line and GUI usage
2. Orfs targeted by the sgRNA,
3. Barcodes (nucleotide sequences) of the sgRNAs
4. sgRNA efficacies (measurements of effect on growth rate; in the publication of this method, sgRNA efficacy is estimated log2-fold-change in CFU at 25 generations (induced vs uninduced), calculated through a passaging experiment).
* **** : Column name of the sgRNA info file that contains the barcodes of the sgRNA (first column is the sgRNA ids)
* The barcode sequence in this column will be reverse complemented before when scanning the reads.
* If the column header has spaces, put it in quotes, like "barcode column"
.. note::
* The order of columns in the sgRNA info file is required to be as listed above if using the GUI implementation. Order of columns doesn't matter for the command-line option - the barcode column can be specified by name as an argument.
* Other columns are allowed in the sgRNA info file (in the case of GUI usage, these extra columns should be after the first four listed above)
**Step 1: Combine Individual Counts File to a Combined Counts File**
This is a fairly fast process. It takes at most a minute for the combination of 12 files with 2 columns (sgRNA id and counts) to one large file of 13 columns (first column sgRNA id and remaining columns are counts from the files entered).
::
> python3 ../src/transit.py cgi combine_counts ...
* counts files : sgRNA ids as their first column, and can have any number of columns.
* comma-separated headers: the column names of the combined counts file
.. note::
the comma-separated headers must be in the same order as the columns in the count files(s) provided
**Step 2: Extract Fractional Abundances**
This is a relatively quick process, taking less than a minute. This step is to turn the barcodes counts into relative normalized abundances. Counts are normalized within samples and calculated relative to the abundances in the uninduced ATC file, essentially fractions. The first few lines of the output file contains information about the counts files processed.
::
> python3 ../src/transit.py cgi extract_abund [Optional Arguments]
Optional Arguments:
-no_uninduced := flag to calculated fractional abundances without uninduced abundances. if do not have a uninduced counts, you can set this flag and they will be approximated
* samples metadata file (USER created):
* The columns expected in this file: column_name, drug, conc_xMIC, days_predepletion
* column_name: the corresponding header name(s) in the combined counts file
* conc_xMIC is the concentration of the drug the sample is treated with
.. warning::
conc_xMIC must be a numerical value, ie. 0.5 and not a categorical value such as "low" or "high"
* Equal number of replicates for all concentrations are not nessessary
* see [Li, S et al. 2022, PMID: 35637331] for explanation of days_predepletion
* Example metadata: ``transit/src/pytransit/data/CGI/counts_metadata.txt``
* control condition: The condition to to be considered the control for these set of experiments, as specificed in the "drug" column of the metadata file; typically an atc-induced (+ ATC) with 0 drug concentration condition.
* sgRNA info file: A file that contains metadata for each sgRNA in the combined counts file, where the columns are as specified above.
* uninduced ATC file: A two column file of sgRNAs and their counts in uninduced ATC (no ATC) with 0 drug concentration. **If you do not have a file with uninduced counts, you can set the '-no_uninduced' flag**. If the **-no_uninduced** flag is set, then uninduced abundances are approximated from the standard coefficient of variation (SCV) across the induced counts.
* drug : Name of the drug in the "drug" column of the metadata file passed in to be fit in the model
* days: Sampled from predepletion day as listed in the "days_predepletion" column of the metadata file to be used in the analysis
**Step 3: Run the CRISPRi-DR model**
This is a relatively quick process, taking at most 3 minutes for a dataset of ~90,000 sgRNAs . This step fits the CRISPRi-DR model (statistical analysis of concentration dependence for each gene) to each gene in the file and prints each output to the in a tab separated file.
::
> python3 ../src/transit.py cgi run_model [Optional Arguments]
Optional Arguments:
-use_negatives := flag to use negative controls to calculate significance of coefficients of concentration dependence
.. warning::
The *use-negatives* flag allows the user to use sgRNAs ID containing "Negative" to calculate Zscores of coefficients of concentration dependence in the final filtering step, as opposed to using the distribution of coefficients for all genes. The significant genes assessed with this flag are those with qval concentration dependence <0.05 and \|Z scores of concentration dependence\| > 2. It does NOT include the empirical Bayes filter.
The output file has the following columns:
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| Column Header | Column Definition |
+======================================+===============================================================================================================+
| Significant Interactions | 0=no interactions. 1=enriched, -1=depleted are those with adjusted P-val (Q-val) < 0.05 and ebFDR < 0.05 |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| Orf | Orf name of the gene |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| Gene | Gene Name |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| Nobs | Number of sgRNAs targeting the gene |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| intercept | Intercept of the CRISPRi-DR model fit to the gene |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| coefficient sgrna efficiency | Measure of the affect of sgRNA efficiency on changes in abundances with increasing concentration |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| coefficient concentration dependence | Measure of the affect of increasing concnetration on changes in abundances |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| pval intercept | P-value of the intercept |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| pval sgrna efficiency | P-value of the coefficient of sgRNA efficiency |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| pval concentration dependence | P-value of the coefficient of concentration dependence based on Wald test |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| qval concentration dependence | Adjustment of the P-values calculated from the coefficient of concentration dependence |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| Z scores of concentration dependence | Z scores of the coefficient of concentration dependence |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| locfdr | Estimated local false discovery rate for each gene using Empirical Bayes |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
| ebFDR | Calculated global false discovery rate for each gene using Empirical Bayes |
+--------------------------------------+---------------------------------------------------------------------------------------------------------------+
.. note::
The column of coefficient concentration dependence is the columns of interest. The Z-scores and adjusted P-values are calculated from this column and used to determine significant interactions. When the output file is sorted on the slope of concentration dependence, the user can rank the genes based on amount of depletion.
**Visualize Concentration-Dependence of sgRNAs for Specific Genes**
This process is fairly quick, taking less than a minute to run. This figure visualizes the amount of depletion in a gene at the sgRNA level. If control concentration provided is 0, the lowest value on the x-axis in the plot refers to this concentration (due to taking log concentration, 0 concentration is treated as a teo fold lower than the lowest concentration.) The slope of relative abundance (fraction of abundance of counts in ATC induced vs. ATC uninduced) versus log2(concentration) for each sgRNA is calculated and plotted, colored by sgRNA strength based on a blue-orange gradient (as seen here):
.. image:: _images/RVBD3645_lmplot.png
:width: 400
:alt: Alternative text
::
> python3 ../src/transit.py cgi visualize