Tnseq_stats

You can generate the same table to statistics as on the Quality Control panel in the GUI from the command-line using the ‘tnseq_stats’ command. Here is an example:

> python3 src/transit.py tnseq_stats -help

usage: python3 src/transit.py tnseq_stats <file.wig>+ [-o <output_file>]
       python3 src/transit.py tnseq_stats -c <combined_wig> [-o <output_file>]

> python3 src/transit.py tnseq_stats -c src/pytransit/data/cholesterol_glycerol.transit/comwig.tsv

Dataset                    Density Mean_ct NZmean NZmedian Max_ct   Total_cts Skewness Kurtosis Pickands_Tail_Index
cholesterol_H37Rv_rep1.wig 0.439   139.6   317.6  147      125355.5 10414005   54.8     4237.7  0.973
cholesterol_H37Rv_rep2.wig 0.439   171.4   390.5  148      704662.8 12786637  105.8    14216.2  1.529
cholesterol_H37Rv_rep3.wig 0.359   173.8   484.2  171      292294.8 12968502   42.2     2328.0  1.584
glycerol_H37Rv_rep1.wig    0.419   123.3   294.5  160        8813.3  9195672    4.0      33.0   0.184
glycerol_H37Rv_rep2.wig    0.516   123.8   240.1  127        8542.5  9235984    4.0      33.5   0.152

The output file is tab-separated text file (spreadsheet) with the following columns:

Column Header	Column Definition
Dataset	name of sample (.wig file)
Density	saturation (percent of TA sites with non-zero insertion counts
Mean_ct	the mean count over all TA sites
NZmean	the mean count over non-zero TA sites
NZmedian	the median count over non-zero TA sites
Max_ct	highest count over all TA sites (to check for outliers)
Total_cts	total insertion counts summed over all TA sites
Skewness	3rd-order moment of read count distribution
Kurtosis	4th-order moment of read count distribution
Pickands Tail Index	another measure of skewness of read count distributions

Signs of potential problems with a dataset:

low density (<30%)

low NZmean (<10) - note: this can be affected by normalization of input combined_wig file (might need to re-run ‘export combined_wig’ with ‘--n nonorm’ to see means of raw data)

max_ct: in most datasets, this is usually in the range of thousands to tens of thousands; if it is over a million, this could be an outlier (super-high counts at one or a few sites, possibly due to positive selection), which could be throwing the rest of the counts off

skewness: it is difficult to give a hard cutoff, but skewness > 50 could be a sign that a sample is noisy

Pickands Tail Index: it is difficult to give a hard cutoff, but PTI>0.5 could be a sign that a sample is noisy (skewed), and PTI>1 is bad.

Pickands Tail Index (PTI) is defined in: James Pickands III. (1975) “Statistical Inference Using Extreme Order Statistics.” Ann. Statist. 3(1):119-131. It is calculated using a formula based on the order statistics of the distribution of counts (highest counts in sorted order), and increases for distributions with heavier tails (outliers). In Transit, the PTI is calculated over the highest counts, with ranks 10-100.

The analysis methods in Transit will work with noisy samples, but the results could be affected (e.g. reduced sensitivity of detecting conditionally essential genes). Two options to consider are: 1) dropping the noisiest samples, 2) applying a non-linear normalization like the Beta-Geometric Correction (BGC) (’--n betagenom’).

See also Quality Control for a discussion of how to interpret these metrics, and what to do if you have noisy samples.