This example shows how to test RNA-Seq data for differentially expressed genes using a negative binomial model.

A typical differential expression analysis of RNA-Seq data consists of normalizing the raw counts and performing statistical tests to reject or accept the null hypothesis that two groups of samples show no significant difference in gene expression. This example shows how to inspect the basic statistics of raw count data, how to determine size factors for count normalization and how to infer the most differentially expressed genes using a negative binomial model.

The dataset for this example comprises of RNA-Seq data obtained in the experiment described by Brooks et al. [1]. The authors investigated the effect of siRNA knock-down of pasilla, a gene known to play an important role in the regulation of splicing in *Drosophila melanogaster*. The dataset consists of 2 biological replicates of the control (untreated) samples and 2 biological replicates of the knock-down (treated) samples.

The starting point for this analysis of RNA-Seq data is a count matrix, where the rows correspond to genomic features of interest, the columns correspond to the given samples and the values represent the number of reads mapped to each feature in a given sample.

The included file `pasilla_count_noMM.mat`

contains two tables with the count matrices at the gene level and at the exon level for each of the considered samples. You can obtain similar matrices using the function `featurecount`

.

```
load pasilla_count_noMM.mat
```

```
% preview the table of read counts for genes
geneCountTable(1:10,:)
```

ans = 10x6 table ID Reference untreated3 untreated4 treated2 treated3 _____________ _________ __________ __________ ________ ________ 'FBgn0000003' '3R' 0 1 1 2 'FBgn0000008' '2R' 142 117 138 132 'FBgn0000014' '3R' 20 12 10 19 'FBgn0000015' '3R' 2 4 0 1 'FBgn0000017' '3L' 6591 5127 4809 6027 'FBgn0000018' '2L' 469 530 492 574 'FBgn0000024' '3R' 5 6 10 8 'FBgn0000028' 'X' 0 0 2 1 'FBgn0000032' '3R' 1160 1143 1138 1415 'FBgn0000036' '3R' 0 0 0 1

Note that when counting is performed without summarization, the individual features (exons in this case) are reported with their metafeature assignment (genes in this case) followed by the start and stop positions.

```
% preview the table of read counts for exons
exonCountTable(1:10,:)
```

ans = 10x6 table ID Reference untreated3 untreated4 treated2 treated3 _______________________________ _________ __________ __________ ________ ________ 'FBgn0000003_2648220_2648518' '3R' 0 0 0 1 'FBgn0000008_18024938_18025756' '2R' 0 1 0 2 'FBgn0000008_18050410_18051199' '2R' 13 9 14 12 'FBgn0000008_18052282_18052494' '2R' 4 2 5 0 'FBgn0000008_18056749_18058222' '2R' 32 27 26 23 'FBgn0000008_18058283_18059490' '2R' 14 18 29 22 'FBgn0000008_18059587_18059757' '2R' 1 4 3 0 'FBgn0000008_18059821_18059938' '2R' 0 0 2 0 'FBgn0000015_12758093_12760298' '3R' 1 2 0 0 'FBgn0000017_16615461_16618374' '3L' 1807 1572 1557 1702

You can annotate and group the samples by creating a logical vector as follows:

samples = geneCountTable(:,3:end).Properties.VariableNames; untreated = strncmp(samples,'untreated',length('untreated')) treated = strncmp(samples,'treated',length('treated'))

untreated = 1x4 logical array 1 1 0 0 treated = 1x4 logical array 0 0 1 1

The included file also contains a table `geneSummaryTable`

with the summary of assigned and unassigned SAM entries. You can plot the basic distribution of the counting results by considering the number of reads that are assigned to the given genomic features (exons or genes for this example), as well as the number of reads that are unassigned (i.e. not overlapping any feature) or ambiguous (i.e. overlapping multiple features).

st = geneSummaryTable({'Assigned','Unassigned_ambiguous','Unassigned_noFeature'},:) bar(table2array(st)','stacked'); legend(st.Properties.RowNames','Interpreter','none','Location','southeast'); xlabel('Sample') ylabel('Number of reads')

st = 3x4 table untreated3 untreated4 treated2 treated3 __________ __________ __________ __________ Assigned 1.5457e+07 1.6302e+07 1.5146e+07 1.8856e+07 Unassigned_ambiguous 1.5708e+05 1.6882e+05 1.6194e+05 1.9977e+05 Unassigned_noFeature 7.5455e+05 5.8309e+05 5.8756e+05 6.8356e+05

Note that a small fraction of the alignment records in the SAM files is not reported in the summary table. You can notice this in the difference between the total number of records in a SAM file and the total number of records processed during the counting procedure for that same SAM file. These unreported records correspond to the records mapped to reference sequences that are not annotated in the GTF file and therefore are not processed in the counting procedure. If the gene models account for all the reference sequences used during the read mapping step, then all records are reported in one of the categories of the summary table.

```
geneSummaryTable{'TotalEntries',:} - sum(geneSummaryTable{2:end,:})
```

ans = 89516 95885 98207 104629

When read counting is performed without summarization using the function `featurecount`

, the default IDs are composed by the attribute or metafeature (by default, gene_id) followed by the start and the stop positions of the feature (by default, exon). You can use the exon start positions to plot the read coverage across any chromosome in consideration, for example chromosome arm 2L.

% consider chromosome arm 2L chr2L = strcmp(exonCountTable.Reference,'2L'); exonCount = exonCountTable{:,3:end}; % retrieve exon start positions exonStart = regexp(exonCountTable{chr2L,1},'_(\d+)_','tokens'); exonStart = [exonStart{:}]; exonStart = cellfun(@str2num, [exonStart{:}]'); % sort exon by start positions [~,idx] = sort(exonStart); % plot read coverage along the genomic coordinates figure; plot(exonStart(idx),exonCount(idx,treated),'.-r',... exonStart(idx),exonCount(idx,untreated),'.-b'); xlabel('Genomic position'); ylabel('Read count (exon level)'); title('Read count on Chromosome arm 2L'); % plot read coverage for each group separately figure; subplot(2,1,1); plot(exonStart(idx),exonCount(idx,untreated),'.-r'); ylabel('Read count (exon level)'); title('Chromosome arm 2L (treated samples)'); subplot(2,1,2); plot(exonStart(idx),exonCount(idx,treated),'.-b'); ylabel('Read count (exon level)'); xlabel('Genomic position'); title('Chromosome arm 2L (untreated samples)');

Alternatively, you can plot the read coverage considering the starting position of each gene in a given chromosome. The file `pasilla_geneLength.mat`

contains a table with the start and stop position of each gene in the corresponding gene annotation file.

% load gene start and stop position information load pasilla_geneLength geneLength(1:10,:)

ans = 10x5 table ID Name Reference Start Stop _____________ _________ _________ _____ _____ 'FBgn0037213' 'CG12581' 3R 380 10200 'FBgn0000500' 'Dsk' 3R 15388 16170 'FBgn0053294' 'CR33294' 3R 17136 21871 'FBgn0037215' 'CG12582' 3R 23029 30295 'FBgn0037217' 'CG14636' 3R 30207 41033 'FBgn0037218' 'aux' 3R 37505 53244 'FBgn0051516' 'CG31516' 3R 44179 45852 'FBgn0261436' 'DhpD' 3R 53106 54971 'FBgn0037220' 'CG14641' 3R 56475 58077 'FBgn0015331' 'abs' 3R 58765 60763

% consider chromosome 3 ('Reference' is a categorical variable) chr3 = (geneLength.Reference == '3L') | (geneLength.Reference == '3R'); sum(chr3) % consider the counts for genes in chromosome 3 counts = geneCountTable{:,3:end}; [~,j,k] = intersect(geneCountTable{:,'ID'},geneLength{chr3,'ID'}); gstart = geneLength{k,'Start'}; gcounts = counts(j,:); % sort according to ascending start position [~,idx] = sort(gstart); % plot read coverage by genomic position figure; plot(gstart(idx), gcounts(idx,treated),'.-r',... gstart(idx), gcounts(idx,untreated),'.-b'); xlabel('Genomic position') ylabel('Read count'); title('Read count on Chromosome 3');

ans = 6360

The read count in RNA-Seq data has been found to be linearly related to the abundance of transcripts [2]. However, the read count for a given gene depends not only on the expression level of the gene, but also on the total number of reads sequenced and the length of the gene transcript. Therefore, in order to infer the expression level of a gene from the read count, we need to account for the sequencing depth and the gene transcript length. One common technique to normalize the read count is to use the RPKM (Read Per Kilobase Mapped) values, where the read count is normalized by the total number of reads yielded (in millions) and the length of each transcript (in kilobases). This normalization technique, however, is not always effective since few, very highly expressed genes can dominate the total lane count and skew the expression analysis.

A better normalization technique consists of computing the effective library size by considering a size factor for each sample. By dividing each sample's counts by the corresponding size factors, we bring all the count values to a common scale, making them comparable. Intuitively, if sample A is sequenced N times deeper than sample B, the read counts of non-differentially expressed genes are expected to be on average N times higher in sample A than in sample B, even if there is no difference in expression.

To estimate the size factors, take the median of the ratios of observed counts to those of a pseudo-reference sample, whose counts can be obtained by considering the geometric mean of each gene across all samples [3]. Then, to transform the observed counts to a common scale, divide the observed counts in each sample by the corresponding size factor.

```
% estimate pseudo-reference with geometric mean row by row
pseudoRefSample = geomean(counts,2);
nz = pseudoRefSample > 0;
ratios = bsxfun(@rdivide,counts(nz,:),pseudoRefSample(nz));
sizeFactors = median(ratios,1)
```

sizeFactors = 0.9374 0.9725 0.9388 1.1789

```
% transform to common scale
normCounts = bsxfun(@rdivide,counts,sizeFactors);
normCounts(1:10,:)
```

ans = 1.0e+03 * 0 0.0010 0.0011 0.0017 0.1515 0.1203 0.1470 0.1120 0.0213 0.0123 0.0107 0.0161 0.0021 0.0041 0 0.0008 7.0315 5.2721 5.1225 5.1124 0.5003 0.5450 0.5241 0.4869 0.0053 0.0062 0.0107 0.0068 0 0 0.0021 0.0008 1.2375 1.1753 1.2122 1.2003 0 0 0 0.0008

You can appreciate the effect of this normalization by using the function `boxplot`

to represent statistical measures such as median, quartiles, minimum and maximum.

figure; subplot(2,1,1) maboxplot(log2(counts),'title','Raw Read Count','orientation','horizontal') ylabel('sample') xlabel('log2(counts)') subplot(2,1,2) maboxplot(log2(normCounts),'title','Normalized Read Count','orientation','horizontal') ylabel('sample') xlabel('log2(counts)')

In order to better characterize the data, we consider the mean and the dispersion of the normalized counts. The variance of read counts is given by the sum of two terms: the variation across samples (raw variance) and the uncertainty of measuring the expression by counting reads (shot noise or Poisson). The raw variance term dominates for highly expressed genes, whereas the shot noise dominates for lowly expressed genes. You can plot the empirical dispersion values against the mean of the normalized counts in a log scale as shown below.

% consider the mean meanTreated = mean(normCounts(:,treated),2); meanUntreated = mean(normCounts(:,untreated),2); % consider the dispersion dispTreated = std(normCounts(:,treated),0,2) ./ meanTreated; dispUntreated = std(normCounts(:,untreated),0,2) ./ meanUntreated; % plot on a log-log scale figure; loglog(meanTreated,dispTreated,'r.'); hold on; loglog(meanUntreated,dispUntreated,'b.'); xlabel('log2(Mean)'); ylabel('log2(Dispersion)'); legend('Treated','Untreated','Location','southwest');

Given the small number of replicates, it is not surprising to expect that the dispersion values scatter with some variance around the true value. Some of this variance reflects sampling variance and some reflects the true variability among the gene expressions of the samples.

You can look at the difference of the gene expression among two conditions, by calculating the fold change (FC) for each gene, i.e. the ratio between the counts in the treated group over the counts in the untreated group. Generally these ratios are considered in the log2 scale, so that any change is symmetric with respect to zero (e.g. a ratio of 1/2 or 2/1 corresponds to -1 or +1 in the log scale).

% compute the mean and the log2FC meanBase = (meanTreated + meanUntreated) / 2; foldChange = meanTreated ./ meanUntreated; log2FC = log2(foldChange); % plot mean vs. fold change (MA plot) mairplot(meanTreated, meanUntreated,'Type','MA','Plotonly',true); set(get(gca,'Xlabel'),'String','mean of normalized counts') set(get(gca,'Ylabel'),'String','log2(fold change)')

Warning: Zero values are ignored

It is possible to annotate the values in the plot with the corresponding gene names, interactively select genes, and export gene lists to the workspace by calling the `mairplot`

function as illustrated below:

mairplot(meanTreated,meanUntreated,'Labels',geneCountTable.ID,'Type','MA');

Warning: Zero values are ignored

It is convenient to store the information about the mean value and fold change for each gene in a table. You can then access information about a given gene or a group of genes satisfying specific criteria by indexing the table by gene names.

```
% create table with statistics about each gene
geneTable = table(meanBase,meanTreated,meanUntreated,foldChange,log2FC);
geneTable.Properties.RowNames = geneCountTable.ID;
```

```
% summary
summary(geneTable)
```

Variables: meanBase: 11609x1 double Values: Min 0.21206 Median 201.24 Max 2.6789e+05 meanTreated: 11609x1 double Values: Min 0 Median 201.54 Max 2.5676e+05 meanUntreated: 11609x1 double Values: Min 0 Median 199.44 Max 2.7903e+05 foldChange: 11609x1 double Values: Min 0 Median 0.99903 Max Inf log2FC: 11609x1 double Values: Min -Inf Median -0.001406 Max Inf

```
% preview
geneTable(1:10,:)
```

ans = 10x5 table meanBase meanTreated meanUntreated foldChange log2FC ________ ___________ _____________ __________ ___________ FBgn0000003 0.9475 1.3808 0.51415 2.6857 1.4253 FBgn0000008 132.69 129.48 135.9 0.95277 -0.069799 FBgn0000014 15.111 13.384 16.838 0.79488 -0.33119 FBgn0000015 1.7738 0.42413 3.1234 0.13579 -2.8806 FBgn0000017 5634.6 5117.4 6151.8 0.83186 -0.26559 FBgn0000018 514.08 505.48 522.67 0.96711 -0.048243 FBgn0000024 7.2354 8.7189 5.752 1.5158 0.60009 FBgn0000028 0.74465 1.4893 0 Inf Inf FBgn0000032 1206.3 1206.2 1206.4 0.99983 -0.00025093 FBgn0000036 0.21206 0.42413 0 Inf Inf

% access information about a specific gene myGene = 'FBgn0261570'; geneTable(myGene,:) geneTable(myGene,{'meanBase','log2FC'}) % access information about a given gene list myGeneSet = {'FBgn0261570','FBgn0261573','FBgn0261575','FBgn0261560'}; geneTable(myGeneSet,:)

ans = 1x5 table meanBase meanTreated meanUntreated foldChange log2FC ________ ___________ _____________ __________ _______ FBgn0261570 4435.5 4939.1 3931.8 1.2562 0.32907 ans = 1x2 table meanBase log2FC ________ _______ FBgn0261570 4435.5 0.32907 ans = 4x5 table meanBase meanTreated meanUntreated foldChange log2FC ________ ___________ _____________ __________ _______ FBgn0261570 4435.5 4939.1 3931.8 1.2562 0.32907 FBgn0261573 2936.9 2954.8 2919.1 1.0122 0.01753 FBgn0261575 4.3776 5.6318 3.1234 1.8031 0.85047 FBgn0261560 2041.1 1494.3 2588 0.57738 -0.7924

Determining whether the gene expressions in two conditions are statistically different consists of rejecting the null hypothesis that the two data samples come from distributions with equal means. This analysis assumes the read counts are modeled according to a negative bionomial distribution (as proposed in [3]). The function `nbintest`

performs this type of hypothesis testing with three possible options to specify the type of linkage between the variance and the mean.

By specifying the link between variance and mean as an identity, we assume the variance is equal to the mean, and the counts are modeled by the Poisson distribution [4].

tIdentity = nbintest(counts(:,treated),counts(:,untreated),'VarianceLink','Identity'); h = plotVarianceLink(tIdentity); % set custom title h(1).Title.String = 'Variance Link on Treated Samples'; h(2).Title.String = 'Variance Link on Untreated Samples';

Alternatively, by specifying the variance as the sum of the shot noise term (i.e. mean) and a constant multiplied by the squared mean, the counts are modeled according to a distribution described in [5]. The constant term is estimated using all the rows in the data.

tConstant = nbintest(counts(:,treated),counts(:,untreated),'VarianceLink','Constant'); h = plotVarianceLink(tConstant); % set custom title h(1).Title.String = 'Variance Link on Treated Samples'; h(2).Title.String = 'Variance Link on Untreated Samples';

Finally, by considering the variance as the sum of the shot noise term (i.e. mean) and a locally regressed non-parametric smooth function of the mean, the counts are modeled according to the distribution proposed in [3].

tLocal = nbintest(counts(:,treated),counts(:,untreated),'VarianceLink','LocalRegression'); h = plotVarianceLink(tLocal); % set custom title h(1).Title.String = 'Variance Link on Treated Samples'; h(2).Title.String = 'Variance Link on Untreated Samples';

In order to evaluate which fit is the best for the data in consideration, you can compare the fitting curves in a single plot, as shown below.

h = plotVarianceLink(tLocal,'compare',true); % set custom title h(1).Title.String = 'Variance Link on Treated Samples'; h(2).Title.String = 'Variance Link on Untreated Samples';

The output of `nbintest`

includes a vector of P-values. A P-value indicates the probability that a change in expression as strong as the one observed (or even stronger) would occur under the null hypothesis, i.e. the conditions have no effect on gene expression. In the histogram of the P-values we observe an enrichment of low values (due to differentially expressed genes), whereas other values are uniformly spread (due to non-differentially expressed genes). The enrichment of values equal to 1 are due to genes with very low counts.

figure; histogram(tLocal.pValue,100) xlabel('P-value') ylabel('Frequency') title('P-value enrichment')

Filter out those genes with relatively low count to observe a more uniform spread of non-significant P-values across the range (0,1]. Note that this does not affect the distribution of significant P-values.

lowCountThreshold = 10; lowCountGenes = all(counts < lowCountThreshold, 2); histogram(tLocal.pValue(~lowCountGenes),100) xlabel('P-value') ylabel('Frequency') title('P-value enrichment without low count genes')

Thresholding P-values to determine what fold changes are more significant than others is not appropriate for this type of data analysis, due to the multiple testing problem. While performing a large number of simultaneous tests, the probability of getting a significant result simply due to chance increases with the number of tests. In order to account for multiple testing, perform a correction (or adjustment) of the P-values so that the probability of observing at least one significant result due to chance remains below the desired significance level.

The Benjamini-Hochberg (BH) adjustment [6] is a statistical method that provides an adjusted P-value answering the following question: what would be the fraction of false positives if all the genes with adjusted P-values below a given threshold were considered significant? Set a threshold of 0.1 for the adjusted P-values, equivalent to consider a 10% false positives as acceptable, and identify the genes that are significantly expressed by considering all the genes with adjusted P-values below this threshold.

% compute the adjusted P-values (BH correction) padj = mafdr(tLocal.pValue,'BHFDR',true); % add to the existing table geneTable.pvalue = tLocal.pValue; geneTable.padj = padj; % create a table with significant genes sig = geneTable.padj < 0.1; geneTableSig = geneTable(sig,:); geneTableSig = sortrows(geneTableSig,'padj'); numberSigGenes = size(geneTableSig,1)

numberSigGenes = 1904

You can now identify the most up-regulated or down-regulated genes by considering an absolute fold change above a chosen cutoff. For example, a cutoff of 1 in log2 scale yields the list of genes that are up-regulated with a 2 fold change.

% find up-regulated genes up = geneTableSig.log2FC > 1; upGenes = sortrows(geneTableSig(up,:),'log2FC','descend'); numberSigGenesUp = sum(up) % display the top 10 up-regulated genes top10GenesUp = upGenes(1:10,:) % find down-regulated genes down = geneTableSig.log2FC < -1; downGenes = sortrows(geneTableSig(down,:),'log2FC','ascend'); numberSigGenesDown = sum(down) % find top 10 down-regulated genes top10GenesDown = downGenes(1:10,:)

numberSigGenesUp = 129 top10GenesUp = 10x7 table meanBase meanTreated meanUntreated foldChange log2FC pvalue padj ________ ___________ _____________ __________ ______ __________ __________ FBgn0030173 3.3979 6.7957 0 Inf Inf 0.0063115 0.047764 FBgn0036822 3.1364 6.2729 0 Inf Inf 0.012203 0.079274 FBgn0052548 8.158 15.269 1.0476 14.575 3.8654 0.00016945 0.0022662 FBgn0050495 6.8315 12.635 1.0283 12.287 3.6191 0.0018949 0.017972 FBgn0063667 20.573 38.042 3.1042 12.255 3.6153 8.5037e-08 2.3845e-06 FBgn0033764 91.969 167.61 16.324 10.268 3.3601 1.8345e-25 2.9174e-23 FBgn0037290 85.845 155.46 16.228 9.5801 3.26 3.5583e-23 4.6941e-21 FBgn0033733 7.4634 13.384 1.5424 8.6773 3.1172 0.0027276 0.024283 FBgn0037191 7.1766 12.753 1.6003 7.9694 2.9945 0.0047803 0.038193 FBgn0033943 6.95 12.319 1.581 7.7921 2.962 0.0053635 0.041986 numberSigGenesDown = 181 top10GenesDown = 10x7 table meanBase meanTreated meanUntreated foldChange log2FC pvalue padj ________ ___________ _____________ __________ _______ __________ __________ FBgn0053498 15.469 0 30.938 0 -Inf 9.8404e-11 4.345e-09 FBgn0259236 6.8092 0 13.618 0 -Inf 1.5526e-05 0.00027393 FBgn0052500 4.3703 0 8.7405 0 -Inf 0.00066783 0.0075343 FBgn0039331 3.6954 0 7.3908 0 -Inf 0.0019558 0.018474 FBgn0040697 3.419 0 6.8381 0 -Inf 0.0027378 0.024336 FBgn0034972 2.9145 0 5.8291 0 -Inf 0.0068564 0.05073 FBgn0040967 2.6382 0 5.2764 0 -Inf 0.0096039 0.065972 FBgn0031923 2.3715 0 4.7429 0 -Inf 0.016164 0.098762 FBgn0085359 62.473 2.9786 121.97 0.024421 -5.3557 5.5813e-33 1.5068e-30 FBgn0004854 7.4674 0.53259 14.402 0.03698 -4.7571 8.1587e-05 0.0012034

A good visualization of the gene expressions and their significance is given by plotting the fold change versus the mean in log scale and coloring the data points according to the adjusted P-values.

figure scatter(log2(geneTable.meanBase),geneTable.log2FC,3,geneTable.padj,'o') colormap(flipud(cool(256))) colorbar; ylabel('log2(Fold Change)') xlabel('log2(Mean of normalized counts)') title('Fold change by FDR')

You can see here that for weakly expressed genes (i.e. those with low means), the FDR is generally high because low read counts are dominated by Poisson noise and consequently any biological variability is drowned in the uncertainties from the read counting.

[1] Brooks et al. Conservation of an RNA regulatory map between Drosophila and mammals. Genome Research 2011. 21:193-202.

[2] Mortazavi et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods 2008. 5:621-628.

[3] Anders et al. Differential expression analysis for sequence count data. Genome Biology 2010. 11:R106.

[4] Marioni et al. RNA-Seq: An assessment of technical reproducibility and comparison with gene expression arrays. Genome Research 2008. 18:1509-1517.

[5] Robinson et al. Moderated statistical test for assessing differences in tag abundance. Bioinformatics 2007. 23(21):2881-2887.

[6] Benjamini et al. Controlling the false discovery rate: a practical and powerful approach to multiple testing. 1995. Journal of the Royal Statistical Society, Series B57 (1):289-300.