CuffDiffOptions

Option set for cuffdiff

Description

A CuffDiffOptions object sets options for the cuffdiff function, which identifies significant changes in transcript expression [1].

Creation

Syntax

cuffdiffOpt = CuffDiffOptions

cuffdiffOpt = CuffDiffOptions(Name,Value)

cuffdiffOpt = CuffDiffOptions(S)

Description

cuffdiffOpt = CuffDiffOptions creates a CuffDiffOptions object with the default property values.

CuffDiffOptions requires the Cufflinks Support Package for the Bioinformatics Toolbox™. If the support package is not installed, then the function provides a download link. For details, see Bioinformatics Toolbox Software Support Packages.

example

cuffdiffOpt = CuffDiffOptions(Name,Value) sets the object properties using one or more name-value pair arguments. Enclose each property name in quotes. For example, cuffdiffOpt = CuffDiffOptions('DoIsoformSwitch',true) specifies to perform isoform switching tests.

cuffdiffOpt = CuffDiffOptions(S) specifies optional parameters using a string or character vector S.

Input Arguments

expand all

`S` — `cuffdiff` options
string | character vector

cuffdiff options, specified as a string or character vector. S must be in the original cuffdiff option syntax (prefixed by one or two dashes).

Example: '--seed 5'

Properties

expand all

`ConditionLabels` — Sample labels
string | string vector | character vector | cell array of character vector

Sample labels, specified as a string, string vector, character vector, or cell array of character vectors. The number of labels must equal the number of samples or the value must be empty [].

Example: ["Control","Mutant1","Mutant2"]

Data Types: string | char | cell

`ContrastFile` — Contrast file name
string | character vector

Contrast file name, specified as a string or character vector. The file must be a two-column tab-delimited text file, where each line indicates two conditions to compare using cuffdiff. The condition labels in the file must match either the labels specified for ConditionLabels or the sample names. The file must have a single header line as the first line, followed by one line for each contrast. An example of the contrast file format follows.

condition_A	condition_B
Control	Mutant1
Control	Mutant2

If you do not provide this file, cuffdiff compares every pair of input conditions, which can impact performance.

Example: "contrast.txt"

Data Types: char | string

`DispersionMethod` — Method to model variance in fragment counts
`"pooled"` (default) | `"per-condition"` | `"blind"` | `"poisson"`

Method to model the variance in fragment counts across replicates, specified as one of the following options:

"pooled" — The function uses each replicated condition to build a model and averages these models into a global model for all conditions in the experiment.
"per-condition" — The function produces a model for each condition. You can use this option only if all conditions have replicates.
"blind" — The function treats all samples as replicates of a single global distribution and produces one model.
"poisson" — Variance in fragment counts is a poisson model, where the fragment count is predicted to be the mean across replicates. This method is not recommended.

Select a method depending on whether you expect the variability in each group of samples to be similar.

When comparing two groups where the first group has low cross-replicate variability and the second group has high variability, choose the per-condition method.
If the conditions have similar levels of variability, choose the pooled method.
If you have only a single replicate in each condition, choose the blind method.

Example: "blind"

Data Types: char | string

`DoIsoformSwitch` — Flag to perform isoform switching tests
`true` (default) | `false`

Flag to perform isoform switching tests, specified as true or false. These tests estimate how much differential splicing exists in isoforms from a single primary transcript. By default, the value is true and the test results are saved in the output file splicing.diff.

Example: false

Data Types: logical

`EffectiveLengthCorrection` — Flag to normalize fragment counts
`true` (default) | `false`

Flag to normalize fragment counts to fragments per kilobase per million mapped reads (FPKM), specified as true or false.

Example: false

Data Types: logical

`ExtraCommand` — Additional commands
`""` (default) | string | character vector

The commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB^® properties.

When the software converts the original flags to MATLAB properties, it stores any unrecognized flags in this property.

Example: '--library-type fr-secondstrand'

Data Types: char | string

`FalseDiscoveryRate` — False discovery rate
`0.05` (default) | scalar between `0` and `1`

False discovery rate used during statistical tests, specified as a scalar between 0 and 1.

Example: 0.01

Data Types: double

`FragmentBiasCorrection` — Name of FASTA file with reference transcripts to detect bias
string | character vector

Name of the FASTA file with reference transcripts to detect bias in fragment counts, specified as a string or character vector. Library preparation can introduce sequence-specific bias into RNA-Seq experiments. Providing reference transcripts improves the accuracy of the transcript abundance estimates.

Example: "bias.fasta"

Data Types: char | string

`FragmentLengthMean` — Expected mean fragment length in base pairs
`200` (default) | positive integer

Expected mean fragment length, specified as a positive integer. The default value is 200 base pairs. The function can learn the fragment length mean for each SAM file. Using this option is not recommended for paired-end reads.

Example: 100

Data Types: double

`FragmentLengthSD` — Expected standard deviation for fragment length distribution
`80` (default) | positive scalar

Expected standard deviation for the fragment length distribution, specified as a positive scalar. The default value is 80 base pairs. The function can learn the fragment length standard deviation for each SAM file. Using this option is not recommended for paired-end reads.

Example: 70

Data Types: double

`GenerateAnalysisDiff` — Flag to create differential analysis files
`true` (default) | `false`

Flag to create differential analysis files (*.diff), specified as true or false.

Example: false

Data Types: logical

`IncludeAll` — Flag to use all object properties
`false` (default) | `true`

Flag to include all the object properties with the corresponding default values when converting to the original options syntax, specified as true or false. You can convert the properties to the original syntax prefixed by one or two dashes (such as '-d 100 -e 80') by using getCommand. The default value false means that when you call getCommand(optionsObject), it converts only the specified properties. If the value is true, getCommand converts all available properties, with default values for unspecified properties, to the original syntax.

Note

If you set IncludeAll to true, the software converts all available properties, using default values for unspecified properties. The only exception is when the default value of a property is NaN, Inf, [], '', or "". In this case, the software does not translate the corresponding property.

Example: true

Data Types: logical

`IsoformShiftReplicates` — Minimum number of replicates to test genes for differential regulation
`3` (default) | positive integer

Minimum number of replicates to test genes for differential regulation, specified as a positive integer. The function skips the tests when the number of replicates is smaller than the specified value.

Example: 2

Data Types: double

`LengthCorrection` — Flag to correct by transcript length
`true` (default) | `false`

Flag to correct by the transcript length, specified as true or false. Set this value to false only when the fragment count is independent of the feature size, such as for small RNA libraries with no fragmentation and for 3' end sequencing, where all fragments have the same length.

Example: false

Data Types: logical

`LibraryNormalizationMethod` — Method to normalize library size
`"geometric"` (default) | `"classic-fpkm"` | `"quartile"`

Method to normalize the library size, specified as one of the following options:

"geometric" — The function scales the FPKM values by the median geometric mean of fragment counts across all libraries as described in [2].
"classic-fpkm" — The function applies no scaling to the FPKM values or fragment counts.
"quartile" — The function scales the FPKM values by the ratio of upper quartiles between fragment counts and the average value across all libraries.

Example: "classic-fpkm"

Data Types: char | string

`MaskFile` — Name of GTF or GFF file containing transcripts to ignore
string | character vector

Name of the GTF or GFF file containing transcripts to ignore during analysis, specified as a string or character vector. Some examples of transcripts to ignore include annotated rRNA transcripts, mitochondrial transcripts, and other abundant transcripts. Ignoring these transcripts improves the robustness of the abundance estimates.

Example: "excludes.gtf"

Data Types: char | string

`MaxBundleFrags` — Maximum number of fragments to include for each locus before skipping
`500000` (default) | positive integer

Maximum number of fragments to include for each locus before skipping new fragments, specified as a positive integer. Skipped fragments are marked with the status HIDATA in the file skipped.gtf.

Example: 400000

Data Types: double

`MaxFragAlignments` — Maximum number of aligned reads to include for each fragment
`Inf` (default) | positive integer

Maximum number of aligned reads to include for each fragment before skipping new reads, specified as a positive integer. Inf, the default value, sets no limit on the maximum number of aligned reads.

Example: 1000

Data Types: double

`MaxMLEIterations` — Maximum number of iterations for maximum likelihood estimation
`5000` (default) | positive integer

Maximum number of iterations for the maximum likelihood estimation of abundances, specified as a positive integer.

Example: 4000

Data Types: double

`MinAlignmentCount` — Minimum number of alignments required in locus for significance testing
`10` (default) | positive integer

Minimum number of alignments required in a locus to perform the significance testing for differences between samples, specified as a positive integer.

Example: 8

Data Types: double

`MinIsoformFraction` — Minimum abundance of isoform to include in differential expression tests
`1e-5` (default) | scalar between `0` and `1`

Minimum abundance of an isoform to include in differential expression tests, specified as a scalar between 0 and 1. For alternative isoforms quantified at below the specified value, the function rounds down the abundance to zero. The specified value is a fraction of the major isoform. The function performs this filtering after MLE estimation but before MAP estimation to improve the robustness of confidence interval generation and differential expression analysis. Using a parameter value other than the default is not recommended.

Example: 1e-5

Data Types: double

`MultiReadCorrection` — Flag to improve abundance estimation using rescue method
`false` (default) | `true`

Flag to improve abundance estimation for reads mapped to multiple genomic positions using the rescue method, specified as true or false. If the value is false, the function divides multimapped reads uniformly to all mapped positions. If the value is true, the function uses additional information, including gene abundance estimation, inferred fragment length, and fragment bias, to improve transcript abundance estimation.

The rescue method is described in [3].

Example: true

Data Types: logical

`NormalizeCompatibleHits` — Flag to use only fragments compatible with reference transcript to calculate FPKM values
`true` (default) | `false`

Flag to use only fragments compatible with a reference transcript to calculate FPKM values, specified as true or false.

Example: false

Data Types: logical

`NormalizeTotalHits` — Flag to include all fragments to calculate FPKM values
`false` (default) | `true`

Flag to include all fragments to calculate FPKM values, specified as true or false. If the value is true, the function includes all fragments, including fragments without a compatible reference.

Example: true

Data Types: logical

`NumFragAssignmentDraws` — Number of fragment assignments to perform on each transcript
`50` (default) | positive integer

Number of fragment assignments to perform on each transcript, specified as a positive integer. For each fragment drawn from a transcript, the function performs the specified number of assignments probabilistically to determine the transcript assignment uncertainty and to estimate the variance-covariance matrix for the assigned fragment counts.

Example: 40

Data Types: double

`NumFragDraws` — Number of draws from negative binomial random number generator
`100` (default) | positive integer

Number of draws from the negative binomial random number generator for each transcript, specified as a positive integer. Each draw is a number of fragments that the function probabilistically assigns to transcripts in the transcriptome to determine the assignment uncertainty and to estimate the variance-covariance matrix for assigned fragment counts.

Example: 90

Data Types: double

`NumThreads` — Number of parallel threads to use
`1` (default) | positive integer

Number of parallel threads to use, specified as a positive integer. Threads are run on separate processors or cores. Increasing the number of threads generally improves the runtime significantly, but increases the memory footprint.

Example: 4

Data Types: double

`OutputDirectory` — Directory to store analysis results
current directory (`"./"`) (default) | string | character vector

Directory to store analysis results, specified as a string or character vector.

Example: "./AnalysisResults/"

Data Types: char | string

`Seed` — Seed for random number generator
`0` (default) | nonnegative integer

Seed for the random number generator, specified as a nonnegative integer. Setting a seed value ensures the reproducibility of the analysis results.

Example: 10

Data Types: double

`TimeSeries` — Flag to treat input samples as time series
`false` (default) | `true`

Flag to treat input samples as a time series rather than as independent experimental conditions, specified as true or false. If you set the value to true, you must provide samples in order of increasing time: the first SAM file must be for the first time point, the second SAM file for the second time point, and so on.

Example: true

Data Types: logical

`Version` — Supported version
string

This property is read-only.

Supported version of the original cufflinks software, returned as a string.

Example: "2.2.1"

Data Types: string

Object Functions

`getCommand`	Translate object properties to original options syntax
`getOptionsTable`	Return table with all properties and equivalent options in original syntax

Examples

collapse all

Create CuffDiffOptions Object

Create a CuffDiffOptions object with the default values.

opt = CuffDiffOptions;

Create an object using name-value pairs.

opt2 = CuffDiffOptions('FalseDiscoveryRate',0.01,'NumThreads',4)

Create an object by using the original syntax.

opt3 = CuffDiffOptions('--FDR 0.01 --num-threads 4')

Assemble Transcriptome and Perform Differential Expression Testing

Create a CufflinksOptions object to define cufflinks options, such as the number of parallel threads and the output directory to store the results.

cflOpt = CufflinksOptions;
cflOpt.NumThreads = 8;
cflOpt.OutputDirectory = "./cufflinksOut";

The SAM files provided for this example contain aligned reads for Mycoplasma pneumoniae from two samples with three replicates each. The reads are simulated 100bp-reads for two genes (gyrA and gyrB) located next to each other on the genome. All the reads are sorted by reference position, as required by cufflinks.

sams = ["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam",...
        "Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"];

Assemble the transcriptome from the aligned reads.

[gtfs,isofpkm,genes,skipped] = cufflinks(sams,cflOpt);

gtfs is a list of GTF files that contain assembled isoforms.

Compare the assembled isoforms using cuffcompare.

stats = cuffcompare(gtfs);

Merge the assembled transcripts using cuffmerge.

mergedGTF = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput');

mergedGTF reports only one transcript. This is because the two genes of interest are located next to each other, and cuffmerge cannot distinguish two distinct genes. To guide cuffmerge, use a reference GTF (gyrAB.gtf) containing information about these two genes. If the file is not located in the same directory that you run cuffmerge from, you must also specify the file path.

gyrAB = which('gyrAB.gtf');
mergedGTF2 = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput2',...
			'ReferenceGTF',gyrAB);

Calculate abundances (expression levels) from aligned reads for each sample.

abundances1 = cuffquant(mergedGTF2,["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],...
                        'OutputDirectory','./cuffquantOutput1');
abundances2 = cuffquant(mergedGTF2,["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"],...
                        'OutputDirectory','./cuffquantOutput2');

Assess the significance of changes in expression for genes and transcripts between conditions by performing the differential testing using cuffdiff. The cuffdiff function operates in two distinct steps: the function first estimates abundances from aligned reads, and then performs the statistical analysis. In some cases (for example, distributing computing load across multiple workers), performing the two steps separately is desirable. After performing the first step with cuffquant, you can then use the binary CXB output file as an input to cuffdiff to perform statistical analysis. Because cuffdiff returns several files, specify the output directory is recommended.

isoformDiff = cuffdiff(mergedGTF2,[abundances1,abundances2],...
                      'OutputDirectory','./cuffdiffOutput');

Display a table containing the differential expression test results for the two genes gyrB and gyrA.

readtable(isoformDiff,'FileType','text')

ans =

  2×14 table

        test_id            gene_id        gene              locus             sample_1    sample_2    status     value_1       value_2      log2_fold_change_    test_stat    p_value    q_value    significant
    ________________    _____________    ______    _______________________    ________    ________    ______    __________    __________    _________________    _________    _______    _______    ___________

    'TCONS_00000001'    'XLOC_000001'    'gyrB'    'NC_000912.1:2868-7340'      'q1'        'q2'       'OK'     1.0913e+05    4.2228e+05          1.9522           7.8886      5e-05      5e-05        'yes'   
    'TCONS_00000002'    'XLOC_000001'    'gyrA'    'NC_000912.1:2868-7340'      'q1'        'q2'       'OK'     3.5158e+05    1.1546e+05         -1.6064          -7.3811      5e-05      5e-05        'yes'

You can use cuffnorm to generate normalized expression tables for further analyses. cuffnorm results are useful when you have many samples and you want to cluster them or plot expression levels for genes that are important in your study. Note that you cannot perform differential expression analysis using cuffnorm.

Specify a cell array, where each element is a string vector containing file names for a single sample with replicates.

alignmentFiles = {["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],...
                  ["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"]}
isoformNorm = cuffnorm(mergedGTF2, alignmentFiles,...
                      'OutputDirectory', './cuffnormOutput');

Display a table containing the normalized expression levels for each transcript.

readtable(isoformNorm,'FileType','text')

ans =

  2×7 table

      tracking_id          q1_0          q1_2          q1_1          q2_1          q2_0          q2_2   
    ________________    __________    __________    __________    __________    __________    __________

    'TCONS_00000001'    1.0913e+05         78628    1.2132e+05    4.3639e+05    4.2228e+05    4.2814e+05
    'TCONS_00000002'    3.5158e+05    3.7458e+05    3.4238e+05    1.0483e+05    1.1546e+05    1.1105e+05

Column names starting with q have the format: conditionX_N, indicating that the column contains values for replicate N of conditionX.

References

[1] Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript Assembly and Quantification by RNA-Seq Reveals Unannotated Transcripts and Isoform Switching during Cell Differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511–15.

[2] Anders, Simon, and Wolfgang Huber. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11, no. 10 (October 2010): R106. https://doi.org/10.1186/gb-2010-11-10-r106.

[3] Mortazavi, Ali, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold. “Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq.” Nature Methods 5, no. 7 (July 2008): 621–28. https://doi.org/10.1038/nmeth.1226.

Version History

Introduced in R2019a

CuffDiffOptions

Description

Creation

Syntax

Description

Input Arguments

S — cuffdiff options string | character vector

Properties

ConditionLabels — Sample labels string | string vector | character vector | cell array of character vector

ContrastFile — Contrast file name string | character vector

DispersionMethod — Method to model variance in fragment counts "pooled" (default) | "per-condition" | "blind" | "poisson"

DoIsoformSwitch — Flag to perform isoform switching tests true (default) | false

EffectiveLengthCorrection — Flag to normalize fragment counts true (default) | false

ExtraCommand — Additional commands "" (default) | string | character vector

FalseDiscoveryRate — False discovery rate 0.05 (default) | scalar between 0 and 1

FragmentBiasCorrection — Name of FASTA file with reference transcripts to detect bias string | character vector

FragmentLengthMean — Expected mean fragment length in base pairs 200 (default) | positive integer

FragmentLengthSD — Expected standard deviation for fragment length distribution 80 (default) | positive scalar

GenerateAnalysisDiff — Flag to create differential analysis files true (default) | false

IncludeAll — Flag to use all object properties false (default) | true

IsoformShiftReplicates — Minimum number of replicates to test genes for differential regulation 3 (default) | positive integer

LengthCorrection — Flag to correct by transcript length true (default) | false

LibraryNormalizationMethod — Method to normalize library size "geometric" (default) | "classic-fpkm" | "quartile"

MaskFile — Name of GTF or GFF file containing transcripts to ignore string | character vector

MaxBundleFrags — Maximum number of fragments to include for each locus before skipping 500000 (default) | positive integer

MaxFragAlignments — Maximum number of aligned reads to include for each fragment Inf (default) | positive integer

MaxMLEIterations — Maximum number of iterations for maximum likelihood estimation 5000 (default) | positive integer

MinAlignmentCount — Minimum number of alignments required in locus for significance testing 10 (default) | positive integer

MinIsoformFraction — Minimum abundance of isoform to include in differential expression tests 1e-5 (default) | scalar between 0 and 1

MultiReadCorrection — Flag to improve abundance estimation using rescue method false (default) | true

NormalizeCompatibleHits — Flag to use only fragments compatible with reference transcript to calculate FPKM values true (default) | false

NormalizeTotalHits — Flag to include all fragments to calculate FPKM values false (default) | true

NumFragAssignmentDraws — Number of fragment assignments to perform on each transcript 50 (default) | positive integer

NumFragDraws — Number of draws from negative binomial random number generator 100 (default) | positive integer

NumThreads — Number of parallel threads to use 1 (default) | positive integer

OutputDirectory — Directory to store analysis results current directory ("./") (default) | string | character vector

Seed — Seed for random number generator 0 (default) | nonnegative integer

TimeSeries — Flag to treat input samples as time series false (default) | true

Version — Supported version string

Object Functions

Examples

Create CuffDiffOptions Object

Assemble Transcriptome and Perform Differential Expression Testing

References

Version History

See Also

Topics

External Websites

`S` — `cuffdiff` options
string | character vector

`ConditionLabels` — Sample labels
string | string vector | character vector | cell array of character vector

`ContrastFile` — Contrast file name
string | character vector

`DispersionMethod` — Method to model variance in fragment counts
`"pooled"` (default) | `"per-condition"` | `"blind"` | `"poisson"`

`DoIsoformSwitch` — Flag to perform isoform switching tests
`true` (default) | `false`

`EffectiveLengthCorrection` — Flag to normalize fragment counts
`true` (default) | `false`

`ExtraCommand` — Additional commands
`""` (default) | string | character vector

`FalseDiscoveryRate` — False discovery rate
`0.05` (default) | scalar between `0` and `1`

`FragmentBiasCorrection` — Name of FASTA file with reference transcripts to detect bias
string | character vector

`FragmentLengthMean` — Expected mean fragment length in base pairs
`200` (default) | positive integer

`FragmentLengthSD` — Expected standard deviation for fragment length distribution
`80` (default) | positive scalar

`GenerateAnalysisDiff` — Flag to create differential analysis files
`true` (default) | `false`

`IncludeAll` — Flag to use all object properties
`false` (default) | `true`

`IsoformShiftReplicates` — Minimum number of replicates to test genes for differential regulation
`3` (default) | positive integer

`LengthCorrection` — Flag to correct by transcript length
`true` (default) | `false`

`LibraryNormalizationMethod` — Method to normalize library size
`"geometric"` (default) | `"classic-fpkm"` | `"quartile"`

`MaskFile` — Name of GTF or GFF file containing transcripts to ignore
string | character vector

`MaxBundleFrags` — Maximum number of fragments to include for each locus before skipping
`500000` (default) | positive integer

`MaxFragAlignments` — Maximum number of aligned reads to include for each fragment
`Inf` (default) | positive integer

`MaxMLEIterations` — Maximum number of iterations for maximum likelihood estimation
`5000` (default) | positive integer

`MinAlignmentCount` — Minimum number of alignments required in locus for significance testing
`10` (default) | positive integer

`MinIsoformFraction` — Minimum abundance of isoform to include in differential expression tests
`1e-5` (default) | scalar between `0` and `1`

`MultiReadCorrection` — Flag to improve abundance estimation using rescue method
`false` (default) | `true`

`NormalizeCompatibleHits` — Flag to use only fragments compatible with reference transcript to calculate FPKM values
`true` (default) | `false`

`NormalizeTotalHits` — Flag to include all fragments to calculate FPKM values
`false` (default) | `true`

`NumFragAssignmentDraws` — Number of fragment assignments to perform on each transcript
`50` (default) | positive integer

`NumFragDraws` — Number of draws from negative binomial random number generator
`100` (default) | positive integer

`NumThreads` — Number of parallel threads to use
`1` (default) | positive integer

`OutputDirectory` — Directory to store analysis results
current directory (`"./"`) (default) | string | character vector

`Seed` — Seed for random number generator
`0` (default) | nonnegative integer

`TimeSeries` — Flag to treat input samples as time series
`false` (default) | `true`

`Version` — Supported version
string