featurecount

Compute the number of reads mapped to genomic features

Syntax

T = featurecount(GTFfile,Inputfile)

[T,S] =
featurecount(___)

[___] = featurecount(___,Name,Value)

Description

T = featurecount(GTFfile,Inputfile) counts the number of reads in the BAM-formatted or SAM-formatted file Inputfile that map onto genomic features as specified in the GTF-formatted file GTFfile. GTFfile specifies the annotation file. Inputfile specifies the names of the BAM or SAM files to consider. The output T is a table where rows correspond to features and columns correspond to the input files. The elements of the table consist of the number of reads mapping to each feature for a given input file.

example

[T,S] = featurecount(___) returns a table S with a summary of assigned and unassigned alignment entries. If multiple input files are provided, each file is associated with a column.

example

[___] = featurecount(___,Name,Value) uses additional options specified by one or more Name,Value pair arguments.

example

Examples

collapse all

Count the number of reads mapped to genomic features

Open Live Script

Count reads from a sample SAM file that map to the features included in a GTF file. By default, featurecount maps the reads to exons, and summarizes the total number of reads at the gene level.

[t,s] = featurecount('Dmel_BDGP5_nohc.gtf','rnaseq_sample1.sam');

Processing GTF file Dmel_BDGP5_nohc.gtf ...
Processing SAM file rnaseq_sample1.sam ...
Processing reference chr2L ...
Processing reference chr2R ...
Processing reference chr3L ...
Processing reference chr3R ...
Processing reference chr4 ...
Processing reference chrX ...
Done.

Display the first 10 rows of count data.

t(1:10,:)

ans=10×3 table
          ID           Reference    rnaseq_sample1
    _______________    _________    ______________

    {'FBgn0002121'}    {'chr2L'}           9      
    {'FBgn0067779'}    {'chr2L'}           2      
    {'FBgn0005278'}    {'chr2L'}           4      
    {'FBgn0031220'}    {'chr2L'}           4      
    {'FBgn0025683'}    {'chr2L'}          13      
    {'FBgn0053635'}    {'chr2L'}           2      
    {'FBgn0016977'}    {'chr2L'}          22      
    {'FBgn0086902'}    {'chr2L'}          27      
    {'FBgn0031245'}    {'chr2L'}           2      
    {'FBgn0024352'}    {'chr2L'}           2

The ID column contains the names of features (genes in this example). The Reference column lists the names of reference sequences for the features. The third column contains the total number of reads mapped to each feature for a given SAM file, that is, rnaseq_sample1.sam. By default, the table shows only those features (rows) and SAM files (columns) with non-zero read counts. Set 'ShowZeroCounts' to true to include those rows and columns with all zero counts in the output table.

s contains the summary statistics of assigned and unassigned reads from each SAM file. For instance, the TotalEntries row indicates the total number of alignment records from the given SAM file, and the Assigned row includes the number of reads that are assigned to features in the GTF file. For details about each row, refer to the Output Arguments section of the reference page.

s=9×1 table
                                    rnaseq_sample1
                                    ______________

    TotalEntries                        33354     
    Assigned                            16399     
    Unassigned_ambiguous                  167     
    Unassigned_filtered                     0     
    Unassigned_lowMappingQuality            0     
    Unassigned_multiMapped                  0     
    Unassigned_noFeature                16788     
    Unassigned_supplementary                0     
    Unassigned_unmapped                     0

Count reads without any summarization and disable displaying the progress messages.

[t2,s2] = featurecount('Dmel_BDGP5_nohc.gtf','rnaseq_sample1.sam', ...
                        'Summarization',false,'Verbose',false);

Notice the ID column of the output table now reports the feature attribute followed by the start and stop positions of each feature, separated by underscores.

t2(1:10,:)

ans=10×3 table
                 ID                  Reference    rnaseq_sample1
    _____________________________    _________    ______________

    {'FBgn0002121_12286_12928'  }    {'chr2L'}          3       
    {'FBgn0002121_13683_14874'  }    {'chr2L'}          1       
    {'FBgn0002121_14933_15711'  }    {'chr2L'}          3       
    {'FBgn0067779_67044_67507'  }    {'chr2L'}          2       
    {'FBgn0005278_108588_108809'}    {'chr2L'}          1       
    {'FBgn0005278_110755_110877'}    {'chr2L'}          1       
    {'FBgn0005278_112690_113369'}    {'chr2L'}          1       
    {'FBgn0031220_117079_117759'}    {'chr2L'}          2       
    {'FBgn0031220_118361_118874'}    {'chr2L'}          1       
    {'FBgn0031220_118931_119076'}    {'chr2L'}          1

You can choose how to assign a read to a particular feature when the read overlaps with multiple features by setting the 'OverlapMethod' option. For instance, if you want to count a read only if it fully overlaps a feature, use the 'full' option.

[tFull, sFull] = featurecount('Dmel_BDGP5_nohc.gtf','rnaseq_sample1.sam', ...
                              'OverlapMethod','full','Verbose',false);

If you have paired-end data, you can count reads as fragments.

[tFrag,sFrag] = featurecount('Dmel_BDGP5_nohc.gtf','rnaseq_sample1.sam', ...
                             'CountFragments',true,'Verbose',false);

You can also count fragments from multiple SAM files.

[t2,s2] = featurecount('Dmel_BDGP5_nohc.gtf',...
          {'rnaseq_sample1.sam','rnaseq_sample2.sam'},'CountFragments',true, ...
          'Verbose',false);

Use the following options to count paired-end reads where at least one of the read mates are above a certain mapping quality threshold.

[t3,s3] = featurecount('Dmel_BDGP5_nohc.gtf',...
          'rnaseq_sample1.sam','CountFragments',true,'MinMappingQuality',20, ...
          'Verbose',false);

If the reads come from any strand-specific assay, you can specify such strand specificity during counting. For instance, if the protocol is stranded, the strand of the feature is compared with the strand of the read. Then only those reads that have the same strand as the overlapped feature are counted.

[t4,s4] = featurecount('Dmel_BDGP5_nohc.gtf',...
          'rnaseq_sample1.sam','StrandSpecificity','stranded','Verbose',false);

Input Arguments

collapse all

`GTFfile` — GTF-formatted file name
character vector | string

GTF-formatted file name, specified as a character vector or string.

Example: 'Dmel_BDGP5_nohc.gtf'

`Inputfile` — BAM-formatted or SAM-formatted file name
character vector | string | string vector | cell array of character vectors

BAM-formatted or SAM-formatted file name, specified as a character vector, string, string vector, or cell array of character vectors.

Example: 'rnaseq_sample1.sam'

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Before R2021a, use commas to separate each name and value, and enclose Name in quotes.

Example: 'CountFragments',true specifies to count reads as pairs of mates.

`Feature` — Feature type
`'exon'` (default) | character vector | string

Feature type, specified as a character vector or string. This is used to decide what feature to consider from the GTF file. Default is 'exon'.

`Metafeature` — Attribute type
`'gene_id'` (default) | character vector | string

Attribute type, specified as a character vector or string. This is used to decide what attribute to consider from the GTF file for grouping features into metafeatures and summarizing the read count.

`Summarization` — Boolean variable indicating whether to summarize at the metafeature level
`true` (default) | `false`

Boolean variable indicating whether to summarize at the metafeature level, specified as true or false.

Default is true, meaning the function groups features into metafeatures and reports the read counts for metafeatures.

`Alias` — Name of file containing aliases of reference names
character vector | string

Name of file containing aliases of reference names, specified as a character vector or string. The file must be a tab-delimited file where the first column corresponds to the reference names used in the GTF file, and the second column corresponds to the reference names used in the input file(s). The names are case-sensitive. It is necessary to include only the reference names that are different in the GTF file and the input file. The file must contain only one alias term for any reference listed in the input file. By default, the reference names in the GTF file and those in the input files are assumed to be the same.

`CountFragments` — Boolean variable indicating whether to count reads as pairs of mates
`false` (default) | `true`

Boolean variable indicating whether to count reads as fragments, specified as true or false. Paired-end reads must have the same ID for the field QNAME in the input file, and the mutual order of mates is inferred by the appropriate bit in the FLAG field within the input file. Reads that have no valid mate either because the mate is unmapped or filtered out by input criteria are still counted if they satisfy the overlapping criteria.

Default is false, that is, the reads are counted as single-end reads, and their pairing information is ignored.

`StrandSpecificity` — Strand specificity of sequencing protocol
`'unstranded'` (default) | `'stranded'` | `'reverse'`

Strand specificity of the sequencing protocol, specified as 'unstranded' (default), 'stranded', or 'reverse'.

If 'unstranded', the strand of the reads (or fragments) is ignored.
If 'stranded', the strand of the reads (or fragments) is considered, and only those having the same strand as the feature they overlap are counted.
If 'reverse', the opposite direction of the strand of the reads (or fragments) is considered, and only those having the opposite strand as the feature they overlap are counted.

When counting fragments (paired-end reads), the strand of the first mate is considered as the strand of the whole fragment. The mutual order of mates (first or second) is inferred from the appropriate bit in the FLAG field of the input file.

`MinOverlap` — Minimum number of overlapped bases required
`1` (default) | positive integer

Minimum number of overlapped bases required to assign a read to a feature, specified as a positive integer. When counting fragments, the sum of the overlaps from each end is used as the minimum number of overlapped bases.

`MinMappingQuality` — Minimum mapping quality for a given read
`0` (default) | non-negative integer

Minimum mapping quality for a given read to be considered for counting, specified as a non-negative integer. This corresponds to the MAPQ field in the input file. If counting fragments, at least one of the read mates must satisfy this criterion in order to be considered for counting.

`CountMultiOverlap` — Boolean variable indicating whether to count reads overlapping multiple features
`false` (default) | `true`

Boolean variable indicating whether to count reads overlapping multiple features, specified as true or false (default).

If true, a read (or fragment) overlapping multiple features is counted multiple times. During summarization at the metafeature level, a read (or fragment) is counted only once if it overlaps with multiple features belonging to the same metafeature as long as it does not overlap with other metafeaures.

`CountMultiMapped` — Counting option for reads having multiple mapping locations in the input file
`'primary'` (default) | `'none'` | `'all'`

Counting option for reads having multiple mapping locations in the input file, specified as 'primary' (default), 'none', or 'all'.

If 'primary', only the primary alignment of a multi-mapped read is considered. The appropriate bit in the input file is used to identify primary alignments.
If 'none', all alignments of a multi-mapped read are ignored. The NH tag is used to identify multi-mapped reads.
If 'all', all alignments of a multi-mapped read are considered and counted multiple times.

`BothEndsMapped` — Boolean variable indicating whether a fragment must have both mates mapped
`false` (default) | `true`

Boolean variable indicating whether a fragment must have both mates mapped, specified as true or false. Mate mapping information is retrieved from the FLAG field in the input file. Default is false.

`ProperlyPaired` — Boolean variable indicating whether a fragment must be properly paired
`false` (default) | `true`

Boolean variable indicating whether a fragment must be properly paired, specified as true or false. Mate pairing information is retrieved from the FLAG field in the input file. Default is false.

`ShowZeroCounts` — Boolean variable indicating whether to report features or metafeatures with zero count
`false` (default) | `true`

Boolean variable indicating whether to report features or metafeatures with zero count for every input file in the output table, specified as true or false.

Default is false, that is, only rows with non-zero counts and columns with non-zero counts are included in the output table.

`OverlapMethod` — Method to use when assigning a given read to metafeature
`'partial'` (default) | `'full'` | `'max'` | `'hits'`

Method to use when assigning a given read to metafeature, specified as 'partial', 'full', 'max', or 'hits'. If 'Summarization' is set to false, then the reads are assigned to features, instead of metafeatures, based on the specified method.

In the following table, R refers to a read or fragment, and M refers to a metafeature.

Method	Description
`'partial'`	R is assigned to M if R overlaps (even partially) only with M. Otherwise R is considered ambiguous.
`'full'`	R is assigned to M if R is completely mapped only within M, that is, fully overlapping only M. Otherwise R is considered ambiguous.
`'max'`	R is assigned to M if R satisfies the overlapping criteria only with M, or if R satisfies the overlapping criteria with several metafeatures but overlaps fully only with M.
`'hits'`	R is assigned to M if R overlaps even partially only M, or if M is the only metafeature with the highest number of features hit by R; otherwise R is considered ambiguous.

The following schematic diagram and table illustrate the outcome of these methods in conjunction with the 'CountMultiOverlap' name-value pair argument. In the figure, the read refers to a short-read sequence from an input file, and feature A and feature B refers to features listed in a GTF file.

Each method column lists the feature that the read is assigned to based on the corresponding method. The 'CountMultiOverlap' column indicates whether this name-value pair is set to true or false and if it has any effect in the outcome of each method.

	`'CountMultiOverlap'`	`'partial'`	`'full'`	`'max'`	`'hits'`
Case 1	No effect since the read maps only to one feature (feature A).	feature A	feature A	feature A	feature A
Case 2	No effect since the read maps only to one feature (feature A).	feature A	no feature	feature A	feature A
Case 3	No effect since the read maps only to one feature (feature A).	feature A	no feature	feature A	feature A
Case 4	No effect since the read maps only to one feature (feature A).	feature A	feature A	feature A	feature A
Case 5	`false`	ambiguous	feature A	feature A	ambiguous
Case 5	`true`	feature A, feature B	feature A	feature A	feature A, feature B
Case 6	`false`	ambiguous	ambiguous	ambiguous	ambiguous
Case 6	`true`	feature A, feature B	feature A, feature B	feature A, feature B	feature A, feature B
Case 7	`false`	Ambiguous	feature A	feature A	feature A
Case 7	`true`	feature A, feature B	feature A	feature A	feature A

no feature means that the read is not assigned to any feature. If you have specified the second output table S, its Unassigned_noFeature row is incremented by one for such occurrence. ambiguous means that the read is not assigned to any feature since it satisfies the overlapping criteria for multiple features, and the Unassigned_ambiguous row is incremented by one for such occurrence.

`UseParallel` — Boolean variable indicating whether to compute in parallel
`false` (default) | `true`

Boolean variable indicating whether to compute in parallel, specified as true or false.

In order to execute the computation in parallel, you must have Parallel Computing Toolbox™. If a MATLAB^® parallel pool does not exist, one is automatically created when the auto-creation option is enabled in your parallel preferences. Otherwise, computation runs in the serial mode.

Default is false, that is, serial mode.

`Verbose` — Boolean variable indicating whether to display the progress of computation
`true` (default) | `false`

Boolean variable indicating whether to display the progress of computation, specified as true or false.

Output Arguments

collapse all

`T` — Results containing sequence reads mapped to genomic features
table

Results containing sequence reads mapped to genomic features, returned as a table. The rows correspond to features, and columns correspond to the input files. The elements of the table consist of the number of reads mapped to each feature for a given input file. The table also reports the ID of each feature and the reference sequence for the feature.

When 'Summarization' is set to false, the ID column of the table reports the metafeature attribute followed by the start and stop positions of each feature, separated by underscores.

`S` — Summary of assigned and unassigned alignment entries
table

Summary of assigned and unassigned alignment entries, returned as a table. Each column of the table corresponds to each input file provided. The table has the following rows:

Row Names	Description
`TotalEntries`	Number of records (or alignments) in the input file
`Assigned`	Number of reads or fragments that were assigned to features
`Unassigned_ambiguous`	Number of unassigned reads or fragments overlapping multiple features or metafeatures
`Unassigned_filtered`	Number of alignment records filtered by input criteria
`Unassigned_lowMappingQuality`	Number of alignment records filtered out due to low mapping quality
`Unassigned_multiMapped`	Number of alignment records not assigned because of corresponding reads mapped to multiple locations
`Unassigned_noFeature`	Number of reads or fragments not assigned to any features
`Unassigned_supplementary`	Number of alignment records not assigned because they are flagged as supplementary records for chimeric alignments
`Unassigned_unmapped`	Number of alignment records not assigned because corresponding reads are unmapped

Extended Capabilities

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

To run in parallel, set 'UseParallel' to true.

For more information, see the 'UseParallel' name-value pair argument.

Version History

Introduced in R2016a

featurecount

Syntax

Description

Examples

Count the number of reads mapped to genomic features

Input Arguments

`GTFfile` — GTF-formatted file name
character vector | string

`Inputfile` — BAM-formatted or SAM-formatted file name
character vector | string | string vector | cell array of character vectors

Name-Value Arguments

`Feature` — Feature type
`'exon'` (default) | character vector | string

`Metafeature` — Attribute type
`'gene_id'` (default) | character vector | string

`Summarization` — Boolean variable indicating whether to summarize at the metafeature level
`true` (default) | `false`

`Alias` — Name of file containing aliases of reference names
character vector | string

`CountFragments` — Boolean variable indicating whether to count reads as pairs of mates
`false` (default) | `true`

`StrandSpecificity` — Strand specificity of sequencing protocol
`'unstranded'` (default) | `'stranded'` | `'reverse'`

`MinOverlap` — Minimum number of overlapped bases required
`1` (default) | positive integer

`MinMappingQuality` — Minimum mapping quality for a given read
`0` (default) | non-negative integer

`CountMultiOverlap` — Boolean variable indicating whether to count reads overlapping multiple features
`false` (default) | `true`

`CountMultiMapped` — Counting option for reads having multiple mapping locations in the input file
`'primary'` (default) | `'none'` | `'all'`

`BothEndsMapped` — Boolean variable indicating whether a fragment must have both mates mapped
`false` (default) | `true`

`ProperlyPaired` — Boolean variable indicating whether a fragment must be properly paired
`false` (default) | `true`

`ShowZeroCounts` — Boolean variable indicating whether to report features or metafeatures with zero count
`false` (default) | `true`

`OverlapMethod` — Method to use when assigning a given read to metafeature
`'partial'` (default) | `'full'` | `'max'` | `'hits'`

`UseParallel` — Boolean variable indicating whether to compute in parallel
`false` (default) | `true`

`Verbose` — Boolean variable indicating whether to display the progress of computation
`true` (default) | `false`

Output Arguments

`T` — Results containing sequence reads mapped to genomic features
table

`S` — Summary of assigned and unassigned alignment entries
table

Extended Capabilities

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

Version History

See Also

Topics

featurecount

Syntax

Description

Examples

Count the number of reads mapped to genomic features

Input Arguments

GTFfile — GTF-formatted file name character vector | string

Inputfile — BAM-formatted or SAM-formatted file name character vector | string | string vector | cell array of character vectors

Name-Value Arguments

Feature — Feature type 'exon' (default) | character vector | string

Metafeature — Attribute type 'gene_id' (default) | character vector | string

Summarization — Boolean variable indicating whether to summarize at the metafeature level true (default) | false

Alias — Name of file containing aliases of reference names character vector | string

CountFragments — Boolean variable indicating whether to count reads as pairs of mates false (default) | true

StrandSpecificity — Strand specificity of sequencing protocol 'unstranded' (default) | 'stranded' | 'reverse'

MinOverlap — Minimum number of overlapped bases required 1 (default) | positive integer

MinMappingQuality — Minimum mapping quality for a given read 0 (default) | non-negative integer

CountMultiOverlap — Boolean variable indicating whether to count reads overlapping multiple features false (default) | true

CountMultiMapped — Counting option for reads having multiple mapping locations in the input file 'primary' (default) | 'none' | 'all'

BothEndsMapped — Boolean variable indicating whether a fragment must have both mates mapped false (default) | true

ProperlyPaired — Boolean variable indicating whether a fragment must be properly paired false (default) | true

ShowZeroCounts — Boolean variable indicating whether to report features or metafeatures with zero count false (default) | true

OverlapMethod — Method to use when assigning a given read to metafeature 'partial' (default) | 'full' | 'max' | 'hits'

UseParallel — Boolean variable indicating whether to compute in parallel false (default) | true

Verbose — Boolean variable indicating whether to display the progress of computation true (default) | false

Output Arguments

T — Results containing sequence reads mapped to genomic features table

S — Summary of assigned and unassigned alignment entries table

Extended Capabilities

Automatic Parallel Support Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.

Version History

See Also

Topics

`GTFfile` — GTF-formatted file name
character vector | string

`Inputfile` — BAM-formatted or SAM-formatted file name
character vector | string | string vector | cell array of character vectors

`Feature` — Feature type
`'exon'` (default) | character vector | string

`Metafeature` — Attribute type
`'gene_id'` (default) | character vector | string

`Summarization` — Boolean variable indicating whether to summarize at the metafeature level
`true` (default) | `false`

`Alias` — Name of file containing aliases of reference names
character vector | string

`CountFragments` — Boolean variable indicating whether to count reads as pairs of mates
`false` (default) | `true`

`StrandSpecificity` — Strand specificity of sequencing protocol
`'unstranded'` (default) | `'stranded'` | `'reverse'`

`MinOverlap` — Minimum number of overlapped bases required
`1` (default) | positive integer

`MinMappingQuality` — Minimum mapping quality for a given read
`0` (default) | non-negative integer

`CountMultiOverlap` — Boolean variable indicating whether to count reads overlapping multiple features
`false` (default) | `true`

`CountMultiMapped` — Counting option for reads having multiple mapping locations in the input file
`'primary'` (default) | `'none'` | `'all'`

`BothEndsMapped` — Boolean variable indicating whether a fragment must have both mates mapped
`false` (default) | `true`

`ProperlyPaired` — Boolean variable indicating whether a fragment must be properly paired
`false` (default) | `true`

`ShowZeroCounts` — Boolean variable indicating whether to report features or metafeatures with zero count
`false` (default) | `true`

`OverlapMethod` — Method to use when assigning a given read to metafeature
`'partial'` (default) | `'full'` | `'max'` | `'hits'`

`UseParallel` — Boolean variable indicating whether to compute in parallel
`false` (default) | `true`

`Verbose` — Boolean variable indicating whether to display the progress of computation
`true` (default) | `false`

`T` — Results containing sequence reads mapped to genomic features
table

`S` — Summary of assigned and unassigned alignment entries
table

Automatic Parallel Support
Accelerate code by automatically running computation in parallel using Parallel Computing Toolbox™.