Main Content

bioinfo.pipeline.block.SeqFilter

Bioinformatics pipeline block to filter sequences

Since R2023a

  • seqfilter block icon

Description

A SeqFilter block enables you to filter sequences based on a specified criterion.

Creation

Description

example

b = bioinfo.pipeline.block.SeqFilter creates a SeqFilter block.

b = bioinfo.pipeline.block.SeqFilter(options) also specifies additional options.

b = bioinfo.pipeline.block.SeqFilter(Name=Value) specifies additional options as the property names and values of a SeqFilterOptions object. This object is set as the value of the Options property of the block.

Note

The block always overwrites existing output files, unlike the seqfilter function.

Input Arguments

expand all

SeqFilter options, specified as a SeqFilterOptions object.

Name-Value Arguments

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Note

The following list of arguments is a partial list. For the complete list, refer to the properties of SeqFilterOptions object.

Criterion to filter sequences, specified as one of the following options. Specify only one filtering criterion per function call.

  • 'MaxNumberLowQualityBases'– applies a maximum threshold on the number of low-quality bases allowed.

  • 'MaxPercentLowQualityBases'– applies a maximum threshold on the percentage of low-quality bases allowed.

  • 'MeanQuality'– applies a minimum threshold on the average base quality across each sequence.

  • 'MinLength'– applies a minimum threshold on the sequence length.

Use this name-value pair argument together with 'Threshold' to specify the appropriate threshold value. Depending on the filtering criterion, the corresponding value for 'Threshold' can be a scalar or two-element vector. See the 'Threshold' option for the default values. If you do not specify 'Threshold', then the function uses the default threshold value of the specified method. For each filtering criterion, the function uses the base quality encoding format specified by the 'Encoding' name-value pair argument.

Threshold value for the filtering criterion, specified as a scalar or vector. Use this name-value pair to define the threshold value for the filtering criterion specified by 'Method'.

Depending on the filtering criterion, the corresponding value for 'Threshold' can be a scalar or two-element vector. If you do not specify 'Threshold', then the function uses the default threshold value of the corresponding method. For each filtering criterion, the function uses the encoding format of the base quality specified by the 'Encoding' name-value pair argument.

'Method''Threshold'Default 'Threshold' value
'MaxNumberLowQualityBases'Two-element vector [V1 V2]. V1 is a nonnegative integer that specifies the maximum number of low-quality bases allowed. V2 specifies the minimum base quality. Any base with quality less than V2 is considered a low-quality base. Any sequence containing a number of low-quality bases greater than V1 is filtered out and not saved in the output file.[0 10]
'MaxPercentLowQualityBases'Two-element vector [V1 V2]. V1 is a scalar between 0 and 100 that specifies the maximum percentage of low-quality bases allowed. V2 specifies the minimum base quality. Any base with quality less than V2 is considered a low-quality base. Any sequence containing a percentage of low-quality bases greater than V1 is filtered out and not saved in the output file.[0 10]
'MeanQuality'Positive scalar that specifies the minimum threshold on the average base quality across each sequence. Any sequence with average base quality less than this value is filtered out.0
'MinLength'Nonnegative integer that specifies the minimum threshold on the sequence length allowed. Any sequence with length less than this value is filtered out. 1

Properties

expand all

Function to handle errors from the run method of the block, specified as a function handle. The handle specifies the function to call if the run method encounters an error within a pipeline. For the pipeline to continue after a block fails, ErrorHandler must return a structure that is compatible with the output ports of the block. The error handling function is called with the following two inputs:

  • Structure with these fields:

    FieldDescription
    identifierIdentifier of the error that occurred
    messageText of the error message
    indexLinear index indicating which block process failed in the parallel run. By default, the index is 1 because there is only one run per block. For details on how block inputs can be split across different dimensions for multiple run calls, see Bioinformatics Pipeline SplitDimension.

  • Input structure passed to the run method when it fails

Data Types: function_handle

This property is read-only.

Input ports of the block, specified as a structure. The field names of the structure are the names of the block input ports, and the field values are bioinfo.pipeline.Input objects. These objects describe the input port behaviors. The input port names are the expected field names of the input structure that you pass to the block run method.

The SeqFilter block Inputs structure has the following field:

  • FASTQFiles — Names of FASTQ-formatted files with sequence and quality information. This input is a required input that must be satisfied. The default value is a bioinfo.pipeline.datatypes.Unset object, which means that the input value is not set yet.

Data Types: struct

This property is read-only.

Output ports of the block, specified as a structure. The field names of the structure are the names of the block output ports, and the field values are bioinfo.pipeline.Output objects. These objects describe the output port behaviors. The field names of the output structure returned by the block run method are the same as the output port names.

The SeqFilter block Outputs structure has the following fields:

  • FilteredFASTQFiles — Output file names. By default, the name of each output file consists of the input file name followed by the output suffix ('_filtered').

    Tip

    To see the actual location of these files, first get the results of the block. Then use the unwrap method as shown in this example.

  • NumFilteredIn — Number of sequences selected from each input file, returned as a scalar or an n-by-1 vector where n is the number of input files. If there are multiple input files, the order in NumFilteredIn corresponds to the order of the input files.

  • NumFilteredOut — Number of sequences excluded from each input file, returned as a scalar or an n-by-1 vector where n is the number of input files. If there are multiple input files, the order in NumFilteredOut corresponds to the order of the input files.

Data Types: struct

SeqFilter options, specified as a SeqFilterOptions object. The default value is a default SeqFilterOptions object.

Object Functions

compilePerform block-specific additional checks and validations
copyCopy array of handle objects
emptyInputsCreate input structure for use with run method
evalEvaluate block object
runRun block object

Examples

collapse all

Use a SeqFilter block to filter out sequences with low-quality bases, where a base is considered low-quality if its quality score is less than 15 (default).

import bioinfo.pipeline.block.*
import bioinfo.pipeline.Pipeline

FC = FileChooser(which("SRR005164_1_50.fastq"));
SF = SeqFilter;

P = Pipeline;
addBlock(P,[FC,SF]);
connect(P,FC,SF,["Files","FASTQFiles"]);

run(P);
R = results(P,SF)
R = 

  struct with fields:

    FilteredFASTQFiles: [1×1 bioinfo.pipeline.datatypes.File]
         NumFilteredIn: 3
        NumFilteredOut: 47

Call unwrap on FilteredFASTQFiles to see the location of the output file.

unwrap(R.FilteredFASTQFiles)
ans = 

    "C:\PipelineResults\SeqFilter_1\1\SRR005164_1_50_filtered.fastq"

Import the Pipeline and block objects needed for the example.

import bioinfo.pipeline.Pipeline
import bioinfo.pipeline.block.*

Create a pipeline.

qcpipeline = Pipeline;

Select an input FASTQ file using a FileChooser block.

fastqfile = FileChooser(which("SRR005164_1_50.fastq"));

Create a SeqFilter block.

sequencefilter = SeqFilter;

Define the filtering threshold value. Specifically, filter out sequences with a total of more than 10 low-quality bases, where a base is considered a low-quality base if its quality score is less than 20.

sequencefilter.Options.Threshold = [10 20];

Add the blocks to the pipeline.

addBlock(qcpipeline,[fastqfile,sequencefilter]);

Connect the output of the first block to the input of the second block. To do so, you need to first check the input and output port names of the corresponding blocks.

View the Outputs (port of the first block) and Inputs (port of the second block).

fastqfile.Outputs
ans = struct with fields:
    Files: [1×1 bioinfo.pipeline.Output]

sequencefilter.Inputs
ans = struct with fields:
    FASTQFiles: [1×1 bioinfo.pipeline.Input]

Connect the Files output port of the fastqfile block to the FASTQFiles port of sequencefilter block.

connect(qcpipeline,fastqfile,sequencefilter,["Files","FASTQFiles"]);

Next, create a UserFunction block that calls the seqqcplot function to plot the quality data of the filtered sequence data. In this case, inputFile is the required argument for the seqqcplot function. The required argument name can be anything as long as it is a valid variable name.

qcplot = UserFunction("seqqcplot",RequiredArguments="inputFile",OutputArguments="figureHandle");

Alternatively, you can also use dot notation to set up your UserFunction block.

qcplot = UserFunction;
qcplot.RequiredArguments = "inputFile";
qcplot.Function = "seqqcplot";
qcplot.OutputArguments = "figureHandle";

Add the block.

addBlock(qcpipeline,qcplot);

Check the port names of sequencefilter block and qcplot block.

sequencefilter.Outputs
ans = struct with fields:
    FilteredFASTQFiles: [1×1 bioinfo.pipeline.Output]
         NumFilteredIn: [1×1 bioinfo.pipeline.Output]
        NumFilteredOut: [1×1 bioinfo.pipeline.Output]

qcplot.Inputs
ans = struct with fields:
    inputFile: [1×1 bioinfo.pipeline.Input]

Connect the FilteredFASTQFiles port of the sequencefilter block to the inputFile port of the qcplot block.

connect(qcpipeline,sequencefilter,qcplot,["FilteredFASTQFiles","inputFile"]);

Run the pipeline to plot the sequence quality data.

run(qcpipeline);

seqqcplot_figure.png

Version History

Introduced in R2023a