# Problem 939. DNA Pattern Match: Performance Metric - Speed

The Challenge is to Rapidly find matches of DNA sequences, Length=6, in a 1,800,000 long DNA file.

At IMACST the paper An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern claimed an astounding time improvement from 9.94 seconds to 7.84 seconds, 21% time reduction, to match six segments of length 6 in a 1.8M long DNA file. Basic probability asserts 1.8M/4^6 * 6 = 2637 matches. The paper's test case produced 2346 matches. The method employed used text processing in C++. The paper's L=25 and L=50 cases will be later challenges.

Matlab can achieve matching a six pattern set of L=6 in <15 msec (i5/16GB). This is merely a 99.8% time reduction.

Challenge Description: DNA is made of letters ACGT, wiki DNA, which for the purposes of this Matlab Cody Challenge are given values 0 thru 3. (ACGT= 0123)

Input: [DNA, DNA_ID, Patterns]

• DNA is a 1.8M long uint8 row vector of values 0 thru 3.
• DNA_ID identifies the DNA segment being processed. Multiple calls using the same DNA_ID will be performed. The first call of a DNA_ID is not timed.
• Patterns is an Nx6 uint8 array where each row corresponds to a search pattern.

Output: Locations

Locations of all start indices that match any of the patterns

Scoring: Average Time (msec) for a block of L=6 patterns

Example:

• DNA= [0 1 2 3 3 2 1 0 1 2 3 3]
• Pattern = [2 3 3 2 1 0]
• Locations = [3]

Hints:

• Vectorization (Base 4 to Base 10 of 6 character words)
• Bitshift/Reshape to create all words
• Logical Indexing

Coming soon: Genome DNA sequencing of PhagePhix174 and Haempphilus Influenza

### Solution Stats

42.31% Correct | 57.69% Incorrect
Last Solution submitted on Nov 30, 2023

Show comments

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!