counting pattern frequency in a string

Question

Mark 2017년 12월 22일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/374238-counting-pattern-frequency-in-a-string

댓글: Mark 2017년 12월 22일

Looking for ideas on the fastest way to count the number of occurrences of letter patterns in a string. For example, for the test string 'ZXCVBNMZXCVBAS' with a pattern length of 4, it would generate the following table: ZXCV=2, XCVB=2, CVBN=1, etc... The brute force way to do it is to start with an empty list of 4-letter strings, and pull 4-letters blocks from the test string, moving over one letter at a time. For each block, check if already on the list, if so increment the count; if not, add to the end of the list.

The problem with this is that if you have very long test strings (say 10^8 and higher) and move to higher pattern 'word' lengths, say 8 or 10 letters, the number of unique patterns is huge and so the thing slows down as each 8 letter block is compared against a huge list.

Another idea would be to assign each possible pattern with an index: AAAA=1, AAAB=2, so that you can just calculate the index using base 26 conversion for any given string, and increment the value at that index of a master vector. (eg AABA=27, etc., so increment vector(27) by one). But again, with longer pattern lengths, there is not enough memory to store a vector with that many indices, covering all the words (26^8 or 26^10, etc.).

So, is there some way to do this efficiently when the strings and pattern lengths get big? Something with sparsity, or smart sorting, or indexing?

Thank you

댓글 수: 2
없음 표시없음 숨기기

Jos (10584) 2017년 12월 22일

Basic questions first: do you really need all patterns? If so, why?

Mark 2017년 12월 22일

편집: Mark 2017년 12월 22일

I don't need all patterns, mostly because the number of possible 10-letter sequences in a test string of 10^6 letters, is just under 10^6, while the number of 10-letter patterns is 26^10. But there is no a priori list of possible patterns for a given test string, though one could quickly be generated from the string of course. I'm not sure if that answers your question? To be clear, I'm only interested in the patterns that appear in the test string, not the ones that don't. So in that sense, I don't need all patterns.

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Stephen23 2017년 12월 22일

1
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/374238-counting-pattern-frequency-in-a-string#answer_297455

편집: Stephen23 2017년 12월 22일

MATLAB Online에서 열기

Hummm, interesting problem. I would approach this slightly differently, taking advantage of two features to try and achieve some reasonably efficient solution:

The requested substring length num is small and fixed.
MATLAB'S inbuilt functions are highly optimized and quite efficient.

Given those points, I would simply loop over the string num times: 1st iteration start from position 1, 2nd iteration starting from position 2, etc. Within each iteration you can reshape the largest possible subset of the main string into a matrix and easily find the unique rows. This intermediate step slightly reduces memory requirements and builds up a "semi-unique" set of all substrings of the requested length. Also collect their indices and use histc to count the substring occurrences for that iteration. Then after the loop use efficient unique again to get the final list of unique substrings, together with accumarray to add the histc output counts together.

% Fake Data:
%str = char(randi([65,90],1,1e7));
str = 'ZXCVBNMZXCVBAS';
num = 4;
tot = numel(str);
% Loop over <num> starting indices:
tmp = cell(1,num);
idh = cell(1,num);
for k = 1:num
  vec = k-1:num:tot;
  [tmp{k},~,idt] = unique(reshape(str(k:vec(end)),num,[]).','rows');
  idh{k} = histc(idt,unique(idt)); % save the count
end
% Count unique rows:
[out,~,ido] = unique(vertcat(tmp{:}),'rows');
cnt = accumarray(ido,vertcat(idh{:}));

For that small sample string it returns:

>> out
out =
BNMZ
CVBA
CVBN
MZXC
NMZX
VBAS
VBNM
XCVB
ZXCV
>> cnt
cnt =
   1
   1
   1
   1
   1
   1
   1
   2
   2

which seems to be correct. For the longer string (1e7 elements) it took my laptop

Elapsed time is 9.031 seconds.

to run, and I would imagine it scales roughly linearly with num. How does that compare to the methods you are using now?

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Mark 2017년 12월 22일

This is excellent, supremely fast. Running brute force for num=3 on something around a 10^7 length test string was taking upwards of 10 mins when I stopped the script, whereas num=5 or more takes 12 seconds on your version. Perfect! Thank you

댓글을 달려면 로그인하십시오.

counting pattern frequency in a string

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

counting pattern frequency in a string

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (0개)

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기