Regular expressions on uint8 or single byte characters
이전 댓글 표시
I have a 200 MB text file encoded in UTF-8. My maximum array size is around 350 MB, so I can safely read it in using fread('filename','*uint8'). For using regular expressions, I need to turn this into a char array, which blows up the array size by at least a factor of two (depending on encoding, but for my application I can ignore all fancy characters), and thus leads to an "out of memory" error.
I wrote some code that breaks up the original array, so that the matching of the regular expressions works on smaller chunks, but I am still wondering: Can I somehow run regular expressions on the uint8 array? Or is there a char-like variable type that only uses 1 byte per character?
댓글 수: 5
I don't think that it is possible (it is not an answer though), and I guess that I would have gone for this block-based solution that you implemented, with a little extra logic to ensures that splits are made at places which do not belong to the pattern (e.g. fseek base position, read 50MB, determine negative offset from the end of first char which is not in the pattern, truncate block at this place, and use this negative offset for computing next base position).
Walter Roberson
2013년 8월 25일
How fancy are your regular expressions?
Martin Hoecker
2013년 8월 26일
dpb
2013년 8월 26일
Instead of 'unit8', try 'uchar' Not sure it'll help but it is at least a character class, not an integer.
Actually, it is simpler to ask what you are trying to match instead of the pattern (copy/paste of chunk of file content or string, and an explanation of what you want to extract). With a little luck, we can perform this using STRFIND (which works on uint8 arrays) or some numeric test on uint8's.
답변 (0개)
카테고리
도움말 센터 및 File Exchange에서 Characters and Strings에 대해 자세히 알아보기
제품
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!