필터 지우기
필터 지우기

Regular Expression to detect spaces in a string

조회 수: 10 (최근 30일)
Deepak
Deepak 2013년 10월 14일
댓글: Cedric 2013년 10월 14일
Hallo All, I have a string for example
string='<abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit not yet decided abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit not yet decided>'
I want to use regexp to get all the white spaces that occur between " and < /a >. I have been trying to figure out how to use regexp to get the spaces but have not yet found an elegant solution. For eg: regexp(string,'(?<="\S*)\s') retuns only 2 spaces and not all of them.. Could someone help me out..
Thanks a lot
  댓글 수: 2
Cedric
Cedric 2013년 10월 14일
편집: Cedric 2013년 10월 14일
What do you mean by "spaces"? Is it just white spaces or all characters? If you really meant white spaces, is it their position that you want? If you want characters, what is the purpose? REGEXP can parse the whole tag and extract whatever part you want.
Jan
Jan 2013년 10월 14일
There are two " characters in this string. Which one do you mean? Please post the wanted result by editing the question (not as comment or pseudo-answer).

댓글을 달려면 로그인하십시오.

채택된 답변

Deepak
Deepak 2013년 10월 14일
Hi Cedric, Thanks for the really detailed answer. It really helped. I actually wanted to get the position of white spaces. So the second part of the answer really addresses my query. I was hoping to get the whote spaces with one regexp without using any other commands like isspace, but I guess would be complicated... I am not really familiar with tokens.. So once again thanks for ur detailed answer..
  댓글 수: 1
Cedric
Cedric 2013년 10월 14일
편집: Cedric 2013년 10월 14일
Hi Deepak, The issue with counting spaces using regexp is that it's not possible to do it using a simple query. The call to regexp (possibly regexprep) that we would have to use would be much more complicated than doing the whole operation using one call to regexp with a simple pattern and a few additional operations.

댓글을 달려면 로그인하십시오.

추가 답변 (1개)

Cedric
Cedric 2013년 10월 14일
편집: Cedric 2013년 10월 14일
Here is an example assuming that you want characters between " and </a> and not only white spaces:
>> s = regexp( html, '(?<=")[^"]+(?=</a>)', 'match' )
s =
'>Mathworks' '>Google'
Look-arounds are treacherous when dealing with this type of situations where the expression in the look-behind can appear multiple times before the expression in the look forward is found. The following example illustrates it
>> s = regexp( html, '(?<=").+?(?=</a>)', 'match' )
s =
'http://www.mathworks.com">Mathworks' 'http://www.google.com">Google'
where we see that the "smallest possible match" fails despite the lazy .+?. Let me know if you want to understand why.. or see the example/discussion between Per and I here.
Note that using tokens is generally more efficient than using look-arounds:
>> s = regexp( html, '"([^"]+)</a>', 'tokens' ) ;
>> celldisp(s)
s{1}{1} =
>Mathworks
s{2}{1} =
>Google
Back to the initial question, the pattern could be more specific though if you wanted to extract the content or the value of the href parameter, e.g.
>> s = regexp( html, '[^>]+(?=</a>)', 'match' )
s =
'Mathworks' 'Google'
Or
>> s = regexp( html, 'href.+?"([^"]*)', 'tokens' ) ;
>> celldisp(s)
s{1}{1} =
http://www.mathworks.com
s{2}{1} =
http://www.google.com
Or
>> s = regexp( html, 'href.+?"(?<href>[^"]*).*?>(?<content>.*?)</a>', 'names' )
s =
1x2 struct array with fields:
href
content
>> s(1)
ans =
href: 'http://www.mathworks.com'
content: 'Mathworks'
>> s(2)
ans =
href: 'http://www.google.com'
content: 'Google'
All these approaches can be fine-tuned/complex-ified for managing a broader set of cases, e.g. when there is a tag in the content of the anchor tag.
EDIT: if you really want to get the position of white spaces, your expression does work but not as you thought. It actually matches
'"abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit '
and
'">abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit '
which start both with a " followed by non-whitespace characters until after the t of Limit. Once thing that you could do if you wanted to keep the pattern simple, is to get the starting and ending position of the relevant sub-string:
>> [mStart, mStop] = regexp( html, '(?<=")[^"]+(?=</a>)', 'start', 'end' )
mStart =
76
mStop =
132
and use them to mask a logical index of position of white spaces:
>> isSpace = html == ' ' ;
>> isSpace(1:mStart-1) = false ;
>> isSpace(mStop+1:end) = false ;
>> find( isSpace )
ans =
117 121 125

카테고리

Help CenterFile Exchange에서 Characters and Strings에 대해 자세히 알아보기

제품

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by