Regular Expression to detect spaces in a string

Question

Deepak 2013년 10월 14일

1
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/90169-regular-expression-to-detect-spaces-in-a-string

댓글: Cedric 2013년 10월 14일

Hallo All, I have a string for example

string='<abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit not yet decided  abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit not yet decided>'

I want to use regexp to get all the white spaces that occur between " and < /a >. I have been trying to figure out how to use regexp to get the spaces but have not yet found an elegant solution. For eg: regexp(string,'(?<="\S*)\s') retuns only 2 spaces and not all of them.. Could someone help me out..

Thanks a lot

댓글 수: 2
없음 표시없음 숨기기

Cedric 2013년 10월 14일

편집: Cedric 2013년 10월 14일

What do you mean by "spaces"? Is it just white spaces or all characters? If you really meant white spaces, is it their position that you want? If you want characters, what is the purpose? REGEXP can parse the whole tag and extract whatever part you want.

Jan 2013년 10월 14일

There are two " characters in this string. Which one do you mean? Please post the wanted result by editing the question (not as comment or pseudo-answer).

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Deepak 2013년 10월 14일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/90169-regular-expression-to-detect-spaces-in-a-string#answer_99684

Hi Cedric, Thanks for the really detailed answer. It really helped. I actually wanted to get the position of white spaces. So the second part of the answer really addresses my query. I was hoping to get the whote spaces with one regexp without using any other commands like isspace, but I guess would be complicated... I am not really familiar with tokens.. So once again thanks for ur detailed answer..

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

Cedric 2013년 10월 14일

편집: Cedric 2013년 10월 14일

Hi Deepak, The issue with counting spaces using regexp is that it's not possible to do it using a simple query. The call to regexp (possibly regexprep) that we would have to use would be much more complicated than doing the whole operation using one call to regexp with a simple pattern and a few additional operations.

댓글을 달려면 로그인하십시오.

Answer 2

Cedric 2013년 10월 14일

3
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/90169-regular-expression-to-detect-spaces-in-a-string#answer_99630

편집: Cedric 2013년 10월 14일

MATLAB Online에서 열기

Here is an example assuming that you want characters between " and </a> and not only white spaces:

 >> s = regexp( html, '(?<=")[^"]+(?=</a>)', 'match' )
 s = 
    '>Mathworks'    '>Google'

Look-arounds are treacherous when dealing with this type of situations where the expression in the look-behind can appear multiple times before the expression in the look forward is found. The following example illustrates it

 >> s = regexp( html, '(?<=").+?(?=</a>)', 'match' )
 s = 
    'http://www.mathworks.com">Mathworks'    'http://www.google.com">Google'

where we see that the "smallest possible match" fails despite the lazy .+?. Let me know if you want to understand why.. or see the example/discussion between Per and I here.

Note that using tokens is generally more efficient than using look-arounds:

 >> s = regexp( html, '"([^"]+)</a>', 'tokens' ) ;
 >> celldisp(s)
 s{1}{1} =
         >Mathworks
 s{2}{1} =
         >Google

Back to the initial question, the pattern could be more specific though if you wanted to extract the content or the value of the href parameter, e.g.

 >> s = regexp( html, '[^>]+(?=</a>)', 'match' )
 s = 
    'Mathworks'    'Google'

Or

 >> s = regexp( html, 'href.+?"([^"]*)', 'tokens' ) ;
 >> celldisp(s)
   s{1}{1} =
           http://www.mathworks.com
   s{2}{1} =
           http://www.google.com

Or

 >> s = regexp( html, 'href.+?"(?<href>[^"]*).*?>(?<content>.*?)</a>', 'names' )
 s = 
    1x2 struct array with fields:
      href
      content
 >> s(1)
 ans = 
       href: 'http://www.mathworks.com'
    content: 'Mathworks'
 >> s(2)
 ans = 
       href: 'http://www.google.com'
    content: 'Google'

All these approaches can be fine-tuned/complex-ified for managing a broader set of cases, e.g. when there is a tag in the content of the anchor tag.

EDIT: if you really want to get the position of white spaces, your expression does work but not as you thought. It actually matches

'"abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit '

and

'">abcd/abcd/yxz/xyz/MOTOR50DEV/sdsds/Limit '

which start both with a " followed by non-whitespace characters until after the t of Limit. Once thing that you could do if you wanted to keep the pattern simple, is to get the starting and ending position of the relevant sub-string:

 >> [mStart, mStop] = regexp( html, '(?<=")[^"]+(?=</a>)', 'start', 'end' )
 mStart =
    76
 mStop =
   132

and use them to mask a logical index of position of white spaces:

 >> isSpace = html == ' ' ;
 >> isSpace(1:mStart-1)  = false ;
 >> isSpace(mStop+1:end) = false ;
 >> find( isSpace )
 ans =
   117   121   125

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Regular Expression to detect spaces in a string

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (1개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

Regular Expression to detect spaces in a string

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 1 이전 댓글 -1개 표시이전 댓글 -1개 숨기기

추가 답변 (1개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

제품

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 1
이전 댓글 -1개 표시이전 댓글 -1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기