Correction of misspelled words in data source

Question

Sandeep Kapour 2021년 4월 14일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/802411-correction-of-misspelled-words-in-data-source

댓글: Walter Roberson 2021년 4월 15일

Hello,

i want to extract some data and I am using the "extractAfter" function, which works very well. My data source or measurement data has some problem for example: extractAfter(data, 'Signal1')

Signal1: 5, Signal2: 6

Signal1: 6, Signal2: 5

Sinal1: 8, Signal2: 5

Signal1: 10, Sigal2: 3

The problem is that Sinal1 and Sigal2 is not spelled correctly. Is it possible to change Sinal1 to Signal1 and Sigal2 to Signal2 automatically, because my data is very large. I am using the MATLAB version 2019b.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Cris LaPierre 2021년 4월 14일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/802411-correction-of-misspelled-words-in-data-source#answer_675326

MATLAB Online에서 열기

If you are using extractAfter, your data must be text. If so, have you tried using the replace function?

data = ["Signal1: 5, Signal2: 6";"Signal1: 6, Signal2: 5";"Sinal1: 8, Signal2: 5";"Signal1: 10, Sigal2: 3"]
data = 4×1 string array
    "Signal1: 5, Signal2: 6"
    "Signal1: 6, Signal2: 5"
    "Sinal1: 8, Signal2: 5"
    "Signal1: 10, Sigal2: 3"
replace(data,["Sinal1","Sigal2"],["Signal1","Signal2"])
ans = 4×1 string array
    "Signal1: 5, Signal2: 6"
    "Signal1: 6, Signal2: 5"
    "Signal1: 8, Signal2: 5"
    "Signal1: 10, Signal2: 3"

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

Cris LaPierre 2021년 4월 14일

편집: Cris LaPierre 2021년 4월 14일

MATLAB Online에서 열기

I'd suggest replaceBetween with pattern matching, but unfortunately, pattern was introduced in 20b. Since you are on 19b, try using regexprep instead.

data = ["Signal1: 5, Signal2: 6";"Signal1: 6, Signal2: 5";"Sinal1: 8, Signal2: 5";"Signal1: 10, Sigal2: 3"];
newStr = regexprep(data,'\<S\w*(?=1:|2:)',"Signal")
newStr = 4×1 string array
    "Signal1: 5, Signal2: 6"
    "Signal1: 6, Signal2: 5"
    "Signal1: 8, Signal2: 5"
    "Signal1: 10, Signal2: 3"

Walter Roberson 2021년 4월 15일

MATLAB Online에서 열기

Variants:

data = ["Signal1: 5, Signal2: 6";"Signal1: 6, Signal2: 5";"Sinal1: 8, Signal2: 5";"Signal1: 10, Sigal2: 3"];
newStr = regexprep(data,'\<S\w+(\d+):','Signal$1:')
newStr = 4×1 string array
    "Signal1: 5, Signal2: 6"
    "Signal1: 6, Signal2: 5"
    "Signal1: 8, Signal2: 5"
    "Signal1: 10, Signal2: 3"
newStr = regexprep(data,'\<S\w+(?=\d+:)','Signal')
newStr = 4×1 string array
    "Signal1: 5, Signal2: 6"
    "Signal1: 6, Signal2: 5"
    "Signal1: 8, Signal2: 5"
    "Signal1: 10, Signal2: 3"
newStr = regexprep(data, '\<S\D+', 'Signal')
newStr = 4×1 string array
    "Signal1: 5, Signal2: 6"
    "Signal1: 6, Signal2: 5"
    "Signal1: 8, Signal2: 5"
    "Signal1: 10, Signal2: 3"

The first of the variations explicitly keeps a sequence of digits and drops it at the end of 'Signal'. The characters up to that point must be "word building characters", which are the letters and the digits and underscore. For example 'S1gn_l2' would match but not 'S1gn-l1' because '-' is not "word-building"

The second of the variations stops the search when it finds digits followed by colon, and replaces up to there. It differs from Cris's suggestion in that it handles any sequence of digits, not just '1' or '2'. Again the characters matched must be "word-building"

The third of the variations matches any non-digit after the S, stopping at the first digit. For example 'S1gn_l2' would stop matching between the S and the 1, and 'S!gn-l2' would be happily matched. But 'Signal: 5' with the digit missing before the colon woud be replaced with 'Signal5', and if the input were a continuous character string instead of a cell array of character vectors or a string array, then \D+ would be happy to cross line boundaries to find digits it expected. For example: 'Signal?: Nan\nSignal1: 5' would get replaced by 'Signal5' because as far as \D+ is concerned, newline is a valid non-digit character.... but as you can see, the code is shorter and sometimes your variations to be matched are well-defined and you can get away with it.

댓글을 달려면 로그인하십시오.

Answer 2

Walter Roberson 2021년 4월 14일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/802411-correction-of-misspelled-words-in-data-source#answer_675331

MATLAB Online에서 열기

If you are dealing with a text file, I would suggest rewriting in terms of regexp() with named tokens

S = sprintf('Signal1: 5, Signal2: 6.2\nSignal1: 6e-3, Signal2: 5\nSinal1: 8, Signal2: 5\nSignal1: 10, Sigal2: 3')
S = 
    'Signal1: 5, Signal2: 6.2
     Signal1: 6e-3, Signal2: 5
     Sinal1: 8, Signal2: 5
     Signal1: 10, Sigal2: 3'
parts = regexp(S, 'Sig?n?al1: (?<s1>[\d.eE+-]+), Sig?n?al2: (?<s2>[\d.eE+-]+)', 'names')
parts = 1×4 struct array with fields:
    s1
    s2
s1 = str2double({parts.s1})
s1 = 1×4
    5.0000    0.0060    8.0000   10.0000
s2 = str2double({parts.s2})
s2 = 1×4
    6.2000    5.0000    5.0000    3.0000

Your example only shows integer values. If that is all that is permitted, then change the

[\d.eE+-]+

to

\d+

The version I coded permits positive and negative values and decimals and exponentiation using either 'e' or 'E' ... but does not permit complex numbers.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Correction of misspelled words in data source

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (2개)

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

Correction of misspelled words in data source

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

답변 (2개)

댓글 수: 3 이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 3
이전 댓글 1개 표시이전 댓글 1개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기