Correction of misspelled words in data source

조회 수: 2 (최근 30일)
Sandeep Kapour
Sandeep Kapour 2021년 4월 14일
댓글: Walter Roberson 2021년 4월 15일
Hello,
i want to extract some data and I am using the "extractAfter" function, which works very well. My data source or measurement data has some problem for example: extractAfter(data, 'Signal1')
Signal1: 5, Signal2: 6
Signal1: 6, Signal2: 5
Sinal1: 8, Signal2: 5
Signal1: 10, Sigal2: 3
The problem is that Sinal1 and Sigal2 is not spelled correctly. Is it possible to change Sinal1 to Signal1 and Sigal2 to Signal2 automatically, because my data is very large. I am using the MATLAB version 2019b.

답변 (2개)

Cris LaPierre
Cris LaPierre 2021년 4월 14일
If you are using extractAfter, your data must be text. If so, have you tried using the replace function?
data = ["Signal1: 5, Signal2: 6";"Signal1: 6, Signal2: 5";"Sinal1: 8, Signal2: 5";"Signal1: 10, Sigal2: 3"]
data = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Sinal1: 8, Signal2: 5" "Signal1: 10, Sigal2: 3"
replace(data,["Sinal1","Sigal2"],["Signal1","Signal2"])
ans = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Signal1: 8, Signal2: 5" "Signal1: 10, Signal2: 3"
  댓글 수: 3
Cris LaPierre
Cris LaPierre 2021년 4월 14일
편집: Cris LaPierre 2021년 4월 14일
I'd suggest replaceBetween with pattern matching, but unfortunately, pattern was introduced in 20b. Since you are on 19b, try using regexprep instead.
data = ["Signal1: 5, Signal2: 6";"Signal1: 6, Signal2: 5";"Sinal1: 8, Signal2: 5";"Signal1: 10, Sigal2: 3"];
newStr = regexprep(data,'\<S\w*(?=1:|2:)',"Signal")
newStr = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Signal1: 8, Signal2: 5" "Signal1: 10, Signal2: 3"
Walter Roberson
Walter Roberson 2021년 4월 15일
Variants:
data = ["Signal1: 5, Signal2: 6";"Signal1: 6, Signal2: 5";"Sinal1: 8, Signal2: 5";"Signal1: 10, Sigal2: 3"];
newStr = regexprep(data,'\<S\w+(\d+):','Signal$1:')
newStr = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Signal1: 8, Signal2: 5" "Signal1: 10, Signal2: 3"
newStr = regexprep(data,'\<S\w+(?=\d+:)','Signal')
newStr = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Signal1: 8, Signal2: 5" "Signal1: 10, Signal2: 3"
newStr = regexprep(data, '\<S\D+', 'Signal')
newStr = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Signal1: 8, Signal2: 5" "Signal1: 10, Signal2: 3"
The first of the variations explicitly keeps a sequence of digits and drops it at the end of 'Signal'. The characters up to that point must be "word building characters", which are the letters and the digits and underscore. For example 'S1gn_l2' would match but not 'S1gn-l1' because '-' is not "word-building"
The second of the variations stops the search when it finds digits followed by colon, and replaces up to there. It differs from Cris's suggestion in that it handles any sequence of digits, not just '1' or '2'. Again the characters matched must be "word-building"
The third of the variations matches any non-digit after the S, stopping at the first digit. For example 'S1gn_l2' would stop matching between the S and the 1, and 'S!gn-l2' would be happily matched. But 'Signal: 5' with the digit missing before the colon woud be replaced with 'Signal5', and if the input were a continuous character string instead of a cell array of character vectors or a string array, then \D+ would be happy to cross line boundaries to find digits it expected. For example: 'Signal?: Nan\nSignal1: 5' would get replaced by 'Signal5' because as far as \D+ is concerned, newline is a valid non-digit character.... but as you can see, the code is shorter and sometimes your variations to be matched are well-defined and you can get away with it.

댓글을 달려면 로그인하십시오.


Walter Roberson
Walter Roberson 2021년 4월 14일
If you are dealing with a text file, I would suggest rewriting in terms of regexp() with named tokens
S = sprintf('Signal1: 5, Signal2: 6.2\nSignal1: 6e-3, Signal2: 5\nSinal1: 8, Signal2: 5\nSignal1: 10, Sigal2: 3')
S =
'Signal1: 5, Signal2: 6.2 Signal1: 6e-3, Signal2: 5 Sinal1: 8, Signal2: 5 Signal1: 10, Sigal2: 3'
parts = regexp(S, 'Sig?n?al1: (?<s1>[\d.eE+-]+), Sig?n?al2: (?<s2>[\d.eE+-]+)', 'names')
parts = 1×4 struct array with fields:
s1 s2
s1 = str2double({parts.s1})
s1 = 1×4
5.0000 0.0060 8.0000 10.0000
s2 = str2double({parts.s2})
s2 = 1×4
6.2000 5.0000 5.0000 3.0000
Your example only shows integer values. If that is all that is permitted, then change the
[\d.eE+-]+
to
\d+
The version I coded permits positive and negative values and decimals and exponentiation using either 'e' or 'E' ... but does not permit complex numbers.

카테고리

Help CenterFile Exchange에서 Cell Arrays에 대해 자세히 알아보기

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by