Using regexp to create dataset

Question

Sebastiano delre 2016년 5월 21일

0
링크

이 질문에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/285239-using-regexp-to-create-dataset

댓글: Sebastiano delre 2016년 5월 23일

채택된 답변: Stephen23

MATLAB Online에서 열기

I have imported a large database using textscan(). Now I have data with 12 variables. Each observation looks like this:

5,573346285,746540138,NA,1341119065,NA,7,0,2,1341111281,"-1,-1,-1,0,-1",-0.8

These are cell data and I would like to convert them in dataset type, but my problem is that the 11th variable is a string that may contain several numbers separated by commas. I cannot use something like this regexp(datacell{1,1}{6,1}, ',\s*', 'split') because it will split the 11th variable in many different parts. Can you please suggest me a code that can make it? Thank you.

댓글 수: 2
없음 표시없음 숨기기

Stephen23 2016년 5월 21일

편집: Stephen23 2016년 5월 21일

@Sebastiano delre: are the number of numbers within the double quotes always the same ? In you example there are five numbers: are there always five ?

Sebastiano delre 2016년 5월 21일

No, actually that can vary. This is exactly what creates my problem...

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Answer 1

Stephen23 2016년 5월 21일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/285239-using-regexp-to-create-dataset#answer_222954

편집: Stephen23 2016년 5월 21일

MATLAB Online에서 열기

test.txt

fmt = [repmat('%f',1,10),'%q','%f'];
opt = {'CollectOutput',true, 'Delimiter',',', 'TreatAsEmpty','NA'};
fid = fopen('test.txt','rt');
C = textscan(fid,fmt,opt{:});
fclose(fid);
D = cellfun(@(s)sscanf(s,'%f,'),C{2},'UniformOutput',false);

This returns all of the numeric values in C{1} and C{3}:

>> C{1}
ans =
    5    573346285    746540138    NaN    1341119065    NaN    7    0    2    1341111281
    6    573346286    746540139    NaN    1341119066    NaN    8    1    3    1341111282
    7    573346287    746540140    NaN    1341119067    NaN    9    0    4    1341111283

and those quoted strings are in C{2}:

>> C{2}
ans = 
    '-1,-1,-1,0,-1'
    '-1,0,-1'
    '-1'

The quoted strings are simply converted to numeric using sscanf (no regexp is required):

>> D{:}
ans =
    -1
    -1
    -1
     0
    -1
ans =
    -1
     0
    -1
ans =
    -1

The sample file that I used is attached here (I had to create my own as you did not provide us with a sample file to work with):

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

Stephen23 2016년 5월 23일

편집: Stephen23 2016년 5월 23일

MATLAB Online에서 열기

test.txt

@Sebastiano delre: the error you are getting "I get this error: "Error using textscan. Invalid file identifier. Use fopen to generate a valid file identifier." " has nothing to do with my algorithm at all.

That error occurs when MATLAB cannot open the file that you requested, most likely because you are passing a wrong filepath to fopen.

This commonly occurs when beginners:

try to access a file in some folder that is not the current directory, but pass only the filename without the filepath.
define a filepath to a file that does not exist.
spell the filename incorrectly.

The solution is (almost always) to pass the correct filepath. You should make this change and tell me what the error message msg is:

[fid,msg] = fopen('test.txt','rt'); % for your filename and filepath!

You will also find hundreds of threads on this forum that explain this exact error message, if you want to read more information about it.

However you have also changed the file format from what you explained in your question, which will then cause my code to not work. Your question did not mention that the file has a header! You can fix this by adding 'HeaderLines',1 to the textscan options.

Or you could try this code, which generates a structure using those header names, which lets you access the data using the fieldnames:

fmt = [repmat('%f',1,10),'%q','%f'];
opt = {'CollectOutput',true, 'Delimiter',',', 'TreatAsEmpty','NA'};
[fid,msg] = fopen('test.txt','rt');
H = regexp(fgetl(fid),'[^,"]+','match');
C = textscan(fid,fmt,opt{:});
fclose(fid);
M = strrep(H,'.','');
C{1} = num2cell(C{1});
C{3} = num2cell(C{3});
C{2} = cellfun(@(s)sscanf(s,'%f,'),C{2},'UniformOutput',false);
M(2,:) = num2cell(horzcat(C{:}),1);
S = struct(M{:});

and access the data like this:

>> S(9).Sentiment
ans =
   -1.3333
>> S(2).Sentiment
ans =
   -0.5000
>> S(2).startingtime
ans =
   1.3411e+09

PS: Sorry about the date mixup! PPS: This new code was tested on your sample file:

Sebastiano delre 2016년 5월 23일

Yes, I see. Now it works, thanks.

댓글을 달려면 로그인하십시오.

Answer 2

Walter Roberson 2016년 5월 21일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/285239-using-regexp-to-create-dataset#answer_222951

If you are using one of the more recent versions of textscan then you can use the %q format to read the double-quoted string as a single item.

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Answer 3

Azzi Abdelmalek 2016년 5월 21일

0
링크

이 답변에 대한 바로 가기 링크

https://kr.mathworks.com/matlabcentral/answers/285239-using-regexp-to-create-dataset#answer_222957

편집: Azzi Abdelmalek 2016년 5월 21일

MATLAB Online에서 열기

a='5,573346285,746540138, NA ,1341119065,NA,7,0,2,1341111281,"-1,-1,-1,0,-1",-0.8'
b=regexp(a,'\<".+\>"\,','match');
c=strrep(a,b,'');
data1=regexp(c,'[\s\,]+','split');
data2=regexp(b{1}(2:end-2),'[\s\,]+','split');
data=[data1{:} data2{:}]

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글을 달려면 로그인하십시오.

Using regexp to create dataset

댓글 수: 2
없음 표시없음 숨기기

채택된 답변

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

추가 답변 (2개)

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

Using regexp to create dataset

댓글 수: 2 없음 표시없음 숨기기

채택된 답변

댓글 수: 6 이전 댓글 4개 표시이전 댓글 4개 숨기기

추가 답변 (2개)

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0 이전 댓글 -2개 표시이전 댓글 -2개 숨기기

참고 항목

카테고리

태그

Community Treasure Hunt

댓글 수: 2
없음 표시없음 숨기기

댓글 수: 6
이전 댓글 4개 표시이전 댓글 4개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기

댓글 수: 0
이전 댓글 -2개 표시이전 댓글 -2개 숨기기