Sorting data by the first two letters in a text string

Question

0 개 추천

Hello,

I am trying to sort a rather large dataset (~1 million rows) that is imported as a table. It has three columns that are arranged like the following:

CC2320932, BC1840824, 0.234

CC4892342, BC2131223, 0.456

CC4892394, AC1293021, 0.586

AA5894383, BA2390232, 0.867

This is genome data, so the first two columns will always start with two letters followed by numbers, and the last column will be a value. In the first row, it's CC vs. BC and their percent overlap. I want to filter the data by the first two letters. So, I want to create a new table pulling rows depending on different genome sets.

So this would mean a table for all CC vs. BC, AA vs. BA, etc. I'm having trouble though because the way the data is stored means that it's isn't simply finding a particular string, as the numbers after the two letters change for virtually every row.

Thank you!

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

Stephen23 2018년 7월 31일

" I want to filter the data by the first two letters."

Table were exactly designed to make this kind of operation easy: group by some variable, calculate statistics or functions for groups of related data, etc. The table class supports useful functions like rowfun, varfun, splitapply, etc.

https://www.mathworks.com/help/matlab/matlab_prog/calculations-on-tables.html

"So, I want to create a new table pulling rows depending on different genome sets."

Which is why you should not split your nice table into lots of smaller tables: that would actually make processing the data harder!

댓글을 달려면 로그인하십시오.

이 질문에 답변하려면 로그인하십시오.

Follow Question

Answer 1

Guillaume 2018년 7월 30일

MATLAB Online에서 열기

1 개 추천

My advice would be not to split the table into multiple tables. If you want to perform some calculation per genome set it can be easily done on all the sets at once using rowfun or varfun with the 'GroupingVariables' option.

As for creating that grouping variable, it's going to depend how the underlying data is stored in the table. If the underlying data of the genome column is a Nx9 char array, then:

yourtable.set =yourtable.genomecolumn(:, [1 2]);

If it's stored as Nx1 cell array of 1x9 char vector, then:

temparray = vertcat(yourtable.genomecolumn{:});
yourtable.set = temparray(:, [1 2]);

The table can then easily be sorted according to the set:

sortrows(yourtable, 'set')

The set column can also be used for processing with splitapply

group = findgroups(yourtable.set);
splitapply(somefunction, yourtable, group)

or as said with rowfun or varfun:

rowfun(somefunc, yourtable, 'GroupingVariables', 'set')

댓글 수: 2
없음 표시 없음 숨기기

Abraham Jarque 2018년 7월 31일

I've tried using those, however I'm not sure how to sort the data. The data is as follows:

From the rows listed, I would have a C3 vs. C3 group, an A vs. C3 group, and an A vs. C1 group. Because the genome name changes so much from row to row, I'm not sure how to get matlab to recognize them as one group.

Guillaume 2018년 8월 1일

MATLAB Online에서 열기

Well, what you now show us doesn't match the format of the data in your original question.

First, we need to establish a few things.

Is your data really in a table? What is

class(yourtable)

Secondly, how is the genome sequence actually stored in a table. With the format in your question, it would make sense to have it stored as a Nx9 char array. With this new format, it's probably a Nx1 cell array of char arrays. So what is

class(yourtable.genomecolumn)

With that new format, a regular expression is indeed the best way to extract that initial one or two letter:

genomeset = regexp(yourtable.genomecolumn, '^[^_](?=_)', 'match', 'once')  %assuming genomecolumn is a cell array

The regex above will extract all the characters from the start of the string up to the _, so will work for your 1st column.

For the inconsistent format in the second column (XX_YYYYY vs xxYYYY), this may work

genomeset = upper(regexp(yourtable.column2, '^([^_](?=_)|[a-z0-9]+)', 'match', 'once'))

which matches the same as above OR a sequence of lowercase characters and/or numbers only.

댓글을 달려면 로그인하십시오.

Answer 2

Adam Danz 2018년 7월 30일

편집: Adam Danz 2018년 7월 30일

MATLAB Online에서 열기

0 개 추천

This is where learning regular expressions really helps.

Here's how to find rows of column 1 that start with CC and rows of column 2 that start with BC. The first two lines below will create logical vectors that select the appropriate rows of the table. The last line creates a new table by finding rows that are satisfied by both logical vectors. 'tab' is your table.

CC1 = ~cellfun(@isempty, regexp(tab{:,1}, 'CC*+')); %look in col 1 for CC...
BC2 = ~cellfun(@isempty, regexp(tab{:,2}, 'BC*+')); %look in col 2 for BC...
CC_BC = tab(CC1 & BC2, :)

As bonus, if you'd like to see a list of all combinations,

allCombos = cellfun(@(x)x(1:2), tab{:,[1,2]}, 'UniformOutput', false);
allCombos =
  {'CC'}    {'BC'}
  {'CC'}    {'BC'}
  {'CC'}    {'AC'}
  {'AA'}    {'BA'} ...

댓글 수: 2
없음 표시 없음 숨기기

Abraham Jarque 2018년 7월 31일

MATLAB Online에서 열기

Your solution seems to be the closest to what I am trying to do, however I am getting this following error when I am trying to process it:

Error using regexp
The 'STRING' input must be either a char row vector, a cell array of char row vectors,
or a string array.
Error in EmilyData (line 4)
C3 = ~cellfun(@isempty, regexp(tab{:,1}, 'C.*')); %look in col 1 for CC...

Also, looking another dataset I have to process, i have the following genome groups: A, B, C1, C2, C3, and D. However, some of the data is stored as lowercase and some as uppercase (c3 and C3, etc.) Is there a way to compensate for this? I'm sorry if the solution is simple, I am not familiar with regular expressions.

Adam Danz 2018년 7월 31일

편집: Adam Danz 2018년 7월 31일

Are you sure 'tab' is a table? Here's an example using your data from your question.

For your 2nd question, if case doesn't matter, use regexpi() instead of regexp().

댓글을 달려면 로그인하십시오.

Sorting data by the first two letters in a text string

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

답변 (2개)

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 2
없음 표시 없음 숨기기

카테고리

태그

Community Treasure Hunt

Sorting data by the first two letters in a text string

댓글 수: 1 이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

답변 (2개)

댓글 수: 2 없음 표시 없음 숨기기

댓글 수: 2 없음 표시 없음 숨기기

카테고리

태그

참고 항목

Community Treasure Hunt

댓글 수: 1
이전 댓글 -1개 표시 이전 댓글 -1개 숨기기

댓글 수: 2
없음 표시 없음 숨기기

댓글 수: 2
없음 표시 없음 숨기기