Is there an interactive way to import one column of a table, with multiple fields, from a text file?

조회 수: 2 (최근 30일)
I have a simple problem to read three files with seismic data from IRIS.edu. Each file has a header, each line starting with a # sign, then a two column comma separated table down to a blank line. The files can have 100 million lines each. Text for working with groups, then binary for batch.
The table looks like this
Time, Sample
2012-04-08T00:00:00.000501Z, -70052
2012-04-08T00:00:00.025501Z, -70093
2012-04-08T00:00:00.050501Z, -70077
2012-04-08T00:00:00.075501Z, -69983
2012-04-08T00:00:00.100501Z, -70044
I can select the table but I don't want to write programs to them have to parse the data field. Is the Text Import Wizard smart enough to go deeper? Is there any way for me to add to the Text Import for everyone?

채택된 답변

Richard K Collins
Richard K Collins 2021년 3월 6일
Walter Roberson,
Got an email just now asking if there was a good answer to this question. Your one reply is good and helpful, but it has no "accept this answer' button next to it.
I am not going to mark my own stuff as the answer. I can only add a note here to thank you for that code and instructions.
Maybe you know MatLab software development team well enough to have them listen to you, and mention that their email system and the "accept" button writers are not in sync. There are no clear visual clues on this page between the answer and comment and reply items here. I am down here in the "Answer this question" box because there is no other way to add a comment about the whole conversation. Perhaps a generic email format where I could add a subject line to each thing would help. Or a way to add tags and emphasis.
Perhaps they need a generic "like" button for any content on a page. Even inside paragraphs. There is no way to downvote. But popularity and grading by people is not very useful for one-on-one discussions and exlporations. And you really do not know what someone searching on the web (entry from search engines) or someone searching or wandering around looking for help inside MathWorks.com is looking for. Maybe a dialog of some sort.
I tried to generate a billion random integers in an array and had to power off my computer because it locked up my computer. It had no progress bar, and even ctrl-alt-del would not get it to stop. Whoover wrote this was so wrapped up with using the matrix and array form of things, they diminished tools for other things. I remember using LISP and got really good with it, but that never caught on. Powerful functions for things one person or group is working on might not be useful to others. There are a lot of websites that use tags. But the problem is the rigid and linear form of these blogs and dialogs. This discussion here could well be related to many other things on this site, But there is no way to connect and show relations to things here on MathWorks, or to anything on the rest of the Internet.
Even though MatLab is more than 30 years old, I think it is not quite ready yet. Partial answers on the web is not a good way to learn any complex software. It seems all our energies go into just getting tools to do simple things, and the original goals are never achieved.
As a human, I find it hard to talk to strangers, especially ones wearing masks, or hidden. From this page, only a lot of scrolling or leaving this page "might" give me clues. I am reminded of all the 3D video clips I stored that relate to ants moving bits of things around, or termites using spit to glue chunks of nearby things into massive organically fabricated structures. Random swarm algorithms are effective sometimes, but there are many problems they cannot solve.
I see no tools to add anything to MathWorks, and with no clear community or body of knowledge where the options and pathways can be seen and navigated at a glance, I don't think it has much for me. I will keep it on my computer because my friend might want to talk about his work and part of that could be in MatLab.
Ya-Yuan Cheng works at MathWorks. I called to ask a question. He and I talked for an hour and a half. Rather for that long I was trying to describe things that might help MathWorks help more people in the world. All the students in all the schools in the world (about 2 billion people). I would love to redesign the website and the associated processes, and the software and all the applications and data interfaces. But it is so hard to talk seriously about anything in bits and pieces. Either you do the whole thing, or waste time in partial efforts that mostly come to nothing.
Well, this site with its 2.18 million pages doesn't have a way to focus itself on each individuals projects and goals and purposes. I think it could, which is why I said I wanted to recast the whole into a different form, and a different style of interacting with communities - enabled and supported by well crafted and adaptable, intelligent algorithms. (I actually registered IntelligentAlgorithms.org) But I can't build it alone, or by myself. I know it is the right idea and how to do it.
Thanks for helping me with a simple problem. These larger issues I will have to do myself. Or at least try.
Richard Collins, Director, The Internet Foundation

추가 답변 (2개)

Walter Roberson
Walter Roberson 2021년 3월 3일
I think some of your premises are wrong.
Mathworks has no authority to curate or correct outside sites, and it would be a disaster for the Internet if a corporation were granted the ability to force sites worldwide to conform to its idea of how the products should be discussed. Furthermore, it is a continued and expensive legal battle even just to attempt to get rid of the sites offering cracked versions of Mathworks products, which is clearly illegal in many countries (but not all, and not all countries are willing to act against such things).
Is it Mathworks' place to try to organize all public knowledge about Mathworks products? I think not.
And the policy questions are a nightmare. What should be policies and procedures in situations where outside sites might have some useful technical information, but also contain racism, sexism, anti-LGBT material, support for fascism, or other hurtful material? People who post such things do interpret nominally neutral links to the technical material as being support for their political positions and hate, and will attack (and mobilize attacks) companies that decline to link when such content is noticed. There are entire countries which are going after search engines to force the search engines to promote the varieties of hate speech that the countries favor.
Your post has the inherent premise that all sites on the internet are run by people of good will towards all, containing neutral information (some better organized or more careful than others), and that organizing the knowledge effectively is just a matter of getting around to it. That premise is false in multiple ways.
  댓글 수: 3
Richard K Collins
Richard K Collins 2021년 3월 3일
Walter, Any organization can find where they are mentioned on the Internet and ask, politely and collaboratively, to work together to improve things. I removed all those broader issues. Thanks for your reply. If you have any suggestions for how I can add "work with me to find better ways to parse this column" to the "Import text" part of MatLab, I would appeciate any pointers or suggestions.
Walter Roberson
Walter Roberson 2021년 3월 3일
%You WOULD have this line or equivalent in the real program
filename = 'testdata.txt';
%this section is just to establish the data as text, instead of having a file to read
%You would not have this section in the actual code: I put it here to create the data
%file to have something to read from
lines = {
'Time, Sample'
'2012-04-08T00:00:00.000501Z, -70052'
'2012-04-08T00:00:00.025501Z, -70093'
'2012-04-08T00:00:00.050501Z, -70077'
'2012-04-08T00:00:00.075501Z, -69983'
'2012-04-08T00:00:00.100501Z, -70044'}
lines = 6x1 cell array
{'Time, Sample' } {'2012-04-08T00:00:00.000501Z, -70052'} {'2012-04-08T00:00:00.025501Z, -70093'} {'2012-04-08T00:00:00.050501Z, -70077'} {'2012-04-08T00:00:00.075501Z, -69983'} {'2012-04-08T00:00:00.100501Z, -70044'}
S = strjoin(lines, '\n');
fid = fopen(filename, 'w');
fwrite(fid, S);
fclose(fid);
%end of section
%beginning of code you would have in the real program
T = importdata(filename, ',', 1);
timetext = T.textdata(2:end,1)
timetext = 5x1 cell array
{'2012-04-08T00:00:00.000501Z'} {'2012-04-08T00:00:00.025501Z'} {'2012-04-08T00:00:00.050501Z'} {'2012-04-08T00:00:00.075501Z'} {'2012-04-08T00:00:00.100501Z'}
timestamps = datetime(timetext, 'InputFormat', "yyyy-MM-dd'T'hh:mm:ss.SSSSSS'Z'", 'Format', 'yyyy-MM-dd hh:mm:ss.SSSSSSZ', 'TimeZone', 'UTC')
timestamps = 5×1 datetime array
2012-04-08 12:00:00.000501+0000 2012-04-08 12:00:00.025501+0000 2012-04-08 12:00:00.050501+0000 2012-04-08 12:00:00.075501+0000 2012-04-08 12:00:00.100501+0000
numeric_value = T.data
numeric_value = 5×1
-70052 -70093 -70077 -69983 -70044
This is a more awkward process than would normally be the case for importing timestamps; the difficulty arises from the use of the non-standard timestamp format.
I have a concern about the fractions of a second. I notice that they all end in 01Z, and I worry that instead of indicating 1E-6 second there that they might be indicating a timezone offset of 1 -- though it is not obvious whether it would be +1 or -1 . If it is timezone information and is not constant, then more work would have to be done to incorporate it into the timestamps.

댓글을 달려면 로그인하십시오.


Richard K Collins
Richard K Collins 2021년 3월 4일
Thank you for your patient and complete reply. I learned many things. I have never used MatLab. I just got it yesterday.
  1. Can I use full filenames in Windows? Can I use double quotes (the filemanager gives them by default when you copy a full file name from the filemanager. FileName = "B:\$Africa\Nigeria\CTAO.IU.00.BH1.2021.001.00.00.00.019-2021.001.23.59.59.994.diff.scale-AUTO.csv"
  2. I had not noticed they all end in 1. It was pure chance, and the first reading just happened to start then. So the information is just YYYY MM DD HH MM SS.SSSSSS
  3. The fileopen and write and close look like the first ever versions of Basic and Fortran. Just hate memorizing yet another set of programmer choices for names and syntax for the thousandth time.
  4. That "Lines" example is interesting, using real line feeds for field separators. But better than quotes and commas. Save some typing.
  5. strjoin is all lowercase, is it case sensitive? Glad they use "\n" a familiar thing
  6. That timestamps line says they had to face the many date formats. And maybe they put in some string parsing generics. But horrors to try to remember that mess.
  7. I wonder how efficient this is. I have a "small" project to read 1000 stations, three channels plus time, one year of 100 sps that came up this morning (I work 18-22 hours a day often). That is 3.1536 Gsamples/year per channel. 12.6144 G numbers per year per instrument. Not huge but I try to avoid routines I cannot audit or optimize.
I have several things to do today (several dozen) but I will try this on a few datasets. The file names are easy to add so I can maybe make a function for the read and analysis, and then run a few examples.
In Javascript I LOVE the speed and efficiency of the object storage and names. I work interactively using Google Chrome and can browse the data as I step through the programs and functions. I can make content scripts that run in the background of the pages I visit on the web and have full access to the DOM. There are limits on cross domain, but I can write to disk easily, and use localhost for reading and processing files. But one thing I use all the time when reviewing new dataset is to count.
var CountHours = {}
for (var h in Hours)
{
if (!CountHours[h]) CountHours[h] = 0
CountHours[h]++
}
It creates a new counter for each unique value in Hours and then counts. It is extremely fast, I routinely count billons of things. In time series, I will count the occurences of pairs (Markov probabilites, transition matrices, conditional probabiliites), triples and further. For these seismic files I have to find filters for splitting many signals apart (multiple sources of noise and signals), so I often use first differences (the velocity goes to an acceleration) then jerk and higher time derivatives. Since those derivatives and difference depend on pairs and triples and longer sequence counts, I just count the sequences and leave the derivate calculations for later. Likewise the mean and standard deviation can be calculated from the counts of values at the end. You don't have to do any floating point additions for billions of things sums and sums of squares to regular statistics and regressions. Just count raw sequences then do the stats at the end.
Richard

카테고리

Help CenterFile Exchange에서 Environment and Settings에 대해 자세히 알아보기

제품


릴리스

R2020b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by