genfromtxt can skip header and footer lines and speicfy which columns to use.
But how can I control how many lines to read?
Sometimes a txt file might contain several blocks with different shape.
For example,
a=StringIO('''
1,2,3
1,2,3
2,3
2,3
''')
genfromtxt(a,delimiter=',',skip_header=1)
This will raise an error,
ValueError: Some errors were detected !
Line #4 (got 2 columns instead of 3)
Line #5 (got 2 columns instead of 3)
Of couse, I can do it like this:
a=StringIO('''
1,2,3
1,2,3
2,3
2,3
''')
genfromtxt(a,delimiter=',',skip_header=1,skip_footer=2)
It's ugly as I have to calculate the number of rows under the block.
However I wish something like
genfromtxt(a,delimiter=',',skip_header=1,nrows=2)
that would be more clear.
Anyone have a good idea about that? Or use other function?
Update 2015 Oct
This question has been solved in new version of Numpy.
genfromtxt now have a new keywords named max_rows which allow one to control the number of lines to read, cf here.
You can use the invalid_raise = False to skip reading the lines that are missing some data.
E.g.
b = np.genfromtxt(a, delimiter=',', invalid_raise=False)
This will give you a warning, but will not raise an exception.
Related
I need to be able to parse 2 different types of CSVs with read_csv, the first has ;-separated values and the second has ,-separated values. I need to do this at the same time.
That is, the CSV can have this format:
some;csv;values;here
or this:
some,csv,values,here
or even mixed:
some;csv,values;here
I tried many things like the following regex but nothing worked:
data = pd.read_csv(csv_file, sep=r'[,;]', engine='python')
Am I doing something wrong with the regex?
Instead of reading from a file, I ran your code sample
reading from a string:
txt = '''C1;C2,C3;C4
some;csv,values;here
some1;csv1,values1;here1'''
data = pd.read_csv(io.StringIO(txt), sep='[,;]', engine='python')
and got a proper result:
C1 C2 C3 C4
0 some csv values here
1 some1 csv1 values1 here1
Note that the sep parameter can be even an ordinary (not raw) string,
because it does not contain any backslashes.
So your idea to specify multiple separators as a regex pattern is OK.
The reason that your code failed is probably an "inconsistent" division of
lines into fileds. Maybe you should ensure that each line contains the
same number of commas and semi-colons (at least not too many).
Look thoroughly at your stack trace. There should include some information
about which line of the source file caused the problem.
Then look at the indicated line and correct it.
Edit
To look what happens in a "failure case", I changed the source string to:
txt = '''C1;C2,C3;C4
some;csv,values;here
some1;csv1,values1;here1
some2;csv2,values2;here2,xxxx'''
i.e. I added one line with 5 fields (one too many).
Then execution of the above code results in an error message:
ParserError: Expected 4 fields in line 4, saw 5. ...
Note words in line 4, precisely indicating the offending input line
(line numbers starts from 1).
I want to import a txt file as below:
0 #switchfoot http://twitpic.com/2y1zl - Awww that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
0 is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
0 #Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
4 my whole body feels itchy and like its on fire
4 #nationwideclass no it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.
0 #Kwesidei not the whole crew
The desired return is a numpy.array with two columns, the sentiment='0' or '4' and tw='string'. But it keeps giving me error. Could anyone help?
Train_tw=np.genfromtxt("classified_tweets0.txt",dtype=(int,str),names=['sentiment','tw'])
The error with your expression is
ValueError: mismatch in size of old and new data-descriptor
If I use dtype=None, I get
ValueError: Some errors were detected !
Line #2 (got 22 columns instead of 20)
Line #3 (got 19 columns instead of 20)
Line #4 (got 11 columns instead of 20)
Line #5 (got 22 columns instead of 20)
Line #6 (got 6 columns instead of 20)
working from 'white space' delimiter, it breaks each line into 20,22, etc fields. The spaces within the text are delimiters just like the first.
One option is to edit the file, and replace the first space with some unique delimiter. Another option is to use the field length version of the delimiter. After a bit of experimentation, this load looks reasonable (This is Py3, so I'm using Unicode string dtype).
In [32]: np.genfromtxt("stack42754603.txt",dtype='int,U100',delimiter=[2,100],names=['sentiment','tw'])
Out[32]:
array([ (0, "#switchfoot http://twitpic.com/2y1zl - Awww that's a bummer. You shoulda got David Carr of Third D"),
(0, "is upset that he can't update his Facebook by texting it... and might cry as a result School today "),
(0, '#Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds\n'),
(4, 'my whole body feels itchy and like its on fire\n'),
(4, "#nationwideclass no it's not behaving at all. i'm mad. why am i here? because I can't see you all o"),
(0, '#Kwesidei not the whole crew')],
dtype=[('sentiment', '<i4'), ('tw', '<U100')])
I have a poorly formatted delimited file, in which the there are errors with the delimiter, so it sometimes appears that there are an inconsistent number of columns in different rows.
When I run
pd.read_csv('patentHeader.txt', sep="|", header=0)
the process dies with this error:
CParserError: Error tokenizing data. C error: Expected 9 fields in line 1034558, saw 15
Is there a way to have pandas skip these lines and continuing? Or put differently, is there some way to make read_csv be more flexible about how many columns it encounters?
Try this.
pd.read_csv('patentHeader.txt', sep="|", header=0, error_bad_lines=False)
error_bad_lines: if False then any lines causing an error will be skipped bad lines, and it will be reported once the reading process is done.
How do we read a specific file line by line while skipping some columns in it?
For example, I have a text file which has data, sorted out in 5 columns, but I need to read only two columns out of it, they can be first two or any other random combination (I mean, need a solution which would work with any combination of columns like first and third only).
Code something like this
open(1, file=data_file)
read (1,*) ! to skip first line, with metadata
lmax = 0
do while (.true.)
! read column 1 and 3 here, either write
! that to an array or just loop through each row
end do
99 continue
close (1)
Any explanation or example would help a lot.
High Performance Mark's answer gives the essentials of simple selective column reading: one still reads the column but transfers it to a then-ignored variable.
To extend that answer, then, consider that we want to read the second and fourth columns of a five-column line:
read(*,*) junk, x, junk, y
The first value is transferred into junk, then the second into x, then the third (replacing the one just acquired a moment ago) into junk and finally the fourth into y. The fifth is ignored because we've run out of input items and the transfer statement terminates (and the next read in a loop will go to the next record).
Of course, this is fine when we know it's those columns we want. Let's generalize to when we don't know in advance:
integer col1, col2 ! The columns we require, defined somehow (assume col1<col2)
<type>, dimension(nrows) :: x, y, junk(3) ! For the number of rows
integer i
do i=1,nrows
read(*,*) junk(:col1-1), x(i), junk(:col2-col1-1), y(i)
end do
Here, we transfer a number of values (which may be zero) up to just before the first column of interest, then the value of interest. After that, more to-be-ignored values (possibly zero), then the final value of interest. The rest of the row is skipped.
This is still very basic and avoids many potential complications in requirements. To some extent, it's such a basic approach one may as well just consider:
do i=1,nrows
read(*,*) allofthem(:5)
x(i) = allofthem(col1)
y(i) = allofthem(col2)
end do
(where that variable is a row-by-row temporary) but variety and options are good.
This is very easy. You simply read 5 variables from each line and ignore the ones you have no further use for. Something like
do i = 1, 100
read(*,*) a(i), b, c(i), d, e
end do
This will overwrite the values in b, d, and e at every iteration.
Incidentally, your line
99 continue
is redundant; it's not used as the closing line for the do loop and you're not branching to it from anywhere else. If you are branching to it from unseen code you could just attach the label 99 to the next line and delete the continue statement. Generally, continue is redundant in modern Fortran; specifically it seems redundant in your code.
I want to read in data from multiple files that I want to use for plotting (matplotlib).
I found a function loadtxt() that I could use for this purpose. However, I only want to read in one column from each file.
How would I do this?
The following command works for me if I read in at least 2 columns, for example:
numpy.loadtxt('myfile.dat', usecols=(2,3))
But
numpy.loadtxt('myfile.dat', usecols=(3))
would throw an error.
You need a comma after the 3 in order to tell Python that (3,) is a tuple. Python interprets (3) to be the same value as the int 3, and loadtxt wants a sequence-type argument for usecols.
numpy.loadtxt('myfile.dat', usecols=(3,))