Import Txt file with two mixed columns - numpy

I want to import a txt file as below:
0 #switchfoot http://twitpic.com/2y1zl - Awww that's a bummer. You shoulda got David Carr of Third Day to do it. ;D
0 is upset that he can't update his Facebook by texting it... and might cry as a result School today also. Blah!
0 #Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds
4 my whole body feels itchy and like its on fire
4 #nationwideclass no it's not behaving at all. i'm mad. why am i here? because I can't see you all over there.
0 #Kwesidei not the whole crew
The desired return is a numpy.array with two columns, the sentiment='0' or '4' and tw='string'. But it keeps giving me error. Could anyone help?
Train_tw=np.genfromtxt("classified_tweets0.txt",dtype=(int,str),names=['sentiment','tw'])

The error with your expression is
ValueError: mismatch in size of old and new data-descriptor
If I use dtype=None, I get
ValueError: Some errors were detected !
Line #2 (got 22 columns instead of 20)
Line #3 (got 19 columns instead of 20)
Line #4 (got 11 columns instead of 20)
Line #5 (got 22 columns instead of 20)
Line #6 (got 6 columns instead of 20)
working from 'white space' delimiter, it breaks each line into 20,22, etc fields. The spaces within the text are delimiters just like the first.
One option is to edit the file, and replace the first space with some unique delimiter. Another option is to use the field length version of the delimiter. After a bit of experimentation, this load looks reasonable (This is Py3, so I'm using Unicode string dtype).
In [32]: np.genfromtxt("stack42754603.txt",dtype='int,U100',delimiter=[2,100],names=['sentiment','tw'])
Out[32]:
array([ (0, "#switchfoot http://twitpic.com/2y1zl - Awww that's a bummer. You shoulda got David Carr of Third D"),
(0, "is upset that he can't update his Facebook by texting it... and might cry as a result School today "),
(0, '#Kenichan I dived many times for the ball. Managed to save 50% The rest go out of bounds\n'),
(4, 'my whole body feels itchy and like its on fire\n'),
(4, "#nationwideclass no it's not behaving at all. i'm mad. why am i here? because I can't see you all o"),
(0, '#Kwesidei not the whole crew')],
dtype=[('sentiment', '<i4'), ('tw', '<U100')])

Related

I would like to MOVE just one line in Vim

yy and p should copy and paste 1 line of text.
But I have to go back and delete the original line.
:2,5m10
should move lines from 2 to 5 to line 10. however I need to enable :set number
to see what lines I am moving
I would like to move just 1 line of text, sort of like yy+p and not use :2,3m10
to move just one line.
Is there something like mm+p ?
so it copies the current line into buffer and deletes the line and you p paste it where you want ?
:3m . moves line 3 to your current line.
Above line does the function I want. Can I set a key mapping so
that "mm" replaces ":3m." ? I find it easier to type. TIA
What you're describing is the default behaviour when using dd -it deletes a
line into the buffer and p will paste it.
So dd and p works.
If you're new to vim, then it might seem a little strange that 'yanking' (with
y) and 'deleting' (with d) both copy to the buffer, given the 'cut', 'copy'
and 'paste' behaviours of most other editors.
You can read more about it with :help change.txt and in particular :help registers.
Also, since you say you need to enable :set number, I wonder if you've come
across :set relativenumber? This is very useful - in the example below, the
numbers would look this way if the your cursor was on the line with
'demonstrate':
3 This is just
2 a small
1 example to
0 demonstrate
1 how relative
2 numbers can
3 be useful
Thus if you wanted to move the line 'a small' below the line with 'numbers
can', you could use the relative line numbers to know that 2k would put the
cursor on the line you want, where you'd hit dd, then you'd have this
situation (the deleted line is now in the buffer:
1 This is just
0 example to
1 demonstrate
2 how relative
3 numbers can
4 be useful
Then you can do 3j to move to the 'numbers can' line, and hit p. So
relative numbers are a nice way to move quickly to lines you can see. Also,
just for completeness, you can use relative numbers in a similar way on the
command line::-2m+3 (although I know this isn't what you're after). You can
even set both relative number and set number at the same time, in which case
it's like in the example above, only you have the absolute line number
displayed on the current line instead of a zero.

pandas read_csv with multiple separators does not work

I need to be able to parse 2 different types of CSVs with read_csv, the first has ;-separated values and the second has ,-separated values. I need to do this at the same time.
That is, the CSV can have this format:
some;csv;values;here
or this:
some,csv,values,here
or even mixed:
some;csv,values;here
I tried many things like the following regex but nothing worked:
data = pd.read_csv(csv_file, sep=r'[,;]', engine='python')
Am I doing something wrong with the regex?
Instead of reading from a file, I ran your code sample
reading from a string:
txt = '''C1;C2,C3;C4
some;csv,values;here
some1;csv1,values1;here1'''
data = pd.read_csv(io.StringIO(txt), sep='[,;]', engine='python')
and got a proper result:
C1 C2 C3 C4
0 some csv values here
1 some1 csv1 values1 here1
Note that the sep parameter can be even an ordinary (not raw) string,
because it does not contain any backslashes.
So your idea to specify multiple separators as a regex pattern is OK.
The reason that your code failed is probably an "inconsistent" division of
lines into fileds. Maybe you should ensure that each line contains the
same number of commas and semi-colons (at least not too many).
Look thoroughly at your stack trace. There should include some information
about which line of the source file caused the problem.
Then look at the indicated line and correct it.
Edit
To look what happens in a "failure case", I changed the source string to:
txt = '''C1;C2,C3;C4
some;csv,values;here
some1;csv1,values1;here1
some2;csv2,values2;here2,xxxx'''
i.e. I added one line with 5 fields (one too many).
Then execution of the above code results in an error message:
ParserError: Expected 4 fields in line 4, saw 5. ...
Note words in line 4, precisely indicating the offending input line
(line numbers starts from 1).

BigQuery Line numbering

Maybe i'm missing something, but does anyone know how the debugging line numbering works in BigQuery?
I get an error like this:
Failed to save view. 2.196 - 2.226: Ambiguous column name eventcode.
and I'm not sure what the address 2.196-2.226 means. Specifically, what does the decimal signify? The second line of my query is very short (just SELECT *), so I don't think the numbers after the decimal indicate a character range?
Anyway -- I can't find any info for it in the docs, so any help here would be great
Line counting is most like shifted by 1 so it is actually line number #1 (not #2)
196 - 226 points to char range on that line with piece of code that introduces ambiquity - just the second piece (that has already existing aliac or column name for it)
Somehow for some first lines line numbering is incremented by 1 - it can be a bug on BQ UI side.
Btw, I noticed same strange behaviour in CodeMirror BQ UI is using, so it can be CodeMirror's bug too

How to control genfromtxt to read rows specified?

genfromtxt can skip header and footer lines and speicfy which columns to use.
But how can I control how many lines to read?
Sometimes a txt file might contain several blocks with different shape.
For example,
a=StringIO('''
1,2,3
1,2,3
2,3
2,3
''')
genfromtxt(a,delimiter=',',skip_header=1)
This will raise an error,
ValueError: Some errors were detected !
Line #4 (got 2 columns instead of 3)
Line #5 (got 2 columns instead of 3)
Of couse, I can do it like this:
a=StringIO('''
1,2,3
1,2,3
2,3
2,3
''')
genfromtxt(a,delimiter=',',skip_header=1,skip_footer=2)
It's ugly as I have to calculate the number of rows under the block.
However I wish something like
genfromtxt(a,delimiter=',',skip_header=1,nrows=2)
that would be more clear.
Anyone have a good idea about that? Or use other function?
Update 2015 Oct
This question has been solved in new version of Numpy.
genfromtxt now have a new keywords named max_rows which allow one to control the number of lines to read, cf here.
You can use the invalid_raise = False to skip reading the lines that are missing some data.
E.g.
b = np.genfromtxt(a, delimiter=',', invalid_raise=False)
This will give you a warning, but will not raise an exception.

Read input from line in J

I am very new to J (learning it for fun) and I am trying to read data from keyboard. I have tried to make a tiny script which reads in a string and reverses it:
|.(1!:1 3)
When I run it, I get a rank error. (I'm using 1!:1 3 instead of defining a verb because codegolf...)
Is there a command that can check the rank of 1!:1 3?
That's a common mistake with foreigns.
The definition for foreign 1!:1 doesn't help, because it really reads:
1!:1 y Read. y is a file name or a file number (produced by 1!:21); the
result is a string of the file contents., e.g. 1!:1 <'abc.q'. The following
values for y are also permitted:
1 read from the keyboard (does not work within a script)
3 read from standard input (stdin)
And so replacing y with 3 should work, right? Well, not quite, because what you're really giving as an argument in writing:
1!:1 3
is an array made of 1 3. Sort of like giving it:
1!:(1 3)
when you want:
1!:1 (3)
For code golf purposes, use a right bracket:
1!:1]3