I want to ignore the $ sign while reading the csv file . I have used multiple encoding options such as latin-1, utf-8, utf-16, utf-32, ascii, utf-8-sig, unicode_escape, rot_13
Also encoding_errors = 'replace' but nothing seems to work
below is a dummy data set which reads the '$' as below. It converts the text in between '$' to bold-italic font.
This is how the original data set looks like
code :
df = pd.read_csv("C:\\Users\\nitin2.bhatt\\Downloads\\CCL\\dummy.csv")
df.head()
please help as I have referred to multiple blogs but couldn't find a solution to this
I need to be able to parse 2 different types of CSVs with read_csv, the first has ;-separated values and the second has ,-separated values. I need to do this at the same time.
That is, the CSV can have this format:
some;csv;values;here
or this:
some,csv,values,here
or even mixed:
some;csv,values;here
I tried many things like the following regex but nothing worked:
data = pd.read_csv(csv_file, sep=r'[,;]', engine='python')
Am I doing something wrong with the regex?
Instead of reading from a file, I ran your code sample
reading from a string:
txt = '''C1;C2,C3;C4
some;csv,values;here
some1;csv1,values1;here1'''
data = pd.read_csv(io.StringIO(txt), sep='[,;]', engine='python')
and got a proper result:
C1 C2 C3 C4
0 some csv values here
1 some1 csv1 values1 here1
Note that the sep parameter can be even an ordinary (not raw) string,
because it does not contain any backslashes.
So your idea to specify multiple separators as a regex pattern is OK.
The reason that your code failed is probably an "inconsistent" division of
lines into fileds. Maybe you should ensure that each line contains the
same number of commas and semi-colons (at least not too many).
Look thoroughly at your stack trace. There should include some information
about which line of the source file caused the problem.
Then look at the indicated line and correct it.
Edit
To look what happens in a "failure case", I changed the source string to:
txt = '''C1;C2,C3;C4
some;csv,values;here
some1;csv1,values1;here1
some2;csv2,values2;here2,xxxx'''
i.e. I added one line with 5 fields (one too many).
Then execution of the above code results in an error message:
ParserError: Expected 4 fields in line 4, saw 5. ...
Note words in line 4, precisely indicating the offending input line
(line numbers starts from 1).
My input txt file looks like this:
2019-06-23 17:53 Page 1
1.9752838,1.9752001,1.9752001,1.9749992,1.9752017,1.9752017,1.9752017,1.9752017,1.9752017,1.9752017,1.9752017,1.9752017
1.9752017,1.9752017,1.9752017,1.9752017,1.9752989,1.9752471,1.9751560,1.9751560,1.9753192,1.9752765,1.9752767,1.9644918
1.9754473,1.9751872,1.9751872,1.9745865,1.9753944,1.9753007,1.9750204,1.9750204,1.9754550,1.9754478,1.9754481,1.9518886
1.9753613,1.9751965,1.9751965,1.9747815,1.9754874,1.9753416,1.9747925,1.9747925,1.9755443,1.9756079,1.9756084,1.9574568
1.9752838,1.9752001,1.9752001,1.9754132,1.9752989,1.9752471,1.9751559,1.9751560,1.9750417,1.9752768,1.9752767,1.9816657
1.9754473,1.9751873,1.9751873,1.9758274,1.9753945,1.9753007,1.9750204,1.9750204,1.9749107,1.9754483,1.9754481,1.9861361
1.9753612,1.9751966,1.9751966,1.9756361,1.9754875,1.9753416,1.9747925,1.9747926,1.9746301,1.9756088,1.9756084,1.9894820
When I do
df = pd.read_csv('/Users/jan/data/ofile.csv', sep = ",")
The output dataframe has all the values organized in the way one might expect, except all the values are in one single column, confirmed by the command below.
len(df.columns)
1
How can I make it so that the dataframe sees the columns? I tried playing around with the 'header' tag and reading it as a txt file instead of csv, but nothing solved the problem. I know this has been asked repeatedly before but nothing seemed to solve my problem.
Check with skiprows
pd.read_csv('/Users/jan/data/ofile.csv', skiprows=3, header=None, sep=',')
I'm trying to do some text processing on entries in a tsv file so I loaded it in as a dataframe and I'm trying to add a quotation mark at the beginning of a certain entry in the dataframe. So the code I'm using to do this is as follows
episode_info.loc[i, 'word'] = "\"" + episode_info.loc[i, "word"]
but the result I'm getting when I look at the output is """help" instead of just "help and the previous entry is just help so I don't know why this isn't working
Okay I printed out the entries in question to terminal and it looks like it was printing out the correct thing. I guess when I viewed it in Sublime, which is what I was using, the quotation marks were being formatted weirdly. Apologies for the unnecessary question.
Pentaho -
Design : Text file output
Requirement :
- Read values from DB and create a csv file.
- I want to remove the CR & LF from the last line in the generated file.
This empty last line is causing problem while file parsing so I want to get rid of it.
Sample example here :
Test.ktr :
https://ufile.io/ug06w
This produces output.csv in which last line contains CRLF (contains 3 lines - blank line at the end of file)
input.csv
https://ufile.io/lj0tj
(To simulate values coming from database, contains 2 lines)
Put some logic between the Table input and CSV output, for example the Filter step which can remove empty lines.
I cannot tell you more, unless you tell me more about your specific case.
I could solve this using Shell Script component. After generating file I added a post process step to remove the empty line at the end of the file.
There could be other solutions but this fulfilled my requirement.
Thank you.