Is it possible to read a csv with `\r\n` line terminators in pandas? - pandas

I'm using pandas==1.1.5 to read a CSV file. I'm running the following code:
import pandas as pd
import csv
csv_kwargs = dict(
delimiter="\t",
lineterminator="\r\n",
quoting=csv.QUOTE_MINIMAL,
escapechar="!",
)
pd.read_csv("...", **csv_kwargs)
It raises the following error: ValueError: Only length-1 line terminators supported.
Pandas documentation confirms that line terminators should be length-1 (I suppose single character).
Is there any way to read this CSV with Pandas or should I read it some other way?
Note that the docs suggest length-1 for C parsers, maybe I can plugin some other parser?
EDIT: Not specifying the line terminator raises a parse error in the middle of the file. Specifically ParserError: Error tokenizing data., it expects the correct number of fields but gets too many.
EDIT2: I'm confident the kwargs above were used to created the csv file I'm trying to read.

The problem might be in the escapchar, since ! is a common text character.
Python's csv module defines a very strict use of escapechar:
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False.
but it's possible that pandas interprets it differently:
One-character string used to escape other characters.
It's possible that you have a row that contains something like:
...\t"some important text!"\t...
which would escape the quote character and continue parsing text into that column.

Related

How to read $ character while reading a csv using pandas dataframe

I want to ignore the $ sign while reading the csv file . I have used multiple encoding options such as latin-1, utf-8, utf-16, utf-32, ascii, utf-8-sig, unicode_escape, rot_13
Also encoding_errors = 'replace' but nothing seems to work
below is a dummy data set which reads the '$' as below. It converts the text in between '$' to bold-italic font.
This is how the original data set looks like
code :
df = pd.read_csv("C:\\Users\\nitin2.bhatt\\Downloads\\CCL\\dummy.csv")
df.head()
please help as I have referred to multiple blogs but couldn't find a solution to this

Encoding Error of Reading .dta Files with Chinese Characters

I am trying to read .dta files with pandas:
import pandas as pd
my_data = pd.read_stata('filename', encoding='utf-8')
the error message is:
ValueError: Unknown encoding. Only latin-1 and ascii supported.
other encoding formality also didn't work, such as gb18030 or gb2312 for dealing with Chineses characters. If I remove the encoding parameter, the DataFrame will be all of garbage values.
Simply read the original data by default encoding, then transfer to the expected encoding! Suppose the column having garbled text is column1
import pandas as pd
dta = pd.read_stata('filename.dta')
print(dta['column1'][0].encode('latin-1').decode('gb18030'))
The print result will show normal Chinese characters, and gb2312 can also make it.
Looking at the source code of pandas (version 0.22.0), the supported encodings for read_stata are ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1', 'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1'). So you can only choose from this list.

Catch bad lines in csv file using pandas read_csv

I am using pandas read_csv to read a 140k lines csv file. The format of the file is as follows:
"HEAD1", "HEAD2", "HEAD3"
"line1-1", "line1-2", "line1-3"
"line2-1", "line2-2", "line2-3"
There are some invalid lines as follows:
"line"3-1", "line3-2",, "li"ne3-4"
How can I catch and print out the invalid lines? Is it possible to do so using the read_csv function or I need to use csv.reader and check each line using a regular expression? If so, can somebody help me build a regular expression? I came up with the following, but it does not work
^".+\",?"?
Thank you.

psycopg2: export csv to database, dealing with e+ expression

I have a csv file containing
numbers like "1.456e+07"
and I am using function "copy_expert" to export the file to database
but I am getting error
psycopg2.DataError: invalid input syntax for integer: "1.5637e+07"
I notice that I can insert "100" as an integer, but when I do "1.5637e+07" with qoute, it doesn't work.
I am using pandas dataframe's to_csv to generate the csv files. not sure how to get rid of qoute for integer like "1.5637e+07" only (I have string column), or whether there is other solution.
I find out the solution
Normally, pandas doesn't put quotes around number. However, I set float_format parameter which causes this. I reset
quoting=csv.QUOTE_MINIMAL
in the function call and the quotes go away.

making a list of traditional Chinese characters from a string

I am currently trying to estimate the number of times each character is used in a large sample of traditional Chinese characters. I am interested in characters not words. The file also includes punctuation and western characters.
I am reading in an example file of traditional Chinese characters. The file contains a large sample of traditional Chinese characters. Here is a small subset:
首映鼓掌10分鐘 評語指不及《花樣年華》
該片在柏林首映,完場後獲全場鼓掌10分鐘。王家衛特別為該片剪輯「柏林版本
增減20處 趙本山香港戲分被刪
在柏林影展放映的《一代宗師》版本
教李小龍武功 葉問決戰散打王
另一增加的戲分是開場時葉問(梁朝偉飾)
My strategy is to read each line, split each line into a list, and go through and check each character to see if it already exists in a list or a dictionary of characters. If the character does not yet exist in my list or dictionary I will add it to that list, if it does exist in my list or dictionary, I will increase the counter for that specific character. I will probably use two lists, a list of characters, and a parallel list containing the counts. This will be more processing, but should also be much easier to code.
I have not gotten anywhere near this point yet.
I am able to read in the example file successfully. Then I am able to make a list for each line of my file. I am able to print out those individual lines into my output file and sort of reconstitute the original file, and the traditional Chinese comes out intact.
However, I run into trouble when I try to make a list of each character on a particular line.
I've read through the following article. I understood many of the comments, but unfortunately, was unable to understand enough of it to solve my problem.
How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator?
My code looks like the following
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import codecs
wordfile = open('Chinese_example.txt', 'r')
output = open('Chinese_output_python.txt', 'w')
LINES = wordfile.readlines()
Through various tests I am sure the following line is not splitting the string LINES[0] into its component Chinese characters.
A_LINE = list(LINES[0])
output.write(A_LINE[0])
I mean you want to use this, from answerer 'flow' at How to do a Python split() on languages (like Chinese) that don't use whitespace as word separator? :
from re import compile as _Re
_unicode_chr_splitter = _Re( '(?s)((?:[\ud800-\udbff][\udc00-\udfff])|.)' ).split
def split_unicode_chrs( text ):
return [ chr for chr in _unicode_chr_splitter( text ) if chr ]
to successfully split a line of traditional Chinese characters.. I just had to know the proper syntax to handle encoded characters.. pretty basic.
my_new_list = list(unicode(LINE[0].decode('utf8')));