Encoding Error of Reading .dta Files with Chinese Characters - pandas

I am trying to read .dta files with pandas:
import pandas as pd
my_data = pd.read_stata('filename', encoding='utf-8')
the error message is:
ValueError: Unknown encoding. Only latin-1 and ascii supported.
other encoding formality also didn't work, such as gb18030 or gb2312 for dealing with Chineses characters. If I remove the encoding parameter, the DataFrame will be all of garbage values.

Simply read the original data by default encoding, then transfer to the expected encoding! Suppose the column having garbled text is column1
import pandas as pd
dta = pd.read_stata('filename.dta')
print(dta['column1'][0].encode('latin-1').decode('gb18030'))
The print result will show normal Chinese characters, and gb2312 can also make it.

Looking at the source code of pandas (version 0.22.0), the supported encodings for read_stata are ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1', 'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1'). So you can only choose from this list.

Related

How to read $ character while reading a csv using pandas dataframe

I want to ignore the $ sign while reading the csv file . I have used multiple encoding options such as latin-1, utf-8, utf-16, utf-32, ascii, utf-8-sig, unicode_escape, rot_13
Also encoding_errors = 'replace' but nothing seems to work
below is a dummy data set which reads the '$' as below. It converts the text in between '$' to bold-italic font.
This is how the original data set looks like
code :
df = pd.read_csv("C:\\Users\\nitin2.bhatt\\Downloads\\CCL\\dummy.csv")
df.head()
please help as I have referred to multiple blogs but couldn't find a solution to this

Is it possible to read a csv with `\r\n` line terminators in pandas?

I'm using pandas==1.1.5 to read a CSV file. I'm running the following code:
import pandas as pd
import csv
csv_kwargs = dict(
delimiter="\t",
lineterminator="\r\n",
quoting=csv.QUOTE_MINIMAL,
escapechar="!",
)
pd.read_csv("...", **csv_kwargs)
It raises the following error: ValueError: Only length-1 line terminators supported.
Pandas documentation confirms that line terminators should be length-1 (I suppose single character).
Is there any way to read this CSV with Pandas or should I read it some other way?
Note that the docs suggest length-1 for C parsers, maybe I can plugin some other parser?
EDIT: Not specifying the line terminator raises a parse error in the middle of the file. Specifically ParserError: Error tokenizing data., it expects the correct number of fields but gets too many.
EDIT2: I'm confident the kwargs above were used to created the csv file I'm trying to read.
The problem might be in the escapchar, since ! is a common text character.
Python's csv module defines a very strict use of escapechar:
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False.
but it's possible that pandas interprets it differently:
One-character string used to escape other characters.
It's possible that you have a row that contains something like:
...\t"some important text!"\t...
which would escape the quote character and continue parsing text into that column.

how to import csv with emojis in it

I'm a bit new to python and I was having trouble importing a csv dataset that has emojis in it
I've been using:
pd.read_csv(DATA, encoding = 'ISO-8859-1')
and the emojis are turning into gibberish, example to the right: 🙌 ---> ð\n\n
I've tried changing the import to be 'UTF-8' however that just prevents the import working at all with the following error: 'utf-8 codec can't decode bytes in position...'
I'm not really sure why the utf-8 encoding import doesn't work either. When I open the file with notepad++ and check the encoding, it shows utf-8.
My eventual goal is to convert the emojis to meaningful text with demojize from emoji. However I'm pretty sure I need the emojis to be in utf-8 encoding in order for that to work.
Any ideas?

File csv has "ñ" in headers, I can't read it with pandas

I'm trying read csv with pandas, it has a header "año"
This is the unicode error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd1 in position 1: invalid continuation byte
How can I read this csv file? I have a lot of files with this problem.
It is not in UTF-8 format. You need to give the format ISO-8859-1 to pandas.
You should post the pandas code where it's specifying UTF-8

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

I'm trying to load a csv file using pd.read_csv but I get the following unicode error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte
Unfortunately, CSV files have no built-in method of signalling character encoding.
read_csv defaults to guessing that the bytes in the CSV file represent text encoded in the UTF-8 encoding. This results in UnicodeDecodeError if the file is using some other encoding that results in bytes that don't happen to be a valid UTF-8 sequence. (If they by luck did also happen to be valid UTF-8, you wouldn't get the error, but you'd still get wrong input for non-ASCII characters, which would be worse really.)
It's up to you to specify what encoding is in play, which requires some knowledge (or guessing) of where it came from. For example if it came from MS Excel on a western install of Windows, it would probably be Windows code page 1252 and you could read it with:
pd.read_csv('../filename.csv', encoding='cp1252')
I got the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
51: invalid continuation byte
This was because I made changes to the file and its encoding. You could also try to change the encoding of file to utf-8 using some code or nqq editor in ubuntu as it provides directory option to change encoding. If problem remains then try to undo all the changes made to the file or change the directory.
Hope this helps
Copy the code, open a new .py file and enter code and save.
I had this same issue recently. This was what I did
import pandas as pd
data = pd.read_csv(filename, encoding= 'unicode_escape')