Python can't decode \x byte - pandas

I have a csv file with about 9 million rows. While processing it in Python, I got an error:
UnicodeEncodeError: 'charmap' codec can't encode character '\xe9' in position 63: character maps to undefined
Turns out the string is Beyonc\xe9 . So I guess it's something like é.
I tried just printing '\xe' in Python and it failed:
>>> print('\xe')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-2: truncated \xXX escape
So I can't even replace or strip the backslash by s.replace('\\x', '') or s.strip('\\x').
Is there a quick way to fix this over the whole file? I tried to set the encoding while reading the file:
pandas.read_csv(inputFile, encoding='utf-8')
but it didn't help. Same problem.
Python version:
python --version
Python 3.5.2
although I installed 3.6.5
Windows 10
Update:
Following #Matti's answer I changed the encoding in pandas.read_csv() to latin1 and now the string became Beyonc\xc3\xa9. And \xc3\xa9 is unicode for é.
This is the line that's failing:
print(str(title) + ' , ' + str(artist))
title = 'Crazy In Love'
artist = 'Beyonc\xc3\xa9'
api is from lyricsgenius

The '\xe9' in the error message isn't an actual backslash followed by letters, it's just a representation of a single byte in the file. Your file is probably encoded as Latin-1, not UTF-8 as you specify. Specify 'latin1' as the encoding instead.

Related

Import sql file invalid byte sequence for encoding "UTF8": 0x80

i try to import a SQL file in my Rails app with postgresql database but when i run ActiveRecord::Base.connection.execute(IO.read("tmp/FILE.SQL"))
I got this error (PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0x80
I never found answer here with 0x80 error code
When i check with file command i got this Non-ISO extended-ASCII text, with very long lines (334), with CRLF line terminators
I can't change the sql file because it's from client so parsing of the file without import can be another solution if the problem from the file
Any chance that your data has the Euro symbol within? Character 0x80 is € in Win-1252 character set. If that's what's going on then try this method of converting to UTF-8:
ActiveRecord::Base.connection.execute(File.read('tmp/FILE.SQL', encoding: 'cp1252').encode('utf-8'))

pandas read .txt doc with specific seperator

I'm trying to read a .txt file using pandas that has a ^ type separator.
I keep running into error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf8 in position 7: invalid start byte
tried to use pd.read_csv(txt.file , sep = '^' , header = None)
txt.file has no headers ,
Am I missing an argument?
UPDATE:
13065^000000000^aaaaa^test , conditions^123455^^01.01.01:Date^^^^^^ 77502^000000123^aaaaa^test, conditions^123456^^^^^^^^
seems there is is uneven amount of the sep ^ on each row.
How could I fix?

getting Unicode decode error while using gTTS in python

While using gTTS google translator module in python 2.x, I am getting error-
File
"/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gtts/tts.py",
line 94, in init
if self._len(text) <= self.MAX_CHARS: File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/gtts/tts.py",
line 154, in _len
return len(unicode(text)) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)`.
Even though I have included # -*- coding: utf-8 -*- in my python script, I am getting the error on using Non-ASCII characters. Tell me some other way to implement like I can write sentence in English and get translated in other language. But this is also not working as I am getting speech in English, only accent changed.
I have searched a lot everywhere but can't find an answer. Please help!
I have tried writing a string in unicode format as-
u'Qu'est-ce que tu fais? Gardez-le de côté.'.
The ASCII code characters are converted into unicode format and hence, resolve the error. So, the text you want to be converted into speech can even have utf-8 format characters and can be easily transformed.
You also need to decode the incoming argument in the gtts-cli.py
Change the following line from this:
if args.text == "-":
text = sys.stdin.read()
else:
text = arg.text
to this:
if args.text == "-":
text = sys.stdin.read()
else:
text = codecs.decode(arg.text,"utf-8")
works for me, give it a try.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

I'm trying to load a csv file using pd.read_csv but I get the following unicode error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte
Unfortunately, CSV files have no built-in method of signalling character encoding.
read_csv defaults to guessing that the bytes in the CSV file represent text encoded in the UTF-8 encoding. This results in UnicodeDecodeError if the file is using some other encoding that results in bytes that don't happen to be a valid UTF-8 sequence. (If they by luck did also happen to be valid UTF-8, you wouldn't get the error, but you'd still get wrong input for non-ASCII characters, which would be worse really.)
It's up to you to specify what encoding is in play, which requires some knowledge (or guessing) of where it came from. For example if it came from MS Excel on a western install of Windows, it would probably be Windows code page 1252 and you could read it with:
pd.read_csv('../filename.csv', encoding='cp1252')
I got the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
51: invalid continuation byte
This was because I made changes to the file and its encoding. You could also try to change the encoding of file to utf-8 using some code or nqq editor in ubuntu as it provides directory option to change encoding. If problem remains then try to undo all the changes made to the file or change the directory.
Hope this helps
Copy the code, open a new .py file and enter code and save.
I had this same issue recently. This was what I did
import pandas as pd
data = pd.read_csv(filename, encoding= 'unicode_escape')

How to import from a mixed-encoding file to a PostgreSQL table

I have a 30 GB text file. the encoding of the file is UTF8 but it also contains some Windows-1252 characters. So, when I try to import, it gives the following error:
ERROR: invalid byte sequence for encoding "UTF8": 0x9b
How can I fix this?
the file already has UTF8 format, when i run the 'file' command for this file it says the encoding is UTF8. but it also contains some not UTF8 byte sequences. for example when I run the \copy command after a while it gives the above mentioned error for this row:
0B012234 Basic study of <img src="/fulltext-image.asp?format=htmlnonpaginated&src=323K744431152658_html\233_2 basic study of img src fulltext image asp format htmlnonpaginated src 323k744431152658_html 233_2 1975 Semigroup Forum semigroup forum 04861B53 19555
The issue is caused by the backslash (\).
Use CSV format which does not treat backslash as a special character, e.g. -
\copy t from myfile.txt with csv quote E'\x1' delimiter E'\x2'