Import sql file invalid byte sequence for encoding "UTF8": 0x80 - sql

i try to import a SQL file in my Rails app with postgresql database but when i run ActiveRecord::Base.connection.execute(IO.read("tmp/FILE.SQL"))
I got this error (PG::CharacterNotInRepertoire: ERROR: invalid byte sequence for encoding "UTF8": 0x80
I never found answer here with 0x80 error code
When i check with file command i got this Non-ISO extended-ASCII text, with very long lines (334), with CRLF line terminators
I can't change the sql file because it's from client so parsing of the file without import can be another solution if the problem from the file

Any chance that your data has the Euro symbol within? Character 0x80 is € in Win-1252 character set. If that's what's going on then try this method of converting to UTF-8:
ActiveRecord::Base.connection.execute(File.read('tmp/FILE.SQL', encoding: 'cp1252').encode('utf-8'))

Related

Weird characters showing up when importing from csv files in sql

I'm trying to import data from a csv file into SQL but I keep getting the error
\copy owner (owner_id, owner_name, owner_surname) FROM 'C:\Users\Documents\owners.csv' DELIMITER ',';
ERROR: invalid input syntax for type integer: "0"
CONTEXT: COPY owner, line 1, column owner_id: "0"
Here's what owners.csv looks like
I understand that the error is to do with the encoding and that I should change the encoding to UTF 8 BOM, which I have done but the error still persist
0 in WIN1252 is hexadecimal 0xEFBBBF30, which would be a BOM and a 0.
Remove that BOM from the file, and you will get better results.

Python can't decode \x byte

I have a csv file with about 9 million rows. While processing it in Python, I got an error:
UnicodeEncodeError: 'charmap' codec can't encode character '\xe9' in position 63: character maps to undefined
Turns out the string is Beyonc\xe9 . So I guess it's something like é.
I tried just printing '\xe' in Python and it failed:
>>> print('\xe')
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-2: truncated \xXX escape
So I can't even replace or strip the backslash by s.replace('\\x', '') or s.strip('\\x').
Is there a quick way to fix this over the whole file? I tried to set the encoding while reading the file:
pandas.read_csv(inputFile, encoding='utf-8')
but it didn't help. Same problem.
Python version:
python --version
Python 3.5.2
although I installed 3.6.5
Windows 10
Update:
Following #Matti's answer I changed the encoding in pandas.read_csv() to latin1 and now the string became Beyonc\xc3\xa9. And \xc3\xa9 is unicode for é.
This is the line that's failing:
print(str(title) + ' , ' + str(artist))
title = 'Crazy In Love'
artist = 'Beyonc\xc3\xa9'
api is from lyricsgenius
The '\xe9' in the error message isn't an actual backslash followed by letters, it's just a representation of a single byte in the file. Your file is probably encoded as Latin-1, not UTF-8 as you specify. Specify 'latin1' as the encoding instead.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

I'm trying to load a csv file using pd.read_csv but I get the following unicode error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte
Unfortunately, CSV files have no built-in method of signalling character encoding.
read_csv defaults to guessing that the bytes in the CSV file represent text encoded in the UTF-8 encoding. This results in UnicodeDecodeError if the file is using some other encoding that results in bytes that don't happen to be a valid UTF-8 sequence. (If they by luck did also happen to be valid UTF-8, you wouldn't get the error, but you'd still get wrong input for non-ASCII characters, which would be worse really.)
It's up to you to specify what encoding is in play, which requires some knowledge (or guessing) of where it came from. For example if it came from MS Excel on a western install of Windows, it would probably be Windows code page 1252 and you could read it with:
pd.read_csv('../filename.csv', encoding='cp1252')
I got the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
51: invalid continuation byte
This was because I made changes to the file and its encoding. You could also try to change the encoding of file to utf-8 using some code or nqq editor in ubuntu as it provides directory option to change encoding. If problem remains then try to undo all the changes made to the file or change the directory.
Hope this helps
Copy the code, open a new .py file and enter code and save.
I had this same issue recently. This was what I did
import pandas as pd
data = pd.read_csv(filename, encoding= 'unicode_escape')

How to import from a mixed-encoding file to a PostgreSQL table

I have a 30 GB text file. the encoding of the file is UTF8 but it also contains some Windows-1252 characters. So, when I try to import, it gives the following error:
ERROR: invalid byte sequence for encoding "UTF8": 0x9b
How can I fix this?
the file already has UTF8 format, when i run the 'file' command for this file it says the encoding is UTF8. but it also contains some not UTF8 byte sequences. for example when I run the \copy command after a while it gives the above mentioned error for this row:
0B012234 Basic study of <img src="/fulltext-image.asp?format=htmlnonpaginated&src=323K744431152658_html\233_2 basic study of img src fulltext image asp format htmlnonpaginated src 323k744431152658_html 233_2 1975 Semigroup Forum semigroup forum 04861B53 19555
The issue is caused by the backslash (\).
Use CSV format which does not treat backslash as a special character, e.g. -
\copy t from myfile.txt with csv quote E'\x1' delimiter E'\x2'

COPY with exclude clause "invalid byte sequence for encoding "UTF8": 0xdf 0x4f"

I use COPY to import a file in my database
COPY mytable(c1, c2) FROM '/tmp/myfile.csv' WITH DELIMITER ';' CSV HEADER
And as usual, there are some "bad" caracters in the file wich generates encoding SQL error:
ERROR: invalid byte sequence for encoding "UTF8": 0xdf 0x4f
So I open the file and delete the line but, is there a way to COPY a file excluding by default this kind of line ?
Thanks for help