Fix Unicode Decode Error Without Specifying Encoding='UTF-8' - python-unicode

I am getting the following error:
'ascii' codec can't decode byte 0xf4 in position 560: ordinal not in range(128)
I find this very weird given that my .csv file doesn't have special characters. Perhaps it has special characters that specify header rows and what not, idk.
But the main problem is that I don't actually have access to the source code that reads in the file, so I cannot simply add the keyword argument encoding='UTF-8'. I need to figure out which encoding is compatible with codecs.ascii_decode(...). I DO have access to the .csv file that I'm trying to read, and I can adjust the encoding to that, but not the source file that reads it.
I have already tried exporting my .csv file into Western (ASCII) and Unicode (UTF-8) formats, but neither of those worked.

Fixed. Had nothing to do with unicode shenanigans, my script was writing a parquet file when my Cloud Formation Template was expecting a csv file. Thanks for the help.

Related

UnicodeDecodeError: ‘utf-8’ codec can’t decode byte 0xf1 in position 990: invalid continuation byte

I am comparing to csv files to each other to produce the final file with fathered differences information its giving me error message. I have resaved all files to csv decoded with utf-8 and tried running - it does not work. Can someone help me.
The problem is that your file is not in UTF-8 format. Many tools will refuse to handle data that is claimed to be UTF-8, but isn’t. I’d check first if that file is actually UTF-8 or is stored in some different encoding.

Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF`

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?
It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte

I'm trying to load a csv file using pd.read_csv but I get the following unicode error:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xcc in position 3: invalid continuation byte
Unfortunately, CSV files have no built-in method of signalling character encoding.
read_csv defaults to guessing that the bytes in the CSV file represent text encoded in the UTF-8 encoding. This results in UnicodeDecodeError if the file is using some other encoding that results in bytes that don't happen to be a valid UTF-8 sequence. (If they by luck did also happen to be valid UTF-8, you wouldn't get the error, but you'd still get wrong input for non-ASCII characters, which would be worse really.)
It's up to you to specify what encoding is in play, which requires some knowledge (or guessing) of where it came from. For example if it came from MS Excel on a western install of Windows, it would probably be Windows code page 1252 and you could read it with:
pd.read_csv('../filename.csv', encoding='cp1252')
I got the following error
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position
51: invalid continuation byte
This was because I made changes to the file and its encoding. You could also try to change the encoding of file to utf-8 using some code or nqq editor in ubuntu as it provides directory option to change encoding. If problem remains then try to undo all the changes made to the file or change the directory.
Hope this helps
Copy the code, open a new .py file and enter code and save.
I had this same issue recently. This was what I did
import pandas as pd
data = pd.read_csv(filename, encoding= 'unicode_escape')

How do I parse a .eml file using C/C++?

How do I parse .eml files using C/C++ (or preferably Objective-C)?
I looked around but couldn't find any details about the structure of the file.
EML files are plain ASCII (7-bit) files. The file format is specified in RFC 2822. The mail header will be separated from the body by an empty line.
If you will be dealing with emails that contain attachments, or characters whose value is greater than 127, you will need a base64 decoder. maybe this link will help.

Cannot upload CSV that starts with an integer

I'm stuck with what seems like a weird BigQuery bug : I cannot upload a CSV file that starts (first line, first column) by an integer.
Here's my schema : COL1:INTEGER,COL2:INTEGER,COL3:STRING
Here's my csv file content :
100,4,XXX
100,4,XXX
If I put the STRING column as first column, the upload is OK.
If I add a header and tell BigQuery to skip it during the import, the upload is ok too.
But with the CSV and schema above, BigQuery always complains : Line:1 / Field:1, Value cannot be converted to expected type.
Anyone knows what the problem is ?
Thank you in advance,
David
I could not reproduce this problem--I copied and pasted the content into a file and uploaded it with no problems.
Perhaps the uploaded file format is corrupted somehow? If there are extra bytes at the beginning of the file, those would be ignored in a header row but might result in this error is the first value of the first field is expected to be an integer. I'd recommend examining the actual binary data in the file to make sure there's nothing funny going on.
Also, are you doing this import via web UI, command-line tool, or API? Have you tried one of the other methods?