How can I resolve a Unicode error from read_csv? - pandas

This is my first time working on a python project outside of school, so bear with me.
When I run the code below, I get the error
"(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated\uXXXXXXXX escape"
and the IDLE editor highlights the '(' before the argument of pd.read_csv.
I googled the error but got a lot of stuff that went way over my head.
The csv file in question is an excel file i saved as csv. should i save it some other way?
import pandas as pd
field = pd.read_csv("C:\Users\Glen\Documents\Feild.csv")
I just want to convert my excel data into a data frame and I don't understand why it was so easy in class, and it's now so difficult on my home pc.

The problem is with the path. There are two ways to mention the path while reading a csv file,
1- Use double backslashes,
pd.read_csv("C:\\Users\\Glen\\Documents\\Feild.csv")
2- Use single forwardslash,
pd.read_csv("C:/Users/Glen/Documents/Feild.csv")
If these do not work, try this one,
pd.read_csv("C:\\Users\\Glen\\Documents\\Feild.csv", encoding='utf-8')
OR
pd.read_csv("C:/Users/Glen/Documents/Feild.csv", encoding='utf-8')

Related

Reading .Rda file in Pandas throws an LibrdataError error

I have a data stored in .Rda format which I want to upload on Python pandas to create a data frame.I am using pyreadr library to do it. However, It's throwing up an error -
LibrdataError: Unable to convert string to the requested encoding (invalid byte sequence)
Your data is in an encoding different from utf-8. pyreadr supports only utf-8 data. Unfortunately nothing can be done to fix it at the moment. The limitation is explained in the README and also tracked in this issue

Fix Unicode Decode Error Without Specifying Encoding='UTF-8'

I am getting the following error:
'ascii' codec can't decode byte 0xf4 in position 560: ordinal not in range(128)
I find this very weird given that my .csv file doesn't have special characters. Perhaps it has special characters that specify header rows and what not, idk.
But the main problem is that I don't actually have access to the source code that reads in the file, so I cannot simply add the keyword argument encoding='UTF-8'. I need to figure out which encoding is compatible with codecs.ascii_decode(...). I DO have access to the .csv file that I'm trying to read, and I can adjust the encoding to that, but not the source file that reads it.
I have already tried exporting my .csv file into Western (ASCII) and Unicode (UTF-8) formats, but neither of those worked.
Fixed. Had nothing to do with unicode shenanigans, my script was writing a parquet file when my Cloud Formation Template was expecting a csv file. Thanks for the help.

Combine SQL files with command `copy` in a batch file introduce an incorrect syntaxe because it does add an invisible character `U+FEFF`

In a pre-build event, a batch file is executed to combine multiple SQL files into a single one.
It is done using this command :
COPY %#ProjectDir%\Migrations\*.sql %#ProjectDir%ContinuousDeployment\AllFilesMergedTogether.sql
Everything appear to work fine but somehow the result give an incorrect syntaxe error.
After two hours of investigation, it turn out the issue is caused by an invisible character that remain invisible even with notepad++.
Using an online website, the character has been spotted and is U+FEFF has shown in following image.
Here are the two input scripts.
PRINT 'Script1'
PRINT 'Script2'
Here is the output given by the copy command.
PRINT 'Script1'
PRINT 'Script2'
Additional info :
Batch file is encoded with UTF-8
Input files are encoded with UTF-8-BOM
Output file is encoded with UTF-8-BOM.
I'm not sure it is possible to change the encoding output of command copy.
I've tried and failed.
What should be done to eradicate this extremely frustrating parasitic character?
It has turned out that changing encoding of input files to ANSI does fix the issue.
No more pesky character(s).
Also, doing so does change the encoding of the result file to UTF-8 instead of UTF-8-BOM which is great I believe.
Encoding can be changed using Notepad++ as show in following picture.

Ipython kernel dies unexpectedly reading large file

I'm reading in a ~3Gb csv using pandas in an ipython notebook. While reading the file, the notebook unexpectedly gives me an error message saying the kernel appears to have died and will restart.
As per several "big data" workflows in python/pandas, I'm reading the files in as follows:
import pandas as pd
tp = pd.read_csv(file_name_cleaned,chunksize,iterator=True,low_memory=False)
df = pd.concat(tp,ignore_index=True)
My workflow has involved some preprocessing to remove all but alphanumeric characters and a few pieces of punctuation as follows:
with open(file_name,'r') as file1:
with open(file_name_cleaned,'w') as file:2
for line in file1:
if len(line.split(sep_string)) == num_columns:
line = re.sub(r'[^A-Za-z0-9|._]+','',line)
file2.write(line+'\n')
The strange thing is that if I remove the line containing re.sub(), I get a different error - "Expected 209 fileds, saw in line 22236, saw 329" even though I've explicitly checked for the exact number of delimiters. Visual inspection of the line and surrounding lines don't really show me much either.
This process has worked fine for several other files, including ones that are larger so I don't think the size of the file is the issue although I suppose it's possible that that's an oversimplification.
I included the preprocessing because I know from experience that sometimes the data contains strange special characters, I've also gone back and forth between using encoding='utf-8' and encoding='utf-8-sig' in the read_csv() and open() statements to no real avail.
I have several questions - does including the encoding keyword argument cause python to ignore characters outside of those character sets or does it maybe invoke some kind of conversion for those characters? I'm not very familiar with these types of issues. Is it possible that some kind of unexpected character could have slipped through my preprocessing and caused this? Is there another type of issue that I haven't found that could cause this? (I have done research but nothing has been quite right.)
Any help would be much appreciated.
Also, I'm using Anaconda 2.4, with Python 3.5.1, Ipython 4.0.0, and pandas 0.17.0
I'm not sure that this totally answers my questions but I did solve the issue, while it is slower, using engine='python' in pd.read_csv() did the trick.

Cannot upload CSV that starts with an integer

I'm stuck with what seems like a weird BigQuery bug : I cannot upload a CSV file that starts (first line, first column) by an integer.
Here's my schema : COL1:INTEGER,COL2:INTEGER,COL3:STRING
Here's my csv file content :
100,4,XXX
100,4,XXX
If I put the STRING column as first column, the upload is OK.
If I add a header and tell BigQuery to skip it during the import, the upload is ok too.
But with the CSV and schema above, BigQuery always complains : Line:1 / Field:1, Value cannot be converted to expected type.
Anyone knows what the problem is ?
Thank you in advance,
David
I could not reproduce this problem--I copied and pasted the content into a file and uploaded it with no problems.
Perhaps the uploaded file format is corrupted somehow? If there are extra bytes at the beginning of the file, those would be ignored in a header row but might result in this error is the first value of the first field is expected to be an integer. I'd recommend examining the actual binary data in the file to make sure there's nothing funny going on.
Also, are you doing this import via web UI, command-line tool, or API? Have you tried one of the other methods?