Ipython kernel dies unexpectedly reading large file - pandas

I'm reading in a ~3Gb csv using pandas in an ipython notebook. While reading the file, the notebook unexpectedly gives me an error message saying the kernel appears to have died and will restart.
As per several "big data" workflows in python/pandas, I'm reading the files in as follows:
import pandas as pd
tp = pd.read_csv(file_name_cleaned,chunksize,iterator=True,low_memory=False)
df = pd.concat(tp,ignore_index=True)
My workflow has involved some preprocessing to remove all but alphanumeric characters and a few pieces of punctuation as follows:
with open(file_name,'r') as file1:
with open(file_name_cleaned,'w') as file:2
for line in file1:
if len(line.split(sep_string)) == num_columns:
line = re.sub(r'[^A-Za-z0-9|._]+','',line)
file2.write(line+'\n')
The strange thing is that if I remove the line containing re.sub(), I get a different error - "Expected 209 fileds, saw in line 22236, saw 329" even though I've explicitly checked for the exact number of delimiters. Visual inspection of the line and surrounding lines don't really show me much either.
This process has worked fine for several other files, including ones that are larger so I don't think the size of the file is the issue although I suppose it's possible that that's an oversimplification.
I included the preprocessing because I know from experience that sometimes the data contains strange special characters, I've also gone back and forth between using encoding='utf-8' and encoding='utf-8-sig' in the read_csv() and open() statements to no real avail.
I have several questions - does including the encoding keyword argument cause python to ignore characters outside of those character sets or does it maybe invoke some kind of conversion for those characters? I'm not very familiar with these types of issues. Is it possible that some kind of unexpected character could have slipped through my preprocessing and caused this? Is there another type of issue that I haven't found that could cause this? (I have done research but nothing has been quite right.)
Any help would be much appreciated.
Also, I'm using Anaconda 2.4, with Python 3.5.1, Ipython 4.0.0, and pandas 0.17.0

I'm not sure that this totally answers my questions but I did solve the issue, while it is slower, using engine='python' in pd.read_csv() did the trick.

Related

How can I resolve a Unicode error from read_csv?

This is my first time working on a python project outside of school, so bear with me.
When I run the code below, I get the error
"(unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated\uXXXXXXXX escape"
and the IDLE editor highlights the '(' before the argument of pd.read_csv.
I googled the error but got a lot of stuff that went way over my head.
The csv file in question is an excel file i saved as csv. should i save it some other way?
import pandas as pd
field = pd.read_csv("C:\Users\Glen\Documents\Feild.csv")
I just want to convert my excel data into a data frame and I don't understand why it was so easy in class, and it's now so difficult on my home pc.
The problem is with the path. There are two ways to mention the path while reading a csv file,
1- Use double backslashes,
pd.read_csv("C:\\Users\\Glen\\Documents\\Feild.csv")
2- Use single forwardslash,
pd.read_csv("C:/Users/Glen/Documents/Feild.csv")
If these do not work, try this one,
pd.read_csv("C:\\Users\\Glen\\Documents\\Feild.csv", encoding='utf-8')
OR
pd.read_csv("C:/Users/Glen/Documents/Feild.csv", encoding='utf-8')

'UnicodeEncodeError' when using 'set_dataframe'

When using set_dataframe to update my Google Sheets via pygsheets and pandas, I get error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 13: ordinal not in range(128)
This is due to utf-8 marks over some text, e.g.,: "señor"
This happens on executing:
wks.set_dataframe(df, start='A10')
Pandas to_csv accepts an encoding parameter similar to encoding="utf-8", may I suggests set_dataframe does the same?
wks.set_dataframe(df, start='A10', encoding="utf-8")
I see there's a ticket opened 10 days ago here but is there a workaround?
Solution:
I ran into the same issue, and I think that, more than a bug in the pygsheets module, it would be a limitation as you clearly point out.
What I did to solve this issue was:
def encodeDataFrame(df, encoding='UTF-8'):
if df is not None and not df.empty:
for column, series in df.items():
if type(series.at[0]) == unicode:
try:
encodedSeries = series.str.encode(encoding)
df[column] = encodedSeries
except Exception as e:
print 'Could not encode column %s' % column
raise
And you can call the function this way:
encodeDataFrame(df)
wks.set_dataframe(df, start='A10')
This might no longer be a good solution because of a change that was made in pygsheets to avoid this issue. See EDIT section below
Explanation:
You solve the issue by encoding the unicode values yourself, before sending them to the set_dataframe function.
This problem comes up whenever you try to use the Worksheet.set_dataframe function, using a dataframe that contains unicode characters that cannot be encoded in ascii (like accents, and many other).
The exception is thrown because the set_dataframe function attempts to cast the unicode values into str values (using the default encoding). For Python 2, the default encoding is ascii and when a character out of the range of ascii is found, the exception is thrown.
Some people have suggested reloading the sys module to circumvent this problem, but here is explained why you should not do it
The other solution I would think of would be to use the pygsheets module in Python 3, where this should no longer be a problem because the default encoding for Python 3 is UTF-8 (see docs)
Bonus:
Ask yourself:
1) Is Unicode an encoding?
2) What is an encoding?
If you hesitated with any of those questions, you should read this article, which gave me the knowledge needed to think of this solution. The time invested was completely worth it.
For more information, you can try this article which links to the previous one at the end.
EDIT:
A change was made 2 days ago (07/26/19) to pygsheets that is supposed to fix this. It looks like the intention is to avoid encoding into the str type, but I would think that this change might try decoding strings into the unicode type from the default ascii encoding, which could also lead to trouble. When/If this change is released it is probably going to be better not to encode anything and pass values as unicode to the set_dataframe function.
EDIT 2:
This change has now been released in version 2.0.2. If my predictions were true, using the encodeDataFrame function I suggested will result in a UnicodeDecodeError because the Worksheet.set_dataframe function will attempt to decode the str values with the default ascii encoding. So the best way to use the function will be not to have any str values in your dataframe. If you have them, decode them to unicode before calling the set_dataframe function. You could have a mirror version of the
function I suggested. It would look something like this:
def decodeDataFrame(df, encoding='UTF-8'):
if df is not None and not df.empty:
for column, series in df.items():
if type(series.at[0]) == str:
try:
decodedSeries = series.str.decode(encoding)
df[column] = decodedSeries
except Exception as e:
print 'Could not decode column %s' % column
raise

pandas to_latex() escaping "\\newline" for use in LaTex

I am struggling with inexplicable behavior differences between Windows and Mac handling of \newline in a jinja template string.
I have a string, mystring = '123 \\newline 456', that is inside a dataframe. When I use pandas to_latex() on a Mac, this gets converted to 123 \textbackslashnewline. But on Windows, it gets converted to 123 \textbackslash newline
The difference between the two complicates the work-around, which is to replace \textbackslash with an empty string. On the Windows version I need to include the space and on the Mac version I need to not include the space.
Is there a better way to incorporate \newline into the table? I tried r'\newline' and I tried just '\newline' and neither work. The reason the latter doesn't work is that the IDE I am using (PyCharm) interprets the single slash as a line break rather than as part of the string. The double-slash that I use is simply meant to get around that, but then to_latex() interprets the double slash as \textblackslash.
I discovered that the issue was really with pandas to_latex() command and have edited the original question.
For the Windows version, pandas was replacing \\newline with '\textbackslash newline'. But on the Mac, it was replacing it with '\textbackslashnewline' without the space. I was then using:
.replace('textbackslash', '')
which worked on the Mac but failed on Windows because of the extra space. What I cannot figure out though is why there is a difference between the Mac and Windows to_latex() functions.
Note that I am aware of the escape=False option in to_latex(), however that messes up instances where I want the parts of the string escaped (e.g. \$).

Enable show differences in line separators in a diff with Intellij IDEA 13

I'm using IDEA 13.0.1. A unit test is failing because of some line separator stuff. But when I try to compare the two results, IDEA says "Contents have differences only in line separators".
And I can't find a setting where I can say show me these differences. Is there one?
I ran into the same issue, I couldn't find a way to show the difference by clicking show differences, but if you put a break point and look at the strings, they have the line separator written out. For me one string had \n and one had \r\n. \r is also a possible line separator.
I ran into the same problem recently. I found a workaround for it, is not that pretty but did the job:
yourString.replaceAll("\n", "").replaceAll("\r", "");
The key is in what #user1633977 said.
To solve this problem, always use System.lineSeparator() to create your separators. That way they will always be the same and will be OS-independant.
(Remember that different OS can use different symbols as line separators).

Script consecutive Replace-All operations in Notepad++

Is there a way to script consecutive Replace-All operations in Notepad++?
For example, I want to be able to first replace all “ characters with &ldquo and then to replace all ” characters with &rdquo and then I would like to replace all "string1" with "string2", etc...
Nevermind,
I finally figured it out, and it seems like a great solution.
I used PythonScript for NotePad++ (which is what I started with but which kept giving me errors, until I finally fixed a few things).
So, here is the code for those who may be interested:
# This Python file uses the following encoding: utf-8
import os, sys
editor.replace(r"“",r"“")
editor.replace(r"”",r"”")
editor.replace(r"’",r"’")
the r before the quotation marks allows the use of special characters as they are, and this is what was so difficult for me to get working.