'UnicodeEncodeError' when using 'set_dataframe' - pandas

When using set_dataframe to update my Google Sheets via pygsheets and pandas, I get error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 13: ordinal not in range(128)
This is due to utf-8 marks over some text, e.g.,: "señor"
This happens on executing:
wks.set_dataframe(df, start='A10')
Pandas to_csv accepts an encoding parameter similar to encoding="utf-8", may I suggests set_dataframe does the same?
wks.set_dataframe(df, start='A10', encoding="utf-8")
I see there's a ticket opened 10 days ago here but is there a workaround?

Solution:
I ran into the same issue, and I think that, more than a bug in the pygsheets module, it would be a limitation as you clearly point out.
What I did to solve this issue was:
def encodeDataFrame(df, encoding='UTF-8'):
if df is not None and not df.empty:
for column, series in df.items():
if type(series.at[0]) == unicode:
try:
encodedSeries = series.str.encode(encoding)
df[column] = encodedSeries
except Exception as e:
print 'Could not encode column %s' % column
raise
And you can call the function this way:
encodeDataFrame(df)
wks.set_dataframe(df, start='A10')
This might no longer be a good solution because of a change that was made in pygsheets to avoid this issue. See EDIT section below
Explanation:
You solve the issue by encoding the unicode values yourself, before sending them to the set_dataframe function.
This problem comes up whenever you try to use the Worksheet.set_dataframe function, using a dataframe that contains unicode characters that cannot be encoded in ascii (like accents, and many other).
The exception is thrown because the set_dataframe function attempts to cast the unicode values into str values (using the default encoding). For Python 2, the default encoding is ascii and when a character out of the range of ascii is found, the exception is thrown.
Some people have suggested reloading the sys module to circumvent this problem, but here is explained why you should not do it
The other solution I would think of would be to use the pygsheets module in Python 3, where this should no longer be a problem because the default encoding for Python 3 is UTF-8 (see docs)
Bonus:
Ask yourself:
1) Is Unicode an encoding?
2) What is an encoding?
If you hesitated with any of those questions, you should read this article, which gave me the knowledge needed to think of this solution. The time invested was completely worth it.
For more information, you can try this article which links to the previous one at the end.
EDIT:
A change was made 2 days ago (07/26/19) to pygsheets that is supposed to fix this. It looks like the intention is to avoid encoding into the str type, but I would think that this change might try decoding strings into the unicode type from the default ascii encoding, which could also lead to trouble. When/If this change is released it is probably going to be better not to encode anything and pass values as unicode to the set_dataframe function.
EDIT 2:
This change has now been released in version 2.0.2. If my predictions were true, using the encodeDataFrame function I suggested will result in a UnicodeDecodeError because the Worksheet.set_dataframe function will attempt to decode the str values with the default ascii encoding. So the best way to use the function will be not to have any str values in your dataframe. If you have them, decode them to unicode before calling the set_dataframe function. You could have a mirror version of the
function I suggested. It would look something like this:
def decodeDataFrame(df, encoding='UTF-8'):
if df is not None and not df.empty:
for column, series in df.items():
if type(series.at[0]) == str:
try:
decodedSeries = series.str.decode(encoding)
df[column] = decodedSeries
except Exception as e:
print 'Could not decode column %s' % column
raise

Related

How to import UTF8 table in another encoding(win1251, SQL_ASCII) with COPY()?

Prehistory: Hello, i saw many questions about encoding in postgres, but.
I have UFT8 table, and i'm using COPY function to import that table in CSV, and i need to make COPY with different encodings like WIN1251 and SQL_ASCII.
Problem: When in table i have characters that not supported in WIN1251/SQL_ASCII, i will got classic error
character with byte sequence 0xe7 0xb0 0xab in encoding "UTF8" has no equivalent in encoding "WIN1251"
I tried using "set client_encoding/ convert / convert_to" - no success.
Main question: Is there any way to do this without error using sql?
There is simply no way to convert 簫 into Windows-1252, so you can forget about that.
If you set the client encoding to SQL_ASCII, you will be able to load the data into an SQL_ASCII database, but that is of little use, since the database does not recognize it as a character, but three meaningless bytes above 127.

read_csv pandas, encoding issue

I have a csv-file with a list of keywords that I want to use for some filtering of texts.
I saved the csv-file, and tried to open it in my notebook using pd.from_csv('file.csv', encoding = 'UTF-8')
This didn't work even though I specified the encoding to this encoding type.
After some searching, I found some different encodings, I decided to go for
keywords = pd.read_csv('file.csv', encoding = 'latin1')
gets me the actual keywords, but when inspecting the words, I get that the spaces are passed as follows:
['falsification\xa0',
'détournement\xa0de\xa0subsides\xa0',
'parachutes\xa0dorés\xa0',...]
about the csv-file: it has two columns of keywords, one column in dutch, the other one in French. The issue with the spaces persists even when I use other encodings like

Encoding issue in Postgres ERROR "UTF8" is it best to set encoding to UTF8 or to make the data WIN1252 compatible?

I created a table importing a CSV file from an excel spreadsheet. When I try to run the select statement below I get the error.
test=# SELECT * FROM dt_master;
ERROR: character with byte sequence 0xc2 0x9d in encoding "UTF8" has no equivalent in encoding "WIN1252"
I have read the solution posted in this stack overflow post and was able to overcome the issue by setting the encoding to UTF8, so up to that point I am still able to keep working with the data. My question, however, is whether setting the encoding to UTF8 actually is solving the problem or it is just a workaround that and will create other problems down the road and I would be better off removing the conflicting characters and making the data WIN1252 compliant.
Thank you
You have a weird character in your database (Unicode code point 9D, a control character) that probably got there by mistake.
You have to set the client encoding to the encoding that your application expects; no other value will produce correct results, even if you get rid of the error. The error has a reason.
You have two choices:
Fix the data in the database. The character is very likely not what was intended.
Change the application to use LATIN1 or (better) UTF-8 internally and set the client encoding appropriately.
Using UTF-8 everywhere would have the advantage that you are safe from this kind of problem.

How to remove illegal characters so a dataframe can write to Excel

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:
openpyxl.utils.exceptions.IllegalCharacterError
I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!
Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.
What's the best way or removing or replacing illegal characters from a dataframe?
Based on Haipeng Su's answer, I added a function that does this:
dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
decode('utf-8') if isinstance(x, str) else x)
Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!
The same problem happened to me. I solved it as follows:
install python package xlsxwriter:
pip install xlsxwriter
replace the default engine 'openpyxl' with 'xlsxwriter':
dataframe.to_excel("file.xlsx", engine='xlsxwriter')
try a different excel writer engine solved my problem.
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.
Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:
dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)
The above line will remove those characters in every cell.
But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.
I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.
My method may not be the best, but it helps me to convert unicode string into ascii compatible.
# install unidecode first
from unidecode import unidecode
def FormatString(s):
if isinstance(s, unicode):
try:
s.encode('ascii')
return s
except:
return unidecode(s)
else:
return s
df2 = df1.applymap(FormatString)
In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.
Hope this can give me some ideas to deal with your problems.
You can use built-in strip() method for python strings.
for each cell:
text = str(illegal_text).strip()
for entire data frame:
dataframe = dataframe.applymap(lambda t: str(t).strip())
If you're still struggling to clean up the characters, this worked well for me:
import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()

Ipython kernel dies unexpectedly reading large file

I'm reading in a ~3Gb csv using pandas in an ipython notebook. While reading the file, the notebook unexpectedly gives me an error message saying the kernel appears to have died and will restart.
As per several "big data" workflows in python/pandas, I'm reading the files in as follows:
import pandas as pd
tp = pd.read_csv(file_name_cleaned,chunksize,iterator=True,low_memory=False)
df = pd.concat(tp,ignore_index=True)
My workflow has involved some preprocessing to remove all but alphanumeric characters and a few pieces of punctuation as follows:
with open(file_name,'r') as file1:
with open(file_name_cleaned,'w') as file:2
for line in file1:
if len(line.split(sep_string)) == num_columns:
line = re.sub(r'[^A-Za-z0-9|._]+','',line)
file2.write(line+'\n')
The strange thing is that if I remove the line containing re.sub(), I get a different error - "Expected 209 fileds, saw in line 22236, saw 329" even though I've explicitly checked for the exact number of delimiters. Visual inspection of the line and surrounding lines don't really show me much either.
This process has worked fine for several other files, including ones that are larger so I don't think the size of the file is the issue although I suppose it's possible that that's an oversimplification.
I included the preprocessing because I know from experience that sometimes the data contains strange special characters, I've also gone back and forth between using encoding='utf-8' and encoding='utf-8-sig' in the read_csv() and open() statements to no real avail.
I have several questions - does including the encoding keyword argument cause python to ignore characters outside of those character sets or does it maybe invoke some kind of conversion for those characters? I'm not very familiar with these types of issues. Is it possible that some kind of unexpected character could have slipped through my preprocessing and caused this? Is there another type of issue that I haven't found that could cause this? (I have done research but nothing has been quite right.)
Any help would be much appreciated.
Also, I'm using Anaconda 2.4, with Python 3.5.1, Ipython 4.0.0, and pandas 0.17.0
I'm not sure that this totally answers my questions but I did solve the issue, while it is slower, using engine='python' in pd.read_csv() did the trick.