How to avoid unicode errors from DataFrame to CSV? - pandas

Cannot get rid of unicode erros, how do i deal with them?
I'm using Dataframe (to_csv method) but the problem it shows on CSV the following:
Gòtic
Montjuïc
How to avoid it in Dataframes? Python 2.7 + Pandas
I'm using:
# encoding=utf8
I've tried:
.encode('utf-8')
u''.join(variable)

Try this, change the encoding to latin-1
df.to_csv('your_csv_name.csv', encoding = 'latin-1')
output:
Gòtic
Montjuïc
Works fine for me in Python 3.7

Related

Is it possible to read a csv with `\r\n` line terminators in pandas?

I'm using pandas==1.1.5 to read a CSV file. I'm running the following code:
import pandas as pd
import csv
csv_kwargs = dict(
delimiter="\t",
lineterminator="\r\n",
quoting=csv.QUOTE_MINIMAL,
escapechar="!",
)
pd.read_csv("...", **csv_kwargs)
It raises the following error: ValueError: Only length-1 line terminators supported.
Pandas documentation confirms that line terminators should be length-1 (I suppose single character).
Is there any way to read this CSV with Pandas or should I read it some other way?
Note that the docs suggest length-1 for C parsers, maybe I can plugin some other parser?
EDIT: Not specifying the line terminator raises a parse error in the middle of the file. Specifically ParserError: Error tokenizing data., it expects the correct number of fields but gets too many.
EDIT2: I'm confident the kwargs above were used to created the csv file I'm trying to read.
The problem might be in the escapchar, since ! is a common text character.
Python's csv module defines a very strict use of escapechar:
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False.
but it's possible that pandas interprets it differently:
One-character string used to escape other characters.
It's possible that you have a row that contains something like:
...\t"some important text!"\t...
which would escape the quote character and continue parsing text into that column.

AWS SageMaker Batch Transform CSV Error: Bare " in non quoted field

AWS SageMaker Batch Transform errors with the following:
bare " in non quoted field found near: "['R627' 'Q2739' 'D509' 'S37009A' 'E860' 'D72829' 'R9431' 'J90' 'R7989'
In a SageMaker Studio notebook, I use Pandas to output data to csv:
data.to_csv(my_file, index=False, header=False)
My Pandas dataframe has columns with string values like the following:
['ABC123', 'DEF456']
Pandas is adding line breaks between these fields e.g. this is one row (that spans two lines) and has a line break. Note that the double quotes now span two lines. Sometimes they'll span 3 or more lines.
False,ABC123,7,1,3412,['I509'],,"['R627' 'Q2739' 'D509' 'S37009A' 'E860' 'D72829' 'R9431' 'J90' 'R7989'
'R5383' 'J9621']",['R51' 'R05' 'R0981'],['X58XXXA'],M,,A,48
The CSV is valid and I can successfully read it back into a Pandas dataframe.
Why would Batch Transform fail to read this CSV format?
I've converted arrays to strings (space separated) e.g.
From:
['ABC123', 'DEF456']
To:
ABC123 DEF456

Encoding Error of Reading .dta Files with Chinese Characters

I am trying to read .dta files with pandas:
import pandas as pd
my_data = pd.read_stata('filename', encoding='utf-8')
the error message is:
ValueError: Unknown encoding. Only latin-1 and ascii supported.
other encoding formality also didn't work, such as gb18030 or gb2312 for dealing with Chineses characters. If I remove the encoding parameter, the DataFrame will be all of garbage values.
Simply read the original data by default encoding, then transfer to the expected encoding! Suppose the column having garbled text is column1
import pandas as pd
dta = pd.read_stata('filename.dta')
print(dta['column1'][0].encode('latin-1').decode('gb18030'))
The print result will show normal Chinese characters, and gb2312 can also make it.
Looking at the source code of pandas (version 0.22.0), the supported encodings for read_stata are ('ascii', 'us-ascii', 'latin-1', 'latin_1', 'iso-8859-1', 'iso8859-1', '8859', 'cp819', 'latin', 'latin1', 'L1'). So you can only choose from this list.

Catch bad lines in csv file using pandas read_csv

I am using pandas read_csv to read a 140k lines csv file. The format of the file is as follows:
"HEAD1", "HEAD2", "HEAD3"
"line1-1", "line1-2", "line1-3"
"line2-1", "line2-2", "line2-3"
There are some invalid lines as follows:
"line"3-1", "line3-2",, "li"ne3-4"
How can I catch and print out the invalid lines? Is it possible to do so using the read_csv function or I need to use csv.reader and check each line using a regular expression? If so, can somebody help me build a regular expression? I came up with the following, but it does not work
^".+\",?"?
Thank you.

How to remove illegal characters so a dataframe can write to Excel

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:
openpyxl.utils.exceptions.IllegalCharacterError
I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!
Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.
What's the best way or removing or replacing illegal characters from a dataframe?
Based on Haipeng Su's answer, I added a function that does this:
dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
decode('utf-8') if isinstance(x, str) else x)
Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!
The same problem happened to me. I solved it as follows:
install python package xlsxwriter:
pip install xlsxwriter
replace the default engine 'openpyxl' with 'xlsxwriter':
dataframe.to_excel("file.xlsx", engine='xlsxwriter')
try a different excel writer engine solved my problem.
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.
Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:
dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)
The above line will remove those characters in every cell.
But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.
I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.
My method may not be the best, but it helps me to convert unicode string into ascii compatible.
# install unidecode first
from unidecode import unidecode
def FormatString(s):
if isinstance(s, unicode):
try:
s.encode('ascii')
return s
except:
return unidecode(s)
else:
return s
df2 = df1.applymap(FormatString)
In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.
Hope this can give me some ideas to deal with your problems.
You can use built-in strip() method for python strings.
for each cell:
text = str(illegal_text).strip()
for entire data frame:
dataframe = dataframe.applymap(lambda t: str(t).strip())
If you're still struggling to clean up the characters, this worked well for me:
import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()