Pandas dataframe to_csv escape double quotes - pandas

I am trying to export pandas dataframe in csv. Some of data contains double quotes and I can't get it escaped properly.
import pandas as pd
from io import StringIO
inp = [{'c1':10, 'c2':'some text'}, {'c1':11,'c2':'some "text"'}]
df = pd.DataFrame(inp)
output = StringIO()
df.to_csv(output, sep='\t', escapechar='\b', header=False, index=False)
as result I get double quotes which are escaped with another double quotes:
'10\tsome text\n11\t"some ""text"""\n'
but I need it to be:
'10\tsome text\n11\t"some \x08"text\x08""\n'
I tried different combinations of doublequote, quotechar and quoting arguments for to_csv() function but no luck.
Closest I got is:
df.to_csv(output, sep='\t', escapechar='\b', header=False, index=False, doublequote=False)
which results in properly escaped double quotes but the whole cell is not wrapped in double quotes and thus cannot be parsed correctly on further steps
'10\tsome text\n11\tsome \x08"text\x08"\n'
Is there a way to make pandas escape double quotes with needed escape character?
PS. Currently I have only workaround to replace "" with \x08" manually in string buffer

I managed to fix this issue with following settings.
df.to_csv(output, sep="\t", escapechar="\\", header=False, index=False, doublequote=False)
escapechar="\\" will put a single back-slash with the double quotes in your value as you have already specified doublequote=False, which will make sure double-quotes are note escaped with another double-quote.

Related

Is it possible to read a csv with `\r\n` line terminators in pandas?

I'm using pandas==1.1.5 to read a CSV file. I'm running the following code:
import pandas as pd
import csv
csv_kwargs = dict(
delimiter="\t",
lineterminator="\r\n",
quoting=csv.QUOTE_MINIMAL,
escapechar="!",
)
pd.read_csv("...", **csv_kwargs)
It raises the following error: ValueError: Only length-1 line terminators supported.
Pandas documentation confirms that line terminators should be length-1 (I suppose single character).
Is there any way to read this CSV with Pandas or should I read it some other way?
Note that the docs suggest length-1 for C parsers, maybe I can plugin some other parser?
EDIT: Not specifying the line terminator raises a parse error in the middle of the file. Specifically ParserError: Error tokenizing data., it expects the correct number of fields but gets too many.
EDIT2: I'm confident the kwargs above were used to created the csv file I'm trying to read.
The problem might be in the escapchar, since ! is a common text character.
Python's csv module defines a very strict use of escapechar:
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False.
but it's possible that pandas interprets it differently:
One-character string used to escape other characters.
It's possible that you have a row that contains something like:
...\t"some important text!"\t...
which would escape the quote character and continue parsing text into that column.

replace() method not working over single quote

I am trying to do:
df_flat = df_flat.replace("'", '"', regex=True)
To change single quotes to double quotes in the whole pandas df.
I have the whole dataframe with single quotes because I applied df.json_normalize and this changed all double quotes from source to single quotes, so I want to recover the double quotes.
I tried this different options:
df_flat = df_flat.apply(lambda s:s.replace("'",'"', regex=True))
df_flat=df_flat.replace({'\'': '"'}, regex=True)
And any of them is working. Any idea of what's happening?
I have pandas==1.3.2.
And the content of the columns are like:
{'A':'1', 'NB':'29382', 'SS': '686'}
Edit:
I need it because I then save that pandas df in a parquet file and copy to AWS Redshift. When try to do json_extract_path it doesn't work as it's not a valid json due to the single quotes. I can do replace in Redshift for each field, but i prefer to store in the correct format.
You may need to treat it as string:
df_flat = df_flat.astype(str).replace("'",'"', regex=True)

Replace function not working as expected with Dask

I'm reading a dask dataframe:
ddf = dd.read_csv({...}, dtype='object')
Next, I'm trying to replace commas with dots, so values can injected in a SQL DB as floats.
ddf = ddf.replace(",", ".")
However, when I'm call ddf.to_sql({...}) my code is returning ValueError: Unable to parse string "2,0" at position 8, which suggests that the replace function is not working as expected. Why is this the case? Is there another way to replace commas with dots in Dask?
You need to use regex here (right now you're replacing a single-character string ","):
ddf = ddf.replace("[,]", ".", regex=True)

Python Pandas to csv without double quotes

My code below converts spark dataframe to Pandas to write it as a CSV file in my local.
myschema.toPandas().to_csv("final_op.txt",header=False,sep='|',index=False,mode='a',doublequote=False,excapechar='"',quoting=None)
Output of above command:
"COLUMN DEFINITION|id"|int
"COLUMN DEFINITION|name"|string
Note that in my 'myschema' dataframe there are no double quotes. While writing to CSV double quotes are coming. Desired output is without double quotes as below:
COLUMN DEFINITION|id|int
COLUMN DEFINITION|name|string
I thought setting doublequote=False,excapechar='"',quoting=None these will solve it. But no luck.
Pass quoting=csv.QUOTE_NONE to to_csv command:
myschema.toPandas().to_csv("final_op.txt",header=False,sep='|',index=False,mode='a',doublequote=False,excapechar='"',quoting=csv.QUOTE_NONE)

How to remove illegal characters so a dataframe can write to Excel

I am trying to write a dataframe to an Excel spreadsheet using ExcelWriter, but it keeps returning an error:
openpyxl.utils.exceptions.IllegalCharacterError
I'm guessing there's some character in the dataframe that ExcelWriter doesn't like. It seems odd, because the dataframe is formed from three Excel spreadsheets, so I can't see how there could be a character that Excel doesn't like!
Is there any way to iterate through a dataframe and replace characters that ExcelWriter doesn't like? I don't even mind if it simply deletes them.
What's the best way or removing or replacing illegal characters from a dataframe?
Based on Haipeng Su's answer, I added a function that does this:
dataframe = dataframe.applymap(lambda x: x.encode('unicode_escape').
decode('utf-8') if isinstance(x, str) else x)
Basically, it escapes the unicode characters if they exist. It worked and I can now write to Excel spreadsheets again!
The same problem happened to me. I solved it as follows:
install python package xlsxwriter:
pip install xlsxwriter
replace the default engine 'openpyxl' with 'xlsxwriter':
dataframe.to_excel("file.xlsx", engine='xlsxwriter')
try a different excel writer engine solved my problem.
writer = pd.ExcelWriter('file.xlsx', engine='xlsxwriter')
If you don't want to install another Excel writer engine (e.g. xlsxwriter), you may try to remove these illegal characters by looking for the pattern which causes the IllegalCharacterError error to be raised.
Open cell.py which is found at /path/to/your/python/site-packages/openpyxl/cell/, look for check_string function, you'll see it is using a defined regular expression pattern ILLEGAL_CHARACTERS_RE to find those illegal characters. Trying to locate its definition you'll see this line:
ILLEGAL_CHARACTERS_RE = re.compile(r'[\000-\010]|[\013-\014]|[\016-\037]')
This line is what you need to remove those characters. Copy this line to your program and execute the below code before your dataframe is written to Excel:
dataframe = dataframe.applymap(lambda x: ILLEGAL_CHARACTERS_RE.sub(r'', x) if isinstance(x, str) else x)
The above line will remove those characters in every cell.
But the origin of these characters may be a problem. As you say, the dataframe comes from three Excel spreadsheets. If the source Excel spreadsheets contains those characters, you will still face this problem. So if you can control the generation process of source spreadsheets, try to remove these characters there to begin with.
I was also struggling with some weird characters in a data frame when writing the data frame to html or csv. For example, for characters with accent, I can't write to html file, so I need to convert the characters into characters without the accent.
My method may not be the best, but it helps me to convert unicode string into ascii compatible.
# install unidecode first
from unidecode import unidecode
def FormatString(s):
if isinstance(s, unicode):
try:
s.encode('ascii')
return s
except:
return unidecode(s)
else:
return s
df2 = df1.applymap(FormatString)
In your situation, if you just want to get rid of the illegal characters by changing return unidecode(s) to return 'StringYouWantToReplace'.
Hope this can give me some ideas to deal with your problems.
You can use built-in strip() method for python strings.
for each cell:
text = str(illegal_text).strip()
for entire data frame:
dataframe = dataframe.applymap(lambda t: str(t).strip())
If you're still struggling to clean up the characters, this worked well for me:
import xlwings as xw
import pandas as pd
df = pd.read_pickle('C:\\Users\\User1\\picked_DataFrame_notWriting.df')
topath = 'C:\\Users\\User1\\tryAgain.xlsx'
wb = xw.Book(topath)
ws = wb.sheets['Data']
ws.range('A1').options(index=False).value = df
wb.save()
wb.close()