My code below converts spark dataframe to Pandas to write it as a CSV file in my local.
myschema.toPandas().to_csv("final_op.txt",header=False,sep='|',index=False,mode='a',doublequote=False,excapechar='"',quoting=None)
Output of above command:
"COLUMN DEFINITION|id"|int
"COLUMN DEFINITION|name"|string
Note that in my 'myschema' dataframe there are no double quotes. While writing to CSV double quotes are coming. Desired output is without double quotes as below:
COLUMN DEFINITION|id|int
COLUMN DEFINITION|name|string
I thought setting doublequote=False,excapechar='"',quoting=None these will solve it. But no luck.
Pass quoting=csv.QUOTE_NONE to to_csv command:
myschema.toPandas().to_csv("final_op.txt",header=False,sep='|',index=False,mode='a',doublequote=False,excapechar='"',quoting=csv.QUOTE_NONE)
Related
I'm using pandas==1.1.5 to read a CSV file. I'm running the following code:
import pandas as pd
import csv
csv_kwargs = dict(
delimiter="\t",
lineterminator="\r\n",
quoting=csv.QUOTE_MINIMAL,
escapechar="!",
)
pd.read_csv("...", **csv_kwargs)
It raises the following error: ValueError: Only length-1 line terminators supported.
Pandas documentation confirms that line terminators should be length-1 (I suppose single character).
Is there any way to read this CSV with Pandas or should I read it some other way?
Note that the docs suggest length-1 for C parsers, maybe I can plugin some other parser?
EDIT: Not specifying the line terminator raises a parse error in the middle of the file. Specifically ParserError: Error tokenizing data., it expects the correct number of fields but gets too many.
EDIT2: I'm confident the kwargs above were used to created the csv file I'm trying to read.
The problem might be in the escapchar, since ! is a common text character.
Python's csv module defines a very strict use of escapechar:
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False.
but it's possible that pandas interprets it differently:
One-character string used to escape other characters.
It's possible that you have a row that contains something like:
...\t"some important text!"\t...
which would escape the quote character and continue parsing text into that column.
I am trying to export pandas dataframe in csv. Some of data contains double quotes and I can't get it escaped properly.
import pandas as pd
from io import StringIO
inp = [{'c1':10, 'c2':'some text'}, {'c1':11,'c2':'some "text"'}]
df = pd.DataFrame(inp)
output = StringIO()
df.to_csv(output, sep='\t', escapechar='\b', header=False, index=False)
as result I get double quotes which are escaped with another double quotes:
'10\tsome text\n11\t"some ""text"""\n'
but I need it to be:
'10\tsome text\n11\t"some \x08"text\x08""\n'
I tried different combinations of doublequote, quotechar and quoting arguments for to_csv() function but no luck.
Closest I got is:
df.to_csv(output, sep='\t', escapechar='\b', header=False, index=False, doublequote=False)
which results in properly escaped double quotes but the whole cell is not wrapped in double quotes and thus cannot be parsed correctly on further steps
'10\tsome text\n11\tsome \x08"text\x08"\n'
Is there a way to make pandas escape double quotes with needed escape character?
PS. Currently I have only workaround to replace "" with \x08" manually in string buffer
I managed to fix this issue with following settings.
df.to_csv(output, sep="\t", escapechar="\\", header=False, index=False, doublequote=False)
escapechar="\\" will put a single back-slash with the double quotes in your value as you have already specified doublequote=False, which will make sure double-quotes are note escaped with another double-quote.
I am trying to do:
df_flat = df_flat.replace("'", '"', regex=True)
To change single quotes to double quotes in the whole pandas df.
I have the whole dataframe with single quotes because I applied df.json_normalize and this changed all double quotes from source to single quotes, so I want to recover the double quotes.
I tried this different options:
df_flat = df_flat.apply(lambda s:s.replace("'",'"', regex=True))
df_flat=df_flat.replace({'\'': '"'}, regex=True)
And any of them is working. Any idea of what's happening?
I have pandas==1.3.2.
And the content of the columns are like:
{'A':'1', 'NB':'29382', 'SS': '686'}
Edit:
I need it because I then save that pandas df in a parquet file and copy to AWS Redshift. When try to do json_extract_path it doesn't work as it's not a valid json due to the single quotes. I can do replace in Redshift for each field, but i prefer to store in the correct format.
You may need to treat it as string:
df_flat = df_flat.astype(str).replace("'",'"', regex=True)
AWS SageMaker Batch Transform errors with the following:
bare " in non quoted field found near: "['R627' 'Q2739' 'D509' 'S37009A' 'E860' 'D72829' 'R9431' 'J90' 'R7989'
In a SageMaker Studio notebook, I use Pandas to output data to csv:
data.to_csv(my_file, index=False, header=False)
My Pandas dataframe has columns with string values like the following:
['ABC123', 'DEF456']
Pandas is adding line breaks between these fields e.g. this is one row (that spans two lines) and has a line break. Note that the double quotes now span two lines. Sometimes they'll span 3 or more lines.
False,ABC123,7,1,3412,['I509'],,"['R627' 'Q2739' 'D509' 'S37009A' 'E860' 'D72829' 'R9431' 'J90' 'R7989'
'R5383' 'J9621']",['R51' 'R05' 'R0981'],['X58XXXA'],M,,A,48
The CSV is valid and I can successfully read it back into a Pandas dataframe.
Why would Batch Transform fail to read this CSV format?
I've converted arrays to strings (space separated) e.g.
From:
['ABC123', 'DEF456']
To:
ABC123 DEF456
I have a csv file containing
numbers like "1.456e+07"
and I am using function "copy_expert" to export the file to database
but I am getting error
psycopg2.DataError: invalid input syntax for integer: "1.5637e+07"
I notice that I can insert "100" as an integer, but when I do "1.5637e+07" with qoute, it doesn't work.
I am using pandas dataframe's to_csv to generate the csv files. not sure how to get rid of qoute for integer like "1.5637e+07" only (I have string column), or whether there is other solution.
I find out the solution
Normally, pandas doesn't put quotes around number. However, I set float_format parameter which causes this. I reset
quoting=csv.QUOTE_MINIMAL
in the function call and the quotes go away.