I am trying to do:
df_flat = df_flat.replace("'", '"', regex=True)
To change single quotes to double quotes in the whole pandas df.
I have the whole dataframe with single quotes because I applied df.json_normalize and this changed all double quotes from source to single quotes, so I want to recover the double quotes.
I tried this different options:
df_flat = df_flat.apply(lambda s:s.replace("'",'"', regex=True))
df_flat=df_flat.replace({'\'': '"'}, regex=True)
And any of them is working. Any idea of what's happening?
I have pandas==1.3.2.
And the content of the columns are like:
{'A':'1', 'NB':'29382', 'SS': '686'}
Edit:
I need it because I then save that pandas df in a parquet file and copy to AWS Redshift. When try to do json_extract_path it doesn't work as it's not a valid json due to the single quotes. I can do replace in Redshift for each field, but i prefer to store in the correct format.
You may need to treat it as string:
df_flat = df_flat.astype(str).replace("'",'"', regex=True)
Related
I am trying to export pandas dataframe in csv. Some of data contains double quotes and I can't get it escaped properly.
import pandas as pd
from io import StringIO
inp = [{'c1':10, 'c2':'some text'}, {'c1':11,'c2':'some "text"'}]
df = pd.DataFrame(inp)
output = StringIO()
df.to_csv(output, sep='\t', escapechar='\b', header=False, index=False)
as result I get double quotes which are escaped with another double quotes:
'10\tsome text\n11\t"some ""text"""\n'
but I need it to be:
'10\tsome text\n11\t"some \x08"text\x08""\n'
I tried different combinations of doublequote, quotechar and quoting arguments for to_csv() function but no luck.
Closest I got is:
df.to_csv(output, sep='\t', escapechar='\b', header=False, index=False, doublequote=False)
which results in properly escaped double quotes but the whole cell is not wrapped in double quotes and thus cannot be parsed correctly on further steps
'10\tsome text\n11\tsome \x08"text\x08"\n'
Is there a way to make pandas escape double quotes with needed escape character?
PS. Currently I have only workaround to replace "" with \x08" manually in string buffer
I managed to fix this issue with following settings.
df.to_csv(output, sep="\t", escapechar="\\", header=False, index=False, doublequote=False)
escapechar="\\" will put a single back-slash with the double quotes in your value as you have already specified doublequote=False, which will make sure double-quotes are note escaped with another double-quote.
I am trying to load some csv data to a Snowflake table. However, I am facing some issues with double double quotes in some rows of the file.
This is the file format I was using inside COPY INTO command:
file_format=(TYPE=CSV,
FIELD_DELIMITER = '|',
FIELD_OPTIONALLY_ENCLOSED_BY='"',
SKIP_HEADER =1);
As you can see in the example below, I have double double quotes around ID, which is a data quality problem, but I have to deal with it because
Column1|Column2|Column3|Column4|Column5|Column6|Column7|""ID""|Column9
I cannot change it in its source. I tried to replace the double double quotes ("") by a single double quote ("), as the below example depicts:
However, Snowflake is still returning the same error:
Found character 'I' instead of field delimiter '|' File 'XXXX', line 709, character 75 Row 708, column "Column8"["$8":8] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
Do you know how can I deal with this, allowing the file content to be properly loaded into Snowflake table?
I have CSV data separated by comma like below which has to be imported into snowflake table using copy command .
"1","2","3","2"In stick"
Since I am already passing the parameter OPTIONALLY_ENCLOSED_BY = '"' to copy command I couldn't escape the " (double quotes) within the data ("2"In stick") .
The imported data that I want to see in the table is like below
1,2,3,2"In stick
Can someone please help here ? Thanks !
If you are on Windows, I have a funny solution for that. Open this CSV file in MS Excel. Excel consumes correct double quotes to show data in the cellular format and leaves the extra in the middle of a cell (if each cell is separated properly by commas). Then choose 'replace' and replace double quotes with something else (like two single quotes or replace by nothing to remove them). Then save it again as a CSV. I assume other spreadsheet programs should do the same.
If you have an un-escaped quote inside a field which is surrounded by quotes that isn't really valid CSV. For example, here is an excerpt from the RFC4180 spec
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"
I think that whatever is generating the CSV file is doing it incorrectly and needs to be fixed before you will be able to load it into Snowflake. I don't think any file_format option will be able to solve this for you since it's not valid CSV.
The CSV row should either look like this:
"1","2","3","2""In stick"
or this:
"1","2","3","2\"In stick"
I had this same problem, and while writing up the question, I found an answer:
Import RFC4180 files (CSV spec) into snowflake? (Unable to create file format that matches CSV RFC spec)
Essentially, set:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
{you have to decide what to put here}
Field optionally enclosed by
Double Quote
Escape Character
None
Escape Unenclosed Field
None
Here is my ALTER statement:
ALTER FILE FORMAT "DB_NAME"."SCHEMA_NAME"."CSV_SPEC3" SET COMPRESSION = 'NONE' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '\042' TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
As I mention in the answer, I don't know why the above works, but it is working for me. Go figure.
My code below converts spark dataframe to Pandas to write it as a CSV file in my local.
myschema.toPandas().to_csv("final_op.txt",header=False,sep='|',index=False,mode='a',doublequote=False,excapechar='"',quoting=None)
Output of above command:
"COLUMN DEFINITION|id"|int
"COLUMN DEFINITION|name"|string
Note that in my 'myschema' dataframe there are no double quotes. While writing to CSV double quotes are coming. Desired output is without double quotes as below:
COLUMN DEFINITION|id|int
COLUMN DEFINITION|name|string
I thought setting doublequote=False,excapechar='"',quoting=None these will solve it. But no luck.
Pass quoting=csv.QUOTE_NONE to to_csv command:
myschema.toPandas().to_csv("final_op.txt",header=False,sep='|',index=False,mode='a',doublequote=False,excapechar='"',quoting=csv.QUOTE_NONE)
I have a csv file containing
numbers like "1.456e+07"
and I am using function "copy_expert" to export the file to database
but I am getting error
psycopg2.DataError: invalid input syntax for integer: "1.5637e+07"
I notice that I can insert "100" as an integer, but when I do "1.5637e+07" with qoute, it doesn't work.
I am using pandas dataframe's to_csv to generate the csv files. not sure how to get rid of qoute for integer like "1.5637e+07" only (I have string column), or whether there is other solution.
I find out the solution
Normally, pandas doesn't put quotes around number. However, I set float_format parameter which causes this. I reset
quoting=csv.QUOTE_MINIMAL
in the function call and the quotes go away.