Remove double double quotes before copy into snowflake - sql

I am trying to load some csv data to a Snowflake table. However, I am facing some issues with double double quotes in some rows of the file.
This is the file format I was using inside COPY INTO command:
file_format=(TYPE=CSV,
FIELD_DELIMITER = '|',
FIELD_OPTIONALLY_ENCLOSED_BY='"',
SKIP_HEADER =1);
As you can see in the example below, I have double double quotes around ID, which is a data quality problem, but I have to deal with it because
Column1|Column2|Column3|Column4|Column5|Column6|Column7|""ID""|Column9
I cannot change it in its source. I tried to replace the double double quotes ("") by a single double quote ("), as the below example depicts:
However, Snowflake is still returning the same error:
Found character 'I' instead of field delimiter '|' File 'XXXX', line 709, character 75 Row 708, column "Column8"["$8":8] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
Do you know how can I deal with this, allowing the file content to be properly loaded into Snowflake table?

Related

How to escape double quotes within a data when it is already enclosed by double quotes

I have CSV data separated by comma like below which has to be imported into snowflake table using copy command .
"1","2","3","2"In stick"
Since I am already passing the parameter OPTIONALLY_ENCLOSED_BY = '"' to copy command I couldn't escape the " (double quotes) within the data ("2"In stick") .
The imported data that I want to see in the table is like below
1,2,3,2"In stick
Can someone please help here ? Thanks !
If you are on Windows, I have a funny solution for that. Open this CSV file in MS Excel. Excel consumes correct double quotes to show data in the cellular format and leaves the extra in the middle of a cell (if each cell is separated properly by commas). Then choose 'replace' and replace double quotes with something else (like two single quotes or replace by nothing to remove them). Then save it again as a CSV. I assume other spreadsheet programs should do the same.
If you have an un-escaped quote inside a field which is surrounded by quotes that isn't really valid CSV. For example, here is an excerpt from the RFC4180 spec
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"
I think that whatever is generating the CSV file is doing it incorrectly and needs to be fixed before you will be able to load it into Snowflake. I don't think any file_format option will be able to solve this for you since it's not valid CSV.
The CSV row should either look like this:
"1","2","3","2""In stick"
or this:
"1","2","3","2\"In stick"
I had this same problem, and while writing up the question, I found an answer:
Import RFC4180 files (CSV spec) into snowflake? (Unable to create file format that matches CSV RFC spec)
Essentially, set:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
{you have to decide what to put here}
Field optionally enclosed by
Double Quote
Escape Character
None
Escape Unenclosed Field
None
Here is my ALTER statement:
ALTER FILE FORMAT "DB_NAME"."SCHEMA_NAME"."CSV_SPEC3" SET COMPRESSION = 'NONE' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '\042' TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
As I mention in the answer, I don't know why the above works, but it is working for me. Go figure.

How to remove new line characters from data rows in Presto/AWS Athena?

I'm querying some tables on Athena (Presto SAS) and then downloading the generated CSV file to use locally. Opening the file, I realised the data contains new line characters that doesn't appear on AWS interface, only in the CSV and need to get rid of them. Tried using the function replace(string, search, replace) → varchar to skip the newline char replacing \n for \\n without success:
SELECT
p.recvepoch, replace(p.description, '\n', '\\n') AS description
FROM
product p
LIMIT 1000
How can I achieve that?
The problem was that the underlying table data doesn't actually contains \n anywhere, instead, the actual newline character, which is represented by char(10). I was able to achieve the expected behaviour using the replace function passing it as parameter:
SELECT
p.recvepoch, replace(p.description, chr(10), '\n') AS description
FROM
product p
LIMIT 1000

Line contains invalid enclosed character data or delimiter at position

I was trying to load the data from the csv file into the Oracle sql developer, when inserting the data I encountered the error which says:
Line contains invalid enclosed character data or delimiter at position
I am not sure how to tackle this problem!
For Example:
INSERT INTO PROJECT_LIST (Project_Number, Name, Manager, Projects_M,
Project_Type, In_progress, at_deck, Start_Date, release_date, For_work, nbr,
List, Expenses) VALUES ('5770','"Program Cardinal
(Agile)','','','','','',to_date('', 'YYYY-MM-DD'),'','','','','');
The Error shown were:
--Insert failed for row 4
--Line contains invalid enclosed character data or delimiter at position 79.
--Row 4
I've had success when I've converted the csv file to excel by "save as", then changing the format to .xlsx. I then load in SQL developer the .xlsx version. I think the conversion forces some of the bad formatting out. It worked at least on my last 2 files.
I fixed it by using the concatenate function in my CSV file first and then uploaded it on sql, which worked.
My guess is that it doesn't like to_date('', 'YYYY-MM-DD'). It's missing a date to format. Is that an actual input of your data?
But it could also possibly be the double quote in "Program Cardinal (Agile). Though I don't see why that would get picked up as an invalid character.

Why double quote and <N> cause errors when upload to BigQuery?

Errors were reported when my program tried to upload a .csv file, via job upload to BigQuery:
Job failed while writing to Bigquery. invalid: Too many errors encountered. Limit is: 0. at
Error: [REASON] invalid [MESSAGE] Data between close double quote (") and field separator: field starts with: <N> [LOCATION] File: 0 / Line:21470 / Field:2
Error: [REASON] invalid [MESSAGE] Too many errors encountered. Limit is: 0. [LOCATION]
I traced back to my file and did find the specified line like:
3D0F92F8-C892-4E6B-9930-6FA254809E58~"N" STYLE TOWING~1~0~5.7.1512.441~10.20.10.25:62342~MSSqlServer: N_STYLE on localhost~3~2015-12-17 01:56:41.720~1~<?xml version="1
The delimiter was set to be ~ , then why the double quote or maybe <N> is a problem?
The specification for csv says that if there is a quote in the field, then the entire field should be quoted. As in a,b,"c,d", which would have only three fields, since the third comma is quoted. The csv parser gets confused when there is data after a closing quote but before the next delimiter, as in a,b,"c,d"e.
You can fix this by specifying a custom quote character, since it sounds like you don't need a quote char at all, so you could just set it to something that you'll never see, like \0 or |. You're already setting configuration.load.delimiter, just set configuration.load.quote as well.

inserting quotes in big query

I can easily upload a file delimited by ^
It looks something like...
CN^others^2012-05-03 00:02:25^^^^^Mozilla/5.0^generic web browser^^^^^^^^
CN^others^2012-05-03 00:02:26^^^^^Mozilla/5.0^generic web browser^^^^^^^^
But if I have a double quote somewhere, it fails with an error message...
Line:1 / Field:, Data between close double quote (") and field separator: field starts with:
Too many errors encountered. Limit is: 0.
CN^others^2012-05-03 00:02:25^^^^^"Mozilla/5.0^generic web browser^^^^^^^^
I do regulary get the files with "Mozilla as browser name, how do I insert data with double quotes?
Quotes can be escaped with another quote. For example, the field: This field has "internal quotes". would become This field has ""internal quotes"".
sed 's/\"/\"\"/g' should do the trick.
Note that in order to import data that contains quoted newlines, you need to set the allow_quoted_newlines flag to true on the import configuration. This means the import cannot be processed in parallel, and so may be slower than importing data without that flag set.