Why double quote and <N> cause errors when upload to BigQuery? - google-bigquery

Errors were reported when my program tried to upload a .csv file, via job upload to BigQuery:
Job failed while writing to Bigquery. invalid: Too many errors encountered. Limit is: 0. at
Error: [REASON] invalid [MESSAGE] Data between close double quote (") and field separator: field starts with: <N> [LOCATION] File: 0 / Line:21470 / Field:2
Error: [REASON] invalid [MESSAGE] Too many errors encountered. Limit is: 0. [LOCATION]
I traced back to my file and did find the specified line like:
3D0F92F8-C892-4E6B-9930-6FA254809E58~"N" STYLE TOWING~1~0~5.7.1512.441~10.20.10.25:62342~MSSqlServer: N_STYLE on localhost~3~2015-12-17 01:56:41.720~1~<?xml version="1
The delimiter was set to be ~ , then why the double quote or maybe <N> is a problem?

The specification for csv says that if there is a quote in the field, then the entire field should be quoted. As in a,b,"c,d", which would have only three fields, since the third comma is quoted. The csv parser gets confused when there is data after a closing quote but before the next delimiter, as in a,b,"c,d"e.
You can fix this by specifying a custom quote character, since it sounds like you don't need a quote char at all, so you could just set it to something that you'll never see, like \0 or |. You're already setting configuration.load.delimiter, just set configuration.load.quote as well.

Related

Remove double double quotes before copy into snowflake

I am trying to load some csv data to a Snowflake table. However, I am facing some issues with double double quotes in some rows of the file.
This is the file format I was using inside COPY INTO command:
file_format=(TYPE=CSV,
FIELD_DELIMITER = '|',
FIELD_OPTIONALLY_ENCLOSED_BY='"',
SKIP_HEADER =1);
As you can see in the example below, I have double double quotes around ID, which is a data quality problem, but I have to deal with it because
Column1|Column2|Column3|Column4|Column5|Column6|Column7|""ID""|Column9
I cannot change it in its source. I tried to replace the double double quotes ("") by a single double quote ("), as the below example depicts:
However, Snowflake is still returning the same error:
Found character 'I' instead of field delimiter '|' File 'XXXX', line 709, character 75 Row 708, column "Column8"["$8":8] If you would like to continue loading when an error is encountered, use other values such as 'SKIP_FILE' or 'CONTINUE' for the ON_ERROR option. For more information on loading options, please run 'info loading_data' in a SQL client.
Do you know how can I deal with this, allowing the file content to be properly loaded into Snowflake table?

Line contains invalid enclosed character data or delimiter at position

I was trying to load the data from the csv file into the Oracle sql developer, when inserting the data I encountered the error which says:
Line contains invalid enclosed character data or delimiter at position
I am not sure how to tackle this problem!
For Example:
INSERT INTO PROJECT_LIST (Project_Number, Name, Manager, Projects_M,
Project_Type, In_progress, at_deck, Start_Date, release_date, For_work, nbr,
List, Expenses) VALUES ('5770','"Program Cardinal
(Agile)','','','','','',to_date('', 'YYYY-MM-DD'),'','','','','');
The Error shown were:
--Insert failed for row 4
--Line contains invalid enclosed character data or delimiter at position 79.
--Row 4
I've had success when I've converted the csv file to excel by "save as", then changing the format to .xlsx. I then load in SQL developer the .xlsx version. I think the conversion forces some of the bad formatting out. It worked at least on my last 2 files.
I fixed it by using the concatenate function in my CSV file first and then uploaded it on sql, which worked.
My guess is that it doesn't like to_date('', 'YYYY-MM-DD'). It's missing a date to format. Is that an actual input of your data?
But it could also possibly be the double quote in "Program Cardinal (Agile). Though I don't see why that would get picked up as an invalid character.

Embedded Newline Character Issue in Redshift Copy Command

We have fifteen embedded newline characters in the field of a source S3 file. The field size in target table in Redshift is VARCHAR(5096). The field length in the source file is 5089 bytes. We are escaping each of the fifteen newline characters with a backslash \ as required by the ESCAPE option of the copy command. Our expectation with the ESCAPE option is that the backslash \ that has been inserted by us before each newline character will be ignored before loading the target in Redshift. However, when we use copy command with the ESCAPE option we are getting
err_code:1204 - String length exceeds DDL length."
Is there a way in which the added backslash \ characters are not counted for target column loads in Redshift?
Note: When we truncated the above source field in the file to 4000 bytes and inserted the backslash \ before the newline characters, the copy command with ESCAPE option successfully loaded the field in Redshift. Also, the backslash \ characters were not loaded in Redshift as expected.
You cold extend your VARCHAR length to allow for more characters.
Or, you could use the TRUNCATECOLUMNS options to load as much as possible without generating an error.
Our understanding w.r.t the above issue was incorrect. The backslashes "\" that we had inserted were not causing the error "err_code:1204 - String length exceeds DDL length.". The "escape" option with the copy command was actually not counting the inserted backslash characters towards the target limit and was also removing them from the loaded value properly.
The actual issue that were facing was that some of the characters that we were trying to load were multibyte UTF8 characters. Since, we were incorrectly assuming them to be of length 1 byte, the size of the target field was proving to be insufficient. We increased the length of the target field from varchar(5096) to varchar(7096), after which all data was loaded successfully.

Upload error in S3 for CSV, despite having "ESCAPE ACCEPTINVCHARS"

This seems to be a common problem with a common solution: Uploading a CSV via S3 and getting the Missing newline: Unexpected character error? Just add ESCAPE ACCEPTINVCHARS to your COPY statement!
So I did that and still get the error.
My CSV looks like this:
email, step1_timestamp, step2_timestamp, step3_timestamp, step4_timestamp, url, type
fake#email.gov, 2015-01-28 12:1I:05, 2015-01-28 12:1I:05, NULL, NULL, notasite.gov, M Final
wrong#email.net, 2015-01-28 12:7I:19, NULL, NULL, NULL, notasite.gov/landing, M
I successfully upload in S3 and run the following COPY
COPY <my_table> FROM 's3://<my_bucket>/<my_folder>/uploadaws.csv'
CREDENTIALS 'aws_access_key_id=<my_id>;aws_secret_access_key=<'
REGION 'us-west-1'
DELIMITER ','
null as '\00'
IGNOREHEADER 1
ESCAPE ACCEPTINVCHARS;
My error code:
Missing newline: Unexpected character 0x6e found at location 4194303
The first characters of the error:
:05,,,,,M Final
xxxx#yyyyy.com,2015-01-28 12:1I:05,,,,,M Final
xxx.xxx#yyyy.com,2015-01-28 12:1I:05,,,,,M Final
xxxx
Your file probably just needs a newline at the end of the very last row.
ACCEPTINVCHARS won't help as it for files that contain invalid UTF8 codepoints or control characters.
ESCAPE is for loading embedded quotes in files with quoted data. Your file would have to be specially prepared for that.

inserting quotes in big query

I can easily upload a file delimited by ^
It looks something like...
CN^others^2012-05-03 00:02:25^^^^^Mozilla/5.0^generic web browser^^^^^^^^
CN^others^2012-05-03 00:02:26^^^^^Mozilla/5.0^generic web browser^^^^^^^^
But if I have a double quote somewhere, it fails with an error message...
Line:1 / Field:, Data between close double quote (") and field separator: field starts with:
Too many errors encountered. Limit is: 0.
CN^others^2012-05-03 00:02:25^^^^^"Mozilla/5.0^generic web browser^^^^^^^^
I do regulary get the files with "Mozilla as browser name, how do I insert data with double quotes?
Quotes can be escaped with another quote. For example, the field: This field has "internal quotes". would become This field has ""internal quotes"".
sed 's/\"/\"\"/g' should do the trick.
Note that in order to import data that contains quoted newlines, you need to set the allow_quoted_newlines flag to true on the import configuration. This means the import cannot be processed in parallel, and so may be slower than importing data without that flag set.