'Missing close double quote (") character' is complained when there're line feeds in csv file when loading data to BigQuery - google-bigquery

The culprit line is as follows. It should be composed of 14 columns, with one of the column, starting with 'Hi I'm Niger...', covering multiple line with line feeds.
17935,9a7105ee-30c8-4a6d-9374-10875b7d6288.jpg,"""top""=>""0"", ""left""=>""0"", ""width""=>""180"", ""height""=>""180""",,"",2015-07-26 19:33:57.292058,2015-07-26 20:25:30.068887,fe43876f-1b2c-464a-aa20-bf335ed3ff62,c68c8c70-bc2b-11e4-90a1-22000b21105f,{},2e790350-15fb-0133-2cb8-22000ba51078,"Hi I'm Nigerian so wish to study in sweden.
so I'm Undergraduate student I want study Engineering.
Thanks.","",{}
When loading this csv data into BigQuery via command bq load --replace --source_format=CSV -F"," ..., Error complains. Could anyone give me an solution to this BigQuery Load Data command?
- File: 0 / Line:17192 / Field:12: Missing close double quote (")
character: field starts with: <Hi I'm N>
- File: 0 / Line:17193: Too few columns: expected 14 column(s) but
got 1 column(s). For additional help: http://goo.gl/RWuPQ
- File: 0 / Line:17194: Too few columns: expected 14 column(s) but
got 3 column(s). For additional help: http://goo.gl/RWuPQ

If you are loading CSV with embedded newlines, you need to specify allowQuotedNewlines.
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.allowQuotedNewlines
The BigQuery default is to assume that CSV data does not contain newlines. This allows for a much higher parsing throughput when dealing with large data files since the input files can be split at arbitrary newlines. If your data contains newlines within strings, each file needs to be parsed linearly by a single machine.

Make sure you include this line before loading data to BigQuery: 'job_config.allow_quoted_newlines = True'
job_config = bigquery.LoadJobConfig()
job_config.allow_quoted_newlines = True

If you trying to load a CSV file to a table from the BigQuery google console make sure you select the Advanced option -> Quoted new lines.

Related

ERROR: extra data after last expected column on PostgreSQL while the number of columns is the same

I am new to PostgreSQL and I need to import a set of csv files, but some of them weren't imported successfully. I got the same error with these files: ERROR: extra data after last expected column. I have investigated this error report and learned that these errors occur might because the number of columns of the table is not equal to that in the file. But I don't think I am in this situation.
For example, I create this table:
CREATE TABLE cast_info (
id integer NOT NULL PRIMARY KEY,
person_id integer NOT NULL,
movie_id integer NOT NULL,
person_role_id integer,
note character varying,
nr_order integer,
role_id integer NOT NULL
);
And then I want to copy the csv file:
COPY cast_info FROM '/private/tmp/cast_info.csv' WITH CSV HEADER;
Then I got the error:
**ERROR: extra data after last expected column
CONTEXT: COPY cast_info, line 8801: "612,207,2222077,1,"(segments \"Homies\" - \"Tilt A Whirl\" - \"We don't die\" - \"Halls of Illusions..."**
The complete row in this csv file is as follows:
612,207,2222077,1,"(segments \"Homies\" - \"Tilt A Whirl\" - \"We don't die\" - \"Halls of Illusions\" - \"Chicken Huntin\" - \"Another love song\" - \"How many times?\" - \"Bowling balls\" - \"The people\" - \"Piggy pie\" - \"Hokus pokus\" - \"Let\"s go all the way\" - \"Real underground baby\")/Full Clip (segments \"Duk da fuk down\" - \"Real underground baby\")/Guy Gorfey (segment \"Raw deal\")/Sugar Bear (segment \"Real underground baby\")",2,1
You can see that there's exactly 7 columns as the table has.
The strange thing is, I found that the error lines of all these files contain the characters backslash and quotation mark (\"). Also, these rows are not the only row that contains \" in the files. I wonder why this error doesn't appear in other rows. Because of that, I am not sure if this is the problem.
After modifying these rows (e.g. replace the \" or delete the content while remaining the commas), there are new errors: ERROR: invalid input syntax for line 2 of every file. And the errors occur because the data in the last column of these rows have been added three semicolons(;;;) for no reason. But when I open these csv files, I can't see the three semicolons in those rows.
For example, after deleting the content in the fifth column of this row:
612,207,2222077,1,,2,1
I got the error:
**ERROR: invalid input syntax for type integer: "1;;;"
CONTEXT: COPY cast_info, line 2, column role_id: "1;;;"**
While the line 2 doesn't contain three semicolons, as follows:
2,2,2163857,1,,25,1
In principle, I hope the problem can be solved without any modification to the data itself. Thank you for your patience and help!
The CSV format protects quotation marks by doubling them, not by backslashing them. You could use the text format instead, except that that doesn't support HEADER, and also it would then not remove the outer quote marks. You could instead tweak the files on the fly with a program:
COPY cast_info FROM PROGRAM 'sed s/\\\\/\"/g /private/tmp/cast_info.csv' WITH CSV;
This works with the one example you gave, but might not work for all cases.
ERROR: invalid input syntax for line 2 of every file. And the errors
occur because the data in the last column of these rows have been
added three semicolons(;;;) for no reason. But when I open these csv
files, I can't see the three semicolons in those rows
How are you editing and viewing these files? Sounds like you are using something that isn't very good at preserving formatting, like Excel.
Try actually naming the columns you want processed in the copy statement:
copy cast_info (id, person_id, movie_id, person_role_id, note, nr_order, role_id) from ...
According to a friend's suggestion, I need to specify the backslashes as escape characters:
copy <table_name> from '<csv_file_path>' csv escape '\';
and then the problem is solved.

How to remove new line characters from data rows in Presto/AWS Athena?

I'm querying some tables on Athena (Presto SAS) and then downloading the generated CSV file to use locally. Opening the file, I realised the data contains new line characters that doesn't appear on AWS interface, only in the CSV and need to get rid of them. Tried using the function replace(string, search, replace) → varchar to skip the newline char replacing \n for \\n without success:
SELECT
p.recvepoch, replace(p.description, '\n', '\\n') AS description
FROM
product p
LIMIT 1000
How can I achieve that?
The problem was that the underlying table data doesn't actually contains \n anywhere, instead, the actual newline character, which is represented by char(10). I was able to achieve the expected behaviour using the replace function passing it as parameter:
SELECT
p.recvepoch, replace(p.description, chr(10), '\n') AS description
FROM
product p
LIMIT 1000

Custom delimiter while exporting Google Cloud SQL to CSV

I've been successfully exporting GCloud SQL to CSV with its default delimiter ",". I want to import this CSV to Google Big Query and I've succeed to do this.
However, I'm experiencing a little problem. There's "," in some of my cell/field. It causes Big Query import process not working properly. For Example:
"Budi", "19", "Want to be hero, and knight"
My questions are:
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
If not, how to make above sample data to be imported in Google Big Query and become 3 field/cell?
Cheers.
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
Yes it's, See the documentation page of BigQuery how to set load options provided in this link
You will need to add --field_delimiter = '|' to your command
From the documentation:
(Optional) The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (,).
As far as I know there's no way of setting a custom delimiter when exporting from CloudSQL to CSV. I attempted to introduce my own delimiter by formulating my select query like so:
select column_1||'|'||column_2 from foo
But this only results in CloudSQL escaping the whole result in the resulting CSV with double quotes. This also aligns with the documentation which states:
Exporting in CSV format is equivalent to running the following SQL statement:
SELECT <query> INTO OUTFILE ... CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\\' LINES TERMINATED BY '\n'
https://cloud.google.com/sql/docs/mysql/import-export/exporting

Loading huge csv file using COPY

I am loading CSV file using COPY.
COPY cts FROM 'C:\...\cts.csv' using DELIMITERS',';
However, error comes out
ERROR: invalid input syntax for type double precision: ""
CONTEXT: COPY testdata, line 7, column latitude: ""
How to fix it please?
Looks like your CSV isn't quite formatted correctly. "" isn't a number, and numbers don't need to be be quoted in CSV.
I find it's usually easier in PostgreSQL to create a staging import table with all text columns, and import CSVs to there first. Then do a cleanup query to put the CSV data into the real table.

How to make Postgres Copy ignore first line of large txt file

I have a fairly large .txt file ~9gb and I will like to load this txt file into postgres. The first row is the header, followed by all the data. If I postgres COPY the data directly, the header will cause an error that data type do not match with my postgres table, so I will need to remove it somehow.
Sample data:
ProjectId,MailId,MailCodeId,prospectid,listid,datemailed,amount,donated,zip,zip4,VectorMajor,VectorMinor,packageid,phase,databaseid,amount2
15,53568419,89734,219906,15,2011-05-11 00:00:00,0,0,90720,2915,NonProfit,POLICY,230,3,1,0
16,84141863,87936,164657,243,2011-03-10 00:00:00,0,0,48362,2523,NonProfit,POLICY,1507,5,1,0
16,81442028,86632,15181625,243,2011-01-19 00:00:00,0,0,11501,2115,NonProfit,POLICY,1508,2,1,0
While the COPY function for postgres has the "header" setting that can ignore the first row, it only works for csv files:
copy training from 'C:/testCSV.csv' DELIMITER ',' csv header;
when I try to run the code above on my txt file, it gets an error:
copy training from 'C:/testTXTFile.txt' DELIMITER ',' csv header
ERROR: unquoted newline found in data
HINT: Use quoted CSV field to represent newline.
I have tried adding "quote" and "escape" attributes but the command just won't seem to work for txt file:
copy training from 'C:/testTXTFile.txt' DELIMITER ',' csv header quote as E'"' escape as E'\\N';
ERROR: COPY escape must be a single one-byte character
Alternatively, I thought about running java or create a seperate stagging table to remove the first row...but these solutions are expansive and time consuming. I will need to load 9gb of data just to remove the first row of headers... are there other solutions out there to remove the first row of a txt file easily so that I can load the data into my postgres database?
Use HEADER option with CSV option:
\copy <table_name> from '/source_file.csv' delimiter ',' CSV HEADER ;
HEADER
Specifies that the file contains a header line with the names of each column in the file. On output, the first line contains the column names from the table, and on input, the first line is ignored. This option is allowed only when using CSV format.
I've looked up docs at https://www.postgresql.org/docs/10/sql-copy.html
written about HEADER is not only true for CSV, but TSV also!
My solution was this in psql
\COPY mytable FROM 'mydata.tsv' DELIMITER E'\t' CSV HEADER;
(in addition mydata.tsv contaned header row which I excluded from copying to database table)