Custom delimiter while exporting Google Cloud SQL to CSV - google-bigquery

I've been successfully exporting GCloud SQL to CSV with its default delimiter ",". I want to import this CSV to Google Big Query and I've succeed to do this.
However, I'm experiencing a little problem. There's "," in some of my cell/field. It causes Big Query import process not working properly. For Example:
"Budi", "19", "Want to be hero, and knight"
My questions are:
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
If not, how to make above sample data to be imported in Google Big Query and become 3 field/cell?
Cheers.

Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
Yes it's, See the documentation page of BigQuery how to set load options provided in this link
You will need to add --field_delimiter = '|' to your command
From the documentation:
(Optional) The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (,).

As far as I know there's no way of setting a custom delimiter when exporting from CloudSQL to CSV. I attempted to introduce my own delimiter by formulating my select query like so:
select column_1||'|'||column_2 from foo
But this only results in CloudSQL escaping the whole result in the resulting CSV with double quotes. This also aligns with the documentation which states:
Exporting in CSV format is equivalent to running the following SQL statement:
SELECT <query> INTO OUTFILE ... CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\\' LINES TERMINATED BY '\n'
https://cloud.google.com/sql/docs/mysql/import-export/exporting

Related

" replaced by ""

redshift unload command is replacing " by "".
example :
UNLOAD($$ select '"Jane"' as name $$)
TO 's3://s3-bucket/test_'
iam_role arn:aws:iam::xxxxxx:role/xxxxxx'
HEADER
CSV
DELIMITER ','
ALLOWOVERWRITE
The output looks like : ""Jane""
If I run the same command with select 'Jane' as name , the output shows without quote at all like Jane. But I need the output to be "Jane"
You are asking for the unloaded file to be in CSV format and CSV format says that if you want a double quote in your data you need to escape it with another double quote. See https://datatracker.ietf.org/doc/html/rfc4180
So Redshift is doing exactly as you requested. Now if you just want a comma delimited file then you don't want to use "CSV" as this will add all the necessary characters to make the file fully compliant with the CSV specification.
This choice will come down to what tool or tools are reading the file and if they expect an rfc compliant CSV or just a simple file where fields are separated by commas.
This is a gripe of mine - tools that say they read CSV but don't follow the spec. If you say CSV then follow the format. Or call what you read something different, like CDV - comma delimited values.

How to escape double quotes within a data when it is already enclosed by double quotes

I have CSV data separated by comma like below which has to be imported into snowflake table using copy command .
"1","2","3","2"In stick"
Since I am already passing the parameter OPTIONALLY_ENCLOSED_BY = '"' to copy command I couldn't escape the " (double quotes) within the data ("2"In stick") .
The imported data that I want to see in the table is like below
1,2,3,2"In stick
Can someone please help here ? Thanks !
If you are on Windows, I have a funny solution for that. Open this CSV file in MS Excel. Excel consumes correct double quotes to show data in the cellular format and leaves the extra in the middle of a cell (if each cell is separated properly by commas). Then choose 'replace' and replace double quotes with something else (like two single quotes or replace by nothing to remove them). Then save it again as a CSV. I assume other spreadsheet programs should do the same.
If you have an un-escaped quote inside a field which is surrounded by quotes that isn't really valid CSV. For example, here is an excerpt from the RFC4180 spec
If double-quotes are used to enclose fields, then a double-quote
appearing inside a field must be escaped by preceding it with another double quote.
For example:
"aaa","b""bb","ccc"
I think that whatever is generating the CSV file is doing it incorrectly and needs to be fixed before you will be able to load it into Snowflake. I don't think any file_format option will be able to solve this for you since it's not valid CSV.
The CSV row should either look like this:
"1","2","3","2""In stick"
or this:
"1","2","3","2\"In stick"
I had this same problem, and while writing up the question, I found an answer:
Import RFC4180 files (CSV spec) into snowflake? (Unable to create file format that matches CSV RFC spec)
Essentially, set:
Name
Value
Column Separator
Comma
Row Separator
New Line
Header lines to skip
{you have to decide what to put here}
Field optionally enclosed by
Double Quote
Escape Character
None
Escape Unenclosed Field
None
Here is my ALTER statement:
ALTER FILE FORMAT "DB_NAME"."SCHEMA_NAME"."CSV_SPEC3" SET COMPRESSION = 'NONE' FIELD_DELIMITER = ',' RECORD_DELIMITER = '\n' SKIP_HEADER = 1 FIELD_OPTIONALLY_ENCLOSED_BY = '\042' TRIM_SPACE = FALSE ERROR_ON_COLUMN_COUNT_MISMATCH = TRUE ESCAPE = 'NONE' ESCAPE_UNENCLOSED_FIELD = 'NONE' DATE_FORMAT = 'AUTO' TIMESTAMP_FORMAT = 'AUTO' NULL_IF = ('\\N');
As I mention in the answer, I don't know why the above works, but it is working for me. Go figure.

skipLeadingRows=1 in external table definition

In the below example, how can I set the skip leading row option?
bq --location=US query --external_table_definition=sales::Region:STRING,Quarter:STRING,Total_sales:INTEGER#CSV=gs://mybucket/sales.csv 'SELECT Region,Total_sales FROM sales;'
Regards,
Sreekanth
Flags options can be found under the installation home folder (I marked in bold below the flag you are looking for)
/google-cloud-sdk/platform/bq/bq.py:
--[no]allow_jagged_rows: Whether to allow missing trailing optional columns in
CSV import data.
--[no]allow_quoted_newlines: Whether to allow quoted newlines in CSV import
data.
-E,--encoding: : The character encoding used by the input
file. Options include:
ISO-8859-1 (also known as Latin-1)
UTF-8
-F,--field_delimiter: The character that indicates the boundary between
columns in the input file. "\t" and "tab" are accepted names for tab.
--[no]ignore_unknown_values: Whether to allow and ignore extra, unrecognized
values in CSV or JSON import data.
--max_bad_records: Maximum number of bad records allowed before the entire job
fails.
(default: '0')
(an integer)
--quote: Quote character to use to enclose records. Default is ". To indicate
no quote character at all, use an empty string.
--[no]replace: If true erase existing contents before loading new data.
(default: 'false')
--schema: Either a filename or a comma-separated list of fields in the form
name[:type].
--skip_leading_rows: The number of rows at the beginning of the source file to
skip.
(an integer)
--source_format: : Format of
source data. Options include:
CSV
NEWLINE_DELIMITED_JSON
DATASTORE_BACKUP

Valid CSV filed import fails with Data between close double quote (") and field separator: field starts with

I am trying to import a CSV file into BQ from GS.
The cmd I use is:
$ bq load --field_delimiter=^ --quote='"' --allow_quoted_newlines
--allow_jagged_rows --ignore_unknown_values wr_dev.drupal_user_profile gs://fls_csv_files/user_profileA.csv
uid:string,first_name:string,last_name:string,category_id:string,logo_type:string,country_id:string,phone:string,phone_2:string,address:string,address_2:string,city:string,state:string,zip:string,company_name:string,created:string,updated:string,subscription:string
the reported error is
File: 0 / Line:1409 / Field:14, Data between close double quote (")
and field separator: field starts with: <Moreno L>
sample data is:
$ sed -n '1409,1409p' user_profileA.csv
$ 1893^"Moreno"^"Jackson"^17^0^1^"517-977-1133"^"517-303-3717"^""^""^""^""^""^"Moreno L Jackson \"THE MOTIVATOR!\" "^0^1282240785^1
which was generated from MySQL with:
SELECT * INTO OUTFILE '/opt/mysql_exports/user_profileA.csv'
FIELDS TERMINATED BY '^'
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM p;
Why I get the error message in BQ? How to properly export from MySQL CSV files that have newlines (CR and LF mixed, as it was user input from Windows or Mac)
Couple of job IDs:
Job ID: aerobic-forge-504:bqjob_r75d28c332a179207_0000014710c6969d_1
Job ID: aerobic-forge-504:bqjob_r732cb544f96e3d8d_0000014710f8ffe1_1
Update
Apparently it's more to this. I used 5.5.34-MariaDB-wsrep-log INTO OUTFILE, and either is a bug or something wrong, but I get invalid CSV exports. I had to use other tool to export proper CSV. (tool: SQLYog)
it has problems with double quotes, for example Field 14 here has error:
3819^Ron ^Wolbert^6^0^1^6123103169^^^^^^^""Lil"" Ron's^0^1282689026^1
UPDATE 2019:
Try this as an alternative:
Load the MySQL backup files into a Cloud SQL instance.
Read the data in BigQuery straight out of MySQL.
Longer how-to:
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229
The proper way to encode a double quote in CSV is to put another double quote in front of it.
So instead of:
"Moreno L Jackson \"THE MOTIVATOR!\"...
Have:
"Moreno L Jackson ""THE MOTIVATOR!""...

Postgres 9.3 end-of-copy marker corrupt - Any way to change this setting?

I am trying to stream data through an AWK program to a Postgres COPY command. This works great usually. However, in my data recently I have been getting long text stings containing '\.' values.
Postgres Documentation mentions this combination of characters represents the end-of-data marker, http://www.postgresql.org/docs/9.2/static/sql-copy.html, and I am getting the associated errors when trying to insert with COPY.
My question is, is there a way to turn this off? Perhaps change the end-of-data marker to a different combination of characters? Or do I have to alter/remove these strings before trying to insert using the COPY command?
You can try to filter your data through sed 's:\\:\\\\:g' - this would change every \ in your data to \\, which is a correct escape sequence for single backslash in copy data.
But I think not only backslash would be problematic. Also newlines should be encoded by \n, carriage returns as \r and tabs as \t (tab is a default field delimiter in copy).