Green Plum Gpload issue while loading a text file using yaml file - sql

I am trying to load text file that is delimited with Pipe (|) to Green Plum Table. But because of some special characters in column like ' ÉCLAIR' causing the load fail. Is there is any option in Greenplum Gpload that will load the data in Table without issue.
I am using yaml file like this :
GPLOAD:
INPUT:
- SOURCE:
FILE: [ /testfile.dat ]
- FORMAT: TEXT
- DELIMITER: '|'
- ENCODING: 'LATIN1'
- NULL_AS: ''
- ERROR_LIMIT: 10000
- ERROR_TABLE:
is there is any other option in Gpload that we can use to load the file ?
I am creating the file to load from Teradata and because of teradata columns has special character it is causing issue in Greenplum as well.

You can try adding: - ESCAPE: 'OFF' in the input section.
You may need to change the ENCODING to something that recognizes those special characters. LATIN9 maybe?
Jim McCann
Pivotal

Related

" replaced by ""

redshift unload command is replacing " by "".
example :
UNLOAD($$ select '"Jane"' as name $$)
TO 's3://s3-bucket/test_'
iam_role arn:aws:iam::xxxxxx:role/xxxxxx'
HEADER
CSV
DELIMITER ','
ALLOWOVERWRITE
The output looks like : ""Jane""
If I run the same command with select 'Jane' as name , the output shows without quote at all like Jane. But I need the output to be "Jane"
You are asking for the unloaded file to be in CSV format and CSV format says that if you want a double quote in your data you need to escape it with another double quote. See https://datatracker.ietf.org/doc/html/rfc4180
So Redshift is doing exactly as you requested. Now if you just want a comma delimited file then you don't want to use "CSV" as this will add all the necessary characters to make the file fully compliant with the CSV specification.
This choice will come down to what tool or tools are reading the file and if they expect an rfc compliant CSV or just a simple file where fields are separated by commas.
This is a gripe of mine - tools that say they read CSV but don't follow the spec. If you say CSV then follow the format. Or call what you read something different, like CDV - comma delimited values.

Custom delimiter while exporting Google Cloud SQL to CSV

I've been successfully exporting GCloud SQL to CSV with its default delimiter ",". I want to import this CSV to Google Big Query and I've succeed to do this.
However, I'm experiencing a little problem. There's "," in some of my cell/field. It causes Big Query import process not working properly. For Example:
"Budi", "19", "Want to be hero, and knight"
My questions are:
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
If not, how to make above sample data to be imported in Google Big Query and become 3 field/cell?
Cheers.
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
Yes it's, See the documentation page of BigQuery how to set load options provided in this link
You will need to add --field_delimiter = '|' to your command
From the documentation:
(Optional) The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (,).
As far as I know there's no way of setting a custom delimiter when exporting from CloudSQL to CSV. I attempted to introduce my own delimiter by formulating my select query like so:
select column_1||'|'||column_2 from foo
But this only results in CloudSQL escaping the whole result in the resulting CSV with double quotes. This also aligns with the documentation which states:
Exporting in CSV format is equivalent to running the following SQL statement:
SELECT <query> INTO OUTFILE ... CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\\' LINES TERMINATED BY '\n'
https://cloud.google.com/sql/docs/mysql/import-export/exporting

'Missing close double quote (") character' is complained when there're line feeds in csv file when loading data to BigQuery

The culprit line is as follows. It should be composed of 14 columns, with one of the column, starting with 'Hi I'm Niger...', covering multiple line with line feeds.
17935,9a7105ee-30c8-4a6d-9374-10875b7d6288.jpg,"""top""=>""0"", ""left""=>""0"", ""width""=>""180"", ""height""=>""180""",,"",2015-07-26 19:33:57.292058,2015-07-26 20:25:30.068887,fe43876f-1b2c-464a-aa20-bf335ed3ff62,c68c8c70-bc2b-11e4-90a1-22000b21105f,{},2e790350-15fb-0133-2cb8-22000ba51078,"Hi I'm Nigerian so wish to study in sweden.
so I'm Undergraduate student I want study Engineering.
Thanks.","",{}
When loading this csv data into BigQuery via command bq load --replace --source_format=CSV -F"," ..., Error complains. Could anyone give me an solution to this BigQuery Load Data command?
- File: 0 / Line:17192 / Field:12: Missing close double quote (")
character: field starts with: <Hi I'm N>
- File: 0 / Line:17193: Too few columns: expected 14 column(s) but
got 1 column(s). For additional help: http://goo.gl/RWuPQ
- File: 0 / Line:17194: Too few columns: expected 14 column(s) but
got 3 column(s). For additional help: http://goo.gl/RWuPQ
If you are loading CSV with embedded newlines, you need to specify allowQuotedNewlines.
https://cloud.google.com/bigquery/docs/reference/v2/jobs#configuration.load.allowQuotedNewlines
The BigQuery default is to assume that CSV data does not contain newlines. This allows for a much higher parsing throughput when dealing with large data files since the input files can be split at arbitrary newlines. If your data contains newlines within strings, each file needs to be parsed linearly by a single machine.
Make sure you include this line before loading data to BigQuery: 'job_config.allow_quoted_newlines = True'
job_config = bigquery.LoadJobConfig()
job_config.allow_quoted_newlines = True
If you trying to load a CSV file to a table from the BigQuery google console make sure you select the Advanced option -> Quoted new lines.

Valid CSV filed import fails with Data between close double quote (") and field separator: field starts with

I am trying to import a CSV file into BQ from GS.
The cmd I use is:
$ bq load --field_delimiter=^ --quote='"' --allow_quoted_newlines
--allow_jagged_rows --ignore_unknown_values wr_dev.drupal_user_profile gs://fls_csv_files/user_profileA.csv
uid:string,first_name:string,last_name:string,category_id:string,logo_type:string,country_id:string,phone:string,phone_2:string,address:string,address_2:string,city:string,state:string,zip:string,company_name:string,created:string,updated:string,subscription:string
the reported error is
File: 0 / Line:1409 / Field:14, Data between close double quote (")
and field separator: field starts with: <Moreno L>
sample data is:
$ sed -n '1409,1409p' user_profileA.csv
$ 1893^"Moreno"^"Jackson"^17^0^1^"517-977-1133"^"517-303-3717"^""^""^""^""^""^"Moreno L Jackson \"THE MOTIVATOR!\" "^0^1282240785^1
which was generated from MySQL with:
SELECT * INTO OUTFILE '/opt/mysql_exports/user_profileA.csv'
FIELDS TERMINATED BY '^'
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM p;
Why I get the error message in BQ? How to properly export from MySQL CSV files that have newlines (CR and LF mixed, as it was user input from Windows or Mac)
Couple of job IDs:
Job ID: aerobic-forge-504:bqjob_r75d28c332a179207_0000014710c6969d_1
Job ID: aerobic-forge-504:bqjob_r732cb544f96e3d8d_0000014710f8ffe1_1
Update
Apparently it's more to this. I used 5.5.34-MariaDB-wsrep-log INTO OUTFILE, and either is a bug or something wrong, but I get invalid CSV exports. I had to use other tool to export proper CSV. (tool: SQLYog)
it has problems with double quotes, for example Field 14 here has error:
3819^Ron ^Wolbert^6^0^1^6123103169^^^^^^^""Lil"" Ron's^0^1282689026^1
UPDATE 2019:
Try this as an alternative:
Load the MySQL backup files into a Cloud SQL instance.
Read the data in BigQuery straight out of MySQL.
Longer how-to:
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229
The proper way to encode a double quote in CSV is to put another double quote in front of it.
So instead of:
"Moreno L Jackson \"THE MOTIVATOR!\"...
Have:
"Moreno L Jackson ""THE MOTIVATOR!""...

PIG - Read each line into its field

Is there a way I can read each line from a logfile into its own field. I thought with ('\n') as a delimiter I should be able to achieve that.
File - test
Audit file /u01/app/oracle/admin/st01/adump/st011_ora_27063_1.aud
Node name: test0041
CLIENT USER:[6] 'oracle'
So I would like to read this into three fields as
filename - Audit file /u01/app/oracle/admin/st01/adump/st011_ora_27063_1.aud
nodename - Node name: test0041
username - CLIENT USER:[6] 'oracle'
I tried this but it didnt help.
A = LOAD 'test' using PigStorage ('\n') AS (filename, nodename, username);
You can't use '\n' as a delimiter with PigStorage. According to the Pig10 docs:
Record Deliminters – For load statements Pig interprets the line feed ( '\n' ), carriage return ( '\r' or CTRL-M) and combined CR + LF ( '\r\n' ) characters as record delimiters (do not use these characters as field delimiters). For store statements Pig uses the line feed ('\n') character as the record delimiter.
If you want to parse the log file you'll have to write a custom loader.
If your file is like that small, why don't you do a pre-processing of the file like converting \n to \t for example before LOAD?