PIG - Read each line into its field

PIG - Read each line into its field - apache-pig

Is there a way I can read each line from a logfile into its own field. I thought with ('\n') as a delimiter I should be able to achieve that.
File - test
Audit file /u01/app/oracle/admin/st01/adump/st011_ora_27063_1.aud
Node name: test0041
CLIENT USER:[6] 'oracle'
So I would like to read this into three fields as
filename - Audit file /u01/app/oracle/admin/st01/adump/st011_ora_27063_1.aud
nodename - Node name: test0041
username - CLIENT USER:[6] 'oracle'
I tried this but it didnt help.
A = LOAD 'test' using PigStorage ('\n') AS (filename, nodename, username);

You can't use '\n' as a delimiter with PigStorage. According to the Pig10 docs:
Record Deliminters – For load statements Pig interprets the line feed ( '\n' ), carriage return ( '\r' or CTRL-M) and combined CR + LF ( '\r\n' ) characters as record delimiters (do not use these characters as field delimiters). For store statements Pig uses the line feed ('\n') character as the record delimiter.
If you want to parse the log file you'll have to write a custom loader.

If your file is like that small, why don't you do a pre-processing of the file like converting \n to \t for example before LOAD?

Related

" replaced by ""

redshift unload command is replacing " by "".
example :
UNLOAD($$ select '"Jane"' as name $$)
TO 's3://s3-bucket/test_'
iam_role arn:aws:iam::xxxxxx:role/xxxxxx'
HEADER
CSV
DELIMITER ','
ALLOWOVERWRITE
The output looks like : ""Jane""
If I run the same command with select 'Jane' as name , the output shows without quote at all like Jane. But I need the output to be "Jane"

You are asking for the unloaded file to be in CSV format and CSV format says that if you want a double quote in your data you need to escape it with another double quote. See https://datatracker.ietf.org/doc/html/rfc4180
So Redshift is doing exactly as you requested. Now if you just want a comma delimited file then you don't want to use "CSV" as this will add all the necessary characters to make the file fully compliant with the CSV specification.
This choice will come down to what tool or tools are reading the file and if they expect an rfc compliant CSV or just a simple file where fields are separated by commas.
This is a gripe of mine - tools that say they read CSV but don't follow the spec. If you say CSV then follow the format. Or call what you read something different, like CDV - comma delimited values.

How to remove new line characters from data rows in Presto/AWS Athena?

I'm querying some tables on Athena (Presto SAS) and then downloading the generated CSV file to use locally. Opening the file, I realised the data contains new line characters that doesn't appear on AWS interface, only in the CSV and need to get rid of them. Tried using the function replace(string, search, replace) → varchar to skip the newline char replacing \n for \\n without success:
SELECT
p.recvepoch, replace(p.description, '\n', '\\n') AS description
FROM
product p
LIMIT 1000
How can I achieve that?

The problem was that the underlying table data doesn't actually contains \n anywhere, instead, the actual newline character, which is represented by char(10). I was able to achieve the expected behaviour using the replace function passing it as parameter:
SELECT
p.recvepoch, replace(p.description, chr(10), '\n') AS description
FROM
product p
LIMIT 1000

Custom delimiter while exporting Google Cloud SQL to CSV

I've been successfully exporting GCloud SQL to CSV with its default delimiter ",". I want to import this CSV to Google Big Query and I've succeed to do this.
However, I'm experiencing a little problem. There's "," in some of my cell/field. It causes Big Query import process not working properly. For Example:
"Budi", "19", "Want to be hero, and knight"
My questions are:
Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
If not, how to make above sample data to be imported in Google Big Query and become 3 field/cell?
Cheers.

Is it possible to export Google Cloud SQL with custom delimiter e.g. "|"?
Yes it's, See the documentation page of BigQuery how to set load options provided in this link
You will need to add --field_delimiter = '|' to your command
From the documentation:
(Optional) The separator for fields in a CSV file. The separator can be any ISO-8859-1 single-byte character. To use a character in the range 128-255, you must encode the character as UTF8. BigQuery converts the string to ISO-8859-1 encoding, and uses the first byte of the encoded string to split the data in its raw, binary state. BigQuery also supports the escape sequence "\t" to specify a tab separator. The default value is a comma (,).

As far as I know there's no way of setting a custom delimiter when exporting from CloudSQL to CSV. I attempted to introduce my own delimiter by formulating my select query like so:
select column_1||'|'||column_2 from foo
But this only results in CloudSQL escaping the whole result in the resulting CSV with double quotes. This also aligns with the documentation which states:
Exporting in CSV format is equivalent to running the following SQL statement:
SELECT <query> INTO OUTFILE ... CHARACTER SET 'utf8mb4'
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\\' LINES TERMINATED BY '\n'
https://cloud.google.com/sql/docs/mysql/import-export/exporting

Green Plum Gpload issue while loading a text file using yaml file

I am trying to load text file that is delimited with Pipe (|) to Green Plum Table. But because of some special characters in column like ' ÉCLAIR' causing the load fail. Is there is any option in Greenplum Gpload that will load the data in Table without issue.
I am using yaml file like this :
GPLOAD:
INPUT:
- SOURCE:
FILE: [ /testfile.dat ]
- FORMAT: TEXT
- DELIMITER: '|'
- ENCODING: 'LATIN1'
- NULL_AS: ''
- ERROR_LIMIT: 10000
- ERROR_TABLE:
is there is any other option in Gpload that we can use to load the file ?
I am creating the file to load from Teradata and because of teradata columns has special character it is causing issue in Greenplum as well.

You can try adding: - ESCAPE: 'OFF' in the input section.
You may need to change the ENCODING to something that recognizes those special characters. LATIN9 maybe?
Jim McCann
Pivotal

Valid CSV filed import fails with Data between close double quote (") and field separator: field starts with

I am trying to import a CSV file into BQ from GS.
The cmd I use is:
$ bq load --field_delimiter=^ --quote='"' --allow_quoted_newlines
--allow_jagged_rows --ignore_unknown_values wr_dev.drupal_user_profile gs://fls_csv_files/user_profileA.csv
uid:string,first_name:string,last_name:string,category_id:string,logo_type:string,country_id:string,phone:string,phone_2:string,address:string,address_2:string,city:string,state:string,zip:string,company_name:string,created:string,updated:string,subscription:string
the reported error is
File: 0 / Line:1409 / Field:14, Data between close double quote (")
and field separator: field starts with: <Moreno L>
sample data is:
$ sed -n '1409,1409p' user_profileA.csv
$ 1893^"Moreno"^"Jackson"^17^0^1^"517-977-1133"^"517-303-3717"^""^""^""^""^""^"Moreno L Jackson \"THE MOTIVATOR!\" "^0^1282240785^1
which was generated from MySQL with:
SELECT * INTO OUTFILE '/opt/mysql_exports/user_profileA.csv'
FIELDS TERMINATED BY '^'
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM p;
Why I get the error message in BQ? How to properly export from MySQL CSV files that have newlines (CR and LF mixed, as it was user input from Windows or Mac)
Couple of job IDs:
Job ID: aerobic-forge-504:bqjob_r75d28c332a179207_0000014710c6969d_1
Job ID: aerobic-forge-504:bqjob_r732cb544f96e3d8d_0000014710f8ffe1_1
Update
Apparently it's more to this. I used 5.5.34-MariaDB-wsrep-log INTO OUTFILE, and either is a bug or something wrong, but I get invalid CSV exports. I had to use other tool to export proper CSV. (tool: SQLYog)
it has problems with double quotes, for example Field 14 here has error:
3819^Ron ^Wolbert^6^0^1^6123103169^^^^^^^""Lil"" Ron's^0^1282689026^1

UPDATE 2019:
Try this as an alternative:
Load the MySQL backup files into a Cloud SQL instance.
Read the data in BigQuery straight out of MySQL.
Longer how-to:
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229
The proper way to encode a double quote in CSV is to put another double quote in front of it.
So instead of:
"Moreno L Jackson \"THE MOTIVATOR!\"...
Have:
"Moreno L Jackson ""THE MOTIVATOR!""...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

PIG - Read each line into its field - apache-pig

If your file is like that small, why don't you do a pre-processing of the file like converting \n to \t for example before LOAD?

Related

" replaced by ""

How to remove new line characters from data rows in Presto/AWS Athena?

Custom delimiter while exporting Google Cloud SQL to CSV

Green Plum Gpload issue while loading a text file using yaml file

Valid CSV filed import fails with Data between close double quote (") and field separator: field starts with

Categories

Resources