Does HIVE accept CTRL character? - hive

Does Hive accept CTRL lines terminated by '\r\n'? I have to generate a text file to Windows and wish to use CRLF for line termination. If so, can you let me know whether I gave the correct one or not?
LINES TERMINATED BY '\r\n'

In the hive documentation, line separator is a char, hence it should not accept two characters !
row_format : DELIMITED
[FIELDS TERMINATED BY char [ESCAPED BY char]]
[COLLECTION ITEMS TERMINATED BY char]
[MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]
[NULL DEFINED AS char] (Note: Only available starting with Hive 0.13)

Currently only \n is supported.
Check out this JIRA tickets.
Allow other characters for LINES TERMINATED BY
Row Delimiter other than '\n' throws error in Hive.
Demo
hive> create table t (i int) row format delimited lines terminated by '\r\n' location '/tmp';
FAILED: SemanticException 1:64 LINES TERMINATED BY only supports newline '\n' right now. Error encountered near token ''\r\n''
hive>

Related

unquoted newline found in data error and date/time field value out of range error while copying into pgadmin

CREATE TABLE procedurehistory (
petid varchar,
proceduredate date,
proceduretype varchar,
proceduresubcode varchar
);
COPY procedurehistory FROM 'C:\Users\LENOVO\Desktop\Databases\ProceduresHistory.csv' DELIMITER ',' CSV HEADER;
the given .csv file link is https://sds-platform-private.s3-us-east-2.amazonaws.com/uploads/P9-ProceduresHistory.csv
Error faced
ERROR: unquoted newline found in data
HINT: Use quoted CSV field to represent a newline.
CONTEXT: COPY procedurehistory, line 103
SQL state: 22P04
Then I tried adding a page break in the excel file where the data ended
Now, facing a new error
ERROR: date/time field value out of range: "13-01-16"
HINT: Perhaps you need a different "datestyle" setting.
CONTEXT: COPY procedurehistory, line 13, column proceduredate: "13-01-16"
SQL state: 22008
If you open up the CSV file in vim, you'll see that some lines don't end with a newline (^M). Since the CSV header of your particular file expects a newline, the rest of the CSV should have a newline at the end of each entry. One solution is to add a ^M by typing Ctrl-V + Ctrl-M for each line that doesn't have a newline, or just remove all newlines completely from the CSV:
-bash-4.2$ psql
psql (9.6.17)
Type "help" for help.
postgres=# COPY procedurehistory FROM '/tmp/foo.csv' DELIMITER ',' CSV HEADER;
ERROR: unquoted newline found in data
HINT: Use quoted CSV field to represent newline.
CONTEXT: COPY procedurehistory, line 103: ""
postgres=# \q
-bash-4.2$ sed -i "s/^M//g" /tmp/P9-ProceduresHistory.csv > /tmp/foo.csv
-bash-4.2$ psql
psql (9.6.17)
Type "help" for help.
postgres=# COPY procedurehistory FROM '/tmp/foo.csv' DELIMITER ',' CSV HEADER;
COPY 2284

Embedded Newline Character Issue in Redshift Copy Command

We have fifteen embedded newline characters in the field of a source S3 file. The field size in target table in Redshift is VARCHAR(5096). The field length in the source file is 5089 bytes. We are escaping each of the fifteen newline characters with a backslash \ as required by the ESCAPE option of the copy command. Our expectation with the ESCAPE option is that the backslash \ that has been inserted by us before each newline character will be ignored before loading the target in Redshift. However, when we use copy command with the ESCAPE option we are getting
err_code:1204 - String length exceeds DDL length."
Is there a way in which the added backslash \ characters are not counted for target column loads in Redshift?
Note: When we truncated the above source field in the file to 4000 bytes and inserted the backslash \ before the newline characters, the copy command with ESCAPE option successfully loaded the field in Redshift. Also, the backslash \ characters were not loaded in Redshift as expected.
You cold extend your VARCHAR length to allow for more characters.
Or, you could use the TRUNCATECOLUMNS options to load as much as possible without generating an error.
Our understanding w.r.t the above issue was incorrect. The backslashes "\" that we had inserted were not causing the error "err_code:1204 - String length exceeds DDL length.". The "escape" option with the copy command was actually not counting the inserted backslash characters towards the target limit and was also removing them from the loaded value properly.
The actual issue that were facing was that some of the characters that we were trying to load were multibyte UTF8 characters. Since, we were incorrectly assuming them to be of length 1 byte, the size of the target field was proving to be insufficient. We increased the length of the target field from varchar(5096) to varchar(7096), after which all data was loaded successfully.

How to import data from cvs which some field contains " " by using load data infie

I have some columns want to insert a existing table, some columns have contents like this"how, using,list, file",each column is separated by",",so how do I use load data infile to import them?
You didn't indicate which db you're using, So I'll answer you for mysql.
Please see this.
If the lines in such a file are terminated by carriage return/newline pairs, the statement shown here illustrates the field- and line-handling options you would use to load the file:
LOAD DATA INFILE 'data.txt' INTO TABLE tbl_name
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
IGNORE 1 LINES;
And further down the page:
Occurrences of the ENCLOSED BY character within a field value are escaped by prefixing them with the ESCAPED BY character. Also note that if you specify an empty ESCAPED BY value, it is possible to inadvertently generate output that cannot be read properly by LOAD DATA INFILE. For example, the preceding output just shown would appear as follows if the escape character is empty. Observe that the second field in the fourth line contains a comma following the quote, which (erroneously) appears to terminate the field:
1,"a string",100.20
2,"a string containing a , comma",102.20
3,"a string containing a " quote",102.20
4,"a string containing a ", quote and comma",102.20
So, if you have unescaped "" inside columns these data could not be imported in the general case, and you'll have either to export data using correct ENCLOSED BY character, or preprocess the file to escape " first.
If you just have , inside your columns, then it's easy, you'll have to use ENCLOSED BY and TERMINATED BY.

Valid CSV filed import fails with Data between close double quote (") and field separator: field starts with

I am trying to import a CSV file into BQ from GS.
The cmd I use is:
$ bq load --field_delimiter=^ --quote='"' --allow_quoted_newlines
--allow_jagged_rows --ignore_unknown_values wr_dev.drupal_user_profile gs://fls_csv_files/user_profileA.csv
uid:string,first_name:string,last_name:string,category_id:string,logo_type:string,country_id:string,phone:string,phone_2:string,address:string,address_2:string,city:string,state:string,zip:string,company_name:string,created:string,updated:string,subscription:string
the reported error is
File: 0 / Line:1409 / Field:14, Data between close double quote (")
and field separator: field starts with: <Moreno L>
sample data is:
$ sed -n '1409,1409p' user_profileA.csv
$ 1893^"Moreno"^"Jackson"^17^0^1^"517-977-1133"^"517-303-3717"^""^""^""^""^""^"Moreno L Jackson \"THE MOTIVATOR!\" "^0^1282240785^1
which was generated from MySQL with:
SELECT * INTO OUTFILE '/opt/mysql_exports/user_profileA.csv'
FIELDS TERMINATED BY '^'
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
FROM p;
Why I get the error message in BQ? How to properly export from MySQL CSV files that have newlines (CR and LF mixed, as it was user input from Windows or Mac)
Couple of job IDs:
Job ID: aerobic-forge-504:bqjob_r75d28c332a179207_0000014710c6969d_1
Job ID: aerobic-forge-504:bqjob_r732cb544f96e3d8d_0000014710f8ffe1_1
Update
Apparently it's more to this. I used 5.5.34-MariaDB-wsrep-log INTO OUTFILE, and either is a bug or something wrong, but I get invalid CSV exports. I had to use other tool to export proper CSV. (tool: SQLYog)
it has problems with double quotes, for example Field 14 here has error:
3819^Ron ^Wolbert^6^0^1^6123103169^^^^^^^""Lil"" Ron's^0^1282689026^1
UPDATE 2019:
Try this as an alternative:
Load the MySQL backup files into a Cloud SQL instance.
Read the data in BigQuery straight out of MySQL.
Longer how-to:
https://medium.com/google-cloud/loading-mysql-backup-files-into-bigquery-straight-from-cloud-sql-d40a98281229
The proper way to encode a double quote in CSV is to put another double quote in front of it.
So instead of:
"Moreno L Jackson \"THE MOTIVATOR!\"...
Have:
"Moreno L Jackson ""THE MOTIVATOR!""...

PIG - Read each line into its field

Is there a way I can read each line from a logfile into its own field. I thought with ('\n') as a delimiter I should be able to achieve that.
File - test
Audit file /u01/app/oracle/admin/st01/adump/st011_ora_27063_1.aud
Node name: test0041
CLIENT USER:[6] 'oracle'
So I would like to read this into three fields as
filename - Audit file /u01/app/oracle/admin/st01/adump/st011_ora_27063_1.aud
nodename - Node name: test0041
username - CLIENT USER:[6] 'oracle'
I tried this but it didnt help.
A = LOAD 'test' using PigStorage ('\n') AS (filename, nodename, username);
You can't use '\n' as a delimiter with PigStorage. According to the Pig10 docs:
Record Deliminters – For load statements Pig interprets the line feed ( '\n' ), carriage return ( '\r' or CTRL-M) and combined CR + LF ( '\r\n' ) characters as record delimiters (do not use these characters as field delimiters). For store statements Pig uses the line feed ('\n') character as the record delimiter.
If you want to parse the log file you'll have to write a custom loader.
If your file is like that small, why don't you do a pre-processing of the file like converting \n to \t for example before LOAD?