Newline characters in external table in HIVE - hive

I have a lot (think terabytes) of data that is partitioned per day (a few gigabytes per file) in a TSV format; some fields, unfortunately, have \n chars.
I am trying to create external table in a way that the newline chars in the data do not break the rows :( i tried the
CREATE EXTERNAL TABLE test
( `column` int, `column1` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
STORED AS TEXTFILE
LOCATION '/home/';
i also tried
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ( 'field.delim' = '\t', 'line.delim'='\n')
and
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = "\t",
"quoteChar" = "'",
"escapeChar" = "\\"
)
In both cases i get no luck, newlines in fields cause the row to terminate :(
How can i get my tsv into hive with newlines in columns?

Related

AWS Athena custom data format?

I'd like to query my app logs on S3 with AWS Athena but I'm having trouble creating the table/specifying the data format.
This is how the log lines look:
2020-12-09T18:08:48.789Z {"reqid":"Root=1-5fd112b0-676bbf5a4d54d57d56930b17","cache":"xxxx","cacheKey":"yyyy","level":"debug","message":"cached value found"}
which is a timestamp followed by space and the JSON line I want to query.
Is there a way to query logs like this? I see CSV, TSV, JSON, Apache Web Logs and Text File with Custom Delimiters data formats are supported but because of the timestamp I can't simply use JSON.
Define table with single column:
CREATE EXTERNAL TABLE your_table(
line STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
LOCATION 's3://mybucket/path/mylogs/';
You can extract timestamp and JSON using regexp, then parse JSON separately:
select ts,
json_extract(json_col, '$.reqid') AS reqid
...
from
(
select regexp_extract(line, '(.*?) +',1) as ts,
regexp_extract(line, '(.*?) +(.*)',2) as json_col
from your_table
)s
Alternatively you can define regexSerDe table with 2 columns, SerDe will do parsing two columns and all you need is to parse JSON_COL:
CREATE EXTERNAL TABLE your_table (
ts STRING,
json_col STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^(.*?) +(.*)$"
)
LOCATION 's3://mybucket/path/mylogs/';
SELECT ts, json_extract(json_col, '$.reqid') AS reqid ...
FROM your_table

How to break a string into columns, where the character delimiter and comma, but this character appears as content of the fields

I am doing a data load, where each line has the characters "at the beginning and end of the fields, and comma as delimiter, as below:
"sU92", "eRouter1.0"
"sU92" "," eRouter1.0 "
"sU9.2", "eRouter1.0"
Note that in the second line there are 2 double quotes (2 ") and that in the third row there is a comma between numbers 9 and 2 (9,2).
Whenever I try to create the table with the delimiter being comma and with quotechar = '\ "', the records break.
Create table without un-quoting enabled, use LasySimpleSerDe(default)
create table mytable(
col1 string,
col2 string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' ;
Then un-quote strings and remove extra spaces in the select using for example regexp_replace:
trim(regexp_replace(str, '\\"',''))

Create a Hive table from datasource with caret delimited, quoted columns and nulls encoded as '\N'

I have a large set of gzip files that need to be loaded into Hive. The columns are strings, encapsulated in double quotes, and delimited by carets (^). There are some null values in the dataset that are encoded as \N, e.g.
"Doug Cutting"^"Hadoop"^"United States"
"Hadley Wickham"^"R"^"New Zealand"
"Alex Woolford"^\N^"United Kingdom"
The dataset, to my eyes, looks like a CSV (or "^SV"), and so I created a table using the OpenCSVSerde:
CREATE TABLE `technologists`(
`name` string,
`famous_for` string,
`country_of_birth` string)
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
'quoteChar'='\"',
'separatorChar'='^')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'/some/hdfs/location'
This worked well except for the null values, which show up as an 'N', e.g.
hive> select * from technologists;
OK
Doug Cutting Hadoop United States
Hadley Wickham R New Zealand
Alex Woolford N United Kingdom
Do you know if there's a simple way to create this table without writing a custom SerDe or editing the files? Can the RegexSerDe replace a \N with a real null?
Looks like this serde uses a backslash as the default escape character, and therefore \N is stripped into N. Add 'escapeChar' to your serde properties and set it to something other than backslash. I'd try to set it to the same as the quoteChar value (double quote) if possible. If the serde complains that it isn't allowed then maybe use some non printable character instead.

How to import csv file into hive table when it has delimiter value, say comma, as a field value?

I want to import a csv file into hive table. The csv file is having comma (,) as within a field value. How can we escape it?
You can use CSV SerDe based on below conditions.
If your fields which has comma are in quoted strings.
sam,1,"sam is adventurous, brave"
bob,2,"bob is affectionate, affable"
CREATE EXTERNAL TABLE csv_table(name String, userid BIGINT,comment STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
"separatorChar" = ",",
"quoteChar" = "\""
)
STORED AS TEXTFILE
LOCATION 'location_of_csv_file';
If your fields which has comma are escaped as below.
sam,1,sam is adventurous\, brave
bob,2,bob is affectionate\, affable
CREATE EXTERNAL TABLE csv_table(name String, userid BIGINT, comment STRING)
ROW FORMAT serde 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties (
"separatorChar" = ",",
"escapeChar" = "\\"
)
STORED AS TEXTFILE
LOCATION '/user/cloudera/input/csv';
In both cases, the output will be as below:
hive> select * from csv_table;
OK
sam 1 sam is adventurous
bob 2 bob is affectionate

How to use XMLType in a SQL*Loader control file?

I have a csvfile, which contains a column of XML data. A sample record of the csv look like:
1,,,<capacidade><numfields>3</numfields><template>1</template><F1><name lang="pt">Apple</name></F1></capacidade>
I'd like to use SQL*Loader to import all the data into Oracle; I defined the ctl file as follows:
LOAD DATA
CHARACTERSET UTF8
INFILE '/home/db2inst1/result.csv'
CONTINUEIF NEXT(1:1) = '#'
INTO TABLE "TEST"."T_DATA_TEMP"
FIELDS TERMINATED BY','
( "ID"
, "SOURCE"
, "CATEGORY"
, "RAWDATA"
)
Wen running this, the error log shows that the column of RAWDATA is treated as CHARACTER data type. How can I define the RAWDATA to be a XMLType in this case so that it can be correctly insert into the Oracle?
Try something like this:
Add a comma at the end of xml so as to mark a delimiter(can't use whitespace as xml may contain spaces in between)
then create your ctl file like below
LOAD DATA
APPEND
INTO TABLE "TEST"."T_DATA_TEMP" fields OPTIONALLY ENCLOSED BY '"'
TRAILING NULLCOLS
( ID terminated by "," ,
SOURCE terminated by ",",
CATEGORY terminated by ",",
RAWDATA char(4000) terminated by ","
)