How to specify a STRING as a delimiter in HIVE table creation - hive

My data looks like:
a||b||c
To fetch the data my create table statement is:
CREATE TABLE
( col1 STRING,
col2 STRING,
col3 STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY "||";
But here it is taking '|' as the delimiter not "||".
Can anyone help me on this?

You may use RegexSerDe when dealing with multi-character delimiter strings:
create table mytable (
col1 string,
col2 string,
col3 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "^([^\\|]+)\\|\\|([^\\|]+)\\|\\|([^\\|]+)$",
"output.format.string" = "%1$s %2$s %3$s")
STORED AS TEXTFILE
LOCATION '/path/to/data';
Note: refine the regex to suit to your needs

Related

Hive - Load delimited data with special character cause off position

Let's say I want to create a simple table with 4 columns in Hive and load some pipe-delimited data.
CREATE table TEST_1 (
COL1 string,
COL2 string,
COL3 string,
COL4 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
;
Raw Data:
123|456|Dasani Bottled \| Water|789
What I expect for Col3 value is "Dasani Bottled \| Water", it has some special character "\|" in the middle thus cause Hive table column off position starting at COL3 because I create the table using "|" as the delimiter. The special character \| does have a pipe | character within it.
Is there any way to resolve the issue so Hive can load data correctly?
Thanks for any help.
you can add the ESCAPED BY clause to your table creation like this to allow character escaping
CREATE table TEST_1 (
COL1 string,
COL2 string,
COL3 string,
COL4 string
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|' ESCAPED BY '\'
;
From the Hive documentation
Enable escaping for the delimiter characters by using the 'ESCAPED BY'
clause (such as ESCAPED BY '\') Escaping is needed if you want to
work with data that can contain these delimiter characters.
A custom NULL format can also be specified using the 'NULL DEFINED AS'
clause (default is '\N').

Insert data into hive table without delimiters

I want 10 words in one column, another 10 words in another column .How to insert data into hive table with no specified delimiters using UDFs?
CREATE TABLE employees_stg (emplid STRING, name STRING, age STRING, salary STRING, dept STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(.{4})(.{35})(.{3})(.{11})(.{4})", --Length of each column specified between braces "({})"
"output.format.string" = "%1$s %2$s %3$s %4$s %5$s" --Output in string format
)
LOCATION '/path/to/input/employees_stg';
LOAD DATA INPATH '/path/to/sample_file.txt' INTO TABLE employees_stg;
SELECT * FROM employees_stg;

How do you add Data to an Existing Hive Metastore?

I have multiple subdirectories in S3 that contain .orc files. I'm trying to create a hive metastore so I can query the data with Presto / Hive, etc. The data is poorlly structured (no consistent delimiter, ugly characters, etc). Here's a scrubbed sample:
1488736466 199.199.199.199 0_b.www.sphericalcow.com.f9b1.qk-g6m6z24tdr.v4.url.name.com TXT IN: NXDOMAIN/0/143
1488736466 6.6.5.4 0.3399.186472.4306.6668.638.cb5a.names-things.update.url.name.com TXT IN: NOERROR/3/306 0\009253\009http://az.blargi.ng/%D3%AB%EF%BF%BD%EF%BF%BD/\009 0\009253\009http://casinoroyal.online/\009 0\009253\009http://d2njbfxlilvpsq.cloudfront.net/b_zq_ym_bangvideo/bangvideo0826.apk\009
I was able to create a table pointing to one of the subdirectories using a serde regex and the fields are parsing properly, but as far as I can tell I can only load one subfolder at a time.
How does one add more data to an existing hive metastore?
Here's an example of my hive metastore create statement with the regex serde bit:
DROP TABLE IF EXISTS test;
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
COMMENT 'fill all the tables with the datas.'
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)",
"output.format.string" = "%1$s %2$s %3$s %4$s"
)
STORED AS ORC
LOCATION 's3://path/to/one/of/10/folders/'
tblproperties ("orc.compress" = "SNAPPY", "skip.header.line.count"="2");
select * from test limit 10;
I realize there is probably a very simple solution, but I tried INSERT INTO in place of CREATE EXTERNAL TABLE, but it understandably complains about the input, and I looked in both the hive and serde documentation for help but was unable to find a reference to adding to an existing store.
Possible solution using partitions.
CREATE EXTERNAL TABLE test (field1 string, field2 string, field3 string, field4 string)
partitioned by (mypartcol string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([0-9]{10}) ([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}) (\\S*) (.*)"
)
LOCATION 's3://whatever/as/long/as/it/is/empty'
tblproperties ("skip.header.line.count"="2");
alter table test add partition (mypartcol='folder 1') location 's3://path/to/1st/of/10/folders/';
alter table test add partition (mypartcol='folder 2') location 's3://path/to/2nd/of/10/folders/';
.
.
.
alter table test add partition (mypartcol='folder 10') location 's3://path/to/10th/of/10/folders/';
For #TheProletariat (the OP)
It seems there is no need for RegexSerDe since the columns are delimited by space (' ').
Note the use of tblproperties ("serialization.last.column.takes.rest"="true")
create external table test
(
field1 bigint
,field2 string
,field3 string
,field4 string
)
row format delimited
fields terminated by ' '
tblproperties ("serialization.last.column.takes.rest"="true")
;

Writing columns having NULL as some string using OpenCSVSerde - HIVE

I'm using 'org.apache.hadoop.hive.serde2.OpenCSVSerde' to write hive table data.
CREATE TABLE testtable ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ","
"quoteChar" = "'"
)
STORED AS TEXTFILE LOCATION '<location>' AS
select * from foo;
So, if 'foo' table has empty strings in it, for eg: '1','2','' . The empty strings are written as is to the textfile. The data in textfile reads '1','2',''
But if 'foo' contains null values, for eg: '1','2',null. The null value is not written in the text file.
The data in the textfile reads '1','2',
How do I make sure that the nulls are properly written to the textfile using csv serde. Either written as empty strings or any other string say "nullstring"?
I also tried this:
CREATE TABLE testtable ROW FORMAT SERDE
....
....
STORED AS TEXTFILE LOCATION '<location>'
TBLPROPERTIES ('serialization.null.format'='')
AS select * foo;
Though this should probably replace the empty strings with null. But this doesn't even do that.
Please guide me on how to write nulls to csv files.
Will I have to check for the null values for columns in the select query itself like (NVL or something) and replace it with something?
Open CSV Serde ignores 'serialization.null.format' property , you can handle null values using below steps
1. CREATE TABLE testtable
(
name string,
title string,
birth_year string
)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
"separatorChar" = ","
,"quoteChar" = "'"
)
STORED AS TEXTFILE;
2. load data into testtable
3. CREATE TABLE testtable1
(
name string,
title string,
birth_year string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
TBLPROPERTIES('serialization.null.format'='');
4. INSERT OVERWRITE TABLE testtable1 SELECT * FROM testtable

Hive Table Definition - Multiple space delimiter

I am defining a hive table where the data has 1 to n spaces between each field.
How do I define the delimiter value in such a case?
I defined the table originally as:
CREATE EXTERNAL TABLE rtt (
field1 STRING,
field2 STRING,
field3 STRING,
field4 STRING,
field5 STRING,
field6 INT,
field7 FLOAT)
COMMENT 'New data set'
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '/test-dir/raw/2014/08/07/';
try the REGEX SERDE, e.g., as described in
Create HIVE Table with multi character delimiter
I think the regex you want to use as a delimiter is "\s+"