I want to create the external hive table from nested json data but fields should be flatten out from nested json.
For Example:-
{
"key1":"value1",
"key2":{
"nestedKey1":1,
"nestedKey2":2
}
}
Hive table should have format or fields flatten out like
key1: String, key2.nestedKey1:Int,key2.nestedKey1:Int
Thanks In Advance
Use JsonSerDe and create table with below syntax:
hive> create table sample(key1 string,key2 struct<nestedKey1:int,nestedKey2:int>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
hive> select key1,key2.nestedkey1,key2.nestedkey2 from sample;
+---------+-------------+-------------+--+
| key1 | nestedkey1 | nestedkey2 |
+---------+-------------+-------------+--+
| value1 | 1 | 2 |
+---------+-------------+-------------+--+
hive> select * from sample;
+--------------+----------------------------------+--+
| sample.key1 | sample.key2 |
+--------------+----------------------------------+--+
| value1 | {"nestedkey1":1,"nestedkey2":2} |
+--------------+----------------------------------+--+
(or)
If you want to create table with flatten out json fields then use RegexSerDe and matching regex to extract nestedkey from the data.
Refer this link for more details regards to regex serde.
UPDATE:
Inputdata:
{"key1":"value1","key2":{"nestedKey1":1,"nestedKey2":2}}
HiveTable:
hive> CREATE TABLE dd (key1 string, nestedKey1 string, nestedKey2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
('input.regex'=".*:\"(.*?)\",\"key2\":\\{\"nestedKey1\":(\\d),\"nestedKey2\":(\\d).*$");
Select data from the table:
hive> select * from dd;
+---------+-------------+-------------+--+
| key1 | nestedkey1 | nestedkey2 |
+---------+-------------+-------------+--+
| value1 | 1 | 2 |
+---------+-------------+-------------+--+
Related
Suppose that I have a table named agents_timesheet that having a structure like this:
ID | name | health_check_record | date | clock_in | clock_out
---------------------------------------------------------------------------------------------------------
1 | AAA | {"mental":{"stress":"no", "depression":"no"}, | 6-Dec-2021 | 08:25:07 |
| | "physical":{"other_symptoms":"headache", "flu":"no"}} | | |
---------------------------------------------------------------------------------------------------------
2 | BBB | {"mental":{"stress":"no", "depression":"no"}, | 6-Dec-2021 | 08:26:12 |
| | "physical":{"other_symptoms":"no", "flu":"yes"}} | | |
---------------------------------------------------------------------------------------------------------
3 | CCC | {"mental":{"stress":"no", "depression":"severe"}, | 6-Dec-2021 | 08:27:12 |
| | "physical":{"other_symptoms":"cancer", "flu":"yes"}} | | |
Now I need to get all agents having flu at the day. As for getting the flu from a single JSON in Oracle SQL, I can already get it by this SQL statement:
SELECT * FROM JSON_TABLE(
'{"mental":{"stress":"no", "depression":"no"}, "physical":{"fever":"no", "flu":"yes"}}', '$'
COLUMNS (fever VARCHAR(2) PATH '$.physical.flu')
);
As for getting the values from the column health_check_record, I can get it by utilizing the SELECT statement.
But How to get the values of flu in the JSON in the health_check_record of that table?
Additional question
Based on the table, how can I retrieve full list of other_symptoms, then it will get me this kind of output:
ID | name | other_symptoms
-------------------------------
1 | AAA | headache
2 | BBB | no
3 | CCC | cancer
You can use JSON_EXISTS() function.
SELECT *
FROM agents_timesheet
WHERE JSON_EXISTS(health_check_record, '$.physical.flu == "yes"');
There is also "plain old way" without JSON parsing only treting column like a standard VARCHAR one. This way will not work in 100% of cases, but if you have the data in the same way like you described it might be sufficient.
SELECT *
FROM agents_timesheet
WHERE health_check_record LIKE '%"flu":"yes"%';
How to get the values of flu in the JSON in the health_check_record of that table?
From Oracle 12, to get the values you can use JSON_TABLE with a correlated CROSS JOIN to the table:
SELECT a.id,
a.name,
j.*,
a."DATE",
a.clock_in,
a.clock_out
FROM agents_timesheet a
CROSS JOIN JSON_TABLE(
a.health_check_record,
'$'
COLUMNS (
mental_stress VARCHAR2(3) PATH '$.mental.stress',
mental_depression VARCHAR2(3) PATH '$.mental.depression',
physical_fever VARCHAR2(3) PATH '$.physical.fever',
physical_flu VARCHAR2(3) PATH '$.physical.flu'
)
) j
WHERE physical_flu = 'yes';
db<>fiddle here
You can use "dot notation" to access data from a JSON column. Like this:
select "DATE", id, name
from agents_timesheet t
where t.health_check_record.physical.flu = 'yes'
;
DATE ID NAME
----------- --- ----
06-DEC-2021 2 BBB
Note that this approach requires that you use an alias for the table name (so you can use it in accessing the JSON data).
For testing I used the data posted by MT0 on dbfiddle. I am not a big fan of double-quoted column names; use something else for "DATE", such as dt or date_.
I'm trying to add some data in one directory, and after to add these data as partition to a table.
create table test (key int, value int) partitioned by (dt int) stored as parquet location '/user/me/test';
insert overwrite directory '/user/me/test/dt=1' stored as parquet select 123, 456, 1;
alter table test add partition (dt=1);
select * from test;
This code sample is simple... but don't work. With the select statement, the output is NULL, NULL, 1. But I need 123, 456, 1.
When I read the data with Impala, I received 123, 456, 1... what is expected.
Why ? What is wrong ?
If I removed the two "stored as parquet", it's all ok... but I want my data in parquet !
PS : I want this construct for a switch of partition, so that when the data are calculated, they don't go to the user...
Identifying the issue
hive
create table test (key int, value int)
partitioned by (dt int)
stored as parquet location '/user/me/test'
;
insert overwrite directory '/user/me/test/dt=1'
stored as parquet
select 123, 456
;
alter table test add partition (dt=1)
;
select * from test
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| NULL | NULL | 1 |
+----------+------------+---------+
bash
parquet-tools cat hdfs://{fs.defaultFS}/user/me/test/dt=1/000000_0
_col0 = 123
_col1 = 456
Verifying the issue
hive
alter table test change column `key` `_col0` int cascade;
alter table test change column `value` `_col1` int cascade;
select * from test
;
+------------+------------+---------+
| test._col0 | test._col1 | test.dt |
+------------+------------+---------+
| 123 | 456 | 1 |
+------------+------------+---------+
Suggestd Solution
create additional table test_admin and do the insert through it
create table test_admin (key int, value int)
partitioned by (dt int)
stored as parquet location '/user/me/test'
;
create external table test (key int, value int)
partitioned by (dt int)
stored as parquet
location '/user/me/test'
;
insert into test_admin partition (dt=1) select 123, 456
;
select * from test_admin
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| 123 | 456 | 1 |
+----------+------------+---------+
select * from test
;
(empty result set)
alter table test add partition (dt=1)
;
select * from test
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| 123 | 456 | 1 |
+----------+------------+---------+
We have the tables in hive like below and we are generating the flat files from hive data while we are generating we found that there was junk characteres with in the data like below we have many characters in many columns can any one help us to remove those junk characters from hive table or from unix file ?
ÿ,ä,í,ã
Here problem the same data need to send the downstream when they are loading in to there DB it shows as double dollar but we design code double dollar as column delimiter.
Basic concept
hive> select regexp_replace('Hÿelloä íworlãd','[^a-zA-Z ]','');
OK
Hello world
Demo
Removing undesired character from the whole table and exporting it to a file.
create table t (i int,s1 string,s2 string);
insert into t values (1,'Hÿelloä','íworlãd'),(2,'ãGããood','Byÿe');
select * from t;
+---+---------+---------+
| i | s1 | s2 |
+---+---------+---------+
| 1 | Hÿelloä | íworlãd |
| 2 | ãGããood | Byÿe |
+---+---------+---------+
create external table t_ext (rec string)
row format delimited
fields terminated by '0'
location '/user/hive/warehouse/t'
;
insert overwrite local directory '/tmp/t_ext'
select regexp_replace(regexp_replace(rec,'[^a-zA-Z0-9 \\01]',''),'\\x01','<--->')
from t_ext
;
! ls /tmp/t_ext
;
000000_0
! cat /tmp/t_ext/000000_0
;
1<--->Hello<--->world
2<--->Good<--->Bye
This works as long as your tables contain only "primitive" types (no structs, arrays, maps etc.).
I really pushed the envelope here.
Demo
create table t (i int, dt date, str string, ts timestamp, bl boolean);
insert into t select 1,current_date,'Hello world',current_timestamp,true;
select * from t;
+-----+------------+-------------+-------------------------+------+
| t.i | t.dt | t.str | t.ts | t.bl |
+-----+------------+-------------+-------------------------+------+
| 1 | 2017-03-14 | Hello world | 2017-03-14 14:37:28.889 | true |
+-----+------------+-------------+-------------------------+------+
select regexp_replace
(
printf(concat('%s',repeat('$$%s',field(unhex(1),*,unhex(1))-2)),*)
,'(\\$\\$)|[^a-zA-Z0-9 -]'
,'$1'
)
from t
;
1$$2017-03-14$$Hello world$$2017-03-14 143728.889$$true
Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.
I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!
As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:
123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU
A Hive table was created:
CREATE TABLE `id_member_of`(
`id` INT,
`member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The raw data was loaded into the Hive table:
LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;
A Parquet version of the table was created:
CREATE TABLE `id_member_of_parquet` (
`id` STRING,
`member_of` ARRAY<STRING>)
STORED AS PARQUET;
The data from the CSV-backed table was inserted into the Parquet table:
INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;
And the Parquet table is now queryable in Hive:
hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]
Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:
[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id |
+-----+
| 123 |
| 234 |
+-----+
Question: What happened to the array column? Can you see what I'm doing wrong?
It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.
Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id | item |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+
I am trying to convert blank values in the source file to NULL in the hive table by setting the property 'serialization.null.format' = '' . The query I have written in hive is:
create table test(a int, b string) stored as parquet TBLPROPERTIES('serialization.null.format'='');
And then insert values into this through impala something like this:
insert overwrite table test values (1, ''), (2, 'b');
The result of this shows something like this:
| a | b |
| 1 | |
| 2 | b |
Can someone help me out here as to why is the blank not getting converted to NULL ?
The problem is the Parquet SerDe. See the issue at https://issues.apache.org/jira/browse/HIVE-12362.
The description is as follows:
create table src (a string);
insert into table src values (NULL), (''), ('');
0: jdbc:hive2://localhost:10000/default> select * from src;
+-----------+--+
| src.a |
+-----------+--+
| NULL |
| |
| |
+-----------+--+
create table dest (a string) row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' stored as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
alter table dest set SERDEPROPERTIES ('serialization.null.format' = '');
alter table dest set TBLPROPERTIES ('serialization.null.format' = '');
insert overwrite table dest select * from src;
0: jdbc:hive2://localhost:10000/default> select * from test11;
+-----------+--+
| test11.a |
+-----------+--+
| NULL |
| |
| |
+-----------+--+
You could try inserting into the table using a statement like this:
CASE
when TRIM(a) = ''
THEN NULL
ELSE a
END,
This will do the trick : nullif(trim(b),'')
Will give the b or NULL value when blank. So when selecting the statement you can do
select a,nullif(trim(b),'') from test;
FYR: nullif( value 1, value 2 ) Returns NULL if value 1 = value 2; otherwise returns value 1 (as of Hive 2.3.0).
Shorthand for: CASE WHEN value 1 = value 2 then NULL else value 1
https://www.docs4dev.com/docs/en/apache-hive/3.1.1/reference/LanguageManual_UDF.html
Cheers!!