Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.
I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!
As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:
123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU
A Hive table was created:
CREATE TABLE `id_member_of`(
`id` INT,
`member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The raw data was loaded into the Hive table:
LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;
A Parquet version of the table was created:
CREATE TABLE `id_member_of_parquet` (
`id` STRING,
`member_of` ARRAY<STRING>)
STORED AS PARQUET;
The data from the CSV-backed table was inserted into the Parquet table:
INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;
And the Parquet table is now queryable in Hive:
hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]
Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:
[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id |
+-----+
| 123 |
| 234 |
+-----+
Question: What happened to the array column? Can you see what I'm doing wrong?
It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.
Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id | item |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+
Related
Suppose that I have a table named agents_timesheet that having a structure like this:
ID | name | health_check_record | date | clock_in | clock_out
---------------------------------------------------------------------------------------------------------
1 | AAA | {"mental":{"stress":"no", "depression":"no"}, | 6-Dec-2021 | 08:25:07 |
| | "physical":{"other_symptoms":"headache", "flu":"no"}} | | |
---------------------------------------------------------------------------------------------------------
2 | BBB | {"mental":{"stress":"no", "depression":"no"}, | 6-Dec-2021 | 08:26:12 |
| | "physical":{"other_symptoms":"no", "flu":"yes"}} | | |
---------------------------------------------------------------------------------------------------------
3 | CCC | {"mental":{"stress":"no", "depression":"severe"}, | 6-Dec-2021 | 08:27:12 |
| | "physical":{"other_symptoms":"cancer", "flu":"yes"}} | | |
Now I need to get all agents having flu at the day. As for getting the flu from a single JSON in Oracle SQL, I can already get it by this SQL statement:
SELECT * FROM JSON_TABLE(
'{"mental":{"stress":"no", "depression":"no"}, "physical":{"fever":"no", "flu":"yes"}}', '$'
COLUMNS (fever VARCHAR(2) PATH '$.physical.flu')
);
As for getting the values from the column health_check_record, I can get it by utilizing the SELECT statement.
But How to get the values of flu in the JSON in the health_check_record of that table?
Additional question
Based on the table, how can I retrieve full list of other_symptoms, then it will get me this kind of output:
ID | name | other_symptoms
-------------------------------
1 | AAA | headache
2 | BBB | no
3 | CCC | cancer
You can use JSON_EXISTS() function.
SELECT *
FROM agents_timesheet
WHERE JSON_EXISTS(health_check_record, '$.physical.flu == "yes"');
There is also "plain old way" without JSON parsing only treting column like a standard VARCHAR one. This way will not work in 100% of cases, but if you have the data in the same way like you described it might be sufficient.
SELECT *
FROM agents_timesheet
WHERE health_check_record LIKE '%"flu":"yes"%';
How to get the values of flu in the JSON in the health_check_record of that table?
From Oracle 12, to get the values you can use JSON_TABLE with a correlated CROSS JOIN to the table:
SELECT a.id,
a.name,
j.*,
a."DATE",
a.clock_in,
a.clock_out
FROM agents_timesheet a
CROSS JOIN JSON_TABLE(
a.health_check_record,
'$'
COLUMNS (
mental_stress VARCHAR2(3) PATH '$.mental.stress',
mental_depression VARCHAR2(3) PATH '$.mental.depression',
physical_fever VARCHAR2(3) PATH '$.physical.fever',
physical_flu VARCHAR2(3) PATH '$.physical.flu'
)
) j
WHERE physical_flu = 'yes';
db<>fiddle here
You can use "dot notation" to access data from a JSON column. Like this:
select "DATE", id, name
from agents_timesheet t
where t.health_check_record.physical.flu = 'yes'
;
DATE ID NAME
----------- --- ----
06-DEC-2021 2 BBB
Note that this approach requires that you use an alias for the table name (so you can use it in accessing the JSON data).
For testing I used the data posted by MT0 on dbfiddle. I am not a big fan of double-quoted column names; use something else for "DATE", such as dt or date_.
I am trying to print the column names from a table called 'meta' and I need also its data types.
I tried this query
SELECT meta FROM INFORMATION_SCHEMA.TABLES;
but it throws an error saying no information schema available. Could you please help me, I am a beginner in SQL.
Edit:
select tables.name from tables join schemas on
tables.schema_id=schemas.id where schemas.name=’sprl_db’ ;
This query gives me all the tables in database 'sprl_db'
You can use the monetdb catalog:
select c.name, c.type, c.type_digits, c.type_scale
from sys.columns c
inner join sys.tables t on t.id = c.table_id and t.name = 'meta';
as you are using monetDB you can get that by using sys.columns
sys.columns
it will return all information related to table columns
you can also check Schema, table and columns documentation for monetDB
in sql server we get that like this exec sp_columns TableName
If I understand correctly you need to see the columns and the types of a table you (or some other user) defined called meta?
There are at least two ways to do this:
First (as #GMB mentioned in their answer) you can query the SQL catalog: https://www.monetdb.org/Documentation/SQLcatalog/TablesColumns
SELECT * FROM sys.tables WHERE NAME='meta';
+------+------+-----------+-------+------+--------+---------------+--------+-----------+
| id | name | schema_id | query | type | system | commit_action | access | temporary |
+======+======+===========+=======+======+========+===============+========+===========+
| 9098 | meta | 2000 | null | 0 | false | 0 | 0 | 0 |
+------+------+-----------+-------+------+--------+---------------+--------+-----------+
1 tuple
So this gets all the relevant information about the table meta. We are mostly interested in the value of the column id because this uniquely identifies the table.
(Please note that this id will probably be different in your system)
After we have this information we can query the columns table with this table id:
SELECT * FROM sys.columns WHERE table_id=9098;
+------+------+------+-------------+------------+----------+---------+-------+--------+---------+
| id | name | type | type_digits | type_scale | table_id | default | null | number | storage |
+======+======+======+=============+============+==========+=========+=======+========+=========+
| 9096 | i | int | 32 | 0 | 9098 | null | true | 0 | null |
| 9097 | j | clob | 0 | 0 | 9098 | null | true | 1 | null |
+------+------+------+-------------+------------+----------+---------+-------+--------+---------+
2 tuples
Since you are only interested in the names and types of the columns, you can modify this query as follows:
SELECT name, type FROM sys.columns WHERE table_id=9098;
+------+------+
| name | type |
+======+======+
| i | int |
| j | clob |
+------+------+
2 tuples
You can combine the two queries above with a join:
SELECT col.name, col.type FROM sys.tables as tab JOIN sys.columns as col ON tab.id=col.table_id WHERE tab.name='meta';
+------+------+
| name | type |
+======+======+
| i | int |
| j | clob |
+------+------+
2 tuples
The second, and preferred way to get this information if you are using the mclient utility of MonetDB, is by using the describe meta-command of mclient. When used without arguments it presents a list of tables that have been defined in the current database and when it is given the name of the table it prints its SQL definition:
sql>\d
TABLE sys.data
TABLE sys.meta
sql>\d sys.meta
CREATE TABLE "sys"."meta" (
"i" INTEGER,
"j" CHARACTER LARGE OBJECT
);
You can use the \? meta-command to see a list of all meta-commands in mclient:
sql>\?
\? - show this message
\<file - read input from file
\>file - save response in file, or stdout if no file is given
\|cmd - pipe result to process, or stop when no command is given
\history - show the readline history
\help - synopsis of the SQL syntax
\D table - dumps the table, or the complete database if none given.
\d[Stvsfn]+ [obj] - list database objects, or describe if obj given
\A - enable auto commit
\a - disable auto commit
\e - echo the query in sql formatting mode
\t - set the timer {none,clock,performance} (none is default)
\f - format using renderer {csv,tab,raw,sql,xml,trash,rowcount,expanded,sam}
\w# - set maximal page width (-1=unlimited, 0=terminal width, >0=limit to num)
\r# - set maximum rows per page (-1=raw)
\L file - save client-server interaction
\X - trace mclient code
\q - terminate session and quit mclient
For MySQL:
SELECT column_name,
data_type
FROM information_schema.columns
WHERE table_schema = ’ yourdatabasename ’
AND table_name = ’ yourtablename ’;
Output:
+-------------+-----------+
| COLUMN_NAME | DATA_TYPE |
+-------------+-----------+
| Id | int |
| Address | varchar |
| Money | decimal |
+-------------+-----------+
I have data in CSV looks like
David,"""SMARTPHONE,6""|""COMPUTER,3""|""LAPTOP,1"""
I try to load this to my hive table
create table user_device(name string, devices array<struct<devicename: string, number : int>>)
FIELDS TERMINATED BY ','
collection items terminated by '|'
STORED AS TEXTFILE
LOCATION 'maprfs:///user/david/';
I expected to see
[{"devicename":"SMARTPHONE","number":6},{"devicename":"COMPUTER","number":3},{"devicename":"LAPTOP","number":1}]
But when I try to query the table, I see the array of struct is
[{"devicename":"\"\"\"SMARTPHONE","number":null}]
Rest of the array and struct are gone.
Does anyone know how I can achieve this?
Thanks
David
He is a code I used. In the approach I used python for cleaning before proceeding to HQL queries. So after doing some wrangling steps, I have a file like this below (saved without indices and headers) in my local file system since its a small file:
import pandas as pd
import numpy as np
Name devicename number
0 David SMARTPHONE 6
1 COMPUTER 3
2 LAPTOP 1
Then a temp table tempt is created and populated with data from LFS or HDFS:
create table tempt
(
name string,
devicename string,
number int
)
row format delimited
FIELDS TERMINATED BY ',';
load data local inpath '/path_to_file' overwrite into table tempt;
select * from tempt;
+--------------------+--------------------------+----------------------+--+
| tempt.name | tempt.devicename | tempt.number |
+--------------------+--------------------------+----------------------+--+
| David | SMARTPHONE | 6 |
| | COMPUTER | 3 |
| | LAPTOP | 1 |
+--------------------+--------------------------+----------------------+--+
And now
Insert overwrite table user_device
select name,
array(named_struct("devicename",devicename,"number",number)) from tempt;
select * from user_device;
and the output now is as you expected.
+-----------------+-------------------------------------------+--+
|user_device.name | user_device.devices |
+-----------------+-------------------------------------------+--+
| David | [{"devicename":"SMARTPHONE","number":6}] |
| | [{"devicename":"COMPUTER","number":3}] |
| | [{"devicename":"LAPTOP","number":1}] |
+-----------------+-------------------------------------------+--+
Cheers!
I want to create the external hive table from nested json data but fields should be flatten out from nested json.
For Example:-
{
"key1":"value1",
"key2":{
"nestedKey1":1,
"nestedKey2":2
}
}
Hive table should have format or fields flatten out like
key1: String, key2.nestedKey1:Int,key2.nestedKey1:Int
Thanks In Advance
Use JsonSerDe and create table with below syntax:
hive> create table sample(key1 string,key2 struct<nestedKey1:int,nestedKey2:int>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
hive> select key1,key2.nestedkey1,key2.nestedkey2 from sample;
+---------+-------------+-------------+--+
| key1 | nestedkey1 | nestedkey2 |
+---------+-------------+-------------+--+
| value1 | 1 | 2 |
+---------+-------------+-------------+--+
hive> select * from sample;
+--------------+----------------------------------+--+
| sample.key1 | sample.key2 |
+--------------+----------------------------------+--+
| value1 | {"nestedkey1":1,"nestedkey2":2} |
+--------------+----------------------------------+--+
(or)
If you want to create table with flatten out json fields then use RegexSerDe and matching regex to extract nestedkey from the data.
Refer this link for more details regards to regex serde.
UPDATE:
Inputdata:
{"key1":"value1","key2":{"nestedKey1":1,"nestedKey2":2}}
HiveTable:
hive> CREATE TABLE dd (key1 string, nestedKey1 string, nestedKey2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
('input.regex'=".*:\"(.*?)\",\"key2\":\\{\"nestedKey1\":(\\d),\"nestedKey2\":(\\d).*$");
Select data from the table:
hive> select * from dd;
+---------+-------------+-------------+--+
| key1 | nestedkey1 | nestedkey2 |
+---------+-------------+-------------+--+
| value1 | 1 | 2 |
+---------+-------------+-------------+--+
We have the tables in hive like below and we are generating the flat files from hive data while we are generating we found that there was junk characteres with in the data like below we have many characters in many columns can any one help us to remove those junk characters from hive table or from unix file ?
ÿ,ä,í,ã
Here problem the same data need to send the downstream when they are loading in to there DB it shows as double dollar but we design code double dollar as column delimiter.
Basic concept
hive> select regexp_replace('Hÿelloä íworlãd','[^a-zA-Z ]','');
OK
Hello world
Demo
Removing undesired character from the whole table and exporting it to a file.
create table t (i int,s1 string,s2 string);
insert into t values (1,'Hÿelloä','íworlãd'),(2,'ãGããood','Byÿe');
select * from t;
+---+---------+---------+
| i | s1 | s2 |
+---+---------+---------+
| 1 | Hÿelloä | íworlãd |
| 2 | ãGããood | Byÿe |
+---+---------+---------+
create external table t_ext (rec string)
row format delimited
fields terminated by '0'
location '/user/hive/warehouse/t'
;
insert overwrite local directory '/tmp/t_ext'
select regexp_replace(regexp_replace(rec,'[^a-zA-Z0-9 \\01]',''),'\\x01','<--->')
from t_ext
;
! ls /tmp/t_ext
;
000000_0
! cat /tmp/t_ext/000000_0
;
1<--->Hello<--->world
2<--->Good<--->Bye
This works as long as your tables contain only "primitive" types (no structs, arrays, maps etc.).
I really pushed the envelope here.
Demo
create table t (i int, dt date, str string, ts timestamp, bl boolean);
insert into t select 1,current_date,'Hello world',current_timestamp,true;
select * from t;
+-----+------------+-------------+-------------------------+------+
| t.i | t.dt | t.str | t.ts | t.bl |
+-----+------------+-------------+-------------------------+------+
| 1 | 2017-03-14 | Hello world | 2017-03-14 14:37:28.889 | true |
+-----+------------+-------------+-------------------------+------+
select regexp_replace
(
printf(concat('%s',repeat('$$%s',field(unhex(1),*,unhex(1))-2)),*)
,'(\\$\\$)|[^a-zA-Z0-9 -]'
,'$1'
)
from t
;
1$$2017-03-14$$Hello world$$2017-03-14 143728.889$$true