I am trying to convert blank values in the source file to NULL in the hive table by setting the property 'serialization.null.format' = '' . The query I have written in hive is:
create table test(a int, b string) stored as parquet TBLPROPERTIES('serialization.null.format'='');
And then insert values into this through impala something like this:
insert overwrite table test values (1, ''), (2, 'b');
The result of this shows something like this:
| a | b |
| 1 | |
| 2 | b |
Can someone help me out here as to why is the blank not getting converted to NULL ?
The problem is the Parquet SerDe. See the issue at https://issues.apache.org/jira/browse/HIVE-12362.
The description is as follows:
create table src (a string);
insert into table src values (NULL), (''), ('');
0: jdbc:hive2://localhost:10000/default> select * from src;
+-----------+--+
| src.a |
+-----------+--+
| NULL |
| |
| |
+-----------+--+
create table dest (a string) row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' stored as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
alter table dest set SERDEPROPERTIES ('serialization.null.format' = '');
alter table dest set TBLPROPERTIES ('serialization.null.format' = '');
insert overwrite table dest select * from src;
0: jdbc:hive2://localhost:10000/default> select * from test11;
+-----------+--+
| test11.a |
+-----------+--+
| NULL |
| |
| |
+-----------+--+
You could try inserting into the table using a statement like this:
CASE
when TRIM(a) = ''
THEN NULL
ELSE a
END,
This will do the trick : nullif(trim(b),'')
Will give the b or NULL value when blank. So when selecting the statement you can do
select a,nullif(trim(b),'') from test;
FYR: nullif( value 1, value 2 ) Returns NULL if value 1 = value 2; otherwise returns value 1 (as of Hive 2.3.0).
Shorthand for: CASE WHEN value 1 = value 2 then NULL else value 1
https://www.docs4dev.com/docs/en/apache-hive/3.1.1/reference/LanguageManual_UDF.html
Cheers!!
Related
I have a table T with two columns.
Column A is a varchar column and Column B is a XML column.
Somewhere inside Column B there is always the following parent tag: <Documents> ... </Documents>. Inside there are many <Document>...</Document> children.
I would like to get a result set with two columns:
Column 1 should contain the same values of Column A;
Column 2 should contain the content of one <Document>...</Document> only.
E.g. Starting table T:
Column A | Column B
--------------------------------------------------------------------------
abc | <Documents><Document>Doc 1</Document><Document>Doc 2</Document></Documents>
Expected result:
Column 1 | Column 2
-------------------------------------
abc |<Document>Doc 1</Document>
abc |<Document>Doc 2</Document>
I can get Column 2 like this (as seen in the docs):
SELECT T2.C.query('.')
FROM T
CROSS APPLY T.[Column B].nodes('*/Documents/*') as T (C)
but this does not work instead:
SELECT T.[Column A], T2.C.query('.')
FROM T
CROSS APPLY T.[Column B].nodes('*/Documents/*') as T2 (C)
How to get the expected result then?
Here is how to do it.
SQL
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID CHAR(3), xmldata XML);;
INSERT INTO #tbl (ID, xmldata)
VALUES
('abc', '<Documents><Document>Doc 1</Document><Document>Doc 2</Document></Documents>')
, ('xyz', '<Documents><Document>Doc 10</Document><Document>Doc 20</Document></Documents>');
-- DDL and sample data population, end
SELECT ID
, c.query('.') AS [Column 2]
FROM #tbl AS tbl
CROSS APPLY tbl.xmldata.nodes('//Documents/Document') AS t(c);
Output
+-----+-----------------------------+
| ID | Column 2 |
+-----+-----------------------------+
| abc | <Document>Doc 1</Document> |
| abc | <Document>Doc 2</Document> |
| xyz | <Document>Doc 10</Document> |
| xyz | <Document>Doc 20</Document> |
+-----+-----------------------------+
I want to create the external hive table from nested json data but fields should be flatten out from nested json.
For Example:-
{
"key1":"value1",
"key2":{
"nestedKey1":1,
"nestedKey2":2
}
}
Hive table should have format or fields flatten out like
key1: String, key2.nestedKey1:Int,key2.nestedKey1:Int
Thanks In Advance
Use JsonSerDe and create table with below syntax:
hive> create table sample(key1 string,key2 struct<nestedKey1:int,nestedKey2:int>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
hive> select key1,key2.nestedkey1,key2.nestedkey2 from sample;
+---------+-------------+-------------+--+
| key1 | nestedkey1 | nestedkey2 |
+---------+-------------+-------------+--+
| value1 | 1 | 2 |
+---------+-------------+-------------+--+
hive> select * from sample;
+--------------+----------------------------------+--+
| sample.key1 | sample.key2 |
+--------------+----------------------------------+--+
| value1 | {"nestedkey1":1,"nestedkey2":2} |
+--------------+----------------------------------+--+
(or)
If you want to create table with flatten out json fields then use RegexSerDe and matching regex to extract nestedkey from the data.
Refer this link for more details regards to regex serde.
UPDATE:
Inputdata:
{"key1":"value1","key2":{"nestedKey1":1,"nestedKey2":2}}
HiveTable:
hive> CREATE TABLE dd (key1 string, nestedKey1 string, nestedKey2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
('input.regex'=".*:\"(.*?)\",\"key2\":\\{\"nestedKey1\":(\\d),\"nestedKey2\":(\\d).*$");
Select data from the table:
hive> select * from dd;
+---------+-------------+-------------+--+
| key1 | nestedkey1 | nestedkey2 |
+---------+-------------+-------------+--+
| value1 | 1 | 2 |
+---------+-------------+-------------+--+
I'm trying to add some data in one directory, and after to add these data as partition to a table.
create table test (key int, value int) partitioned by (dt int) stored as parquet location '/user/me/test';
insert overwrite directory '/user/me/test/dt=1' stored as parquet select 123, 456, 1;
alter table test add partition (dt=1);
select * from test;
This code sample is simple... but don't work. With the select statement, the output is NULL, NULL, 1. But I need 123, 456, 1.
When I read the data with Impala, I received 123, 456, 1... what is expected.
Why ? What is wrong ?
If I removed the two "stored as parquet", it's all ok... but I want my data in parquet !
PS : I want this construct for a switch of partition, so that when the data are calculated, they don't go to the user...
Identifying the issue
hive
create table test (key int, value int)
partitioned by (dt int)
stored as parquet location '/user/me/test'
;
insert overwrite directory '/user/me/test/dt=1'
stored as parquet
select 123, 456
;
alter table test add partition (dt=1)
;
select * from test
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| NULL | NULL | 1 |
+----------+------------+---------+
bash
parquet-tools cat hdfs://{fs.defaultFS}/user/me/test/dt=1/000000_0
_col0 = 123
_col1 = 456
Verifying the issue
hive
alter table test change column `key` `_col0` int cascade;
alter table test change column `value` `_col1` int cascade;
select * from test
;
+------------+------------+---------+
| test._col0 | test._col1 | test.dt |
+------------+------------+---------+
| 123 | 456 | 1 |
+------------+------------+---------+
Suggestd Solution
create additional table test_admin and do the insert through it
create table test_admin (key int, value int)
partitioned by (dt int)
stored as parquet location '/user/me/test'
;
create external table test (key int, value int)
partitioned by (dt int)
stored as parquet
location '/user/me/test'
;
insert into test_admin partition (dt=1) select 123, 456
;
select * from test_admin
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| 123 | 456 | 1 |
+----------+------------+---------+
select * from test
;
(empty result set)
alter table test add partition (dt=1)
;
select * from test
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| 123 | 456 | 1 |
+----------+------------+---------+
Please tell how to display them like
|NumberOfQuotesGenerated | 11 |
TotalAmountOfQuotes | 78100 |
NumberOfInvoiceGenerated | 9 |
TotalAmountOfInvoice | 8222
Thank you in advance
you can try Unpivot syntax like below
-- create table t(NumberOfQuotesGenerated int,TotalAmountOfQuotes int,NumberOfInvoiceGenerated int, TotalAmountOfInvoice int)
-- insert into t values (11, 78100, 9, 8222)
select * from
(select * from t) s
unpivot
(
data for
colname in
([NumberOfQuotesGenerated],[TotalAmountOfQuotes],[NumberOfInvoiceGenerated],[TotalAmountOfInvoice])
)up
Live SQL demo
I am trying to load int data into a table on a Netezza server. However, I do not know how to load in data with a default value in the case that it does not exist or is null.
Right now, the table consists of two columns with both having its own default value.
Attribute | Type | Modifier | Default Value
----------+---------+----------+----------------
number1 | integer | | 0
number2 | integer | | 100
I am currently running the following nzload command: nzload -cf test.log
The test.log file looks like this:
DATAFILE /usr/me/test.dat
{
Database test
TableName numberTest
Delimiter '|'
}
The test.dat file looks like this:
1|2
3|4
5|6
7|
|8
The issue I am faced with is that while the command runs fine, the integer values default to Null as opposed to what the default value was set to. I have tried using insert within the nzsql, and that creates the correct default values, but I was wondering if there is a way to do this with nzload.
Any help would be much appreciated.
The default value constraint will be enforced when performing inserts where the column with the default value is not referenced in the column list for the insert.
For example:
TESTDB.ADMIN(ADMIN)=> create table default_test (col1 varchar(10),
TESTDB.ADMIN(ADMIN)(> col2 varchar(10) default 'myDefault', col3 varchar(10));
CREATE TABLE
TESTDB.ADMIN(ADMIN)=> insert into default_test (col1, col3) values ('A','C');
INSERT 0 1
TESTDB.ADMIN(ADMIN)=> select * from default_test;
COL1 | COL2 | COL3
------+-----------+------
A | myDefault | C
(1 row)
However, when you are performing an nzload, Netezza is actually performing an insert into the target table with a select from an external table defined on your load datafile. In doing so it is including each column in the column list, and therefore the default value will not be triggered, even if the value in the external table's data file is NULL or an empty string.
[nz#netezza test]$ cat test.txt
A,B,C
D,,F
G,NULL,I
TESTDB.ADMIN(ADMIN)=> create external table default_test_ext
TESTDB.ADMIN(ADMIN)-> sameas default_test using (
TESTDB.ADMIN(ADMIN)(> dataobject '/export/home/nz/test/test.txt' delimiter ','
TESTDB.ADMIN(ADMIN)(> );
CREATE EXTERNAL TABLE
TESTDB.ADMIN(ADMIN)=> select * from default_test_ext;
COL1 | COL2 | COL3
------+------+------
A | B | C
D | | F
G | | I
(3 rows)
TESTDB.ADMIN(ADMIN)=> select * from default_test_ext where
TESTDB.ADMIN(ADMIN)-> (col2 is null or col2 = '');
COL1 | COL2 | COL3
------+------+------
D | | F
G | | I
(2 rows)
Since NULL and empty strings are valid values, and nzload is referencing that column in its insert the default value cannot/should not be used. It's working as I would expect it to, however it would definitely be useful if you could tell nzload to transform NULLs or empty strings to the default value for a column. Unfortunately that functionality doesn't currently exist (at least not to my knowledge).
While this is hyper-kludgey, I have gotten around this for data loads by doing the external table manually, and loading in two steps.
TESTDB.ADMIN(ADMIN)=> truncate table default_test;
TRUNCATE TABLE
TESTDB.ADMIN(ADMIN)=> insert into default_test (col1, col3)
TESTDB.ADMIN(ADMIN)-> select col1, col3 from default_test_ext
TESTDB.ADMIN(ADMIN)-> where (col2 is null or col2 = '');
INSERT 0 2
TESTDB.ADMIN(ADMIN)=> select * from default_test;
COL1 | COL2 | COL3
------+-----------+------
D | myDefault | F
G | myDefault | I
(2 rows)
TESTDB.ADMIN(ADMIN)=> insert into default_test
TESTDB.ADMIN(ADMIN)-> select * from default_test_ext
TESTDB.ADMIN(ADMIN)-> where (col2 is not null and col2 <> '');
INSERT 0 1
TESTDB.ADMIN(ADMIN)=> select * from default_test;
COL1 | COL2 | COL3
------+-----------+------
A | B | C
D | myDefault | F
G | myDefault | I
(3 rows)
Netezza does not enforce the default value constraint. It merely exists for notation. IBM Documentation. In order to fix your table you must run update statements.