Hive insert overwrite directory stored as parquet made NULL values - hive

I'm trying to add some data in one directory, and after to add these data as partition to a table.
create table test (key int, value int) partitioned by (dt int) stored as parquet location '/user/me/test';
insert overwrite directory '/user/me/test/dt=1' stored as parquet select 123, 456, 1;
alter table test add partition (dt=1);
select * from test;
This code sample is simple... but don't work. With the select statement, the output is NULL, NULL, 1. But I need 123, 456, 1.
When I read the data with Impala, I received 123, 456, 1... what is expected.
Why ? What is wrong ?
If I removed the two "stored as parquet", it's all ok... but I want my data in parquet !
PS : I want this construct for a switch of partition, so that when the data are calculated, they don't go to the user...

Identifying the issue
hive
create table test (key int, value int)
partitioned by (dt int)
stored as parquet location '/user/me/test'
;
insert overwrite directory '/user/me/test/dt=1'
stored as parquet
select 123, 456
;
alter table test add partition (dt=1)
;
select * from test
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| NULL | NULL | 1 |
+----------+------------+---------+
bash
parquet-tools cat hdfs://{fs.defaultFS}/user/me/test/dt=1/000000_0
_col0 = 123
_col1 = 456
Verifying the issue
hive
alter table test change column `key` `_col0` int cascade;
alter table test change column `value` `_col1` int cascade;
select * from test
;
+------------+------------+---------+
| test._col0 | test._col1 | test.dt |
+------------+------------+---------+
| 123 | 456 | 1 |
+------------+------------+---------+
Suggestd Solution
create additional table test_admin and do the insert through it
create table test_admin (key int, value int)
partitioned by (dt int)
stored as parquet location '/user/me/test'
;
create external table test (key int, value int)
partitioned by (dt int)
stored as parquet
location '/user/me/test'
;
insert into test_admin partition (dt=1) select 123, 456
;
select * from test_admin
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| 123 | 456 | 1 |
+----------+------------+---------+
select * from test
;
(empty result set)
alter table test add partition (dt=1)
;
select * from test
;
+----------+------------+---------+
| test.key | test.value | test.dt |
+----------+------------+---------+
| 123 | 456 | 1 |
+----------+------------+---------+

Related

How to use selected value as table name for another select in PostgreSQL

I have table with two coloms source_name and output_name:
CREATE TABLE all_table(
source_name text,
target_name text
);
Source name it is some external data name. Target name it is auto-generated table name in my DB. There is a relationship between the source name and the target name, there is only one target name for each source name.
I have additional table in DB:
CREATE TABLE output_table_1(
first_name text,
second_name text,
birthday timestamp
);
CREATE TABLE output_table_2(
login text,
money int
);
In table "all_table" I have some rows:
| source_name | target_name |
|---------------|----------------|
| personal data | output_table_1 |
| login data | output_table_2 |
I want select information from correct table by source name. So I tried
WITH selected_table AS (
SELECT target_name FROM all_table WHERE source_name='personal data'
)
SELECT * FROM selected_table;
And also
SELECT first_name FROM
(SELECT target_name FROM all_table WHERE source_name='personal data') AS out_table;
But Postgres print me only correct target name
| target_name |
|----------------|
| output_table_1 |
I want to get something similar on my query
| first_name | second_name | birthday |
|------------|-------------|----------|
| FName1 | SName1 | Date1 |
| FName2 | SName2 | Date2 |
| FName3 | SName3 | Date3 |
| FName4 | SName4 | Date4 |
| ... | ... | ... |
I've also tried this query
DO
$$
BEGIN
EXECUTE format('SELECT * FROM %s LIMIT 10', (SELECT target_name FROM all_table WHERE source_name='personal data'));
END;
$$ LANGUAGE plpgsql;
Query executed but nothing happened.
Surfing on Google doesn't do anything useful. But mb I'm bad in this.
¯\(ツ)/¯
if you want to obtain data from DO block you need to define cursor for query.
do
$$
declare
_query text ;
_cursor CONSTANT refcursor :='_cursor';
begin
_query:='Select * from '||(Select tab_name from ... where ..);
open _cursor for execute _query;
end;
$$;
fetch all from _cursor;

AUTOINCREMENT primary key for snowflake bulk loading

I would like to upload data into snowflake table. The snowflake table has a primary key field with AUTOINCREMENT.
When I tried to upload data into snowflake without a primary key field, I've received following error message:
The COPY failed with error: Number of columns in file (2) does not
match that of the corresponding table (3), use file format option
error_on_column_count_mismatch=false to ignore this error
Does anyone know if I can bulk load data into a table that has an AUTOINCREMENT primary key?
knozawa
You can query the stage file using file format to load your data. I have created sample table like below. First column set autoincrement:
-- Create the target table
create or replace table Employee (
empidnumber autoincrement start 1 increment 1,
name varchar,
salary varchar
);
I have staged one sample file into snowflake internal stage to load data into table and I have queried stage file using following and then I have executed following copy cmd:
copy into mytable (name, salary )from (select $1, $2 from #test/test.csv.gz );
And it loaded the table with incremented values.
The docs have the following example which suggests this can be done:
https://docs.snowflake.net/manuals/user-guide/data-load-transform.html#include-autoincrement-identity-columns-in-loaded-data
-- Omit the sequence column in the COPY statement
copy into mytable (col2, col3)
from (
select $1, $2
from #~/myfile.csv.gz t
)
;
Could you please try this syntax and see if it works for you?
Create the target table
create or replace table mytable (
col1 number autoincrement start 1 increment 1,
col2 varchar,
col3 varchar
);
Stage a data file in the internal user stage
put file:///tmp/myfile.csv #~;
Query the staged data file
select $1, $2 from #~/myfile.csv.gz t;
+-----+-----+
| $1 | $2 |
|-----+-----|
| abc | def |
| ghi | jkl |
| mno | pqr |
| stu | vwx |
+-----+-----+
Omit the sequence column in the COPY statement
copy into mytable (col2, col3)
from (
select $1, $2
from #~/myfile.csv.gz t
)
;
select * from mytable;
+------+------+------+
| COL1 | COL2 | COL3 |
|------+------+------|
| 1 | abc | def |
| 2 | ghi | jkl |
| 3 | mno | pqr |
| 4 | stu | vwx |
+------+------+------+
Adding of PRIMARY KEY is different in SNOWFLAKE when compared to SQL
syntax for adding primary key with auto increment
CREATE OR REPLACE TABLE EMPLOYEES (
NAME VARCHAR(100),
SALARY VARCHAR(100),
EMPLOYEE_ID AUTOINCREMENT START 1 INCREMENT 1,
);
START 1 = STARTING THE PRIMARY KEY AT NUMBER 1 (WE CAN START AT ANY NUMBER WE WANT )
INCREMENT = FOR THE ID ADD THE NUMBER 1 TO PREVIOUS EXISTING NUMBER ( WE CAN GIVE ANYTHING WE WANT)

create hive table from nested json data with flatten out fields

I want to create the external hive table from nested json data but fields should be flatten out from nested json.
For Example:-
{
"key1":"value1",
"key2":{
"nestedKey1":1,
"nestedKey2":2
}
}
Hive table should have format or fields flatten out like
key1: String, key2.nestedKey1:Int,key2.nestedKey1:Int
Thanks In Advance
Use JsonSerDe and create table with below syntax:
hive> create table sample(key1 string,key2 struct<nestedKey1:int,nestedKey2:int>)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe';
hive> select key1,key2.nestedkey1,key2.nestedkey2 from sample;
+---------+-------------+-------------+--+
| key1 | nestedkey1 | nestedkey2 |
+---------+-------------+-------------+--+
| value1 | 1 | 2 |
+---------+-------------+-------------+--+
hive> select * from sample;
+--------------+----------------------------------+--+
| sample.key1 | sample.key2 |
+--------------+----------------------------------+--+
| value1 | {"nestedkey1":1,"nestedkey2":2} |
+--------------+----------------------------------+--+
(or)
If you want to create table with flatten out json fields then use RegexSerDe and matching regex to extract nestedkey from the data.
Refer this link for more details regards to regex serde.
UPDATE:
Inputdata:
{"key1":"value1","key2":{"nestedKey1":1,"nestedKey2":2}}
HiveTable:
hive> CREATE TABLE dd (key1 string, nestedKey1 string, nestedKey2 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES
('input.regex'=".*:\"(.*?)\",\"key2\":\\{\"nestedKey1\":(\\d),\"nestedKey2\":(\\d).*$");
Select data from the table:
hive> select * from dd;
+---------+-------------+-------------+--+
| key1 | nestedkey1 | nestedkey2 |
+---------+-------------+-------------+--+
| value1 | 1 | 2 |
+---------+-------------+-------------+--+

Parquet-backed Hive table: array column not queryable in Impala

Although Impala is much faster than Hive, we used Hive because it supports complex (nested) data types such as arrays and maps.
I notice that Impala, as of CDH5.5, now supports complex data types. Since it's also possible to run Hive UDF's in Impala, we can probably do everything we want in Impala, but much, much faster. That's great news!
As I scan through the documentation, I see that Impala expects data to be stored in Parquet format. My data, in its raw form, happens to be a two-column CSV where the first column is an ID, and the second column is a pipe-delimited array of strings, e.g.:
123,ASDFG|SDFGH|DFGHJ|FGHJK
234,QWERT|WERTY|ERTYU
A Hive table was created:
CREATE TABLE `id_member_of`(
`id` INT,
`member_of` ARRAY<STRING>)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
COLLECTION ITEMS TERMINATED BY '|'
LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
The raw data was loaded into the Hive table:
LOAD DATA LOCAL INPATH 'raw_data.csv' INTO TABLE id_member_of;
A Parquet version of the table was created:
CREATE TABLE `id_member_of_parquet` (
`id` STRING,
`member_of` ARRAY<STRING>)
STORED AS PARQUET;
The data from the CSV-backed table was inserted into the Parquet table:
INSERT INTO id_member_of_parquet SELECT id, member_of FROM id_member_of;
And the Parquet table is now queryable in Hive:
hive> select * from id_member_of_parquet;
123 ["ASDFG","SDFGH","DFGHJ","FGHJK"]
234 ["QWERT","WERTY","ERTYU"]
Strangely, when I query the same Parquet-backed table in Impala, it doesn't return the array column:
[hadoop01:21000] > invalidate metadata;
[hadoop01:21000] > select * from id_member_of_parquet;
+-----+
| id |
+-----+
| 123 |
| 234 |
+-----+
Question: What happened to the array column? Can you see what I'm doing wrong?
It turned out to be really simple: we can access the array by adding it to the FROM with a dot, e.g.
Query: select * from id_member_of_parquet, id_member_of_parquet.member_of
+-----+-------+
| id | item |
+-----+-------+
| 123 | ASDFG |
| 123 | SDFGH |
| 123 | DFGHJ |
| 123 | FGHJK |
| 234 | QWERT |
| 234 | WERTY |
| 234 | ERTYU |
+-----+-------+

Convert Blank to NULL in Hive

I am trying to convert blank values in the source file to NULL in the hive table by setting the property 'serialization.null.format' = '' . The query I have written in hive is:
create table test(a int, b string) stored as parquet TBLPROPERTIES('serialization.null.format'='');
And then insert values into this through impala something like this:
insert overwrite table test values (1, ''), (2, 'b');
The result of this shows something like this:
| a | b |
| 1 | |
| 2 | b |
Can someone help me out here as to why is the blank not getting converted to NULL ?
The problem is the Parquet SerDe. See the issue at https://issues.apache.org/jira/browse/HIVE-12362.
The description is as follows:
create table src (a string);
insert into table src values (NULL), (''), ('');
0: jdbc:hive2://localhost:10000/default> select * from src;
+-----------+--+
| src.a |
+-----------+--+
| NULL |
| |
| |
+-----------+--+
create table dest (a string) row format serde 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' stored as INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
alter table dest set SERDEPROPERTIES ('serialization.null.format' = '');
alter table dest set TBLPROPERTIES ('serialization.null.format' = '');
insert overwrite table dest select * from src;
0: jdbc:hive2://localhost:10000/default> select * from test11;
+-----------+--+
| test11.a |
+-----------+--+
| NULL |
| |
| |
+-----------+--+
You could try inserting into the table using a statement like this:
CASE
when TRIM(a) = ''
THEN NULL
ELSE a
END,
This will do the trick : nullif(trim(b),'')
Will give the b or NULL value when blank. So when selecting the statement you can do
select a,nullif(trim(b),'') from test;
FYR: nullif( value 1, value 2 ) Returns NULL if value 1 = value 2; otherwise returns value 1 (as of Hive 2.3.0).
Shorthand for: CASE WHEN value 1 = value 2 then NULL else value 1
https://www.docs4dev.com/docs/en/apache-hive/3.1.1/reference/LanguageManual_UDF.html
Cheers!!