Loading data into hive table with dynamic partitioning

Loading data into hive table with dynamic partitioning - hive

File empdetails.log has below data-
100 AAA 12000 HYD
101 BBB 13000 PUNE
102 CCC 14000 HYD
103 DDD 10000 BLORE
104 EEE 12000 PUNE
I want to load this data into a 'Emp' table with dynamic partitioning such that select * from Emp; gives me following output(partitioned by location).
100 AAA 12000 HYD
102 CCC 14000 HYD
101 BBB 13000 PUNE
104 EEE 12000 PUNE
103 DDD 10000 BLORE
Could anyone provide the load command to be executed in hive.
TABLE CREATED-
create table Emp (cid int, cname string, csal int)
partitioned by (cloc string)
row format delimited
fields terminated by '\t'
stored as textfile;

For dynamic partitioning, you have to use INSERT ... SELECT query (Hive insert).
Inserting data into Hive table having DP, is a two step process.
Create staging table in staging database in hive and load data into that table from
external source such as RDBMS, document database or local files
using Hive load.
Insert data into actual table into ODS (operational data store/final database) using Hive insert.
Also, set following properties in Hive.
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
Following example works on cloudera VM.
-- Extract orders data from mysql (Retail_DB.products)
select * from orders into outfile '/tmp/orders_data.psv' fieldsterminated by '|' lines terminated by 'n';
-- Create Hive table with DP - order_month is DP.
CREATE TABLE orders (order_id int, order_date string, order_customer_id int, order_status string ) PARTITIONED BY (order_month string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'STORED AS TEXTFILE;
--Create staging table in Hive.
CREATE TABLE orders_stage (order_id int,order_date string, order_customer_id int, order_status string ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
--Load data into staging table (Hive)
Load data into staging table load data local inpath
/tmp/orders_data.psv' overwrite into table orders_stage;
--Insert into Orders, which is final table (Hive).
Insert overwrite table retail_ods.orders partition (order_month)
select order_id, order_date, order_customer_id,order_status,
substr(order_date, 1, 7) order_month from retail_stage.orders_stage;
You can find more details at https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions

Related

why there is a data mismatch in hive and bigSQL by 1 record?

I have created a hive table and integrated it to bigSQL. in the hive my count is proper, but in bigSQL, the record count is extra by 1. Below is the table properties that I have used to create the hive table.
create table test(name string,age int,sal float,city string,country string,emp_id int,increment int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION '/user/test'
tblproperties ("skip.header.line.count"="1");
The textfile that I am loading has column names in the very first row. So I have to use the
tblproperties ("skip.header.line.count"="1");
When I do a count query in hive, I get below output
Total MapReduce CPU Time Spent: 7 seconds 440 msec
OK
48203
However, when I synced the table in bigSQL, I am getting below count
+-------+
| 1 |
+-------+
| 48204 |
Any idea, where am I committing the mistake?
thanks

I found the workaround for this problem.
1) We need to create a temp hive table with tblproperties ("skip.header.line.count"="1");.
2) Load the file on this temp table.
3) create another table without tblproperties ("skip.header.line.count"="1");.
4) insert into tbl select * from temo_tbl;.

how to add columns to existing hive partitioned table?

alter table abc add columns (stats1 map<string,string>, stats2 map<string,string>)
i have altered my table with above query. But after while checking the data i got NULL's for the both extra columns. I'm not getting data.
screenshot

CASCADE is the solution.
Query:
ALTER TABLE dbname.table_name ADD columns (column1 string,column2 string) CASCADE;
This changes the columns of a table's metadata and cascades the same change to all the partition metadata.
RESTRICT is the default, limiting column change only to table metadata.

As others have noted CASCADE will change the metadata for all partitions. Without CASCADE, if you want to change old partitions to include the new columns, you'll need to DROP the old partitions first and then fill them, INSERT OVERWRITE without the DROP won't work, because the metadata won't update to the new default metadata.
Let's say you have already run alter table abc add columns (stats1 map<string,string>, stats2 map<string,string>) without CASCADE by accident and then you INSERT OVERWRITE an old partition without DROPPING first. The data will be stored in the underlying files, but if you query that table from hive for that partition, it won't show because the metadata wasn't updated. This can be fixed without having to rerun the insert overwrite using the following:
Run SHOW CREATE TABLE dbname.tblname and copy all the column definitions that existed before adding new columns
Run ALTER TABLE dbname.tblname REPLACE COLUMNS ({paste in col defs besides columns to add here}) CASCADE
Run ALTER TABLE dbname.tblname ADD COLUMNS (newcol1 int COMMENT "new col") CASCADE
be happy that the metadata has been changed for all partitions =)
As an example of steps 2-3:
DROP TABLE IF EXISTS junk.testcascade ;
CREATE TABLE junk.testcascade (
startcol INT
)
partitioned by (d int)
stored as parquet
;
INSERT INTO TABLE junk.testcascade PARTITION(d=1)
VALUES
(1),
(2)
;
INSERT INTO TABLE junk.testcascade PARTITION(d=2)
VALUES
(1),
(2)
;
SELECT * FROM junk.testcascade ;
+-----------------------+----------------+--+
| testcascade.startcol | testcascade.d |
+-----------------------+----------------+--+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+-----------------------+----------------+--+
--no cascade! opps
ALTER TABLE junk.testcascade ADD COLUMNS( testcol1 int, testcol2 int) ;
INSERT OVERWRITE TABLE junk.testcascade PARTITION(d=3)
VALUES
(1,1,1),
(2,1,1)
;
INSERT OVERWRITE TABLE junk.testcascade PARTITION(d=2)
VALUES
(1,1,1),
(2,1,1)
;
--okay! because we created this table after altering the metadata
select * FROM junk.testcascade where d=3;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | 1 | 1 | 3 |
| 2 | 1 | 1 | 3 |
+-----------------------+-----------------------+-----------------------+----------------+--+
--not okay even tho we inserted =( because the metadata isnt changed
select * FROM junk.testcascade where d=2;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | NULL | NULL | 2 |
| 2 | NULL | NULL | 2 |
+-----------------------+-----------------------+-----------------------+----------------+--+
--cut back to original columns
ALTER TABLE junk.testcascade REPLACE COLUMNS( startcol int) CASCADE;
--add
ALTER table junk.testcascade ADD COLUMNS( testcol1 int, testcol2 int) CASCADE;
--it works!
select * FROM junk.testcascade where d=2;
+-----------------------+-----------------------+-----------------------+----------------+--+
| testcascade.startcol | testcascade.testcol1 | testcascade.testcol2 | testcascade.d |
+-----------------------+-----------------------+-----------------------+----------------+--+
| 1 | 1 | 1 | 2 |
| 2 | 1 | 1 | 2 |
+-----------------------+-----------------------+-----------------------+----------------+--+

To add columns into partitioned table you need to recreate partitions.
Suppose the table is external and the datafiles already contain new columns, do the following:
1. Alter table add columns...
2. Recreate partitions. For each partitions do Drop then create. Newly created partition schema will inherit the table schema.
Alternatively you can drop the table then create table and create all partitions or restore them simply running MSCK REPAIR TABLE abc command. The equivalent command on Amazon Elastic MapReduce (EMR)'s version of Hive is: ALTER TABLE table_name RECOVER PARTITIONS.
See manual here: RECOVER PARTITIONS
Also in Hive 1.1.0 and later you can use CASCADE option of ALTER TABLE ADD|REPLACE COLUMNS. See manual here: ADD COLUMN
These suggestions work for external tables.

This solution only works if your data is partitioned and you know the location of the latest partition. In this case instead of doing a recover partition or a repair which is a costly operation, you can do something like:
Read the partitioned table and get the schema details
Read the table you want to update
Now find which all columns are different and do a alter table for each
Posting a scala code for reference:
def updateMetastoreColumns(spark: SparkSession, partitionedTablePath: String, toUpdateTableName: String): Unit = {
//fetch all column names along with their corresponding datatypes from latest partition
val partitionedTable = spark.read.orc(partitionedTablePath)
val partitionedTableColumns = partitionedTable.columns zip partitionedTable.schema.map(_.dataType.catalogString)
//fetch all column names along with their corresponding datatypes from currentTable
val toUpdateTable = spark.read.table(toUpdateTableName)
val toUpdateTableColumns = toUpdateTable.columns zip toUpdateTable.schema.map(_.dataType.catalogString)
//check if new columns are present in newer partition
val diffColumns = partitionedTableColumns.diff(toUpdateTableColumns)
//update the metastore with new column info
diffColumns.foreach {column: (String, String) => {
spark.sql(s"ALTER TABLE ${toUpdateTableName} ADD COLUMNS (${column._1} ${column._2})")
}}
}
This will help you dynamically find latest columns which are added to newer partition and update it to your metastore on the fly.

How to update the same value for new updated column in SQL Server for all rows?

I have insert 10,000 rows into table which contains 3 columns.
Now i need to add the new column into table.And also need to store the value to updated column which is to be same for all 10,000 rows.
For example:
my table like below..,
No Name ID
1 raj 1000
2 ravi 1001
3 git 1002
..
.
.
10000 dat 10,000
Now i need to add the new column "date"
Then data like this..,
No Name ID Date
1 raj 1001
2 ravi 1002
3 git 1003
I have using below query to add new column
ALTER TABLE table_name
ADD Date date
but i need to know how to store the same data into all rows in table like below.
No Name ID Date
1 raj 1001 10.12.2020
2 ravi 1002 10.12.2020
3 git 1003 10.12.2020
.
.
.
.
10,000 dat 10000 10.12.2020
How can i achieve above requirement?
I am little bit know about sql server.
Can anyone please help me to solve this?

ALTER TABLE table_name
ADD CONSTRAINT DF_date DEFAULT N'10.10.2020' FOR [date];
or
ALTER TABLE [dbo].table_name
ADD CONSTRAINT DF_table_name_column_name DEFAULT ('10.10.2020') FOR column_name
iam giving one example
CREATE TABLE #TEST(PART VARCHAR(10),LASTTIME DATETIME)
GO
ALTER TABLE [DBO].#TEST
ADD CONSTRAINT DF_#TEST_LASTTIME DEFAULT ('10.10.2020') FOR LASTTIME
INSERT INTO #TEST (PART )
VALUES('A')
INSERT INTO #TEST (PART )
VALUES('B')
INSERT INTO #TEST (PART )
VALUES('AA')
INSERT INTO #TEST (PART )
VALUES('BA')
GO

First only add new column to your table .
ALTER TABLE Protocols
ADD Date Date
After , For your past data ,you can use update query .
update Protocols set Date='10.12.2020' .
In last ,
ALTER TABLE Protocols
ALTER COLUMN Date SET DEFAULT '10.12.2020'
It will be updating all your pass date with '10.12.2020' and in future also value for Date will be '10.12.2020' .
Thanks .

create table in hive with additional columns

I am new to Hive . I want to create the table in hive with the same columns as that of existing table plus some additional columns. I Know we can use something like this.
CREATE TABLE new_table_name
AS
SELECT *
FROM old_table_name
This will create the table with same columns as that of old_table_name.
But How do I specify additional columns in new_table_name ?

Here is how you can achieve it:
Old table:
hive> describe departments;
OK
department_id int from deserializer
department_name string from deserializer
Create table:
create table ctas as
select department_id, department_name,
cast(null as int) as col_null
from departments;
Displaying Structure of new table:
hive> describe ctas;
OK
department_id int
department_name string
col_null int
Time taken: 0.106 seconds, Fetched: 3 row(s)
Results from new table:
hive> select * from ctas;
OK
2 Fitness NULL
3 Footwear NULL
4 Apparel NULL
5 Golf NULL
6 Outdoors NULL
7 Fan Shop NULL
8 TESTING NULL
8000 TESTING NULL
9000 testing export NULL

Simple way is to issue ALTER TABLE command to add more(additional) columns after the above CREATE statement.

first create a new table like the first one
and after that alter this new table and add the columns that you want.
CREATE TABLE new_table LIKE old_table;
ALTER TABLE new_table ADD COLUMNS (newCol1 int,newCol2 int);
if you wish to avoid data copy, make your table external
I wish that it helps you :)

Inserting system timestamp into a timestamp field in hive table

I m using Hive 0.8.0 version. I wanted to insert the system timestamp into a timestamp field while loading data into a hive table.
In Detail:
I have a file with 2 fields like below:
id name
1 John
2 Merry
3 Sam
Now i wanted to load this file on hive table along with the extra column "created_date". So i have created hive table with the extra filed like below:
CREATE table mytable(id int,name string, created_date timestamp) row format delimited fields terminated by ',' stored as textfile;
If i load the data file i used the below query:
LOAD DATA INPATH '/user/user/data/' INTO TABLE mytable;
If i run the above query the "created_date" field will be NULL. But i wanted that field should be inserted with the system timestamp instead of null while loading the data into hive table. Is it possible in hive. How can i do it?

You can do this in two steps. First load data from the file into a temporary table without the timestamp. Then insert from the temp table into the actual table, and generate the timestamp with the unix_timestamp() UDF:
create table temptable(id int, name string)
row format delimited fields terminated by ','
stored as textfile;
create table mytable(id int, name string, created_date timestamp)
row format delimited fields terminated by ','
stored as textfile;
load data inpath '/user/user/data/' into table temptable;
insert into table mytable
select id, name, unix_timestamp()
from temptable;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Loading data into hive table with dynamic partitioning - hive

Related

why there is a data mismatch in hive and bigSQL by 1 record?

how to add columns to existing hive partitioned table?

How to update the same value for new updated column in SQL Server for all rows?

create table in hive with additional columns

Inserting system timestamp into a timestamp field in hive table

Categories

Resources