Split Hive table on subtables by field value - hive

I have a Hive table foo. There are several fields in this table. One of them is some_id. Number of unique values in this fields in range 5,000-10,000. For each value (in example it 10385) I need to perform CTAS queries like
CREATE TABLE bar_10385 AS
SELECT * FROM foo WHERE some_id=10385 AND other_id=10385;
What is the best way to perform this bunch of queries?

You can store all these tables in the single partitioned one. This approach will allow you to load all the data in single query. Query performance will not be compromised.
Create table T (
... --columns here
)
partitioned by (id int); --new calculated partition key
Load data using one query, it will read source table only once:
insert overwrite table T partition(id)
select ..., --columns
case when some_id=10385 AND other_id=10385 then 10385
when some_id=10386 AND other_id=10386 then 10386
...
--and so on
else 0 --default partition for records not attributed
end as id --partition column
from foo
where some_id in (10385,10386) AND other_id in (10385,10386) --filter
Then you can use this table in queries specifying partition:
select from T where id = 10385; --you can create a view named bar_10385, it will act the same as your table. Partition pruning works fast

Related

UPDATE two columns with new value under large size table

We have table like :
mytable (pid, string_value, int_value)
This table has more than 20M rows in total. Now we have a feature try to mark all the rows from this tables as invalid. So we need update the table columns: string_Value = NULL and int_value = 0 which indicate this is invalid row ( we still want to keep the pid as it is important to us)
So what is the best way?
I use the following SQL:
UPDATE Mytable
SET string_value = NULL,
int_value = 0;
but this query takes more than 4 minutes in my test env. Is there any better way we can improve it?
Updating all the rows can be quite expensive. Often, it is faster to empty the table and reload it.
In generic SQL this looks like:
create table mytable_temp as
select pid
from mytable;
truncate table mytable; -- back it up first!
insert into mytable (pid, string_value, int_value)
select pid, null, 0
from mytable_temp;
The creation of the temporary table may use different syntax, depending on our database.
Updates can take time to complete. Another way of achieving this is to follow the following steps:
Add new columns with the values you need set as the default value
Drop the original columns
Rename the new columns with the names of the original columns.
You can then drop the default values on the new columns.
This needs to be tested as different DBMSs allow different levels of table alters (i.e. not all DMBSs allow a drop default or a drop column).

Merge update records in a final table

I have a user table in Hive of the form:
User:
Id String,
Name String,
Col1 String,
UpdateTimestamp Timestamp
I'm inserting data in this table from a file which has the following format:
I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp
e.g. for inserting a user with Id 1:
I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456
and updating col1 for the same user with Id 1:
U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457
The columns which are not updated are returned as null.
Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.
However, how would I go about the update statements? So that my final row in hive looks like below:
1,Bob,updatedstuff,123457
I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?
Typically with a merge statement your "file" would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.
However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.
In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:
create table test.orc_acid_example_users
(
id int
,name string
,col1 string
,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');
After your insert statements, your Bob record would say "stuff" in col1:
As far as the updates - you could tackle these with an update or merge statement. I think the key here is the null values. It's important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here's a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.
merge into test.orc_acid_example_users as t
using test.orc_acid_example_staging as s
on t.id = s.id
and s.type = 'U'
when matched
then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)
Now Bob will show "updatedstuff"
Quick disclaimer - if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn't really a complete transactional DB - it would be preferred for the source to send full user records any time there's an update, instead of just the changed fields only.
You can reconstruct each record in the table using you can use last_value() with the null option:
select h.id,
coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
update_timestamp
from history h;
You can use row_number() and a subquery if you want the most recent record.

Hive table creation with a default value

I have a table in RDBMS like so:
create table test (sno number, entry_date date default sysdate).
Now I want to create a table in hive with a structure as adding a default value to a column.
Hive currently doesn't support the feature of adding default value to any column while creating a table.
As a workaround load data into a temporary table and use the insert overwrite table statement to add the current date and time into the main table.
Create a temporary table:
create table test (sno number);
Load data into the table:
Create final table:
create table final_table (sno number, createDate string);
Finally load the data from temp test table to the final table:
insert overwrite table final_table select sno, FROM_UNIXTIME( UNIX_TIMESTAMP(), 'dd/MM/YYYY' ) from test;
Hive doesn't support DEFAULT fields
Doesn't mean you can't do it, though. Just a two step process of creating one "staging" table, then inserting into a second table and selecting that "default" value.
Adding a default value to a column while creating table in hive
Since you mention,
I've table in RDBMS
You could also use your existing table, and use Sqoop to import the data into Hive.

Table partitioning with procedure input parameter

I'm trying to partitioning my table on ID which I got from procedure parameter.
For example my table ddl:
CREATE TABLE bigtable
( ID number )
As input procedure parameter I got eg. number: 130 , So I'm trying to create partition:
Alter table bigtable
add partition part_random_number values(random number);
Of course as random number I mean eg. 120,56 etc : )
But I got an error that object is not partitioned. So I tried to first defined partition clause in crate table statement:
CREATE TABLE bigtable
( ID number )
PARTITION BY list (ID)
But i doesn't work, It works when I defined some partition eg.
CREATE TABLE bigtable
( ID number )
PARTITION BY list (ID)
( partition type values(130);
)
But I would like to avoid it... Is there any other solution?
As result I would like to have table partitioned by procedure input parameterers.
A partitioned table has to have at least one partition. Just create it with a dummy partition and add the ones you actually need using your procedure.

How Insert a columns of unpartitoned table into a partitioned table in Hive?

A table 'A'is there which is partitioned. The another table 'B' is not partitioned . How to insert the values of B into A? Will error be thrown?
Yes, you can insert from a non-partitioned table to a partitioned table. You will either have to define the partition you want to insert into or have Hive do it dynamically.
For example, to dynamically insert into partitions, you could run something similar to:
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE A PARTITION (partition) SELECT col1, col2, ..., colN, partition FROM B WHERE .... ;
More information regarding Hive Partitions with dynamic inserts can be found here : https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions. Take note, the last column in your SELECT is what is used for the partition insert. Another thing to note is that you need the number of columns to match between the two tables, otherwise you will have to fill in NULLs.