Apache hive Query on bucketed column

Apache hive Query on bucketed column - hive

I have created one bucketed table on timeslot column which has value from 0 to 23 and datatype of timeslot column is int
I have created 24 buckets and load 10000000 rows (6GB of data) in the bucketed table
At the same time i created a normal non-bucketed table using same dataset
later I queried on bucketed table as well as non-bucketed table like as below
select * from bucketed_table where timeslot = 15;
select * from non-bucketed_table where timeslot = 15;
both the queries are taking almost same time
I was assuming bucketed table perform far better than non-bucketed table
can anyone let me know if i am doing something wrong or my assumption is completely wrong?

As per my understanding bucketed table only performed better in case of joining with other bucketed table. if we just query on bucketed column there will not be any performance gain as in this case both bucketed table and non-bucketed table scan whole table (data files) and that is why in both the cases same number of mapper are launched

Related

How to choose partition keys for apache iceberg tables

I have a number of hive warehouses. The data resides in parquet files in Amazon S3. Some of the tables contain TB of data. Currently in hive most tables are partitioned by a combination of month and year, both of which are saved mainly as string. Other fields are either bigint, int, float, double, string and unix timestamps. Our goal is to migrate them to apache iceberg tables. The challenge is how to choose the partition keys.
I have already calculated the cardinality of each field in each table by:
Select COUNT(DISTINCT my_column) As my_column_count
From my_table;
I have also calculated the percentage of null values for each field:
SELECT 100.0 * count(*)/number_of_all_records
FROM my_db.my_table
Where my_column IS NULL;
In short I already know three things for each field:
Data type
Cardinality
Percentage of null values
By knowing these three pieces of information, my question is how should I choose the best column or combination of columns as partition keys for my future iceberg tables? Are there any rule of thumbs?
How many partitions is considered as optimized when choosing partition keys? What data type is best when choosing partition keys? What are other factors that need to be considered? Is bucketing the same in iceberg tables as it is in hive and how it can be leveraged by the partition keys? What data types are best for partition keys? Is it better to have many small partitions or having a few big partitions? Any other aspects in partition keys that need to be considered?

One crucial part is missing from your description - the queries. You need to understand what are the queries that will run on this data. Understanding the queries that will run on the data (to the best you can) is super important.
For example, consider a simple table with: Date, Id, Name, Age as columns.
If the queries are date based meaning, it will query the data in the context of dates,
select * from table where date > 'some-date'
then it's a good idea to partition by date.
However, if the queries are age related
select * from table where age between 20 and 30
then you should consider partition by age or age groups

Partition Tables - Empty or Doesn't Exist

Recently, I have been working on converting date suffixed tables into partitioned tables using ingestion time. However, in partition tables, how do we know whether certain date contains no data or the table was not created successfully?
Here is more details,
Previously, daily tables were created, but it is OK that some tables were empty because no result met the criteria. For example,
daily_table_20200601 (100 rows)
daily_table_20200602 (0 rows)
daily_table_20200603 (10 rows)
In this case, I can see table daily_table_20200602 exists, so I know my scheduled job runs successfully.
When switching to partitioned tables using ingestion time, I am writing into the table daily_table every day, for example,
daily_table$20200601 (100 rows)
daily_table$20200602 (0 rows)
daily_table$20200603 (10 rows)
But how do we know the whether table daily_table$20200602 was created successfully or it is just empty?
Also, there is something interesting. I am using API to check whether partition table exist, see the following code,
dataset_ref = client.dataset('dataset_name')
table_ref = dataset_ref.table("daily_table$20210101")
client.get_table(table_ref)
The result shows the table exist. So are we able to check whether the certain date table exist or not?

there's no separate (date table) for every partition, because the partitioning doesn't create a separate partition table, it's similar to relational database partitioning
ingestion time partitioning method adds a pseudo columns for day partitioning (_PARTITIONTIME,_PARTITIONDATE) and for hourly partitioning (_PARTITIONTIME) which will contains the timestamp of the beginning of the insertion data or hour and partition the table accordingly,
for this code:
dataset_ref = client.dataset('dataset_name')
table_ref = dataset_ref.table("daily_table$20210101")
client.get_table(table_ref)
This will success as long as the partitioned table exists

Hive tablesampling and bucketing

I'm new to Hive and facing some problem. I'm learning bucketing right now and my task is to create a Hive table that consists of 2 buckets, then put at least 5 records into that table. Well, that part is clear I think:
CREATE TABLE <tablename>(id INT,field2 STRING,field3 TINYINT) CLUSTERED BY(id) INTO 2 BUCKETS;
For populating the table I simply used insert into values(...) statement. What I don't really know is the following - I have to run this query:
SELECT * FROM <tablename> TABLESAMPLE(BUCKET 1 OUT OF 2 ON id)
When I run it it returns 0 rows and I don't know why. I tried to look it up on the internet but didn't find exact answer. If I replace the id with an other field in the table it returns the rows in that bucket. So can someone explain it please?

Here I give you some tips for create and insert in bucketing tables.
Bucketing is an approach for improving Hive query performance.
Bucketing stores data in separate files, not separate subdirectories like partitioning.
It divides the data in an effectively random way, not in a predictable way like partitioning.
When records are inserted into a bucketed table, Hive computes hash codes of the values in the specified bucketing column and uses these hash codes to divide the records into buckets.
For this reason, bucketing is sometimes called hash partitioning.
The goal of bucketing is to distribute records evenly across a predefined number of buckets.
Bucketing can improve the performance of joins if all the joined tables are bucketed on the join key column.
For more on bucketing, see the page of the Hive Language Manual describing bucketed tables, at BucketedTables
As an example of bucketing:
Let us see how we can create Bucketed Tables in Hive.
Bucketed tables is nothing but Hash Partitioning in conventional databases.
We need to specify the CLUSTERED BY Clause as well as INTO BUCKETS to create Bucketed table.
CREATE TABLE orders_buck (
order_id INT,
order_date STRING,
order_customer_id INT,
order_status STRING
) CLUSTERED BY (order_id) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
DESCRIBE FORMATTED orders_buck;
Let us see how we can add data to bucketed tables.
Typically we use INSERT command to get data into bucketed tables, as source data might not match the criterial of our bucketed table.
If the data is in files, first we need to get data to stage and then insert into bucketed table.
We already have data in orders table, let us use to insert data into our bucketed table orders_buck
hive.enforce.bucketing should be set to true.
Here is the example of inserting data into bucketed table from regular managed or external table.
SET hive.enforce.bucketing;
SET hive.enforce.bucketing=true;
INSERT INTO orders_buck
SELECT * FROM orders;
-- check out into the directory of the bucketed table if the
-- number of files is equal to number of buckets
dfs -ls /user/hive/warehouse/training_retail.db/orders_buck;
SELECT * FROM orders_buck TABLESAMPLE(BUCKET 1 OUT OF 2 ON order_id);
-- In my case this query works perfectly well

difficulties to fetching data from table

We have a table of 627 columns and approx 850 000 records.
We are trying to retrieve only two columns and dump that data in new table, but the query is taking endless time and we are unable to get the result in new table.
create table test_sample
as
select roll_no, date_of_birth from sample_1;
We have unique index on roll_no column (varchar) and data type for date_of_birth is date.

Your query has no WHERE clause, so it scans the full table. It reads all the columns of every row into memory to extract the columns it needs to satisfy your query. This will take a long time because your table has 627 columns, and I'll bet some of them are pretty wide.
Additionally, a table with that many columns may give you problems with migrated rows or chaining. The impact of that will depend on the relative position of roll_no and date_of_birth in the table's projection.
In short, a table with 627 columns shows poor (non-existent) data modelling. Which doesn't help you now, it's just a lesson to be learned.
If this is a one-off exercise you'll just need to let the query run. (Although you should check whether it is running at all: can you see active progress in V$SESSION_LONGOPS?)

Update a column value for 500 million rows in Interval Partitioned table

we've a table with 10 Billion rows. This table is Interval Partitioned on date. In a subpartition we need to update the date for 500 million rows that matches the criteria to a new value. This will definetly affect creation of new partition or something because the table is partitioned on the same date. Could anyone give me pointers to a best approach to follow?
Thanks in advance!

If you are going to update partitioning key and the source rows are in a single (sub)partition, then the reasonable approach would be to:
Create a temporary table for the updated rows. If possible, perform the update on the fly
CREATE TABLE updated_rows
AS
SELECT add_months(partition_key, 1), other_columns...
FROM original_table PARITION (xxx)
WHERE ...;
Drop original (sub)partition
ALTER TABLE original_table DROP PARTITION xxx;
Reinsert the updated rows back
INSERT /*+append*/ INTO original_table
SELECT * FROM updated_rows;
In case you have issues with CTAS or INSERT INTO SELECT for 500M rows, consider partitioning the temporary table and moving the data in batches.

hmmm... If you have enough space i would create a "copy" of the source table with the good updated rows, then check the results and drop the source table after it, in the end rename the "copy" to the source. Yes this have a long executing time, but this could be a painless way, of course parallel hint is needed.

You may consider to add a new column (Flag) 'updated' bit that have by fedault the values NULL (Or 0, i preffer NULL) to your table, and using the criticias of dates that you need to update you can update data group by group in the same way described by Kombajn, once the group of data is updated you can affect the value 1 to the flag 'updated' to your group of data.
For exemple lets start by making groups of datas, let consider that the critecia of groups is the year. so lets start to treate data year by year.
Create a temporary table of year 1 :
CREATE TABLE updated_rows
AS
SELECT columns...
FROM original_table PARITION (2001)
WHERE YEAR = 2001
...;
2.Drop original (sub)partition
ALTER TABLE original_table DROP PARTITION 2001;
3.Reinsert the updated rows back
INSERT /*+append*/ INTO original_table(columns....,updated)
SELECT columns...,1 FROM updated_rows;
Hope this will helps you to treat data step by step to prevent waiting all data of the table to be updated in once. You may consider a cursor that loop over years.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas