I understand that when you create ORC tables, it will improve the speed dramatically. However, can we improve it further by partitioning and bucketing an ORC table? If so, how to do partitioning and bucketing in an existing ORC table?
You can bucket and partition an ORC table.
Partitions are directly mapped to directories in HDFS. You can ALTER TABLE and add partition. You'd have to do partition recovery after thou.
Everything is well explained here: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-AlterPartition.
Personally I'd create new table wih dynamic partitioning and copy the data to new table.
Partitioning and Bucketing are features offered to help improve query performance. In Hive, as explained by Karol, Partitioning is mapped to a hdfs directory structure and the way to partition is totally driven by the query needs and pattern. For example
customer_purchases table stores all the transactions over the past 2 - 3yrs (around 1-2 PB of data). An analyst is trying to answer "How much of sales happened during the first quarter of 2017 month-wise?".
WITHOUT PARTITION
customer_purchases table schema
transaction_id | cust_id | price_per_unit | units_purchased | invoiceDate
Sample dataset
1,CustomerId-32,3.24,91,2017-10-19
2,CustomerId-16,3.24,88,2017-10-14
3,CustomerId-3,1.96,99,2017-10-14
4,CustomerId-95,1.96,38,2017-10-17
5,CustomerId-51,1.32,39,2017-10-18
6,CustomerId-29,1.32,14,2017-10-14
7,CustomerId-15,3.88,66,2017-10-19
8,CustomerId-74,1.32,44,2017-10-17
9,CustomerId-43,3.88,22,2017-10-18
Stored as csvs in hdfs://your-nn/your-path/data*.csv
SELECT month(invoiceDate), count(*) FROM customer_purchases WHERE
YEAR(invoiceDate) = '2017' AND MONTH(invoiceDate) BETWEEN 1 AND 3
GROUP BY MONTH(invoiceDate)
The above statement does an entire table scan to perform the filter (where) and aggregation (group by). This is inefficient as we are only need a fraction of the dataset.
WITH PARTITION
We could infer that the partition is more time-series based as there is a date range. In order to avoid the full-table scan, we could create a partition which is more month based. Following are the changes
customer_purchases table schema (partition column 'yr' and 'mon')
transaction_id | cust_id | price_per_unit | units_purchased | invoiceDate | mon
The same data is stored in hdfs as hdfs://your-nn/your-path///data*.csv where is yyyy format year and is any value between 1 and 12 (Jan through Dec).
with the new hdfs structure and hive table schema structure. The query would be
SELECT mon, count(*) FROM customer_purchases WHERE yr='2017' AND mon
BETWEEN 1 AND 3 GROUP BY mon
The explain plan on the above query would now scan only files under yr=2017 directory and mon=1, mon=2 and mon=3 subdirectories. This is a small dataset and you will have the results returned faster.
As per ORC file format, there is nothing that would change except for the files in hdfs location would be .orc instead of .csv.
BUCKETING adds grouping the transactions in specific files.
Does that answer your question?
DYNAMIC PATITIONING helps in performing the partition automatically based on input transaction data in a table.
Related
I have a number of hive warehouses. The data resides in parquet files in Amazon S3. Some of the tables contain TB of data. Currently in hive most tables are partitioned by a combination of month and year, both of which are saved mainly as string. Other fields are either bigint, int, float, double, string and unix timestamps. Our goal is to migrate them to apache iceberg tables. The challenge is how to choose the partition keys.
I have already calculated the cardinality of each field in each table by:
Select COUNT(DISTINCT my_column) As my_column_count
From my_table;
I have also calculated the percentage of null values for each field:
SELECT 100.0 * count(*)/number_of_all_records
FROM my_db.my_table
Where my_column IS NULL;
In short I already know three things for each field:
Data type
Cardinality
Percentage of null values
By knowing these three pieces of information, my question is how should I choose the best column or combination of columns as partition keys for my future iceberg tables? Are there any rule of thumbs?
How many partitions is considered as optimized when choosing partition keys? What data type is best when choosing partition keys? What are other factors that need to be considered? Is bucketing the same in iceberg tables as it is in hive and how it can be leveraged by the partition keys? What data types are best for partition keys? Is it better to have many small partitions or having a few big partitions? Any other aspects in partition keys that need to be considered?
One crucial part is missing from your description - the queries. You need to understand what are the queries that will run on this data. Understanding the queries that will run on the data (to the best you can) is super important.
For example, consider a simple table with: Date, Id, Name, Age as columns.
If the queries are date based meaning, it will query the data in the context of dates,
select * from table where date > 'some-date'
then it's a good idea to partition by date.
However, if the queries are age related
select * from table where age between 20 and 30
then you should consider partition by age or age groups
I'm new to Hive and facing some problem. I'm learning bucketing right now and my task is to create a Hive table that consists of 2 buckets, then put at least 5 records into that table. Well, that part is clear I think:
CREATE TABLE <tablename>(id INT,field2 STRING,field3 TINYINT) CLUSTERED BY(id) INTO 2 BUCKETS;
For populating the table I simply used insert into values(...) statement. What I don't really know is the following - I have to run this query:
SELECT * FROM <tablename> TABLESAMPLE(BUCKET 1 OUT OF 2 ON id)
When I run it it returns 0 rows and I don't know why. I tried to look it up on the internet but didn't find exact answer. If I replace the id with an other field in the table it returns the rows in that bucket. So can someone explain it please?
Here I give you some tips for create and insert in bucketing tables.
Bucketing is an approach for improving Hive query performance.
Bucketing stores data in separate files, not separate subdirectories like partitioning.
It divides the data in an effectively random way, not in a predictable way like partitioning.
When records are inserted into a bucketed table, Hive computes hash codes of the values in the specified bucketing column and uses these hash codes to divide the records into buckets.
For this reason, bucketing is sometimes called hash partitioning.
The goal of bucketing is to distribute records evenly across a predefined number of buckets.
Bucketing can improve the performance of joins if all the joined tables are bucketed on the join key column.
For more on bucketing, see the page of the Hive Language Manual describing bucketed tables, at BucketedTables
As an example of bucketing:
Let us see how we can create Bucketed Tables in Hive.
Bucketed tables is nothing but Hash Partitioning in conventional databases.
We need to specify the CLUSTERED BY Clause as well as INTO BUCKETS to create Bucketed table.
CREATE TABLE orders_buck (
order_id INT,
order_date STRING,
order_customer_id INT,
order_status STRING
) CLUSTERED BY (order_id) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
DESCRIBE FORMATTED orders_buck;
Let us see how we can add data to bucketed tables.
Typically we use INSERT command to get data into bucketed tables, as source data might not match the criterial of our bucketed table.
If the data is in files, first we need to get data to stage and then insert into bucketed table.
We already have data in orders table, let us use to insert data into our bucketed table orders_buck
hive.enforce.bucketing should be set to true.
Here is the example of inserting data into bucketed table from regular managed or external table.
SET hive.enforce.bucketing;
SET hive.enforce.bucketing=true;
INSERT INTO orders_buck
SELECT * FROM orders;
-- check out into the directory of the bucketed table if the
-- number of files is equal to number of buckets
dfs -ls /user/hive/warehouse/training_retail.db/orders_buck;
SELECT * FROM orders_buck TABLESAMPLE(BUCKET 1 OUT OF 2 ON order_id);
-- In my case this query works perfectly well
I have a Big Query table with daily partitions
Now the problem is in one of the partitions i.e. the last partition of the month (for example : 2019-12-31) I have some data that should belong to the next partition i.e 2020-01-01.
I want to know if it is possible to take out that data from my partition 2019-12-31 and put it in the next partition 2020-01-01 using Big Query SQL? or do I have to create a Beam job for it?
Yes, using DML. UPDATE statement moves rows from one partition to another.
Updating data in a partitioned table using DML is the same as updating data from a non-partitioned table.
For example, the following UPDATE statement moves rows from one partition to another. Rows in the May 1, 2017 partition (“2017-05-01”) of mytable where field1 is equal to 21 are moved to the June 1, 2017 partition (“2017-06-01”).
UPDATE
project_id.dataset.mycolumntable
SET
ts = "2017-06-01"
WHERE
DATE(ts) = "2017-05-01"
AND field1 = 21
I have an external hive table which is partitioned on load_date (DD-MM-YYYY). however the very first period lets say 01-01-2000 has all the data from 1980 till 2000. How can I further create partitions on year for the previous data while keeping the existing data (data for load date greater than 01-01-2000) still available
First load the data of '01-01-2000' into a table and create a dynamic partition table partitioned by data '01-01-2000'. This might solve your problem.
My question is about table partitioning in SQL Server 2008.
I have a program that loads data into a table every 10 mins or so. Approx 40 million rows per day.
The data is bcp'ed into the table and needs to be able to be loaded very quickly.
I would like to partition this table based on the date the data is inserted into the table. Each partition would contain the data loaded in one particular day.
The table should hold the last 50 days of data, so every night I need to drop any partitions older than 50 days.
I would like to have a process that aggregates data loaded into the current partition every hour into some aggregation tables. The summary will only ever run on the latest partition (since all other partitions will already be summarised) so it is important it is partitioned on insert_date.
Generally when querying the data, the insert date is specified (or multiple insert dates). The detailed data is queried by drilling down from the summarised data and as this is summarised based on insert date, the insert date is always specified when querying the detailed data in the partitioned table.
Can I create a default column in the table "Insert_date" that gets a value of Getdate() and then partition on this somehow?
OR
I can create a column in the table "insert_date" and put a hard coded value of today's date.
What would the partition function look like?
Would seperate tables and a partitioned view be better suited?
I have tried both, and even though I think partition tables are cooler. But after trying to teach how to maintain the code afterwards it just wasten't justified. In that scenario we used a hard coded field date field that was in the insert statement.
Now I use different tables ( 31 days / 31 tables ) + aggrigation table and there is an ugly union all query that joins togeather the monthly data.
Advantage. Super timple sql, and simple c# code for bcp and nobody has complained about complexity.
But if you have the infrastructure and a gaggle of .net / sql gurus I would choose the partitioning strategy.