In Apache Hive, how does the directory structure looks after a huge dataset is being partitioned and then bucketed?
For Ex - I have a customer dataset for a country, the data is being partitioned by state and then bucketed by city. How do we get to know how many files will be present in a city bucket?
A partition is a directory, and each partition corresponds to a specific value of the partitioned column.
Within a table or a partition/directory, buckets are organized as files. The number of buckets is predefined when creating a table with CLUSTERED BY (sth) INTO K BUCKETS. There will be ONE file for each individual bucket. Hive assigns records to buckets based on their hash value calculated by the bucketed column, and a mod is taken by the num of buckets K.
Maximum number of bucketing is 256 . For more details kindly refer below link:
[What is the difference between partitioning and bucketing a table in Hive ?
Related
I'm new to Hive and facing some problem. I'm learning bucketing right now and my task is to create a Hive table that consists of 2 buckets, then put at least 5 records into that table. Well, that part is clear I think:
CREATE TABLE <tablename>(id INT,field2 STRING,field3 TINYINT) CLUSTERED BY(id) INTO 2 BUCKETS;
For populating the table I simply used insert into values(...) statement. What I don't really know is the following - I have to run this query:
SELECT * FROM <tablename> TABLESAMPLE(BUCKET 1 OUT OF 2 ON id)
When I run it it returns 0 rows and I don't know why. I tried to look it up on the internet but didn't find exact answer. If I replace the id with an other field in the table it returns the rows in that bucket. So can someone explain it please?
Here I give you some tips for create and insert in bucketing tables.
Bucketing is an approach for improving Hive query performance.
Bucketing stores data in separate files, not separate subdirectories like partitioning.
It divides the data in an effectively random way, not in a predictable way like partitioning.
When records are inserted into a bucketed table, Hive computes hash codes of the values in the specified bucketing column and uses these hash codes to divide the records into buckets.
For this reason, bucketing is sometimes called hash partitioning.
The goal of bucketing is to distribute records evenly across a predefined number of buckets.
Bucketing can improve the performance of joins if all the joined tables are bucketed on the join key column.
For more on bucketing, see the page of the Hive Language Manual describing bucketed tables, at BucketedTables
As an example of bucketing:
Let us see how we can create Bucketed Tables in Hive.
Bucketed tables is nothing but Hash Partitioning in conventional databases.
We need to specify the CLUSTERED BY Clause as well as INTO BUCKETS to create Bucketed table.
CREATE TABLE orders_buck (
order_id INT,
order_date STRING,
order_customer_id INT,
order_status STRING
) CLUSTERED BY (order_id) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
DESCRIBE FORMATTED orders_buck;
Let us see how we can add data to bucketed tables.
Typically we use INSERT command to get data into bucketed tables, as source data might not match the criterial of our bucketed table.
If the data is in files, first we need to get data to stage and then insert into bucketed table.
We already have data in orders table, let us use to insert data into our bucketed table orders_buck
hive.enforce.bucketing should be set to true.
Here is the example of inserting data into bucketed table from regular managed or external table.
SET hive.enforce.bucketing;
SET hive.enforce.bucketing=true;
INSERT INTO orders_buck
SELECT * FROM orders;
-- check out into the directory of the bucketed table if the
-- number of files is equal to number of buckets
dfs -ls /user/hive/warehouse/training_retail.db/orders_buck;
SELECT * FROM orders_buck TABLESAMPLE(BUCKET 1 OUT OF 2 ON order_id);
-- In my case this query works perfectly well
I have a table created in hive stored in s3 location.
It has about 10 columns and is partitioned on 3 columns month, year and city , in the same order.
I am running a spark job that creates a dataframe(2 billion rows) and writes into this table.
val partitions:Seq[Column] = Seq(col("month"),col("year"),col("city"))
df.repartition(partitions: _*).selectExpr(cs.map(_.name): _*).write.mode("overwrite").insertInto(s"$tableName")
selectExpr(cs.map(_.name): _*) reorders the columns in the dataframe to align with the ordering in the table.
When i run the above command to insert into the table, I see there are so many staging files and multiple small files created under each city.
s3://s3Root/tableName/month/year/city/file1.csv
file2.csv
...
file200.csv
I am hoping to get a single file under each city per year per month.
To coalesce per partition.
Expected:
s3://s3Root/tableName/month/year/city/file.csv
Any help is appreciated.
to achieve one file by partition you should use
.partitionBy("")
val partitions:Seq[Column] = Seq(col("month"),col("year"),col("city"))
df.repartition(partitions: _*).selectExpr(cs.map(_.name): _*).write.partitionBy(partitions: _*).mode("overwrite").insertInto(s"$tableName")
I think that you could avoid to make the repartition before, if you do only the partitionBy,files will be partitioned by one per partition.
I have question on spark dataframe number of partitions.
If I have Hive table(employee) which has columns (name,age,id,location).
CREATE TABLE employee (name String, age String, id Int) PARTITIONED BY (location String);
If the employee table has 10 different locations. So data will be partitioned into 10 partitions in HDFS.
If I create a Spark dataframe(df) by reading the whole data of a Hive table(employee).
How many number of partitions will be created by Spark for a dataframe(df)?
df.rdd.partitions.size = ??
Partitions are created depending on the block size of HDFS.
Imagine you have read the 10 partitions as a single RDD and if the block size is 128MB then
no of partitions = (size of(10 partitions in MBs)) / 128MB
will be stored on HDFS.
Please refer to the following link:
http://www.bigsynapse.com/spark-input-output
I have a hive table that is partitioned by day (e.g. 20151001, 20151002,....).
Is there a hive query to list these partitions in a way that it is possible to be used in a nested sub query?
That is can I do something along the line of:
SELECT * FROM (SHOW PARTITIONS test) a where ...
The query-
SELECT ptn FROM test
returns as many rows as the number of rows in the test table. I want it to return only as many rows as the number of partitions (without using the DISTINCT function)
A potential solution is to find the partitions from the hdfs location for the table of interest by using either shell script/python.
The data that corresponds to the hive table is stored in the hdfs e.g
/hive/database/table/partition/datafiles
in your case,
/hive/database/table/20151001/datafiles
If the table is bucketed there are as many individual files as the cluster size.
Once you have the above path, create a shell script to loop through the folder (in this case 20151001 etc..)
capture this in a shell variable and pass it as a parameter to the hive query.
I have created a hive table with gender as bucket column.
create table userinfoBucketed(userid INT,age INT,gender STRING,occupation STRING,zipcode STRING) CLUSTERED BY(gender) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
loading following data from text file into table
(user id | age | gender | occupation | zip code) :
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703
I have set the hive.enforce.bucketing property to true;
set hive.enforce.bucketing=true;
1, when inserted data into table using load command, buckets are not created. all the data stored in one bucket
load data local inpath '/home/mainnode/u.user' into table userinfobucketed;
Question1, Why the data is not split into 2 buckets?
2, when inserted data into table from other table, data stored in 2 buckets. here is the command I have executed:
insert into table userinfobucketed select * from userinfo where gender='M';
Now bucket1(000000_0) has below data:
1|24|M|technician|85711
4|24|M|technician|43537
6|42|M|executive|98101
7|57|M|administrator|91344
bucket2(000001_0) has below data:
3|23|M|writer|32067
Question2, I do not understand why data got stored into 2 buckets even though all same records has the same gender.
Then I again inserted data into the table using the below command.
insert into table userinfobucketed select * from userinfo where gender='F';
Now 2 more extra buckets (000000_0_copy_1,000001_0_copy_1) are created and data stored into those instead of inserting data into existing buckets. Now that makes total buckets to 4 even though create table is configured into 2 buckets.
Question3 ; Why the extra buckets got created into of copying into existing buckets
please clarify
Thanks
Sean
Q1: Why doesn't this work to insert into a bucketed table?
load data local inpath '/home/mainnode/u.user' into table userinfobucketed;
A1: Take a look at this tutorial for inserting into bucketed tables. Hive does not support loading to bucketed tables directly from a flat file using LOAD DATA INPATH, so you have to LOAD the data into a regular table first then INSERT OVERWRITE into your bucketed table.
Q2: Why was the inserted data split into 2 buckets even though all records had the same value for the bucket column?
A2: Hmm. This is abnormal behavior. You should never see records with the same bucket column value getting hashed into different buckets. I suspect you did not drop the table and recreate it after trying the LOAD DATA INPATH method above in Q1. If that were the case, new buckets would be created on the insert, disregarding what's in the existing buckets, which leads us to the next question...
Q3: Why were extra buckets created instead of inserting into existing buckets?
A3: Hive does not append new data to files on insert. Even though you told Hive that your table is bucketed, it only hashes the data you are currently inserting; it does not take into account the data already in the table.
To maintain the number of buckets set in the table definition, you will have to hash all the data together every time you do an insert, and use INSERT OVERWRITE instead of INSERT INTO to overwrite the table.
Generally this is much easier to do if your table is partitioned, so you're not copying and re-hashing your whole table every time you have to do an insert. And speaking of partitioning, since it is such low cardinality, gender is much better suited as a partition value than a bucket value. This article does a pretty good job at explaining this concept.
Bucketing is driven by the hash of the column. Apparently M and F are resulting in the same hash. You might consider making the gender part of the partitioning key - to ensure they end up in different physical files.