to populate bucketed tables in hive - hive

I have created a hive table with gender as bucket column.
create table userinfoBucketed(userid INT,age INT,gender STRING,occupation STRING,zipcode STRING) CLUSTERED BY(gender) INTO 2 BUCKETS ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS TEXTFILE;
loading following data from text file into table
(user id | age | gender | occupation | zip code) :
1|24|M|technician|85711
2|53|F|other|94043
3|23|M|writer|32067
4|24|M|technician|43537
5|33|F|other|15213
6|42|M|executive|98101
7|57|M|administrator|91344
8|36|M|administrator|05201
9|29|M|student|01002
10|53|M|lawyer|90703
I have set the hive.enforce.bucketing property to true;
set hive.enforce.bucketing=true;
1, when inserted data into table using load command, buckets are not created. all the data stored in one bucket
load data local inpath '/home/mainnode/u.user' into table userinfobucketed;
Question1, Why the data is not split into 2 buckets?
2, when inserted data into table from other table, data stored in 2 buckets. here is the command I have executed:
insert into table userinfobucketed select * from userinfo where gender='M';
Now bucket1(000000_0) has below data:
1|24|M|technician|85711
4|24|M|technician|43537
6|42|M|executive|98101
7|57|M|administrator|91344
bucket2(000001_0) has below data:
3|23|M|writer|32067
Question2, I do not understand why data got stored into 2 buckets even though all same records has the same gender.
Then I again inserted data into the table using the below command.
insert into table userinfobucketed select * from userinfo where gender='F';
Now 2 more extra buckets (000000_0_copy_1,000001_0_copy_1) are created and data stored into those instead of inserting data into existing buckets. Now that makes total buckets to 4 even though create table is configured into 2 buckets.
Question3 ; Why the extra buckets got created into of copying into existing buckets
please clarify
Thanks
Sean

Q1: Why doesn't this work to insert into a bucketed table?
load data local inpath '/home/mainnode/u.user' into table userinfobucketed;
A1: Take a look at this tutorial for inserting into bucketed tables. Hive does not support loading to bucketed tables directly from a flat file using LOAD DATA INPATH, so you have to LOAD the data into a regular table first then INSERT OVERWRITE into your bucketed table.
Q2: Why was the inserted data split into 2 buckets even though all records had the same value for the bucket column?
A2: Hmm. This is abnormal behavior. You should never see records with the same bucket column value getting hashed into different buckets. I suspect you did not drop the table and recreate it after trying the LOAD DATA INPATH method above in Q1. If that were the case, new buckets would be created on the insert, disregarding what's in the existing buckets, which leads us to the next question...
Q3: Why were extra buckets created instead of inserting into existing buckets?
A3: Hive does not append new data to files on insert. Even though you told Hive that your table is bucketed, it only hashes the data you are currently inserting; it does not take into account the data already in the table.
To maintain the number of buckets set in the table definition, you will have to hash all the data together every time you do an insert, and use INSERT OVERWRITE instead of INSERT INTO to overwrite the table.
Generally this is much easier to do if your table is partitioned, so you're not copying and re-hashing your whole table every time you have to do an insert. And speaking of partitioning, since it is such low cardinality, gender is much better suited as a partition value than a bucket value. This article does a pretty good job at explaining this concept.

Bucketing is driven by the hash of the column. Apparently M and F are resulting in the same hash. You might consider making the gender part of the partitioning key - to ensure they end up in different physical files.

Related

Copy parquet file content into an SQL temp table and include partition key as column

I have multiple parquet files in S3 that are partitioned by date, like so:
s3://mybucket/myfolder/date=2022-01-01/file.parquet
s3://mybucket/myfolder/date=2022-01-02/file.parquet
and so on.
All of the files follow the same schema, except some which is why I am using the FILLRECORD (to fill the files with NULL values in case a column is not present). Now I want to load the content of all these files into an SQL temp table in redshift, like so:
DROP TABLE IF EXISTS table;
CREATE TEMP TABLE table
(
var1 bigint,
var2 bigint,
date timestamp
);
COPY table
FROM 's3://mybucket/myfolder/'
access_key_id 'id'secret_access_key 'key'
PARQUET FILLRECORD;
The problem is that the date column is not a column in the parquet files which is why the date column in the resulting table is NULL. I am trying to find a way to use the date to be inserted into the temp table.
Is there any way to do this?
I believe there are only 2 approaches to this:
Perform N COPY commands, one per S3 partition value, and populate the date column with the same information as the partition key value as a literal. A simple script can issue the SQL to Redshift. The issue with this is that you are issuing many COPY commands and if each partition in S3 has only 1 parquet file (or a few files) this will not take advantage of Redshift's parallelism.
Define the region of S3 with the partitioned parquet files as a Redshift partitioned external table and then INSERT INTO (SELECT * from );. The external table knows about the partition key and can insert this information into the local table. The down side is that you need to define the external schema and table and if this is a one time process, you will want to then tear these down after.
There are some other ways to attack this but none that are worth the effort or will be very slow.

Hive tablesampling and bucketing

I'm new to Hive and facing some problem. I'm learning bucketing right now and my task is to create a Hive table that consists of 2 buckets, then put at least 5 records into that table. Well, that part is clear I think:
CREATE TABLE <tablename>(id INT,field2 STRING,field3 TINYINT) CLUSTERED BY(id) INTO 2 BUCKETS;
For populating the table I simply used insert into values(...) statement. What I don't really know is the following - I have to run this query:
SELECT * FROM <tablename> TABLESAMPLE(BUCKET 1 OUT OF 2 ON id)
When I run it it returns 0 rows and I don't know why. I tried to look it up on the internet but didn't find exact answer. If I replace the id with an other field in the table it returns the rows in that bucket. So can someone explain it please?
Here I give you some tips for create and insert in bucketing tables.
Bucketing is an approach for improving Hive query performance.
Bucketing stores data in separate files, not separate subdirectories like partitioning.
It divides the data in an effectively random way, not in a predictable way like partitioning.
When records are inserted into a bucketed table, Hive computes hash codes of the values in the specified bucketing column and uses these hash codes to divide the records into buckets.
For this reason, bucketing is sometimes called hash partitioning.
The goal of bucketing is to distribute records evenly across a predefined number of buckets.
Bucketing can improve the performance of joins if all the joined tables are bucketed on the join key column.
For more on bucketing, see the page of the Hive Language Manual describing bucketed tables, at BucketedTables
As an example of bucketing:
Let us see how we can create Bucketed Tables in Hive.
Bucketed tables is nothing but Hash Partitioning in conventional databases.
We need to specify the CLUSTERED BY Clause as well as INTO BUCKETS to create Bucketed table.
CREATE TABLE orders_buck (
order_id INT,
order_date STRING,
order_customer_id INT,
order_status STRING
) CLUSTERED BY (order_id) INTO 8 BUCKETS
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
DESCRIBE FORMATTED orders_buck;
Let us see how we can add data to bucketed tables.
Typically we use INSERT command to get data into bucketed tables, as source data might not match the criterial of our bucketed table.
If the data is in files, first we need to get data to stage and then insert into bucketed table.
We already have data in orders table, let us use to insert data into our bucketed table orders_buck
hive.enforce.bucketing should be set to true.
Here is the example of inserting data into bucketed table from regular managed or external table.
SET hive.enforce.bucketing;
SET hive.enforce.bucketing=true;
INSERT INTO orders_buck
SELECT * FROM orders;
-- check out into the directory of the bucketed table if the
-- number of files is equal to number of buckets
dfs -ls /user/hive/warehouse/training_retail.db/orders_buck;
SELECT * FROM orders_buck TABLESAMPLE(BUCKET 1 OUT OF 2 ON order_id);
-- In my case this query works perfectly well

Update column in Amazon Redshift with join for big tables

I have 500M rows with 30 columns table (with bigint ID column), lets call it big_one.
Also, I have another one table extra_one with the same number of rows and the same ID column, but two new columns with extra data that I'd like to include in the first table.
I added two extra columns into the first table and want to update the data based on join.
Query is quite easy:
update big_one set
col1=extra_one.col1,
col2=extra_one.col2
from extra_one
where big_one.id=extra_one.id;
But during execution the disk space usage dramatically increased up to 100%. Before the start I had 23.41% of free space on 4 nodes (160GB each, 640GB total). The big_one table initially used about 18% of space. This 23.41% indicates that I had about 490GB of free disk space to perform updates smoothly. But Redhisft thinks differently.
Two new columns are md5 hashes (so they're 32 chars length) (ideally it should take up to 16GB of space).
Recap:
I have a wide table big_one.
Have another table extra_one (with 3 columns total), with same IDs and number of records.
I added two new columns to big_one.
I want to enrich big_one with data from extra_one. (into that 2 new columns)
Q1: Any advice on how to perform such big updates?
Q2: If I will create the VIEW where will join two tables and then use it, will it save me from such space drain situations? How does Redshift work with VIEWs (not materialized) in such cases.
Do not use UPDATE on a large number of rows.
When a row is modified in Amazon Redshift, the existing row is marked as Deleted and a new row is appended to the table. This will effectively double the size of the table and wastes a lot of disk space until the table is Vacuumed. It is also very slow!
Instead:
Create a query that JOINs the two tables
Use the query to populate a new table (see below)
Delete the old table and rename the new table so that it replaces the original table (or, truncate the original table and copy the data back into it)
You can use CREATE TABLE LIKE to create a new, empty table based on an existing table.
From CREATE TABLE - Amazon Redshift:
LIKE parent_table [ { INCLUDING | EXCLUDING } DEFAULTS ]
A clause that specifies an existing table from which the new table automatically copies column names, data types, and NOT NULL constraints. The new table and the parent table are decoupled, and any changes made to the parent table aren't applied to the new table. Default expressions for the copied column definitions are copied only if INCLUDING DEFAULTS is specified. The default behavior is to exclude default expressions, so that all columns of the new table have null defaults.
Tables created with the LIKE option don't inherit primary and foreign key constraints. Distribution style, sort keys,BACKUP, and NULL properties are inherited by LIKE tables, but you can't explicitly set them in the CREATE TABLE ... LIKE statement.

Hive - Huge 10TB table repartitioning (Adding new partition columns)

Techies,
Background -
We have 10TB existing hive table which has been range partitioned on column A. Business case has changes which now require adding of partition column B in addition to Column A.
Problem statement -
Since data on HDFS is too huge and needs to be restructured to inherit the new partition column B, we are facing difficulty to copy over table onto backup and re-ingest using simple IMPALA INSERT OVERWRITE into main table.
We want to explore if there is/ are efficient way to handle adding over partition columns to such huge table
Alright!
If I understand your situation correctly, you have a table backed by 10 TB of data in HDFS with partition on Column A and you want to add the partition also on Column B.
So, if Column B is going to be the sub partition, the HDFS directory would look like user/hive/warehouse/database/table/colA/colB or /colB/colA otherwise (considering it as an managed table).
Restructuring the HDFS directory manually won't be a great idea because it will become a nightmare to scan the data on all files and organize it accordingly in its corresponding folder.
Below is one way of doing it,
1. Create a new table with new structure - i.e., with partitions on Col A and Col B.
CREATE TABLE NEWTABLE ( COLUMNS ... ) PARTITON ON ( COL_A INT, COL_B INT )
2.a. Insert data from the old table to the new table (created in Step #1) like below,
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
But Yes, this step is going to consume a lot of resources during execution if not handled properly, space in HDFS for storing the results as data for NEWTABLE and of-course the time.
OR
2.b. If you think that HDFS will not have enough space to hold all the data or resource crunch, I would suggest you to this INSERT in batches with removal of old data after each INSERT operations.
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='abc'
DELETE FROM OLDTABLE
WHERE COL_A='abc'
INSERT INTO NEWTABLE
SELECT * FROM OLDTABLE
WHERE COL_A='def'
DELETE FROM OLDTABLE
WHERE COL_A='def'
.
.
.
so on.
This way, you can unload HDFS with already handled data and balancing the space.
If you follow step 2.b. then you can write a Script to automate this process by passing the partition names (derived from SHOW PARTITIONS) dynamically for each run. But, try the first two attempts manually before going with automation to make sure things go as expected.
Let me know if it helps!

Multiple Parquet files while writing to Hive Table(Incremental)

Having a Hive table that's partitioned
CREATE EXTERNAL TABLE IF NOT EXISTS CUSTOMER_PART (
NAME string ,
AGE int ,
YEAR INT)
PARTITIONED BY (CUSTOMER_ID decimal(15,0))
STORED AS PARQUET LOCATION 'HDFS LOCATION'
The first LOAD is done from ORACLE to HIVE via PYSPARK using
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;
Which works fine and creates partition dynamically during the run. Now coming to data loading incrementally everyday creates individual files for a single record under the partition.
INSERT INTO TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3; --Assume this gives me the latest record in the database
Is there a possibility to have the value appended to the existing parquet file under the partition until it reaches it block size, without having smaller files created for each insert.
Rewriting the whole partition is one option but I would prefer not to do this
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3;
The following properties are set for the Hive
set hive.execution.engine=tez; -- TEZ execution engine
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Which still doesn't help with daily inserts. Any alternate approach that can be followed would be really helpful.
As Per my knowledge we cant store the single file for daily partition data since data will be stored by different part files for each day partition.
Since you mention that you are importing the data from Oracle DB so you can import the entire data each time from oracle DB and overwrite into HDFS. By this way you can maintain the single part file.
Also HDFS is not recommended for small amount data.
I could think of the following approaches for this case:
Approach1:
Recreating the Hive Table, i.e after loading incremental data into CUSTOMER_PART table.
Create a temp_CUSTOMER_PART table with entire snapshot of CUSTOMER_PART table data.
Run overwrite the final table CUSTOMER_PART selecting from temp_CUSTOMER_PART table
In this case you are going to have final table without small files in it.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created.
Approach2:
Using input_file_name() function by making use of it:
check how many distinct filenames are there in each partition then select only the partitions that have more than 10..etc files in each partition.
Create an temporary table with these partitions and overwrite the final table only the selected partitions.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created because we are going to overwrite the final table.
Approach3:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;