Understanding Hive table creation notation - hive

I have come across Hive tables which I need to convert to Redshift/MySql equivalent.
I am having trouble understanding Hive query structure and would appreciate some help:
CREATE TABLE IF NOT EXISTS table_1 (
id BIGINT,
price DOUBLE,
asset string
)
PARTITIONED BY (
pt STRING
);
ALTER TABLE table_1 DROP IF EXISTS PARTITION (pt== '${yyyymmdd}');
INSERT OVERWRITE TABLE table_1 PARTITION (pt= '${yyyymmdd}')
select aa.id,aa.price,aa.symbol from
...
...
from
table_2 table
I am having trouble understanding the PARTITIONED BY clause. This, if I am understanding correctly, is different from MySQL table partitions, and is a Hive specific dynamic partition.
The partition does not define a column or a key, and partitions by the current date.
Does this mean that table_1 is partitioned by the date? Each day has a separate partition?
Then later on in the code there are notations similar to
inner join table_new table on table.pt = '${yyyymmdd}' and ...
In this context, does it mean only rows inserted on yyyymmdd are selected for the join?
Thank you.

Partition in Hive is a folder in HDFS by default with name key=value + metadata in the Hive metastore. You can alter partition location and create partition on top of any folder.
This PARTITIONED BY (pt STRING) defines partition column pt of type string, not date. Partition values are stored in the metadata. The pt column is not present in the table data files, it is only defined in PARTITIONED BY, all partition values are stored in the metadata. If you load partition dynamically, partition folder is being created with name pt='value'.
This sentence creates partition dynamically:
INSERT OVERWRITE TABLE table_1 PARTITION (pt)
select id, price, symbol
coln as pt --partition column should be the last one
from ...
And this sentence loads single STATIC partition:
INSERT OVERWRITE TABLE table_1 PARTITION (pt= '${yyyymmdd}')
select aa.id,aa.price,aa.symbol
from
No partition column is selected, partition value specified in the
PARTITION (pt= '${yyyymmdd}')
'${yyyymmdd}' here is a parameter with name yyyymmdd which is passed to the script using --hivevar like this:
hive --hivevar yyyymmdd=20200604 -f myscript.sql
You can pass ANY string as partition value in this case, though parameter name yyyymmdd suggests it's format.
BTW date format in hive is 'yyyy-MM-dd' Strings in 'yyyy-MM-dd' format can be implicitly converted to DATE.

I will try in one shot explain what is partitioning in Hive. First of all would be
WHEN TO USE TABLE PARTITIONING
Table partitioninig is good when:
Reading the entire dataset takes too long
Queries almost always filter on the partition columns
There are a reasonable number of different values for partition columns
Data generation of ETL process splits data by file or directory names
Partition column values are not in the data itself
Don't partition on columns with many unique values
Example: Partitioning customers by first name
CREATING PARTITIONED TABLES
To create a partitioned table, use the PARTITIONED BY clause in the CREATE TABLE statement.
The names and types of the partition columns must be specified
in the PARTITIONED BY clause, and only in the PARTITIONED BY clause.
They must not also appear in the list of all the other columns.
CREATE TABLE customers_by_country
(cust_id STRING, name STRING)
PARTITIONED BY (country STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';
The example CREATE TABLE statement shown above creates the table customers_by_country,
which is partitioned by the STRING column named country.
Notice that the country column appears only in the PARTITIONED BY clause,
and not in the column list above it.
This example specifies only one partition column, but you can specify more than one by using
a comma-separated column list in the PARTITIONED BY clause.
Aside from these specific differences, this CREATE TABLE statement is the same
as the statement used to create an equivalent non-partitioned table.
Table partitioning is implemented in a way that is mostly transparent
to a user issuing queries with Hive.
A partition column is what’s known as a virtual column, because its values are not stored within the data files.
Following is the result of the DESCRIBE command on customers_by_country;
it displays the partition column country just as if it were a normal column within the table.
You can refer to partition columns in any of the usual clauses of a SELECT statement.
name type comment
cust_id string
name string
country string
You can load data in partitioned tables dynamically or statically
LOADING DATA WITH DYNAMIC PARTITION
One way to load data into a partitioned table is to use dynamic partitioning,
which automatically defines partitions when you load the data, using the values in the partition column.
(The other way is to manually define the partitions with Static Partitioning)
To use dynamic partitioning, you must load data using an INSERT statement.
In the INSERT statement, you must use the PARTITION clause to list the partition columns.
The data you are inserting must include values for the partition columns.
The partition columns must be the rightmost columns in the data you are inserting,
and they must be in the same order as they appear in the PARTITION clause.
INSERT OVERWRITE TABLE customers_by_country
PARTITION(country)
SELECT cust_id, name, country FROM customers;
The example shown above uses an INSERT … SELECT statement
to load data into the customers_by_country table with dynamic partitioning.
Notice that the partition column, country, is included
in the PARTITION clause and is specified last in the SELECT list.
When Hive executes this statement, it automatically creates partitions
for the country column and loads the data into these partitions based on the values in the country column.
The resulting data files in the partition subdirectories do not include values for the country column.
Since the country is known based on which subdirectory a data file is in,
it would be redundant to include country values in the data files as well.
Look at the contents of the customers_by_country directory.
It should now have one subdirectory for each value in the country column.
Look at the file in one of those directories.
Notice that the file contains the row for the customer from that country,
and no others; notice also that the country value is not included.
Note: Hive includes a safety feature that prevents users
from accidentally creating or overwriting a large number of partitions.
(See “Risks of Using Partitioning” for more about this.)
By default, Hive sets the property hive.exec.dynamic.partition.mode to strict.
This prevents you from using dynamic partitioning, though you can still use static partitions.
You can disable this safety feature in Hive by setting
the property hive.exec.dynamic.partition.mode to nonstrict:
SET hive.exec.dynamic.partition.mode=nonstrict;
Then you can use the INSERT statement to load the data dynamically.
Hive properties set in Beeline are for the current session only,
so the next time you start a Hive session this property will be set back to strict.
But you or your system administrator can configure properties permanently, if necessary.
When you run some SELECT queries on the partitioned table, if the table is big enough you can note significant difference in the time it takes to run.
Notice that you will not query the table any differently than you would query the customers table.
LOADING DATA WITH STATIC PARTITIONING
One way to load data into a partitioned table is to use static partitioning,
in which you manually define the different partitions.
With static partitioning, you create a partition manually, using an ALTER TABLE … ADD PARTITION statement,
and then load the data into the partition.
For example, this ALTER TABLE statement creates the partition for Pakistan (pk):
ALTER TABLE customers_by_country
ADD PARTITION (country='pk');
Notice how the partition column name, which is country, and the specific value that defines this partition,
which is pk, are both specified in the ADD PARTITION clause.
This creates a partition directory named country=pk inside the customers_by_country table directory.
After the partition for Pakistan is created, you can add data into the partition using an INSERT … SELECT statement:
INSERT OVERWRITE TABLE customers_by_country
PARTITION(country='pk')
SELECT cust_id, name FROM customers WHERE country='pk'
Notice how in the PARTITION clause, the partition column name, which is country,
and the specific value, which is pk, are both specified, just like in the ADD PARTITION command used to create the partition.
Also notice that in the SELECT statement, the partition column is not included in the SELECT list.
Finally, notice that the WHERE clause in the SELECT statement selects only customers from Pakistan.
With static partitioning, you need to repeat these two steps for each partition:
first create the partition, then add data.
You can actually use any method to load the data; you need not use an INSERT statement.
You could instead use hdfs dfs commands or a LOAD DATA INPATH command.
But however you load the data, it’s your responsibility to ensure that data is stored in the correct partition subdirectories.
For example, data for customers in Pakistan must be stored in the Pakistan partition subdirectory,
and data for customers in other countries must be stored in those countries’ partition subdirectories.
Static partitioning is most useful when the data being loaded
into the table is already divided into files based on the partition column,
or when the data grows in a manner that coincides with the partition column:
For example, suppose your company opens a new store in a different country,
like New Zealand ('nz'), and you're given a file of data for new customers, all from that country.
You could easily add a new partition and load that file into it.
RISKS OF USING PARTITIONING
A major risk when using partitioning is creating partitions that lead you into the small files problem.
When this happens, partitioning a table will actually worsen query performance
(the opposite of the goal when using partitioning) because it causes too many small files to be created.
This is more likely when using dynamic partitioning, but it could still
happen with static partitioning—for example if you added a new partition to a sales table
on a daily basis containing the sales from the previous day,
and each day’s data is not particularly big.
When choosing your partitions, you want to strike a happy balance between too many partitions
(causing the small files problem) and too few partitions (providing performance little benefit).
The partition column or columns should have a reasonable number of values
for the partitions—but what you should consider reasonable is difficult to quantify.
Using dynamic partitioning is particularly dangerous because if you're not careful,
it's easy to partition on a column with too many distinct values.
Imagine a use case where you are often looking for data that falls within
a time frame that you would specify in your query.
You might think that it's a good idea to partition on a column that pertains to time.
But a TIMESTAMP column could have the time to the nanosecond, so every row could have a unique value;
that would be a terrible choice for a partition column! Even to the minute or hour could create
far too many partitions, depending on the nature of your data;
partitioning by larger time units like day, month, or even year might be a better choice.
As another example, consider an employees table.
This has five columns: empl_id, first_name, last_name, salary, and office_id.
Before reading on, think for a moment, which of these might be reasonable for partitioning
The column empl_id is a unique identifier.
If that were your partition column, you would have a separate partition for each employee,
and each would have exactly one row.
In addition, it's not likely you'll be doing a lot of queries looking for a particular value,
or even a particular range of values. This is a poor choice.
The column first_name will not have one per employee, but there will likely be many columns that have only one row.
This is also true for last_name.
Also, like empl_id, it's not likely you'll need filter queries based on these columns. These are also poor choices.
The column salary also will have many divisions
(and even more so if your salaries go to the cent rather than to the dollar as our sample table does).
While it may be that you'll sometimes want to query on salary ranges,
it's not likely you'll want to use individual salaries.
So salary is a poor choice.
A more limited salary_grades specification, like the ones in the salary_grades table,
might be reasonable if your use case involves looking at the data by salary grade frequently.
The office_id column identifies the office where an employee works.
This will have a much smaller number of unique values, even if you have a large company with offices in many cities.
It's imaginable that your use case might be to frequently filter
your employee data based on office location, too. So this would be a good choice.
You also can use multiple columns and create nested partitions.
For example, a dataset of customers might include country and state_or_province columns.
You can partition by country and then partition those further by state_or_province, so customers from Ontario,
Canada would be in the country=ca/state_or_province=on/ partition directory.
This can be extremely helpful for large amounts of data that you want to access either by country or by state or province.
However, using multiple columns increases the danger of creating too many partitions, so you must take extra care when doing so.
The risk of creating too many partitions is why Hive includes the property
hive.exec.dynamic.partition.mode, set to strict by default, which must be reset to nonstrict before you can create a partition.
Rather than automatically and mechanically resetting that property when you're about to load data dynamically,
take it as an opportunity to think about the partitioning columns
and maybe check the number of unique values you would get when you load the data.
And that's all.

Related

How to choose partition keys for apache iceberg tables

I have a number of hive warehouses. The data resides in parquet files in Amazon S3. Some of the tables contain TB of data. Currently in hive most tables are partitioned by a combination of month and year, both of which are saved mainly as string. Other fields are either bigint, int, float, double, string and unix timestamps. Our goal is to migrate them to apache iceberg tables. The challenge is how to choose the partition keys.
I have already calculated the cardinality of each field in each table by:
Select COUNT(DISTINCT my_column) As my_column_count
From my_table;
I have also calculated the percentage of null values for each field:
SELECT 100.0 * count(*)/number_of_all_records
FROM my_db.my_table
Where my_column IS NULL;
In short I already know three things for each field:
Data type
Cardinality
Percentage of null values
By knowing these three pieces of information, my question is how should I choose the best column or combination of columns as partition keys for my future iceberg tables? Are there any rule of thumbs?
How many partitions is considered as optimized when choosing partition keys? What data type is best when choosing partition keys? What are other factors that need to be considered? Is bucketing the same in iceberg tables as it is in hive and how it can be leveraged by the partition keys? What data types are best for partition keys? Is it better to have many small partitions or having a few big partitions? Any other aspects in partition keys that need to be considered?
One crucial part is missing from your description - the queries. You need to understand what are the queries that will run on this data. Understanding the queries that will run on the data (to the best you can) is super important.
For example, consider a simple table with: Date, Id, Name, Age as columns.
If the queries are date based meaning, it will query the data in the context of dates,
select * from table where date > 'some-date'
then it's a good idea to partition by date.
However, if the queries are age related
select * from table where age between 20 and 30
then you should consider partition by age or age groups

Create partitioned table with date suffix in bigquery using SQL or web UI

I want to create such table:
CREATE TABLE sometable
(SELECT columns, columns, date_col)
PARTITIONED BY date_col
And I want it to be date partitioned with the date in table suffix: sometable$date_partition
I read the docs, but can't complete this neither with web UI nor with SQL.
The web UI shows such error "Missing argument for parameter DATE."
My table name is "daily_export_${DATE}"
My partitioning column isn't blank, it's date_col.
Can I have a simple example, please?
PARTITION BY goes earlier
The query needs to parse the table suffix into a DATE type.
For example:
CREATE OR REPLACE TABLE temp.so
PARTITION BY date_from_table_name
AS
SELECT PARSE_DATE('%Y%m%d', _table_suffix) date_from_table_name, event_timestamp, event_name, items
FROM `bingo-blast-174dd.analytics_151321511.events_*`
WHERE _table_suffix BETWEEN '20200530' AND '20200531'
LIMIT 10
As you can see in this documentation, BigQuery implements two different concepts: sharded tables and partitioned tables
The first one (sharded tables) is a way of dividing a whole table into many tables with a date suffix. You can query those tables individually or using wildcards. For example, instead of creating a single table named events, you can create many tables named events_20200101, events_20200102, [...]
When you do that, you are able to query any of those tables individually or you can query all of them by running some query like select * from events_*
The second concept (partitioned tables) is an approach to fragment your table in smaller pieces in order to improve the performance and reduce costs when querying data. Partitioned tables can be based on some column of your table or even on the ingestion time. When you table is partitioned by ingestion time you can access a pseudo column named _PARTITIONTIME
When comparing both approaches, the documentation says:
Date/timestamp partitioned tables perform better than tables sharded
by date. When you create date-named tables, BigQuery must maintain a
copy of the schema and metadata for each date-named table. Also, when
date-named tables are used, BigQuery might be required to verify
permissions for each queried table. This practice also adds to query
overhead and impacts query performance. The recommended best practice
is to use date/timestamp partitioned tables instead of date-sharded
tables.
In your case, you basically need to create a partitioned table without a date in its name.

Hive partition column

We have avro partitioned table in hive. When we query table, partition column is displaying at the end. Is there any way to display partition column at first?
Eg: select * from tablea
Output:
Col1 col2 partition_column
Expected output:
Partition_column col1 col2
Partition column is not stored in files, so, avro or not avro, it does not matter in this context. Partition column corresponds partition sub-folder within table folder and stored in the metadata.
Historically partition column is the last one. dynamic partitioning using Insertoverwrite table partition (partition_column) SELECT * from ...` is rather common scenario. Hive will know partition is the last column.
The dynamic partition columns must be specified last among the columns
in the SELECT statement and in the same order in which they appear in
the PARTITION() clause.
You can change the order of columns displayed when running SELECT * only by creating a view in which you list all columns in the required order, OR select columns explicitly in your select.
Also according to the Codd's theory, column and row order is immaterial, you always must specify columns order desired explicitly in the select and rows order using ORDER BY, instead of relying on columns order and row order in the table or view. But in Hive the partitioning column is the last one in the table.
Consider also this: You may even not know, what you selecting from: table or view. And you may be not notified that upstream system decided to change the table or view eventually. View or table can change the order of columns. Consider view the same as a table when doing selects. It is just abstraction level. Use explicit column list to make your program working reliably always and do not have strong dependency on column order in the underlying table/view, which is immaterial.

Partitioned by in Apache HIVE, more questions

There are some good questions/answers here
Hive clustered by on more than one column
hive subquery optimization using cluster by
difference between Cluster By and CLUSTERED BY in hive?
What is the difference between partitioning and bucketing a table in Hive ?
but I have a few more, unfortunately there is no good explanation here on page 24:
https://docs.hortonworks.com/HDPDocuments/HDP3/HDP-3.0.1/using-hiveql/hive_using_hiveql.pdf
My questions:
In below example from the above:
CREATE TABLE pageviews (userid VARCHAR(64), link STRING, from STRING)
PARTITIONED BY (datestamp STRING) CLUSTERED BY (userid) INTO 256 BUCKETS;
INSERT INTO TABLE pageviews PARTITION (datestamp = '2014-09-23') VALUES
('jsmith', 'mail.com', 'sports.com'), ('jdoe', 'mail.com', null);
INSERT INTO TABLE pageviews PARTITION (datestamp) VALUES ('tjohnson',
'sports.com', 'finance.com', '2014-09-23'), ('tlee', 'finance.com', null,
'2014-09-21');
why does "datestamp STRING" do not exist in the the schema of the pageviews?
Why is it defined as string? should not be TIMESTAMP?
Why does the second insert miss it and only has it as type but it has as values (i.e. '2014-09-23' and '2014-09-21?
why does "datestamp STRING" do not exist in the the schema of the pageviews?
Although datestamp looks and behaves like a standard column defined in the schema, it's actually just a reference to a particular partition of the underlying data for the table. When you see '2014-09-23' in the datestamp column, it's not actually showing you a value contained in a particular record in one of the data files, instead it's telling you that the data in the rest of the row comes from an HDFS directory called 'datestamp=2014-09-23' that contains a partition or "chunk" of the data. This is were a lot of the optimization comes in, since filtering a query to a particular partition allows Hive to simply go to the data in that particular directory and ignore the data contained in the other n number of partitions.
Why is it defined as string? should be TIMESTAMP?
Since a partition is simply referring to a directory name, it only makes sense that the type is a string representation of a specific date format instead of a timestamp or date. Conceptually, a date field would not make sense since although '2014-09-23' and '9/23/2014' are two equal datestamps, these would be considered different directories if they were directory names. In other words, if a directory is named '2014-09-23', you cannot refer to it by any other name making it more like a string and less like a date which has many alternate forms that are all equivalent. Furthermore, Hive already treats dates as strings which makes it a better solution than say, a type of int. For example if you pass in a timestamp to Hive's to_date() user defined function, it returns the date as a string.
Also, since you mentioned timestamp, using a full timestamp that has fractions of a second in it is a bad idea for partitions, even if you use a string representation of it. You would end up with a massive amount of partitions and probably one or at most only a few records in each partition. I would imagine you would quickly lose any of the performance benefits of partitioning.
Why does the second insert miss it and only has it as type but it has as values (i.e. '2014-09-23' and '2014-09-21?
This is simply a different syntax that produces the same result. When you include partitions, Hive will assume the values at the end of the values array refer to the partitions. So if you have a table with 3 columns in your schema and 1 partition, when you perform an insert into table command and specify partition (datestamp), you can just pass in 4 values and Hive will know that the first 3 values are to be inserted into the 3 columns in your schema, and the fourth value refers to which datestamp partition you want to add this record's data to.

How does Impala support partitioning?

How does Impala support the concept of partitioning and, if it supports it, what are the differences between Hive Partitioning and Impala Partitioning?
By default, all the data files for a table are located in a single directory.
Partitioning is a technique for physically dividing the data during loading, based on values from one or more columns, to speed up queries that test those columns.
For example, with a school_records table partitioned on a year column, there is a separate data directory for each different year value, and all the data for that year is stored in a data file in that directory. A query that includes a WHERE condition such as YEAR=1966, YEAR IN (1989,1999), or YEAR BETWEEN 1984 AND 1989 can examine only the data files from the appropriate directory or directories, greatly reducing the amount of data to read and test.
Static and Dynamic Partitioning
Specifying all the partition columns in a SQL statement is called "static partitioning" ,because the statement affects a single predictable partition. For example, you use static partitioning with an ALTER TABLE statement that affects only one partition, or with an INSERT statement that inserts all values into the same partition:
insert into t1 partition(x=10, y='a') select c1 from some_other_table;
When you specify some partition key columns in an INSERT statement, but leave out the values, Impala determines which partition to insert This technique is called "dynamic partitioning":
insert into t1 partition(x, y='b') select c1, c2 from some_other_table;
Create new partition if necessary based on variable year, month, and day; insert a single value.
insert into weather partition (year, month, day) select 'cloudy',2014,4,21;
Create new partition if necessary for specified year and month but variable day; insert a single value.
insert into weather partition (year=2014, month=04, day) select 'sunny',22;
The more key columns you specify in the PARTITION clause, the fewer columns you need in the SELECT list. The trailing columns in the SELECT list are substituted in order for the partition key columns with no specified value.
You may refer to this link for further reading.
Hope that helps!