Is hive partitioning hierarchical in nature? - hive

Say we have a table partitioned as:-
CREATE EXTERNAL TABLE MyTable (
col1 string,
col2 string,
col3 string
)
PARTITIONED BY(year INT, month INT, day INT, hour INT, combination_id BIGINT);
Now obviously year is going to store year value (e.g. 2016), the month will store month va.ue (e.g. 7) the day will store day (e.g. 18) and hour will store hour value in 24 hour format (e.g. 13). And combination_id is going to be combination of padded (if single digit value pad it with 0 on left) values for all these. So in this case for example the combination id is 2016071813.
So we fire query (lets call it Query A):-
select * from mytable where combination_id = 2016071813
Now Hive doesn't know that combination_id is actually combination of year,month,day and hour. So will this query not take proper advantage of partitioning?
In other words, if I have another query, call it Query B, will this be more optimal than query A or there is no difference?:-
select * from mytable where year=2016 and month=7 and day=18 and hour=13
If Hive partitioning scheme is really hierarchical in nature then Query B should be better from performance point of view is what I am thinking. Actually I want to decide whether to get rid of combination_id altogether from partitioning scheme if it is not contributing to better performance at all.
The only real advantage for using combination id is to be able to use BETWEEN operator in select:-
select * from mytable where combination_id between 2016071813 and 2016071823
But if this is not going to take advantage of partitioning scheme, it is going to hamper performance.

Yes. Hive partitioning is hierarchical.
You can simply check this by printing the partitions of the table using below query.
show partitions MyTable;
Output:
year=2016/month=5/day=5/hour=5/combination_id=2016050505
year=2016/month=5/day=5/hour=6/combination_id=2016050506
year=2016/month=5/day=5/hour=7/combination_id=2016050507
In your scenario, you don't need to specify combination_id as partition column if you are not using for querying.
You can partition either by
Year, month, day, hour columns
or
combination_id only
Partitioning by Multiple columns helps in performance in grouping operations.
Say if you want to find maximum of a col1 for 'March' month of the years (2016 & 2015).
It can easily fetch the records by going to the specific 'Year' partition(year=2016/2015) and month partition(month=3)

Related

Specify the partition # based on date range for that pkey value

We have a DW query that needs to extract data from a very large table around 10 TB which is partitioned by datetime column lets say time to purge data based on this column everyday. So my understanding is each partition has worth a day of data. From storage (SSMS GUI) tab I see # of partitions is 1995.
There is no clustered index on this table as its mostly intended for write operations. Just a design by vendor.
SELECT
a.*
FROM dbo.VLTB AS a
CROSS APPLY
(
VALUES($PARTITION.a_func(a.time))
) AS c (pid)
WHERE c.pid = 1896;
Currently query submitted is as
SELECT * from dbo.VLTB
WHERE time >= convert(datetime,'20210601',112)
AND time < convert(datetime,'20210602',112)
So replacing inequality predicates with equality to look in that days specific partition might help. Users via app can control dates when sending but how will they manage if we want them to use partition # as per first query
Question
How do I find a way in above query to find partition number for that day rather than me inserting like for 06/01 I had to give 1896 part#. Is there a better way to have script find the part# to avoid all partitions being scanned and can insert correct part# in where clause query?
Thank you

BigQuery - Create view with Partition but base table doesn't have

This may sound crazy, but I want to implement something like having a view with a partition.
Background:
I had a table with a date partition on a column which is really huge in size. We are running data ingestion to this table at every 2mins interval. All the data loads are append-only. Ever load will insert 10k+ rows. After some time, we encountered the partition limitation issue.
message: "Quota exceeded: Your table exceeded quota for Number of partition modifications to a column partitioned table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
Root cause:(from GCP support team)
The root cause under the hood was that due to your partitioned tables
have pretty granular partition for instance by minutes, hours or date,
when the loaded data cover a wide range of partition period, the
number of partition get modified will be high and above 4000. As per
internal documentation, it was suggested the user who ran into this
issue to consider making a less granular partition for instance change
a date/hour/minute based partitioned table to a week based partitioned
table. Alternatively split the load to multiple and hence limit the
data range to cover less number of partitions that would be affected.
This is the best recommendation I could have now.
So I'm planning to keep this table as un-partitioned and create a view(we need a view for eliminating the duplicates) and it should have parition. Is this possible? or any other alternate solution for this?
You can't partition a view, it's not physically materialized. Partitioning on day can be limiting with the 4000 limit, would year work? then you can use an integer partition:
create or replace table BI.test
PARTITION BY RANGE_BUCKET(Year, GENERATE_ARRAY(2000, 3000, 1)) as
select 2000 as Year, 1 as value
union all
select 2001 as Year, 1 as value
union all
select 2002 as Year, 1 as value
Alternatively, I've used month (YYYYMM) or week (YYYYWW) to integer partition by which gets you around 40 years:
RANGE_BUCKET(monthasintegerfield, GENERATE_ARRAY(201612, 205712, 1))

Performing Date Math on Hive Partition Columns

My data is partitioned by day in the standard Hive format:
/year=2020/month=10/day=01
/year=2020/month=10/day=02
/year=2020/month=10/day=03
/year=2020/month=10/day=04
...
I want to query all data from the last 60 days, using Amazon Athena (IE: Presto). I want this query to use the partitioned columns (year, month, day) so that only the necessary partition files are scanned. Assuming I can't change the file partition format, what is the best approach to this problem?
You don't have to use year, month, day as the partition keys for the table. You can have a single partition key called date and add the partitions like this:
ALTER TABLE the_table ADD
PARTITION (`date` = '2020-10-01') LOCATION 's3://the-bucket/data/year=2020/month=10/day=01'
PARTITION (`date` = '2020-10-02') LOCATION 's3://the-bucket/data/year=2020/month=10/day=02'
...
With this setup you can even set the type of the partition key to date:
PARTITIONED BY (`date` date)
Now you have a table with a date column typed as a DATE, and you can use any of the date and time functions to do calculations on it.
What you won't be able to do with this setup is use MSCK REPAIR TABLE to load partitions, but you really shouldn't do that anyway – it's extremely slow and inefficient and really something you only do when you have a couple of partitions to load into a new table.
An alternative way to that proposed by Theo, is to use the following syntax, e.g.:
select ... from my_table where year||month||day between '2020630' and '20201010'
this works when the format for the columns year, month and day are string. It's particularly useful to query across months.

BigQuery table partitioning by month

I can't find any documentation relating to this. Is time_partitioning_type=DAY the only way to partition a table in BigQuery? Can this parameter take any other values besides a date?
Note that even if you partition on day granularity, you can still write your queries to operate at the level of months using an appropriate filter on _PARTITIONTIME. For example,
#standardSQL
SELECT * FROM MyDatePartitionedTable
WHERE DATE_TRUNC(EXTRACT(DATE FROM _PARTITIONTIME), MONTH) = '2017-01-01';
This selects all rows from January of this year.
Unfortunately not. BigQuery currently only supports date-partitioned tables.
https://cloud.google.com/bigquery/docs/partitioned-tables
BigQuery offers date-partitioned tables, which means that the table is divided into a separate partition for each date
It seems like this would work:
#standardSQL
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY event_month
OPTIONS (
description="this is a table partitioned by month"
) AS
SELECT
DATE_TRUNC(DATE(some_event_timestamp), month) as event_month,
*
FROM `TableThatNeedsPartitioning`
For those that run into the error "Too many partitions produced by query, allowed 4000, query produces at least X partitions", due to the 4000 partitions BigQuery limit as of 2023.02, you can do the following:
CREATE OR REPLACE TABLE `My_Partition_Table`
PARTITION BY DATE_TRUNC(date_column, MONTH)
OPTIONS (
description="This is a table partitioned by month"
) AS
-- Your query
Basically, take #david-salmela 's answer, but move the DATE_TRUNC part to the PARTITION BY section.
It seems to work exactly like PARTITION BY date_column in terms of querying the table (e.g. WHERE date_column = "2023-02-20"), but my understanding is that you always retrieve data for a whole month in terms of cost.

SQL Server BETWEEN

I have a table which has Year, Month and few numeric columns
Year Month Total
2011 10 100
2011 11 150
2011 12 100
2012 01 50
2012 02 200
Now, I want to SELECT rows between 2011 Nov and 2012 FEB. Note that I want the to Query to use range. Just as if I had a date column in the table..
Coming up with a way to use BETWEEN with the table as it is will work, but will be worse performance in every case:
It will at best consume more CPU to do some kind of calculation on the rows instead of working with them as dates.
It will at the very worst force a table scan on every row in the table, but if your columns have indexes, then with the right query a seek is possible. This could be a HUGE performance difference, because forcing the constraints into a BETWEEN clause will disable using the index.
I suggest the following instead if you have an index on your date columns and care at all about performance:
DECLARE
#FromDate date = '20111101',
#ToDate date = '20120201';
SELECT *
FROM dbo.YourTable T
WHERE
(
T.[Year] > Year(#FromDate)
OR (
T.[Year] = Year(#FromDate)
AND T.[Month] >= Month(#FromDate)
)
) AND (
T.[Year] < Year(#ToDate)
OR (
T.[Year] = Year(#ToDate)
AND T.[Month] <= Month(#ToDate)
)
);
However, it is understandable that you don't want to use such a construction as it is very awkward. So here is a compromise query, that at least uses numeric computation and will use less CPU than date-to-string-conversion computation (though not enough less to make up for the forced scan which is the real performance problem).
SELECT *
FROM dbo.YourTable T
WHERE
T.[Year] * 100 + T.[Month] BETWEEN 201111 AND 201202;
If you have an index on Year, you can get a big boost by submitting the query as follows, which has the opportunity to seek:
SELECT *
FROM dbo.YourTable T
WHERE
T.[Year] * 100 + T.[Month] BETWEEN 201111 AND 201202
AND T.[Year] BETWEEN 2011 AND 2012; -- allows use of an index on [Year]
While this breaks your requirement of using a single BETWEEN expression, it is not too much more painful and will perform very well with the Year index.
You can also change your table. Frankly, using separate numbers for your date parts instead of a single column with a date data type is not good. The reason it isn't good is because of the exact issue you are facing right now--it is very hard to query.
In some data warehousing scenarios where saving bytes matters a lot, I could envision situations where you might store the date as a number (such as 201111) but that is not recommended. The best solution is to change your table to use dates instead of splitting out the numeric value of the month and the year. Simply store the first day of the month, recognizing that it stands in for the entire month.
If changing the way you use these columns is not an option but you can still change your table, then you can add a persisted computed column:
ALTER Table dbo.YourTable
ADD ActualDate AS (DateAdd(year, [Year] - 1900, DateAdd(month, [Month], '18991201')))
PERSISTED;
With this you can just do:
SELECT *
FROM dbo.YourTable
WHERE
ActualDate BETWEEN '20111101' AND '20120201';
The PERSISTED keyword means that while you still will get a scan, it won't have to do any calculation on each row since the expression is calculated on each INSERT or UPDATE and stored in the row. But you can get a seek if you add an index on this column, which will make it perform very well (though all in all, this is still not as ideal as changing to use an actual date column, because it will take more space and will affect INSERTs and UPDATEs):
CREATE NONCLUSTERED INDEX IX_YourTable_ActualDate ON dbo.YourTable (ActualDate);
Summary: if you truly can't change the table in any way, then you are going to have to make a compromise in some way. It will not be possible to get the simple syntax you want that will also perform well, when your dates are stored split into separate columns.
(Year > #FromYear OR Year = #FromYear AND Month >= #FromMonth)
AND (Year < #ToYear OR Year = #ToYear AND Month <= #ToMonth)
Your example table seems to indicate that there's only one record per year and month (if it's really a summary-by-month table). If that's so, you're likely to accrue very little data in the table even over several decades of activity. The concatenated expression solution will work and performance (in this case) won't be an issue:
SELECT * FROM Table WHERE ((Year * 100) + Month) BETWEEN 201111 AND 201202
If that's not the case and you really have a large number of records in the table (more than a few thousand records), you have a couple of choices:
Change your table to store year and month in the format YYYYMM (either as an integer value or text). This column can replace your current year and index column or be in addition to them (although this breaks normal form). Index this column and query against it.
Create a separate table with one record per year and month and also the indexable column as described above. In your query, JOIN this table back to the source table and perform your query against the indexed column in the smaller table.