Oracle - How to best implement partitioning with a conditional clause - sql

Our system has many tables that require partitioning to support data maintenance. Let's talk about one table to simplify the question. If the data in a table hits 100GB, then the OLTP system starts to slow down. We recommend to customers to move the data from the OLTP system to the OLAP system. We use partitioning by year or month (based on data insertion rates) to facilitate this move.
Here is a sample of a table definition:
create table myPartionedTable
(
object_id number ,
object_type varchar2(18),
RETIREDTIMESTAMP timestamp
)
partition by range (RETIREDTIMESTAMP)
(
partition WM_2010 values less than(TO_DATE('01/01/2011','MM/DD/YYYY')),
partition WM_2011 values less than(TO_DATE('01/01/2012','MM/DD/YYYY')),
partition WM_2012 values less than(TO_DATE('01/01/2013','MM/DD/YYYY')),
partition WM_2013 values less than(TO_DATE('01/01/2014','MM/DD/YYYY')),
partition WM_2014 values less than(TO_DATE('01/01/2015','MM/DD/YYYY')),
partition WM_ACTIVE values less than(MAXVALUE)
)
tablespace MYDATE;
The important point is that the data needs to be retained in the WM_ACTIVE partition till the data is deemed RETIRED. Once retired, the data moves to the appropriate partition and is then eligible for PARTITION_MOVE out of OLTP and into OLAP.
Is this a good approach? Is there a better approach for managing this list of requirements?

Interval partitioning may help. Oracle can automatically create partitions as needed.
There may not be a need to worry about active vs. inactive since it's easy to use the smallest range for all partitions.
create table myPartionedTable
(
object_id number ,
object_type varchar2(18),
RETIREDTIMESTAMP timestamp
)
partition by range (RETIREDTIMESTAMP) INTERVAL(NUMTOYMINTERVAL(1, 'MONTH'))
(
--You still need to specify the lowest possible partition.
partition WM_2010 values less than(date '2011-01-01')
);

Related

BigQuery - Create view with Partition but base table doesn't have

This may sound crazy, but I want to implement something like having a view with a partition.
Background:
I had a table with a date partition on a column which is really huge in size. We are running data ingestion to this table at every 2mins interval. All the data loads are append-only. Ever load will insert 10k+ rows. After some time, we encountered the partition limitation issue.
message: "Quota exceeded: Your table exceeded quota for Number of partition modifications to a column partitioned table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
Root cause:(from GCP support team)
The root cause under the hood was that due to your partitioned tables
have pretty granular partition for instance by minutes, hours or date,
when the loaded data cover a wide range of partition period, the
number of partition get modified will be high and above 4000. As per
internal documentation, it was suggested the user who ran into this
issue to consider making a less granular partition for instance change
a date/hour/minute based partitioned table to a week based partitioned
table. Alternatively split the load to multiple and hence limit the
data range to cover less number of partitions that would be affected.
This is the best recommendation I could have now.
So I'm planning to keep this table as un-partitioned and create a view(we need a view for eliminating the duplicates) and it should have parition. Is this possible? or any other alternate solution for this?
You can't partition a view, it's not physically materialized. Partitioning on day can be limiting with the 4000 limit, would year work? then you can use an integer partition:
create or replace table BI.test
PARTITION BY RANGE_BUCKET(Year, GENERATE_ARRAY(2000, 3000, 1)) as
select 2000 as Year, 1 as value
union all
select 2001 as Year, 1 as value
union all
select 2002 as Year, 1 as value
Alternatively, I've used month (YYYYMM) or week (YYYYWW) to integer partition by which gets you around 40 years:
RANGE_BUCKET(monthasintegerfield, GENERATE_ARRAY(201612, 205712, 1))

Mariadb Scans all partitions in timestamp column

I have a table Partitioned by:
HASH(timestamp DIV 43200 )
When I perform this query
SELECT max(id)
FROM messages
WHERE timestamp BETWEEN 1581708508 AND 1581708807
it scans all partitions while both numbers 1581708508 & 1581708807& numbers between them are in the same partition, how can I make it to scan only that partition?
You have discovered one of the reasons why PARTITION BY HASH is useless.
In your situation, the Optimizer sees the range (BETWEEN) and says "punt, I'll just scan all the partitions".
That is, "partition pruning" does not work when the WHERE clause involves a range and you are using PARTITION BY HASH. PARTITION BY RANGE, on the other hand, may be able to prune. But... What's the advantage? It does not make the query any faster.
I have found only four uses for partitioning: http://mysql.rjweb.org/doc.php/partitionmaint . It sounds like your application does not fit any of those cases.
That particular query would best be done without partitioning. Instead have a non-partitioned table with this 'composite' index:
INDEX(timestamp, id)
It must scan all the rows to discover the MAX(id), but with this index, it is
Scanning only the 2-column index
Not touching any rows outside the timestamp range.
Hence it will be as fast as possible. Even if PARTITION BY HASH were smart enough to do the desired pruning, it would not run any faster.
In particular, when you ask for a range on the Partition key, such as with WHERE timestamp BETWEEN 1581708508 AND 1581708807, the execution looks in all partitions for the desired rows. This is one of the major failings of Hash. Even if it could realize that only Partition is needed, it would be no faster than simply using the index I suggest.
You can determine that individual partition by using modular arithmetic
MOD(<formula which's argument of hash function>,<number of partitions>)
assuming you have 2 partitions
CREATE TABLE messages(ID int, timestamp int)
PARTITION BY HASH( timestamp DIV 43200 )
PARTITIONS 2;
look up partition names by
SELECT CONCAT( 'p',MOD(timestamp DIV 43200,2)) AS partition_name, timestamp
FROM messages;
and determine the related partition name for the value 1581708508 of timestamp column (assume p1). Then Use
SELECT MAX(id)
FROM messages PARTITION(p1)
to get all the records only in the partition p1 without need of a WHERE condition such as
WHERE timestamp BETWEEN 1581708508 AND 1581708807
Btw, all partitions might be listed through
SELECT *
FROM INFORMATION_SCHEMA.PARTITIONS
WHERE table_name='messages'
Demo

Creating a daily Oracle partition

Creating oracle partition for a table for the every day.
ALTER TABLE TAB_123 ADD PARTITION PART_9999 VALUES LESS THAN ('0001') TABLESPACE TS_1
Here I am getting error because value is decreased as 0001 as lower boundary.
You can have Oracle automatically create partitions by using the PARTITION BY RANGE option.
Sample DDL, assuming that the partition key is column my_date_column :
create table TAB_123
( ... )
partition by range(my_date_column) interval(/*numtoyminterval*/ NUMTODSINTERVAL(1,'day'))
( partition p_first values less than (to_date('2010-01-01', 'yyyy-mm-dd')) tablespace ts_1)
;
With this set up in place, Oracle will, if needed, create a partition on the fly when you insert data into the table. It is also usually a good idea to create a default partition, as shown above.
This naming convention (last digit of year plus day number) won't support holding more than ten years worth of data. Maybe you think that doesn't matter but I know databases which are well into their second decade. Be optimistic!
Also, that key is pretty much useless for querying. Most queries against partitioned tables want to get the benefit of partition elimination. But that only' works if the query uses the same value as the partition key. Developers really won't want to be casting a date to YDDD format every time they write a select on the table.
So. Use an actual date for defining the partition key and hence range. Also for naming the partition if it matters that much.
ALTER TABLE TAB_123
ADD PARTITION P20200101 VALUES LESS THAN (date '2020-01-02') TABLESPACE TS_1
/
Note that the range is defined by less than the next day. Otherwise the date of the partition name won't align with the date of the records in the actual partition.

BigQuery - What is the difference between Wildcard and partitionned tables [duplicate]

I try to understand if there is a difference in big query (in the cost or possibility of requesting for example) between :
Create one table per day (like my_table_2018_02_06)
Create a time partitioned table (my-table with time partition by day).
Thanks !
Short explanation: querying multiple tables using Wildcard Tables was the proposed alternative for when BigQuery did not have a partition mechanism available. The natural evolution was to include the feature of Partitioned Table, and currently there is an alpha release consisting in column-based time partitioning, i.e. letting the user define which column (having a DATE or TIMESTAMP data type) will be used for the partitioning.
So currently BigQuery engineers are working in adding more new features to table partitioning, instead of the legacy Wildcard Tables methodology, then I'd suggest that you work with them.
Long explanation: you are comparing two approaches that in fact are used with the same purpose, but which have different implications:
Wildcard Tables: some time ago, when table partitioning was not a feature supported by Big Query, Wildcard Tables was the way to query multiple tables using concise SQL queries. A Wildcard Table represents the union of all the tables that match the wildcard expression specified in the SQL statement. However, Wildcard Tables have some limitations, such as:
Do not support views.
Do not support cached results (queries containing wildcard tables are billed every time they are run, even if the "cached results" option is checked).
Only work with native BigQuery storage (cannot work with external tables [Bigtable, Storage or Drive]).
Only available in standard SQL.
Partitioned Tables: these are unique tables that are divided into segments, split by date. There is a lot of documentation regarding how to work with Partitioned Tables, and regarding the pricing, each partition in a Partitioned Table is considered an independent entity, so if a partition was not updated for the last 90 days, this data will be considered long-term and therefore will be billed with the appropriate discount (as would happen with a normal table). Finally, Partitioned Tables are here to stay, so there are more incoming features to them, such as column-based partitioning, which is currently in alpha, and you can follow its status in this Public Issue Tracker post. On the other hand, there are also some current limitations to be considered:
Maximum of 2500 partitions per Partitioned Table.
Maximum of 2000 partition updates per table per day.
Maximum of 50 partition updates every 10 seconds.
So in general, it would be advisable to work with Partitioned Tables over multiple tables using Wildcard Tables. However, you should always consider your use case and see which one of the possibilities meets your requirements better.
One thing to add to your decision criteria here is caching and usage of legacy vs standard SQL.
Since the syntax in standard SQL for selecting multiple tables uses a wild card there is no way for the query result to be cached.
Interestingly, the query result would have been cached if legacy SQL was used. Just converting the query to standard SQL would disable caching.
This may be important to consider, at least in some cases more than others.
Thank you,
Hazem
Not exactly a time partition, but one can benefit from both worlds - wildcard "partitions" and real partitions to slice the data even further. Below is an example where we first use the data suffix to select only table holding data from that particular date, then we use actual partitioning within the table to limit the amount of data scanned even further.
Create first partitioned table with data suffix
CREATE TABLE `test_2021-01-05` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-05` (x,y) values (5,1);
insert `test_2021-01-05` (x,y) values (5,2);
insert `test_2021-01-05` (x,y) values (5,3);
Create second partitioned table with data suffix
CREATE TABLE `test_2021-01-04` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-04` (x,y) values (4,1);
insert `test_2021-01-04` (x,y) values (4,2);
Select all the data from both tables using wildcard notation, 80B of data is the whole test set
select * from `test_*`
-- 80B, all the data
Just select data from one table, which is like partitioning on date
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
-- 48B
Select data both from one table(where I am interested in one date) and only from one partition
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
and y = 1
-- 16B, that was the goal
Select data just from one partition from all the tables
select * from `test_*`
where y = 1
-- 32B, only one partition from both tables
The ultimate goal was to limit the data scanned when reading, thus reducing the cost and increasing performance.

Wilcard on day table vs time partition

I try to understand if there is a difference in big query (in the cost or possibility of requesting for example) between :
Create one table per day (like my_table_2018_02_06)
Create a time partitioned table (my-table with time partition by day).
Thanks !
Short explanation: querying multiple tables using Wildcard Tables was the proposed alternative for when BigQuery did not have a partition mechanism available. The natural evolution was to include the feature of Partitioned Table, and currently there is an alpha release consisting in column-based time partitioning, i.e. letting the user define which column (having a DATE or TIMESTAMP data type) will be used for the partitioning.
So currently BigQuery engineers are working in adding more new features to table partitioning, instead of the legacy Wildcard Tables methodology, then I'd suggest that you work with them.
Long explanation: you are comparing two approaches that in fact are used with the same purpose, but which have different implications:
Wildcard Tables: some time ago, when table partitioning was not a feature supported by Big Query, Wildcard Tables was the way to query multiple tables using concise SQL queries. A Wildcard Table represents the union of all the tables that match the wildcard expression specified in the SQL statement. However, Wildcard Tables have some limitations, such as:
Do not support views.
Do not support cached results (queries containing wildcard tables are billed every time they are run, even if the "cached results" option is checked).
Only work with native BigQuery storage (cannot work with external tables [Bigtable, Storage or Drive]).
Only available in standard SQL.
Partitioned Tables: these are unique tables that are divided into segments, split by date. There is a lot of documentation regarding how to work with Partitioned Tables, and regarding the pricing, each partition in a Partitioned Table is considered an independent entity, so if a partition was not updated for the last 90 days, this data will be considered long-term and therefore will be billed with the appropriate discount (as would happen with a normal table). Finally, Partitioned Tables are here to stay, so there are more incoming features to them, such as column-based partitioning, which is currently in alpha, and you can follow its status in this Public Issue Tracker post. On the other hand, there are also some current limitations to be considered:
Maximum of 2500 partitions per Partitioned Table.
Maximum of 2000 partition updates per table per day.
Maximum of 50 partition updates every 10 seconds.
So in general, it would be advisable to work with Partitioned Tables over multiple tables using Wildcard Tables. However, you should always consider your use case and see which one of the possibilities meets your requirements better.
One thing to add to your decision criteria here is caching and usage of legacy vs standard SQL.
Since the syntax in standard SQL for selecting multiple tables uses a wild card there is no way for the query result to be cached.
Interestingly, the query result would have been cached if legacy SQL was used. Just converting the query to standard SQL would disable caching.
This may be important to consider, at least in some cases more than others.
Thank you,
Hazem
Not exactly a time partition, but one can benefit from both worlds - wildcard "partitions" and real partitions to slice the data even further. Below is an example where we first use the data suffix to select only table holding data from that particular date, then we use actual partitioning within the table to limit the amount of data scanned even further.
Create first partitioned table with data suffix
CREATE TABLE `test_2021-01-05` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-05` (x,y) values (5,1);
insert `test_2021-01-05` (x,y) values (5,2);
insert `test_2021-01-05` (x,y) values (5,3);
Create second partitioned table with data suffix
CREATE TABLE `test_2021-01-04` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-04` (x,y) values (4,1);
insert `test_2021-01-04` (x,y) values (4,2);
Select all the data from both tables using wildcard notation, 80B of data is the whole test set
select * from `test_*`
-- 80B, all the data
Just select data from one table, which is like partitioning on date
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
-- 48B
Select data both from one table(where I am interested in one date) and only from one partition
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
and y = 1
-- 16B, that was the goal
Select data just from one partition from all the tables
select * from `test_*`
where y = 1
-- 32B, only one partition from both tables
The ultimate goal was to limit the data scanned when reading, thus reducing the cost and increasing performance.