Partition bigquery table with more than 4000 days of data? - google-bigquery

I have about 11 years of data in a bunch of Avro files. I wanted to partition by the date of each row, but from the documentation it appears I can't because there are too many distinct dates?
Does clustering help on this? The natural cluster key for my data would still have some that'd have data for more than 4000 days.

two solutions i see:
1)
Combine tables sharding (per year) with time partitioning based on your column. I never tested that myself, but it should work, as every shard is seen as a new table in BQ.
With that you are able to easily address the shard plus the partition with one wildcard/variable.
2)
A good workaround is to create an extra column with the date of you field which should be partitioned.
For every data entry longer ago than 9 years (eg: DATE_DIFF(current_date(), DATE('2009-01-01'), YEAR)) format your date to the 1st of the particular month.
With that you are able to create another 29 years of data.
Be aware that you cannot filter based on that column with a date filter eg in DataStudio. But for query it works.
Best Thomas

Currently as per doc clustering is supported for partition table only. In future it might support non-partition tables.
You can put old data per year in single partition.
You need to add extra column to you table for partioning it.
Say, all data for year 2011 will go to partition 20110101.
For newer data (2019) you can have seperate partition for each date.
This is not a clean solution to problem but using this you can optimize further by using clustering to provide minimal table scan.

4,000 daily partitions is just over 10 years of data. If you require a 'table' with more than 10 years of data one workaround would be to use a view:
Split your table into decades ensuring all tables are partitioned on the same field and have the same schema
Union the tables together in a BigQuery view
This results in a view with 4,000+ partitions which business users can query without worrying about which version of a table they need to use or union-ing the tables themselves.

It might make sense to partition by week/month/year instead of day - depending on how much data you have per day.
In that case, see:
Partition by week/year/month to get over the partition limit?

Related

Is there a performant way to search by a non-partitioned column in crateDB?

My team and I have been using crate for one of our projects over the passed few years. We have a table with hundreds of millions of records and performance is key.
As we've developed more and more features on this project, we've ran into interesting problem. We have a column on this table labeled 'persist_date' which is when the record actually got persisted into the table. These dates may not always align and we could have a start_date of 2021-06-21 with a persist_date of 2021-10-14.
All of our queries up this point have easily been able to add a partition against start_date. Now we are encountering a problem which requires us to use a non-partitioned column (persist_date) to query against.
As I understand it, crateDB is really performant but only when you query against 1 specific partition at a time. My question now is how would I go about creating a partition for this other date column without duplicated my data? Is there anything other than a partition that might help, like the way the table is clustered?
You could use both columns as partition values.
e.g.
CREATE TABLE two_parted (a TEXT, b TEXT, val DOUBLE) PARTITIONED BY (a,b);
If either a or b are used in a selection, this would limit queries to shards that have either value. However this could lead to more shards, so you might want to partitions not on a daily, but weekly or monthly basis.

What's the cost of scan including column that has been added recently?

Assuming I have an old table with a lot of data. Two columns are there - user_id existing from very beginning and data added very recently, say, a week ago. My goal is to join this table on user_id but retrieve only the newly created column data. Could it be the case that because data column didn't exist so far, there is no point of scanning whole user_id range and, therefore, query would be cheaper? How is the price calculated for such operation?
According to the documentation there are 2 pricing models for queries:
On-demand pricing
Flat-rate pricing
Seeing that you use On-demand pricing, you will only be billed by the number of bytes processed, you can check how data size is calculated here. In that sense the answer would be: yes, scanning user_id partially would be cheaper. But reading through the documentation you'll find this sentence:
When you run a query, you're charged according to the data processed in the columns you select, even if you set an explicit LIMIT on the results.
So probably the best solution would be creating another table with the data that has to be processed and run the query.

I need to partition a huge table and truncate the partition on daily basis

I have a table tblCallDataStore which has a flow of 2 millions records daily.Daily, I need to delete any record older than 48 hours. If I create a delete job, it runs for more than 13 hours or sometimes more than that. What is the feasible way to partition the table and truncate the partition.
How may I do that?
I have a Receivedate column and I want to do the partition on its basis.
Unfortunately, you will need to bite the one-time bullet and add partitioning to your table (hint: base it off the DATEPART(DAY, ...) of the timestamp you are keying off of so you can quickly get this as your #ParitionID variable later). After that, you can use the SWITCH PARTITION statement to move all rows of the specified partition ID to another table. You can then quickly TRUNCATE the receiving table since you don't care about the data anymore.
There are a couple of tricky setup steps in this process: Managing the partition IDs themselves, and ensuring the receiving table has an identical structure (and various other requirements, such as not being able to refer to your partitioned table with foreign keys (umm... ouch)). Once you have the partition ID you want, the code is simple:
ALTER TABLE [x] SWITCH PARTITION #PartitionID TO [x_copy]
I have been meaning to write a stored procedure to automatically maintain the receiving tables, since at work we have just been manually updating both tables in tandem whenever we make a change (e.g. adding a new column or index). To manually do it, basically just copy the table definition and append something to all the entity names (e.g. just put the number 2 on the end).
P.S. Maybe you want to use hourly partitioning instead, then you have much more control over the "48 hours" requirement, and also have the flexibility to handle time zones (ugh). You'd just have to be more clever since DATEPART(HOUR, ...) obviously won't work directly. Also note that your partition IDs must be defined as a range, so eventually you will end up cycling through them, so keep that in mind.
P.P.S. As noted in the comments by David Browne, partitioning was an enterprise-only feature until SQL 2016 SP1.

Creative use of date partitions

I have some data that I would like to partition by date, and also partition by an internally-defined client id.
Currently, we store this data uses the table-per-date model. It works well, but querying individual client ids is slow and expensive.
We have considered creating a table per client id, and using date partitioning within those tables. The only issue here is that would force us to incur thousands of load jobs per day, and also have the data partitioned by client id in advance.
Here is a potential solution I came up with:
-Stick with the table-per-date approach (eg log_20170110)
-Create a dummy date column which we use as the partition date, and set that date to -01-01 (eg for client id 1235, set _PARTITIONTIME to 1235-01-01)
This would allow us to load data per-day, as we do now, would give us partitioning by date, and would leverage the date partitioning functionality to partition by client id. Can you see anything wrong with this approach? Will BigQuery allow us to store data for the year 200, or the year 5000?
PS: We could also use a scheme that pushes the dates to post-zero-unixtime, eg add 2000 to the year, or push the last two digits to the month and day, eg 1235 => 2012-03-05.
Will BigQuery allow us to store data for the year 200, or the year 5000?
Yes, any date between 00001-01-01 and 9999-12-31
So formally speaking this is an option (and btw depends on how many clients you plan / already have)
See more about same idea at https://stackoverflow.com/a/41091896/5221944
Meantime, I would expect BigQuery to have soon ability to partition by arbitrary field. Maybe at NEXT 2017 - just guessing :o)
The suggested idea is likely to create some performance issues for queries (as the number of partitions increase). Generally speaking, Date partitioning works well with a few 1000 partitions.
client_ids are generally unrelated with each other and are ideal for hashing. While we work towards supporting richer partitioning flavors, one option is to hash your client_ids into N buckets (~100?), and have N partitioned tables. That way you can query across your N tables for a given date. Using, for example, 100 tables would drop the cost down to 1% of what it would be using 1 table with all the client_ids. It should also scan a small number of partitions, improving performance also accordingly. Unfortunately, this approach doesn't address the concern of putting the client ids in the right table (it has to be managed by you).

ALTER PARTITION FUNCTION to include 1.5TB worth of data for a quick switch

I inhereted a unmaintained database in which the partition function was set on a date field and expired on the first of the year. The data is largely historic and I can control the jobs that import new data into this table.
My question is relating to setting up or altering partitioning to include so much data, roughly 1.5TB counting indexes. This is on a live system and I don't know what kind of impact it will have with so many users connecting to it at once. I will test this on a non prod system but then I can't get real usage load on there. My alternative solution was to kill all the users hitting the DB and quickly doing a rename of the table, and renaming a table that does have a proper partitioning scheme in.
I wanted to:
-Keep the same partition function but extend it to:
keep all 2011 data up to a certain date (let's say Nov 22nd 2011) on 1 partition, all data coming in after that need to be put in their own new partitions
-Do a quick switch of the specific partition which has the full years worth of data
Anyone know if altering a partition on a live system to include a new partition for a full years worth of data, roughly 5-6 billion records and 1.5tb, is plausible? Any pitfalls? I will share my test results once I complete them but want any input. Thanks!
Partitions switch are a metadata only operation and the size of the partition switched in or out does not matter, it can be 1Kb or 1TB, it takes the exactly same amount of time (ie. very fast).
However what you're describing is not a partition switch operation, but a partition split: you want to split the last partition of the table into two partitions, one containing all the existing data and a new one empty. Splitting a partition has to split the data, and unfortunately this is an offline size-of-data operation.