BigQuery - Create view with Partition but base table doesn't have

BigQuery - Create view with Partition but base table doesn't have - google-bigquery

This may sound crazy, but I want to implement something like having a view with a partition.
Background:
I had a table with a date partition on a column which is really huge in size. We are running data ingestion to this table at every 2mins interval. All the data loads are append-only. Ever load will insert 10k+ rows. After some time, we encountered the partition limitation issue.
message: "Quota exceeded: Your table exceeded quota for Number of partition modifications to a column partitioned table. For more information, see https://cloud.google.com/bigquery/troubleshooting-errors"
Root cause:(from GCP support team)
The root cause under the hood was that due to your partitioned tables
have pretty granular partition for instance by minutes, hours or date,
when the loaded data cover a wide range of partition period, the
number of partition get modified will be high and above 4000. As per
internal documentation, it was suggested the user who ran into this
issue to consider making a less granular partition for instance change
a date/hour/minute based partitioned table to a week based partitioned
table. Alternatively split the load to multiple and hence limit the
data range to cover less number of partitions that would be affected.
This is the best recommendation I could have now.
So I'm planning to keep this table as un-partitioned and create a view(we need a view for eliminating the duplicates) and it should have parition. Is this possible? or any other alternate solution for this?

You can't partition a view, it's not physically materialized. Partitioning on day can be limiting with the 4000 limit, would year work? then you can use an integer partition:
create or replace table BI.test
PARTITION BY RANGE_BUCKET(Year, GENERATE_ARRAY(2000, 3000, 1)) as
select 2000 as Year, 1 as value
union all
select 2001 as Year, 1 as value
union all
select 2002 as Year, 1 as value
Alternatively, I've used month (YYYYMM) or week (YYYYWW) to integer partition by which gets you around 40 years:
RANGE_BUCKET(monthasintegerfield, GENERATE_ARRAY(201612, 205712, 1))

Related

Specify the partition # based on date range for that pkey value

We have a DW query that needs to extract data from a very large table around 10 TB which is partitioned by datetime column lets say time to purge data based on this column everyday. So my understanding is each partition has worth a day of data. From storage (SSMS GUI) tab I see # of partitions is 1995.
There is no clustered index on this table as its mostly intended for write operations. Just a design by vendor.
SELECT
a.*
FROM dbo.VLTB AS a
CROSS APPLY
(
VALUES($PARTITION.a_func(a.time))
) AS c (pid)
WHERE c.pid = 1896;
Currently query submitted is as
SELECT * from dbo.VLTB
WHERE time >= convert(datetime,'20210601',112)
AND time < convert(datetime,'20210602',112)
So replacing inequality predicates with equality to look in that days specific partition might help. Users via app can control dates when sending but how will they manage if we want them to use partition # as per first query
Question
How do I find a way in above query to find partition number for that day rather than me inserting like for 06/01 I had to give 1896 part#. Is there a better way to have script find the part# to avoid all partitions being scanned and can insert correct part# in where clause query?
Thank you

Creating a daily Oracle partition

Creating oracle partition for a table for the every day.
ALTER TABLE TAB_123 ADD PARTITION PART_9999 VALUES LESS THAN ('0001') TABLESPACE TS_1
Here I am getting error because value is decreased as 0001 as lower boundary.

You can have Oracle automatically create partitions by using the PARTITION BY RANGE option.
Sample DDL, assuming that the partition key is column my_date_column :
create table TAB_123
( ... )
partition by range(my_date_column) interval(/*numtoyminterval*/ NUMTODSINTERVAL(1,'day'))
( partition p_first values less than (to_date('2010-01-01', 'yyyy-mm-dd')) tablespace ts_1)
;
With this set up in place, Oracle will, if needed, create a partition on the fly when you insert data into the table. It is also usually a good idea to create a default partition, as shown above.

This naming convention (last digit of year plus day number) won't support holding more than ten years worth of data. Maybe you think that doesn't matter but I know databases which are well into their second decade. Be optimistic!
Also, that key is pretty much useless for querying. Most queries against partitioned tables want to get the benefit of partition elimination. But that only' works if the query uses the same value as the partition key. Developers really won't want to be casting a date to YDDD format every time they write a select on the table.
So. Use an actual date for defining the partition key and hence range. Also for naming the partition if it matters that much.
ALTER TABLE TAB_123
ADD PARTITION P20200101 VALUES LESS THAN (date '2020-01-02') TABLESPACE TS_1
/
Note that the range is defined by less than the next day. Otherwise the date of the partition name won't align with the date of the records in the actual partition.

Does deleting and creating a table in bigquery renews the quota limit per day?

I am creating a data pipeline which writes the data into bigquery table every minute and eventually exceeds the quota limit. Does deleting the table after a few hours and then creating it again will renew the quota limit of that table?
I'm using Python API of bigquery to achieve this task.
Need to update the same table in bigquery without exceeding the quota limit.

As per BQ documents, it imposes an upper-bound limit of 1,000 updates per table per day.
I think you have to "engineer" ways to get around your frequency of updates to a table. There are some very obvious ways around this (which are also pretty standard industry practices) and then there are some tricks. Here is what I can think from top of my head:
You can choose to update your target table (overwrite) less frequently.
You can a compose a new table name to be valid only for updates coming in for a certain time interval during the day (example: between 2-3 AM, let your pipeline write query results to table mydataset.my_table_[date]_02_03). Then, at the time of querying, you can just use wildcard statements like:
select count(*) as cnt from `mydataset.my_table_[date]_*`
Which is equivalent to:
select count(*) as cnt from (
select * from (
select * from `mydataset.my_table_[date]_00_01`
)
union all
select * from (
select * from `mydataset.my_table_[date]_01_02`
)
union all
....
)
In this, however, make sure you are always "appending" (not overwriting) data to the table corresponding to the hour of the day. Also, not to forget, you can always take decent advantage of BQ's date partitioned tables to achieve similar results.
Hope this helps.

Wilcard on day table vs time partition

I try to understand if there is a difference in big query (in the cost or possibility of requesting for example) between :
Create one table per day (like my_table_2018_02_06)
Create a time partitioned table (my-table with time partition by day).
Thanks !

Short explanation: querying multiple tables using Wildcard Tables was the proposed alternative for when BigQuery did not have a partition mechanism available. The natural evolution was to include the feature of Partitioned Table, and currently there is an alpha release consisting in column-based time partitioning, i.e. letting the user define which column (having a DATE or TIMESTAMP data type) will be used for the partitioning.
So currently BigQuery engineers are working in adding more new features to table partitioning, instead of the legacy Wildcard Tables methodology, then I'd suggest that you work with them.
Long explanation: you are comparing two approaches that in fact are used with the same purpose, but which have different implications:
Wildcard Tables: some time ago, when table partitioning was not a feature supported by Big Query, Wildcard Tables was the way to query multiple tables using concise SQL queries. A Wildcard Table represents the union of all the tables that match the wildcard expression specified in the SQL statement. However, Wildcard Tables have some limitations, such as:
Do not support views.
Do not support cached results (queries containing wildcard tables are billed every time they are run, even if the "cached results" option is checked).
Only work with native BigQuery storage (cannot work with external tables [Bigtable, Storage or Drive]).
Only available in standard SQL.
Partitioned Tables: these are unique tables that are divided into segments, split by date. There is a lot of documentation regarding how to work with Partitioned Tables, and regarding the pricing, each partition in a Partitioned Table is considered an independent entity, so if a partition was not updated for the last 90 days, this data will be considered long-term and therefore will be billed with the appropriate discount (as would happen with a normal table). Finally, Partitioned Tables are here to stay, so there are more incoming features to them, such as column-based partitioning, which is currently in alpha, and you can follow its status in this Public Issue Tracker post. On the other hand, there are also some current limitations to be considered:
Maximum of 2500 partitions per Partitioned Table.
Maximum of 2000 partition updates per table per day.
Maximum of 50 partition updates every 10 seconds.
So in general, it would be advisable to work with Partitioned Tables over multiple tables using Wildcard Tables. However, you should always consider your use case and see which one of the possibilities meets your requirements better.

One thing to add to your decision criteria here is caching and usage of legacy vs standard SQL.
Since the syntax in standard SQL for selecting multiple tables uses a wild card there is no way for the query result to be cached.
Interestingly, the query result would have been cached if legacy SQL was used. Just converting the query to standard SQL would disable caching.
This may be important to consider, at least in some cases more than others.
Thank you,
Hazem

Not exactly a time partition, but one can benefit from both worlds - wildcard "partitions" and real partitions to slice the data even further. Below is an example where we first use the data suffix to select only table holding data from that particular date, then we use actual partitioning within the table to limit the amount of data scanned even further.
Create first partitioned table with data suffix
CREATE TABLE `test_2021-01-05` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-05` (x,y) values (5,1);
insert `test_2021-01-05` (x,y) values (5,2);
insert `test_2021-01-05` (x,y) values (5,3);
Create second partitioned table with data suffix
CREATE TABLE `test_2021-01-04` (x INT64, y INT64)
PARTITION BY RANGE_BUCKET(y, GENERATE_ARRAY(0, 500, 1));
insert `test_2021-01-04` (x,y) values (4,1);
insert `test_2021-01-04` (x,y) values (4,2);
Select all the data from both tables using wildcard notation, 80B of data is the whole test set
select * from `test_*`
-- 80B, all the data
Just select data from one table, which is like partitioning on date
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
-- 48B
Select data both from one table(where I am interested in one date) and only from one partition
select * from `test_*`
where _TABLE_SUFFIX = "2021-01-05"
and y = 1
-- 16B, that was the goal
Select data just from one partition from all the tables
select * from `test_*`
where y = 1
-- 32B, only one partition from both tables
The ultimate goal was to limit the data scanned when reading, thus reducing the cost and increasing performance.

Averaging large amounts of data in SQL Server

It is desired to perform averaging calculations on a large set of data. Data is captured from the devices rather often and we want to get the last day's average, the last week's average, the last month's average and the last year's average.
Unfortunately, taking the average of the last year's data takes several minutes to complete. I only have a basic knowledge of SQL and am hoping that there is some good information here to speed things up.
The table has a timestamp, an ID that identifies which device the data belongs to and a floating point data value.
The query I've been using follows this general example:
select avg(value)
from table
where id in(1,2,3,4) timestamp > last_year
Edit: I should clarify also that they are requesting that these averages be calculated on a rolling basis. As in "year to date" averages. I do realize that just simply due to the sheer volume of results, we may have to compromise.

For this kind of problems you can always try the following solutions:
1) optimize the query: look at the query plan, create some indexes, defrag the existing ones, run the query when the server is free, etc
2) create a cache table.
To populate the cache table chose one of the following strategies:
1) use triggers on the tables that affects the result and on insert,update,delete refresh the cache table. The trigger should run very, very, very fast. Other condition is to not block any records ( otherwise you will end up in a deadlock if the server is busy )
2) populate the cache table with a job once per day/hour/etc
3) one solution that I like is to populate the cache by a SP when the result is needed ( ex: when the report is requested by the user ) and to use some logic to serialize the process ( only one user at one time can generate the cache ) plus some optimization to not recompute the same rows next time ( ex: if no row was added for the yesterday, and in cache I have the result for yesterday, I don't recalculate that value - calculate only the new values from the last run )

You might want to consider making the Clustered index on the timestamp. Typically clustered index is wasted on the id. One caution on this, the sort order of the output of other sql statements may change if there was no explicit sort.

You can make a caching table, for statistics cache, it should have something similar to this structure:
year | reads_sum | total_reads | avg
=====|============|=============|=====
2009 | 6817896234 | 564345 |
at the end of the year, you fill avg (average) field with the, now quick to calculate, value.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas