Materialized View for Latest Rows by ID in a BigQuery Table? - sql

I have a BigQuery table with ~5k unique IDs. Every day new rows are inserted for IDs that may or may not already exist.
We use this query to find the most recent rows:
SELECT t.*
EXCEPT (seqnum),
FROM (SELECT t.*,
ROW_NUMBER() OVER (PARTITION BY id
ORDER BY date_of_data DESC
) as seqnum
FROM `[project]`.[dataset].[table] t
) t
WHERE seqnum = 1
Although we only want the most recent row for each ID, this query must scan the entire table. This query is slower and more expensive every day as the table size grows. Right now, for an 8GB table, the query above creates a 22MB table. We would much rather query the 22MB table if it could stay up-to-date.
Is it possible to create a materialized view that gets the latest rows for each ID?
Is there a better solution than growing tables to infinity?
Other requirements:
Keep historical data (somewhere)
Can't use updates - we would do more than 1,500 per day - https://cloud.google.com/bigquery/quotas

One of the solutions would be to partition your main table (with all rows) by column date_of_data with a daily granularity.
Create a separate table which will keep only the most recent row for each ID. Populate it once with a single scan of entire main table and then update it every day by querying only the last day of the main table. Thanks to the partitioning querying the last day of the main table will scan only the last day of the main table.

Related

Automatically add date for each day in SQL

I'm working on BigQuery and have created a view using multiple tables. Each day data needs to be synced with multiple platforms. I need to insert a date or some other field via SQL through which I can identify which rows were added into the view each day or which rows got updated so only that data I can take forward each day instead of syncing all every day. Best way I can think is to somehow add the the current date wherever an update to a row happens but that date needs to be constant until a further update happens for that record.
Ex:
Sample data
Say we get the view T1 on 1st September and T2 on 2nd. I need to to only spot ID:2 for 1st September and ID:3,4,5 on 2nd September. Note: no such date column is there.I need help in creating such column or any other approach to verify which rows are getting updated/added daily
You can create a BigQuery schedule queries with frequency as daily (24 hours) using below INSERT statement:
INSERT INTO dataset.T1
SELECT
*
FROM
dataset.T2
WHERE
date > (SELECT MAX(date) FROM dataset.T1);
Your table where the data is getting streamed to (in your case: sample data) needs to be configured as a partitioned table. Therefor you use "Partition by ingestion time" so that you don't need to handle the date yourself.
Configuration in BQ
After you recreated that table append your existing data to that new table with the help of the format options in BQ (append) and RUN.
Then you create a view based on that table with:
SELECT * EXCEPT (rank)
FROM (
SELECT
*,
ROW_NUMBER() OVER (GROUP BY invoice_id ORDER BY _PARTITIONTIME desc) AS rank
FROM `your_dataset.your_sample_data_table`
)
WHERE rank = 1
Always use the view from that on.

How to take some data from one partition in the BigQuery table and insert to the next partition?

I have a Big Query table with daily partitions
Now the problem is in one of the partitions i.e. the last partition of the month (for example : 2019-12-31) I have some data that should belong to the next partition i.e 2020-01-01.
I want to know if it is possible to take out that data from my partition 2019-12-31 and put it in the next partition 2020-01-01 using Big Query SQL? or do I have to create a Beam job for it?
Yes, using DML. UPDATE statement moves rows from one partition to another.
Updating data in a partitioned table using DML is the same as updating data from a non-partitioned table.
For example, the following UPDATE statement moves rows from one partition to another. Rows in the May 1, 2017 partition (“2017-05-01”) of mytable where field1 is equal to 21 are moved to the June 1, 2017 partition (“2017-06-01”).
UPDATE
project_id.dataset.mycolumntable
SET
ts = "2017-06-01"
WHERE
DATE(ts) = "2017-05-01"
AND field1 = 21

Delete based on column values when it is part of a composite key

I have a table which has an id and a date. (id, date) make up the composite key for the table.
What I am trying to do is delete all entries older than a specific date.
delete from my_table where date < '2018-12-12'
The query plan explains that it will do a sequential scan for the date column.
I somehow want to make use of the index present since the number of distinct ids are very very small compared to total rows in the table.
How do I do it ? I have tried searching for it but to no avail
In case your use-case involves data-archival on monthly basis or some time period, you can think of updating your DataBase table to use partitions.
Let's say you collect data on monthly basis and want to keep data for the last 5 months. It would be really efficient to create partition over the table based on month of the year.
This will,
optimise your READ queries (table scans will reduce to partition scans)
optimise your DELETE requests (just delete the complete partition)
You need an index on date for this query:
create index idx_mytable_date on mytable(date);
Alternatively, you can drop your existing index and add a new one with (date, id). date needs to be the first key for this query.

How to query data from latest partition in interval partitioned table

I have an interval partitioned table: PARTITION_TEST.
I need to pick a data from the last partition. Is it possible, without refering to dba_tab_partition table?
This table will have only 5 partitions at a time. Is there any way to select data from 5th partition position ?
Something like,
SELECT * FROM PARTITION_TEST partition_position(5)?

Sql Server 2008 partition table based on insert date

My question is about table partitioning in SQL Server 2008.
I have a program that loads data into a table every 10 mins or so. Approx 40 million rows per day.
The data is bcp'ed into the table and needs to be able to be loaded very quickly.
I would like to partition this table based on the date the data is inserted into the table. Each partition would contain the data loaded in one particular day.
The table should hold the last 50 days of data, so every night I need to drop any partitions older than 50 days.
I would like to have a process that aggregates data loaded into the current partition every hour into some aggregation tables. The summary will only ever run on the latest partition (since all other partitions will already be summarised) so it is important it is partitioned on insert_date.
Generally when querying the data, the insert date is specified (or multiple insert dates). The detailed data is queried by drilling down from the summarised data and as this is summarised based on insert date, the insert date is always specified when querying the detailed data in the partitioned table.
Can I create a default column in the table "Insert_date" that gets a value of Getdate() and then partition on this somehow?
OR
I can create a column in the table "insert_date" and put a hard coded value of today's date.
What would the partition function look like?
Would seperate tables and a partitioned view be better suited?
I have tried both, and even though I think partition tables are cooler. But after trying to teach how to maintain the code afterwards it just wasten't justified. In that scenario we used a hard coded field date field that was in the insert statement.
Now I use different tables ( 31 days / 31 tables ) + aggrigation table and there is an ugly union all query that joins togeather the monthly data.
Advantage. Super timple sql, and simple c# code for bcp and nobody has complained about complexity.
But if you have the infrastructure and a gaggle of .net / sql gurus I would choose the partitioning strategy.