I have the same problem again as this question:
How to choose the latest partition in BigQuery?
What's the problem?
How do you make an incremental table using BigQuery in DBT without scanning the entire table every time?
The suggested incremental table format doesn't work (it scans the whole table) and DECLARE isn't supported (I think?).
Details
The suggested incremental format for DBT involves something like this:
{% if is_incremental() %}
WHERE
AND _partitiontime > (select max(_partitiontime) from `dataset.table`)
{% endif %}
First run with incremental table builds and adds a lot of rows.
Second run adds a tiny number of rows but still scans the entire table.
BigQuery will scan the entire table for every incremental run, meaning you're paying the full cost every day.
The recommended solution by BigQuery is to use DECLARE:
DECLARE max_date DATE;
SET max_date = (select max(_partitiontime) from `dataset.table`);
This post suggests that isn't possible.
Is there a workaround people are doing here? Is there some sort of escaped way I can set up DECLARE with DBT, or another solution I haven't seen?
Other context
I've previously posted a version of it involving Data Studio:
Pruning BigQuery partitions with Data studio
Couldn't figure out how to do it there either.
Turns out there is a DBT shortcut _dbt_max_partition which goes through the steps of declaring a variable and so the partitions are correctly pruned.
{% if is_incremental() %}
AND _partitiontime >= _dbt_max_partition
{% endif %}
I found an excellent dbt helper post here.
Related
I am trying to optimise some DBT models by using incremental runs on partitioned data and ran into a problem - the suggested approach that I've found doesn't seem to work. By not working I mean that it doesn't decrease the processing load as I'd expect.
Below is the processing load of a simple select of the partitioned table:
unfiltered query processing load
Now I try to select only new rows added since the last incremental run:
filtered query
You can notice, that the load is exactly the same.
However, the select inside the WHERE is rather lightweight:
selecting only the max date
And when I fill in the result of that select manually, the processing load is suddenly minimal, what I'd expect:
expected processing load
Finally, both tables (the one I am querying data from, and the one I am querying max(event_time)) are configured in exactly the same way, both being partitioned by DAY on field event_time:
config on tables
What am I doing wrong? How could I improve my code to actually make it work? I'd expect the processing load to be similar to the one using an explicit condition.
P.S. apologies for posting links instead of images. My reputation is too low, as this is my first question here.
Since the nature of query is dynamic, i.e. the where condition is not absolute(constant), BigQuery cannot estimate the accurate processed data before execution.
This is due the fact that max(event_time) is not constant and might change, hence affecting the size of the data to be fetched by the outer query.
For estimation purposes, try one of these 2 approaches:
Replace the inner query by a constant value and check the estimated bytes to be processed.
Try running the query once and check the processed data under Query results -> Job Information ->Bytes processed and Bytes billed
I am using DBT to incremental load data from one schema in redshift to another to create reports. In DBT there is straight forward way to incrementally load data with upsert. But instead of doing the traditional upsert. I want to take sum (on the unique id for the rest of the columns in the table) of the incoming rows and old rows in the destination table if they already exist else do insert them.
Say for example I have a table.
T1(userid, total_deposit, total_withdrawal)
i have created a table that calculates total deposit and total withdrawal for a user, when i do an incremental query i might get new deposit or withdrawal the for existing user, in that case, I'll have to add the value in existing table instead of replacing it using upsert. And if the user is new I just need to do simple insert.
Any suggestion on how to approach this?
dbt is quite opinionated that invocations of dbt should be idempotent. This means that you can run the same command over and over again, and the result will be the same.
The operation you're describing is not idempotent, so you're going to have a hard time getting it to work with dbt out of the box.
As an alternative, I would break this into two steps:
Build an incremental model, where you are appending the new activity
Create a downstream model that references the incremental model and performs the aggregations you want to calculate the balance for each customer. You could very carefully craft this as an incremental model with your user_id as the unique_key (since you have all of the raw transactions in #1), but I'd start without that and make sure that's absolutely necessary for performance reasons, since it will add a fair bit of complexity.
For more info on complex incremental materializations, I suggest this discourse post written by Tristan Handy, Founder & CEO at dbt Labs
is any way to know the refresh status of materialized views? I want to figure out how to track if the materialized refresh was successful .
Views, by definition, are not refreshed as such: they are always containing the latest data available in the source. E.g. if you were to query a staging model that is materialised as a view and looks like the following:
-- This is your staging model, materialised as a view
{{ config(materialized='view') }}
select * from {{ source('your_crm', 'orders') }}
You would get the freshest data from the source, even if it is an order (for the sake of the example) that was created 5min ago, as long as this order is already appearing in your source table.
So, long story short, you can always confirm this by querying any of your materialized views, and checking what data is available in their source.
Assuming I have an old table with a lot of data. Two columns are there - user_id existing from very beginning and data added very recently, say, a week ago. My goal is to join this table on user_id but retrieve only the newly created column data. Could it be the case that because data column didn't exist so far, there is no point of scanning whole user_id range and, therefore, query would be cheaper? How is the price calculated for such operation?
According to the documentation there are 2 pricing models for queries:
On-demand pricing
Flat-rate pricing
Seeing that you use On-demand pricing, you will only be billed by the number of bytes processed, you can check how data size is calculated here. In that sense the answer would be: yes, scanning user_id partially would be cheaper. But reading through the documentation you'll find this sentence:
When you run a query, you're charged according to the data processed in the columns you select, even if you set an explicit LIMIT on the results.
So probably the best solution would be creating another table with the data that has to be processed and run the query.
As part of our Bigquery solution we have a cron job which checks the latest table created in a dataset and will create more if this table is out of date.This check is done with the following query
SELECT table_id FROM [dataset.__TABLES_SUMMARY__] WHERE table_id LIKE 'table_root%' ORDER BY creation_time DESC LIMIT 1
Our integration tests have recently been throwing errors because this query is hitting Bigquery's internal cache even though running the query against the underlying table would provide a different result. This caching also occurs if I run this query in the web interface from Google cloud console.
If I specify for the query not to cache using the
queryRequest.setUseQueryCache(false)
flag in the code then the tests pass correctly.
My understanding was that Bigquery automatic caching would not occur if running the query against the underlying table would provide a different result. Am I incorrect in this assumption in which case when does it occur or is this a bug?
Well the answer for your question is: you are doing conceptually wrong. You always need to set the no cache param if you want no cache data. Even on the web UI there are options you need to use. The default is to use the cached version.
But, fundamentally you need to change the process and use the recent features:
Automatic table creation using template tables
A common usage pattern for streaming data into BigQuery is to split a logical table into many smaller tables, either for creating smaller sets of data (e.g., by date or by user ID) or for scalability (e.g., streaming more than the current limit of 100,000 rows per second). To split a table into many smaller tables without adding complex client-side code, use the BigQuery template tables feature to let BigQuery create the tables for you.
To use a template table via the BigQuery API, add a templateSuffix parameter to your insertAll request
By using a template table, you avoid the overhead of creating each table individually and specifying the schema for each table. You need only create a single template, and supply different suffixes so that BigQuery can create the new tables for you. BigQuery places the tables in the same project and dataset. Templates also make it easier to update the schema because you need only update the template table.
Tables created via template tables are usually available within a few seconds.
This way you don't need to have a cron, as it will automatically create the missing tables.
Read more here: https://cloud.google.com/bigquery/streaming-data-into-bigquery#template-tables