DBT Merge Partition Prunning - google-bigquery

This is my first time working with dbt! I have successfully implemented an incremental model using #db-bigquery with following config like so
{{ config(
materialized='incremental',
alias='sale_transactions',
schema='marts',
unique_key='unique_key',
partition_by={
"field": "sale_date",
"data_type": "timestamp",
"granularity": "day"
},
require_partition_filter = false,
database="iprocure-edw"
) }}
The model runs well, though each time the model runs, it scan the whole destination table.
The source data has both new and changed data. The changes are not more than 3 months on the partitioned column.
The merge statement only checks based on unique key DBT_INTERNAL_SOURCE.unique_key = DBT_INTERNAL_DEST.unique_key but has no date filters.
I would like to be able to prune out partitions and scan only relevant partitions to avoid full scan which is very costly.
How can one add date filters in the merge query generated by dbt, that will reduce amount of data scanned by BigQuery.
Have found this thought i'm not aware of how/where i can use the solution

Related

How to update insert new record with updated value from staging table in Azure Data Explorer

I have requirement, where data is indigested from the Azure IoT hub. Sample incoming data
{
"message":{
"deviceId": "abc-123",
"timestamp": "2022-05-08T00:00:00+00:00",
"kWh": 234.2
}
}
I have same column mapping in the Azure Data Explorer Table, kWh is always comes as incremental value not delta between two timestamps. Now I need to have another table which can have difference between last inserted kWh value and the current kWh.
It would be great help, if anyone have a suggestion or solution here.
I'm able to calculate the difference on the fly using the prev(). But I need to update the table while inserting the data into table.
As far as I know, there is no way to perform data manipulation on the fly and inject Azure IoT data to Azure Data explorer through JSON Mapping. However, I found a couple of approaches you can take to get the calculations you need. Both the approaches involve creation of secondary table to store the calculated data.
Approach 1
This is the closest approach I found which has on-fly data manipulation. For this to work you would need to create a function that calculates the difference of Kwh field for the latest entry. Once you have the function created, you can then bind it to the secondary(target) table using policy update and make it trigger for every new entry on your source table.
Refer the following resource, Ingest JSON records, which explains with an example of how to create a function and bind it to the target table. Here is a snapshot of the function the resource provides.
Note that you would have to create your own custom function that calculates the difference in kwh.
Approach 2
If you do not need a real time data manipulation need and your business have the leniency of a 1-minute delay, you can create a query something similar to below which calculates the temperature difference from source table (jsondata in my scenario) and writes it to target table (jsondiffdata)
.set-or-append jsondiffdata <| jsondata | serialize
| extend temperature = temperature - prev(temperature,1), humidity, timesent
Refer the following resource to get more information on how to Ingest from query. You can use Microsoft Power Automate to schedule this query trigger for every minute.
Please be cautious if you decide to go the second approach as it is uses serialization process which might prevent query parallelism in many scenarios. Please review this resource on Windows functions and identify a suitable query approach that is better optimized for your business needs.

Adding a new column into Athena (Presto) table calculated by taking the difference between two rows

Over the past few weeks, I've written a pipeline that picks up all the clickstream data that is being broadcasted from a website. The pipeline makes use of AWS in the following way: S3 > EC2 (for transforms) > Athena (scanning a clean, partitioned s3). New data comes into the pipeline every 24hour and this works great - my clickstream data is easily queriable. However, I now need to add some additional columns i.e. time spent on each page. This can be achieved by sorting by user ID, timestamp and then taking the difference between the timestamp column of row_n1 and row_n2. So my questions are:
1) How can I do this via an SQL query? I'm struggling to get it to work, but my thinking is that once I do I can trigger this query every 24hours to run on the new clickstream data that's coming into Athena.
2) Is this a reasonable way to add additional columns or new aggregate tables? for example, build a query that runs every 24hours on new data to append to a new table.
Ideally, I don't want to touch any of the source code that's been written to do the "core" ETL pipeline
for reference my table looks similar to the following (with the new column time spent on page) :
| userID | eventNum | Category| Time | ...... | timeSpentOnPage |
'103-1023' '3' 'View' '12-10-2019...' 3s
Thanks for any direction/advice that can be provided.
I'm not entirely sure what you are asking, and some example data and expected output would be helpful. For example, I don't quite understand what you mean by row_n and row_m.
I'm going to guess that you mean something like calculating the difference between the timestamps of consecutive rows. That can be achieved by a query like
SELECT
userID,
timestamp - LAG(timestamp, 1) OVER (PARTITION BY userID ORDER BY timestamp) AS timeSpentOnPage
FROM events
The LAG window function returns the value from a previous row (1 in this case means the previous row) in the window given by the window frame (in this case all rows with the same userID and sorted by timestamp). It's kind of like GROUP BY but for each row, if that makes sense.
It wouldn't quite give you the time spent on each page, some page views would look like they were very long when in fact there was just not any activity between them (say someone browsed some, went to lunch, and browsed some more – the last page view before lunch would look like it spanned the whole lunch).
There is no way to do the equivalent of UPDATE in Athena. The closest thing is doing a "CTAS" (Create Table AS) to create a new table (which with some automation can be turned into creating new partitions for existing tables).
If you provide some more information about your data I can revise this answer with other suggestions.

Load data into BigQuery and partition data based on time as well as split by another variable

(I'm very new to databases so please let me know how I can make this question better.)
I am trying to load multiple files of columnar data into BigQuery from an AWS S3 bucket
It is web analytics data of over 150 different websites
There are multiple files, each containing 15 minutes of web analytics data
Each file contains data for all 150 websites for a 15 minute slot, however there is a column called site_code which tells us which site the row pertains to
Here is a snapshot of relevant columns:
timestamp_info_nginx_ms site_code action
1.539168e+12 site_1 event1
1.539168e+12 site_2 event2
1.539168e+12 site_3 event1
1.539168e+12 site_1 event1
1.539168e+12 site_2 event2
The size of data is 200+GB per week and I want to be able to load 12 weeks' data.
My goal is to minimise monthly query costs.
Some context: My main use case is that I will analyse data for one website (or a group of websites) at a time. Of the 150 sites, I will focus mainly on 10-15 websites. Let's call them primary websites. I expect to analyse primary websites on a regular basis (daily) and the rest of the websites occasionally (1-3 times month) or rarely (1-3 times in 2 months).
I have understood that I need to partition my data tables by day. Which looks relatively simple to do via BigQuery GUI.
However, my question is that is it possible to load this data into separate tables for my primary websites (one table for each primary website) and separate for the rest of them?
Having looked at BigQuery's recently released functionality called clustering, it is exactly what I was looking for. The following lines of code will solve my question regarding the sample dataset.
For this use case I am assuming the data is stored in GCS and that it is ndjson and compressed files.
At the time of posting this answer, it is not possible to create clustered tables via web UI hence I am looking at the command line option, by installing the gcloud sdk.
While it possible to create a partitioned table at the time of loading the data (as in the table creation and loading data into it can be done at the same time) it is not possible (as of now) to simultaneously create a clustered table. Hence this is a two step process where Step 1 is to create an empty table; and Step 2 is to load data into it.
Given my sample dataset my schema will look like this:
[
{"type": "TIMESTAMP", "name": "timestamp_info_nginx_ms", "mode": "NULLABLE"},
{"type": "STRING", "name": "site_code", "mode": "NULLABLE"},
{"type": "STRING", "name": "action", "mode": "NULLABLE"}
]
Store the above json in current working directory as myschema.json
Note that my partitioning filed is going to be timestamp and my clustering fields are going to be site_code and action. The order in which clustering is done matters. Remember the clustering order when you run queries on this
Create a dataset in BigQuery called my-dataset.
Now call gcloud sdk's bq command in your terminal to create a table.
bq mk -t --schema ./myschema.json --time_partitioning_type=DAY --time_partitioning_field timestamp_info_nginx_ms --require_partition_filter=TRUE --clustering_fields='site_code,action' my-dataset.my-clustered-table
This should create a new table called my-clustered-table within an existing dataset called my-dataset.
Now load the data into the table using gcloud sdk's bq command in your terminal.
bq load --source_format=NEWLINE_DELIMITED_JSON --max_bad_records=1000 my-dataset.my-clustered-table gs://my-bucket/my-json-files/*
This should work.

Bigquery and Tableau

I attached Tableau with Bigquery and was working on the Dash boards. Issue hear is Bigquery charges on the data a query picks everytime.
My table is 200GB data. When some one queries the dash board on Tableau, it runs on total query. Using any filters on the dashboard it runs again on the total table.
on 200GB data, if someone does 5 filters on different analysis, bigquery is calculating 200*5 = 1 TB (nearly). For one day on testing the analysis we were charged on a 30TB analysis. But table behind is 200GB only. Is there anyway I can restrict Tableau running on total data on Bigquery everytime there is any changes?
The extract in Tableau is indeed one valid strategy. But only when you are using a custom query. If you directly access the table it won't work as that will download 200Gb to your machine.
Other options to limit the amount of data are:
Not calling any columns that you don't need. Do this by hiding unused fields in Tableau. It will not include those fields in the query it sends to BigQuery. Otherwise it's a SELECT * and then you pay for the full 200Gb even if you don't use those fields.
Another option that we use a lot is partitioning our tables. For instance, a partition per day of data if you have a date field. Using TABLE_DATE_RANGE and TABLE_QUERY functions you can then smartly limit the amount of partitions and hence rows that Tableau will query. I usually hide the complexity of these table wildcard functions away in a view. And then I use the view in Tableau. Another option is to use a parameter in Tableau to control the TABLE_DATE_RANGE.
1) Right now I learning BQ + Tableau too. And I found that using "Extract" is must for BQ in Tableau. With this option you can also save time building dashboard. So my current pipeline is "Build query > Add it to Tableau > Make dashboard > Upload Dashboard to Tableau Online > Schedule update for Extract
2) You can send Custom Quota Request to Google and set up limits per project/per user.
3) If each of your query touching 200GB each time, consider to optimize these queries (Don't use SELECT *, use only dates you need, etc)
The best approach I found was to partition the table in BQ based on a date (day) field which has no timestamp. BQ allows you to partition a table by a day level field. The important thing here is that even though the field is day/date with no timestamp it should be a TIMESTAMP datatype in the BQ table. i.e. you will end up with a column in BQ with data looking like this:
2018-01-01 00:00:00.000 UTC
The reasons the field needs to be a TIMESTAMP datatype (even though there is no time in the data) is because when you create a viz in Tableau it will generate SQL to run against BQ and for the partitioned field to be utilised by the Tableau generated SQL it needs to be a TIMESTAMP datatype.
In Tableau, you should always filter on your partitioned field and BQ will only scan the rows within the ranges of the filter.
I tried partitioning on a DATE datatype and looked up the logs in GCP and saw that the entire table was being scanned. Changing to TIMESTAMP fixed this.
The thing about tableau and Big Query is that tableau calculates the filter values using your query ( live query ). What I have seen in my project logging is, it creates filters from your own query.
select 'Custom SQL Query'.filtered_column from ( your_actual_datasource_query ) as 'Custom SQL Query' group by 'Custom SQL Query'.filtered_column
Instead, try to create the tableau data source with incremental extracts and also try to have your query date partitioned ( Big Query only supports date partitioning) so that you can limit the data use.

Bigquery return nested results without flattening it without using a table

It is possible to return nested results(RECORD type) if noflatten_results flag is specified but it is possible to just view them on screen without writing it to table first.
for example, here is an simple user table(my actual table is big large(400+col with multi-level of nesting)
ID,
name: {first, last}
I want to view record particular user & display in my applicable, so my query is
SELECT * FROM dataset.user WHERE id=423421 limit 1
is it possible to return the result directly?
You should write your output to "temp" table with noflatten_results option (also respective expiration to be set to purge table after it is used) and serve your client out of this temp table. All "on-fly"
Have in mind that no matter how small "temp" table is - if you will be querying it (in above second step) you will be billed for at least 10MB, so you better use Tabledata.list API in this step (https://cloud.google.com/bigquery/docs/reference/v2/tabledata/list) which is free!
So if you try to get repeated records it will fail on the interface/BQ console with the error:
Error: Cannot output multiple independently repeated fields at the same time.
and in order to get past this error is to FLATTEN your output.