while doing incremental using dbt i want to to aggregation if that row exist else insert - sql

I am using DBT to incremental load data from one schema in redshift to another to create reports. In DBT there is straight forward way to incrementally load data with upsert. But instead of doing the traditional upsert. I want to take sum (on the unique id for the rest of the columns in the table) of the incoming rows and old rows in the destination table if they already exist else do insert them.
Say for example I have a table.
T1(userid, total_deposit, total_withdrawal)
i have created a table that calculates total deposit and total withdrawal for a user, when i do an incremental query i might get new deposit or withdrawal the for existing user, in that case, I'll have to add the value in existing table instead of replacing it using upsert. And if the user is new I just need to do simple insert.
Any suggestion on how to approach this?

dbt is quite opinionated that invocations of dbt should be idempotent. This means that you can run the same command over and over again, and the result will be the same.
The operation you're describing is not idempotent, so you're going to have a hard time getting it to work with dbt out of the box.
As an alternative, I would break this into two steps:
Build an incremental model, where you are appending the new activity
Create a downstream model that references the incremental model and performs the aggregations you want to calculate the balance for each customer. You could very carefully craft this as an incremental model with your user_id as the unique_key (since you have all of the raw transactions in #1), but I'd start without that and make sure that's absolutely necessary for performance reasons, since it will add a fair bit of complexity.
For more info on complex incremental materializations, I suggest this discourse post written by Tristan Handy, Founder & CEO at dbt Labs

Related

Advice on changing the partition field for dynamic BigQuery tables

I am dealing with the following issue: I have a number of tables imported into BigQuery from an external source via AirByte with _airbyte_emitted_at as the default setting for partition field.
As this default choice for a partition field is not very lucrative, the need to change the partition field naturally presents itself. I am aware of the method available for changing partitions of existing tables, by means of a CREATE TABLE FROM SELECT * statement, however the new tables thus created - essentially copies of the original ones, with modified partition fields - will be mere static snapshots and no longer dynamically update, as the originals do each time new data is recorded in the external source.
Given such a context, what would the experienced members of this forum suggest as a solution to the problem?
Being that I am a relative beginner in such matters, I apologise in advance for any potential lack of clarity. I look forward to improving the clarity, should there be any suggestions to do so from interested readers & users of this forum.
I can think of 2 approaches to overcome this.
Approach 1 :
You can use Scheduled queries to copy the newly inserted rows to your 2nd table. You have to write the query in such a way that it will always select the latest rows from your main table and once you have that you can use Insert Into statement to append the rows in your 2nd table.
Since Schedule queries run at specific times the only drawback will be the the 2nd table will not get updated immediately whenever there is a new row in the main table, it will get the latest data whenever the Scheduled Query runs.
If you do not wish to have the latest data always in your 2nd table then this approach is the easier one to achieve.
Approach 2 :
You can trigger Cloud Actions for BigQuery events such as Insert, delete, update etc. Whenever a new row gets inserted in your main table ,using Cloud Run Actions you can insert that new data in your 2nd table.
You can follow this article , here a detailed solution has been given.
If you wish to have the latest data always in your 2nd table then this would be a good way to do so.

Using Apache beam Python SDK to update BigQuery Tables

Consider the following scenario: In Google BigQuery there are two tables "User" and "Order"
User table:
user_id INTEGER
num_orders INTEGER
last_updated_time TIMESTAMP
Order table:
user_id INTEGER
order_id INTEGER
created_time TIMESTAMP
The field "User.num_orders" is the total number of orders a user have so far and there is a daily Dataflow pipeline routine (written as a Google Cloud Dataflow Template and executed through REST API) which aggregates new orders in the previous day and increments the counter "User.num_orders".
My questions are the following:
There seems no update option in the python SDK. In package beam.io.BigQueryDisposition there are only 5 options:
BigQueryDisposition options:
CREATE_NEVER
CREATE_IF_NEEDED
WRITE_TRUNCATE
WRITE_APPEND
WRITE_EMPTY
Right now I can think of a workaround solution by first summing new orders and appending these to an intermediate table "Intermediate"
Intermediate table:
user_id INTEGER
intermediate_num_orders INTEGER
created_time TIMESTAMP
At the end of the pipeline execution (after writing data to "Intermediate" table), we then issue a bigQuery update query to increment the counter "User.num_orders". The workaround is less obvious but doable and there is one pitfall in this method:
Because it is a two-step operation (1. write to "Intermediate" table 2. update "User" table), we lost atomicity property of the entire process, which means step 2 might fail and special care must be taken to avoid any side effects (ex: If step 2 fails and we re-run the pipeline, some orders might be accumulated multiple times. This can be avoided by using timestamps but we as developers have to take care of it ourselves)
From 1. Is it a recommended way to use BigQuery? From data warehousing management point of view (academically...), would it be better to avoid table updates completely and use "append only" operations? For example, we can aggregate daily orders, append them to "Intermediate" table, and compute the total on the fly (costs more money compared to table lookup if we store the overall value and records we currently have are at the scale of millions). This sounds like event sourcing pattern to me ...record every change as event and compute the current state by applying the events on the fly.
p.s. the example above is an over-simplified version of our current application and is for illustration purpose only. The real-life situation is way more complicated.

BigQuery update multi tables

i'm holding huge transactions data on daily multi tables according the business date.
trascation_20140101
trascation_20140102
trascation_20140103..
the process flow is like that:
1.i''m loading the batch of new files that that arrive to temp table
2.i group by the transcation_date field in order to notice on which date is belong -
for each date i query the temp table on this date and insert it to the proper trasaction_YYYYMMDD
table.
3.i'm doing part 2 in parallel in order to save time, because the temp table might contain data that belong to 20 days..
my challenge is what to do if one these process failed and other not..
i can't run it all again , since it will cause for duplications for the table that been already successfully update.
i solve these issue by managing this update, but it's seems to be too complex.
Is this best practice to deal with multi tables?
i will be glad to get some best practice in order to understand how others deals when they need to load the data to multi tables according to business date and Not just insert date(this is easy..)
You could add an extra step in the middle, where instead of moving directly from today's temp table into the permanent business-date tables, you extract into temporary daily tables and then copy the data over to the permanent tables.
Query from today's temp table, sharded by day into tmp_transaction_YYMMDD. Use WRITE_EMPTY or WRITE_TRUNCATE write disposition so that this step is idempotent.
Verify that all expected tmp_transaction_YYMMDD tables exist. If not, debug failures and go back to step 1.
Run parallel copy jobs from each tmp_transaction_YYMMDD table to append to the corresponding permanent transaction_YYMMDD table.
Verify copy jobs succeeded. If not, retry the individual failures from step 3.
Delete the tmp_transaction_YYMMDD tables.
The advantage of this is that you can catch query errors before affecting any of the end destination tables, then copy over all the added data at once. You may still have the same issue if the copy jobs fail, but they should be easier to debug and retry individually.
Our incentive for incremental load is cost, and therefore we interested in "touching each record only once".
We use table decorators to identify increment. We manage the increments timestamps independently, and add them to the query on run-time. It requires some logic to maintain, but nothing too complicated.

Join or storing directly

I have a table A which contains entries I am regularly processing and storing the result in table B. Now I want to determine for each entry in A its latest processing date in B.
My current implementation is joining both tables and retrieving the latest date. However an alternative, maybe less flexible, approach would be to simply store the date in table A directly.
I can think of pros and cons for both cases (performance, scalability, ....), but didnt have such a case yet and would like to see whether someone here on stackoverflow had a similar situation and has a recommendation for either one for a specific reason.
Below a quick schema design.
Table A
id, some-data, [possibly-here-last-process-date]
Table B
fk-for-A, data, date
Thanks
Based on your description, it sounds like Table B is your historical (or archive) table and it's populated by batch.
I would leave Table A alone and just introduce an index on id and date. If the historical table is big, introduce an auto-increment PK for table B and have a separate table that maps the B-Pkid to A-pkid.
I'm not a fan of UPDATE on a warehouse table, that's why I didn't recommend a CURRENT_IND, but that's an alternative.
This is a fairly typical question; there are lots of reasonable answers, but there is only one correct approach (in my opinion).
You're basically asking "should I denormalize my schema?". I believe that you should denormalize your schema only if you really, really have to. The way you know you have to is because you can prove that - under current or anticipated circumstances - you have a performance problem with real-life queries.
On modern hardware, with a well-tuned database, finding the latest record in table B by doing a join is almost certainly not going to have a noticable performance impact unless you have HUGE amounts of data.
So, my recommendation: create a test system, populate the two tables with twice as much data as the system will ever need, and run the queries you have on the production environment. Check the query plans, and see if you can optimize the queries and/or indexing. If you really can't make it work, de-normalize the table.
Whilst this may seem like a lot of work, denormalization is a big deal - in my experience, on a moderately complex system, denormalized data schemas are at the heart of a lot of stupid bugs. It makes introducing new developers harder, it means additional complexity at the application level, and the extra code means more maintenance. In your case, if the code which updates table A fails, you will be producing bogus results without ever knowing about it; an undetected bug could affect lots of data.
We had a similar situation in our project tracking system where the latest state of the project is stored in the projects table (Cols: project_id, description etc.,) and the history of the project is stored in the project_history table (Cols: project_id, update_id, description etc.,). Whenever there is a new update to the project, we need find out the latest update number and add 1 to it to get the sequence number for the next update. We could have done this by grouping the project_history table on the project_id column and get the MAX(update_id), but the cost would be high considering the number of the project updates (in a couple of hundreds of thousands) and the frequency of update. So, we decided to store the value in the projects table itself in max_update_id column and keep updating it whenever there is a new update to a given project. HTH.
If I understand correctly, you have a table whose each row is a parameter and another table that logs each parameter value historically in a time series. If that is correct, I currently have the same situation in one of the products I am building. My parameter table hosts a listing of measures (29K recs) and the historical parameter value table has the value for that parameter every 1 hr - so that table currently has 4M rows. At any given point in time there will be a lot more requests FOR THE LATEST VALUE than for the history so I DO HAVE THE LATEST VALUE STORED IN THE PARAMETER TABLE in addition to it being in the last record in the parameter value table. While this may look like duplication of data, from the performance standpoint it makes perfect sense because
To get a listing of all parameters and their CURRENT VALUE, I do not have to make a join and more importantly
I do not have to get the latest value for each parameter from such a huge table
So yes, I would in your case most definitely store the latest value in the parent table and update it every time new data comes in. It will be a little slower for writing new data but a hell of a lot faster for reads.

How should I keep accurate records summarising multiple tables?

I have a normalized database and need to produce web based reports frequently that involve joins across multiple tables. These queries are taking too long, so I'd like to keep the results computed so that I can load pages quickly. There are frequent updates to the tables I am summarising, and I need the summary to reflect all update so far.
All tables have autoincrement primary integer keys, and I almost always add new rows and can arrange to clear the computed results in they change.
I approached a similar problem where I needed a summary of a single table by arranging to iterate over each row in the table, and keep track of the iterator state and the highest primary keen (i.e. "highwater") seen. That's fine for a single table, but for multiple tables I'd end up keeping one highwater value per table, and that feels complicated. Alternatively I could denormalise down to one table (with fairly extensive application changes), which feels a step backwards and would probably change my database size from about 5GB to about 20GB.
(I'm using sqlite3 at the moment, but MySQL is also an option).
I see two approaches:
You move the data in a separate database, denormalized, putting some precalculation, to optimize it for quick access and reporting (sounds like a small datawarehouse). This implies you have to think of some jobs (scripts, separate application, etc.) that copies and transforms the data from the source to the destination. Depending on the way you want the copying to be done (full/incremental), the frequency of copying and the complexity of data model (both source and destination), it might take a while to implement and then to optimizie the process. It has the advantage that leaves your source database untouched.
You keep the current database, but you denormalize it. As you said, this might imply changing in the logic of the application (but you might find a way to minimize the impact on the logic using the database, you know the situation better than me :) ).
Can the reports be refreshed incrementally, or is it a full recalculation to rework the report? If it has to be a full recalculation then you basically just want to cache the result set until the next refresh is required. You can create some tables to contain the report output (and metadata table to define what report output versions are available), but most of the time this is overkill and you are better off just saving the query results off to a file or other cache store.
If it is an incremental refresh then you need the PK ranges to work with anyhow, so you would want something like your high water mark data (except you may want to store min/max pairs).
You can create triggers.
As soon as one of the calculated values changes, you can do one of the following:
Update the calculated field (Preferred)
Recalculate your summary table
Store a flag that a recalculation is necessary. The next time you need the calculated values check this flag first and do the recalculation if necessary
Example:
CREATE TRIGGER update_summary_table UPDATE OF order_value ON orders
BEGIN
UPDATE summary
SET total_order_value = total_order_value
- old.order_value
+ new.order_value
// OR: Do a complete recalculation
// OR: Store a flag
END;
More Information on SQLite triggers: http://www.sqlite.org/lang_createtrigger.html
In the end I arranged for a single program instance to make all database updates, and maintain the summaries in its heap, i.e. not in the database at all. This works very nicely in this case but would be inappropriate if I had multiple programs doing database updates.
You haven't said anything about your indexing strategy. I would look at that first - making sure that your indexes are covering.
Then I think the trigger option discussed is also a very good strategy.
Another possibility is the regular population of a data warehouse with a model suitable for high performance reporting (for instance, the Kimball model).