How with one model dbt update data in another table (in another schema)? - dbt

The task is this: I have several models and after each processing of the model I would like to increase the count of processing by 1 in a separate table (always containing 1 row). Please give me a hint on how best doing it.
I think this is a job for post_hook but I'd like a concrete example.

Related

while doing incremental using dbt i want to to aggregation if that row exist else insert

I am using DBT to incremental load data from one schema in redshift to another to create reports. In DBT there is straight forward way to incrementally load data with upsert. But instead of doing the traditional upsert. I want to take sum (on the unique id for the rest of the columns in the table) of the incoming rows and old rows in the destination table if they already exist else do insert them.
Say for example I have a table.
T1(userid, total_deposit, total_withdrawal)
i have created a table that calculates total deposit and total withdrawal for a user, when i do an incremental query i might get new deposit or withdrawal the for existing user, in that case, I'll have to add the value in existing table instead of replacing it using upsert. And if the user is new I just need to do simple insert.
Any suggestion on how to approach this?
dbt is quite opinionated that invocations of dbt should be idempotent. This means that you can run the same command over and over again, and the result will be the same.
The operation you're describing is not idempotent, so you're going to have a hard time getting it to work with dbt out of the box.
As an alternative, I would break this into two steps:
Build an incremental model, where you are appending the new activity
Create a downstream model that references the incremental model and performs the aggregations you want to calculate the balance for each customer. You could very carefully craft this as an incremental model with your user_id as the unique_key (since you have all of the raw transactions in #1), but I'd start without that and make sure that's absolutely necessary for performance reasons, since it will add a fair bit of complexity.
For more info on complex incremental materializations, I suggest this discourse post written by Tristan Handy, Founder & CEO at dbt Labs

how to retrieve all the data of a table and duplicate it in another table with ETL Pentaho

i am using pentaho data integration and I would like to generate or duplicate the data from one table input for another table output in pdi with ETLenter image description here
OK, so you want to load the table B with exactly the same data in Table A twice or n times. The easiest way would be to create two transformations and a job calling the transformations.
The first transformation generates n rows (the number of rows you want the data repeated) with a simple Data generator step, after the data generator, you put a Copy rows to result step
The second transformation simply querys table A with a Table input step and inserts that data in table B with a Table output step.
Then you create a job with two Transformation job entries. The first transformation, and the second transformation. In the second transformation properties, you check the option Execute for every input row? so the transformation is executed the number of rows generated in the first transformation.
You have in the installation directory of PDI a samples directory, inside there's a jobs/shell for every row/ with an example of how to execute a job entry n times.
If you only want to duplicate the rows, I wouldn't bother creating the first transformation, I would simply create the first transformation and call it twice in the job. It's not so elegant, but it's quicker if n is small, it's not going to change with time and you don't need to reuse it in another process.
You've asked this question at least twice, without giving much detail. Do you want something like this:
Table A:
COLUMN1|COLUMN2
A|B
C|D
Table B:
COLUMN1|COLUMN2
A|B
A|B
C|D
C|D
In that case, the easiest way would be to create a transformation Table Input A -> Table Output B, and calling that transformation twice in a job.

Schedule same query for 7 Projects

I've created a Query that selects a subset of data from one Projects view and overwrites a table to store this subset of data.
I now need to run that same query on 7 different but equally configured projects, all of which need append their subset of data into a single aggregate dataset.
A big problem I see here is switching from overwrite to append, which adds complexity in figuring out the last time an append worked correctly and avoiding duplicate data.
I'm thinking to create a parameterized query that will take the project ID as an input and am looking to schedule it 7 times with the given project IDs.
However, maybe there's a better solution? My approach doesn't seem to "right". Maybe a sort of subselect / repeat for every parameter value could be used?

SSIS performance in this case scenario

Can this kind of logic can be implemented in SSIS and is it possible to do it in near-real time?
Users are submitting tables with hundreds of thousands of records and waiting for the results for up to 1 hour with the current implementation when the starting table have about 500.000 rows (after the STEP1 and STEP2 we have millions of records). In the future the amount of data and the user base may drastically grow.
STEP 1
We have a table (A) of around 500.000 rows with the following main columns: ID, AMOUNT
We also have a table (B) with the prop.steps and the following main columns: ID_A, ID_B, COEF
TABLE A:
id amount
a 1000
b 2000
TABLE B:
id_a,id_b,coef
a,a1,2
a1,b2,2
b,b1,5
We are creating new records from all the 500.000 records that we have in the table A multiplying the AMOUNT by the COEF:
OUTPUT TABLE:
id, amount
a,1000
a1,2000
a2,4000
b,2000
b1,10000
STEP 2
Following custom logic, we are assigning the amount of all the records calculated before to some other items with the following logic:
TABLE A
ID,AMOUNT
a,1000
a1,2000
a2,4000
b,2000
b1,10000
TABLE B
ID,WC,COEF
a,wc1,0.1
a,wc2,1
a,wc3,0.1
a1,wc4,1
a2,wc5,1
b,wc1,1
b1,wc1,1
b1,wc2,1
OUTPUT TABLE:
ID,WC,AMOUNT
a,wc1,100
a,wc2,1000
a,wc3,100
a1,wc4,2000
a2,wc5,4000
b,wc1,2000
b1,wc1,10000
b1,wc2,10000
The other steps are just joins and arithmetical operations on the tables and the overall number of records can't be reduced (the tables have also other fields with metadata).
In my personal experience that kind of logic can be completely implemented in SSIS.
I would do it in a Script Task or Component for two reasons:
First, if I understood correctly, you need an asynchronous task to
output more data than your input. Scripts can handle multiple and diferent outputs.
Second, in the script you can implement all those calculations
which in the case of using other components would take a lot of them and
relationships between them. And the most important aspect, the
algorithm complexity is kept in relation with your algorithmic
design which can be a huge boost on performance and scalability if
you get a good complexity, two aspects that, if I get it right again,
are fundamental.
There are, though, some professionals that have a bad opinion of "complex" scripts and...
The down step of this approach is that you need some ability with .NET and programming, also most of your package logic will be focus there and script debugging can be more complex than other components. But once you get to use the .NET features of SSIS, there is no turning back.
Usually getting near real time in SSIS is tricky for big data sets, and sometimes you need to integrate other tools (e.g. StreamInsight) to achieve it.

Too many MasterDataSets?

I'm writing a program in Visual Studio 2010 which is using an Access Database. Right now it has 6 Master Data Sets.
Each Dataset has a single tabular connection. Would it be better, if I instead used ONE MasterDataSet instead of the five or should I continue to use each of the Master Data Sets.
Below is a copy of my Solution Explorer to indicate what I mean:
EDIT: Even better: If it should be better that I merge down into one, how would I go about starting this?
This depends
if each dataset used in deferent form its better to keep them on the same way you did.
That if you put them on single dataset and you initialize this dataset on one form that use only one table from the six table existing on your dataset this will consume CPU to load the unwanted tables and memory for the unwanted table....
And if you are using two table as example on one screen its better to combine both in single, even the memory consumption will not differ if you distribute them to two dataset with single table on each one..
And also if you have some relation between some table like employee and Department and you want this data on single form its better to bring the two table on one dataset for view issue..to have the relation ready and don't build it on your code.....