Checking of replicated data Pentaho - pentaho

I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?

You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.

Related

ADF - How should I copy table data from source Azure SQL Database to 6 other Azure SQL Databases?

We curate data in the "Dev" Azure SQL Database and then currently use RedGate's Data Compare tool to push up to 6 higher Azure SQL Databases. I am trying to migrate that manual process to ADFv2 and would like to avoid copy/pasting the 10+ copy data actives for each database (x6) to keep it more maintainable for future changes. The static tables have some customization in the copy data activity but the basic idea follows this post to perform an upsert.
How can the implementation described above be done in Azure Data Factory?
I was imagining something like the following:
Using one parameterized link service that has the server name & database name configurable to generate a dynamic connection to Azure SQL Database.
Creating a pipeline for each table's copy data activity.
Creating a master pipeline to then nest each table's pipeline in.
Using variables loop over the different connections an passing those to the sub-pipelines parameters.
Not sure if that is the most efficient plan or even works yet. Other ideas/suggestions?
we can not tell you if that's the most efficient plan. But I think so. Just make it works.
As you said in the comment:
we can use Dynamic Pipelines - Copy multiple tables in Bulk with
'Lookup' & 'ForEach'. we can perform dynamic copies of your data
table lists in bulk within a single pipeline. Lookup returns either
the lists of data or first row of data. ForEach - #activity('Azure
SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv') + This is
efficient and cost optimized since we are using less number of
activities and datasets.
In usually, we also will choose same solution with you: dynamic parameter/pipeline, lookup + foreach active to achieve the scenario. In one word, make the pipeline has a strong logic, simple and efficient.
Added the same info mentioned in the Comment as Answer.
Yup, we can use Dynamic Pipelines - Copy multiple tables in Bulk with 'Lookup' & 'ForEach'.
We can perform dynamic copies of your data table lists in bulk within a single pipeline. Lookup returns either the lists of data or first row of data.
ForEach - #activity('Azure SQL Table lists').output.value ;
#concat(item().TABLE_SCHEMA,'.',item().TABLE_NAME,'.csv')
This is efficient and cost optimized since we are using less number of activities and datasets.
Attached pic as ref-

Do while loop with GPDB using talend

I have a very large data set in GPDB from which I need to extract close to 3.5 million records. I use this for a flatfile which is then used to load to different tables. I use Talend, and do a select * from table using the tgreenpluminput component and feed that to a tfileoutputdelimited. However due to the very large volume of the file, I run out of memory while executing it on the Talend server.
I lack the permissions of a super user and unable to do a \copy to output it to a csv file. I think something like a do while or a tloop with more limited number of rows might work for me. But my table doesnt have any row_id or uid to distinguish the rows.
Please help me with suggestions how to solve this. Appreciate any ideas. Thanks!
If your requirement is to load data into different tables from one table, then you do not need to go for load into file and then from file to table.
There is a component named tGreenplumRow which allows you to write direct sql queries (DDL and DML queries) in it.
Below is a sample job,
If you notice, there are three insert statements inside this component. It will be executed one by one separated by semicolon.

PDI or mysqldump to extract data without blocking the database nor getting inconsistent data?

I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.
However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)
Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.

How to create a Temporary Table using (Select * into ##temp from table) syntax(For MS SQL) using Pentaho data integration

When I am using the above syntax in "Execute row script" step...it is showing success but the temporary table is not getting created. Plz help me out in this.
Yes, the behavior you're seeing is exactly what I would expect. It works fine from the TSQL prompt, throws no error in the transform, but the table is not there after transform completes.
The problem here is the execution model of PDI transforms. When a transform is run, each step gets its own thread of execution. At startup, any step that needs a DB connection is given its own unique connection. After processing finishes, all steps disconnect from the DB. This includes the connection that defined the temp table. Once that happens (the defining connection goes out of scope), the temp table vanishes.
Note, that this means in a transform (as opposed to a Job), you cannot assume a specific order of completion of anything (without Blocking Steps).
We still don't have many specifics about what you're trying to do with this temp table and how you're using it's data, but I suspect you want its contents to persist outside your transform. In that case, you have some options, but a global temp table like this simply won't work.
Options that come to mind:
Convert temp table to a permanent table. This is the simplest
solution; you're basically making a staging table, loading it with a
Table Output step (or whatever), and then reading it with Table
Input steps in other transforms.
Write table contents to a temp file with something like a Text File
Output or Serialze to File step, then reading it back in from the
other transforms.
Store rows in memory. This involves wrapping your transforms in a
Job, and using the Copy Rows to Results and Get Rows from Results steps.
Each of these approaches has its own pros and cons. For example, storing rows in memory will be faster than writing to disk or network, but memory may be limited.
Another step it sounds like you might need depending on what you're doing is the ETL Metadata Injection step. This step allows you in many cases to dynamically move the metadata from one transform to another. See the docs for descriptions of how each of these work.
If you'd like further assistance here, or I've made a wrong assumption, please edit your question and add as much detail as you can.

SSIS storing logging variables in a derived column

I am developing SSIS packages that consist of 2 main steps:
Step 1: Grab all sorts of data from existing legacy systems and dump them into a series of staging tables in my database.
Step 2: Move the data from my staging tables into a more relational set of tables that I'm using specifically for my project.
In step 1 I'm just doing a bulk SELECT and a bulk INSERT; however, in step 2 I'm doing row-by-row inserts into my tables using OLEDB Command tasks so that I can log very specific row-level activity of everything that's happening. Here is my general layout for step 2 processes.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_1.png
You'll notice 3 OLEDB tasks: 1 for the actual INSERT, and 2 for success/fail INSERTs into our logging table.
The main thing I'm logging is source table/id and destination table/id for each row that passes through this flow. I'm storing this stuff in variables and adding them to the data flow using a Derived Column so that I can easily map them to the query parameters of the stored procedures.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_3.png
I've decided to store these logging values in variables instead of hard-coding the values in the SqlCommand field on the task, because I'm pretty sure you CAN'T put variable expressions in that field (i.e. exec storedproc #[User::VariableName],... ,... ,...). So, this is the best solution I've found.
alt text http://dl.dropbox.com/u/2468578/screenshots/step_2.png
Is this the best solution? Probably not.
Is it good performance wise to add 4 logging columns to a data flow that consists of 500,000 records? Probably not.
Can you think of a better way?
I really don't think calling an OLEDBCommand 500,000 times is going to be performant.
If you are already going to staging tables - load it all to a staging table and take it from there in T-SQL or even another dataflow (or to a raw file and then something else depending on your complete operation). A Bulk insert is going to be hugely more efficient.
to add to Cade's answer if you truly need the logging info on a row by row basis, your best best is to leverage the oledb destination and use one or both of the following transformations to add columns to the dataflow:
Derived Column Transformation
Audit Transformation
This should be your best bet and should't add much overhead