joining two table inputs in a single transaction - pentaho

Hi, I am trying to create a kettle transaction where the data is read from two tables and then joined. Above seems to be very simple and basic transformation but i get an issue when trying to do it in a single transaction i.e. with "make the transformation database transactional" enabled in transformation settings. The below exception is reported when trying to do so-
com.mysql.jdbc.RowDataDynamic#7c02dce0 is still active. No statements may be issued when any streaming result sets are open and in use on a given connection. Ensure that you have called .close() on any active streaming result sets before attempting more queries.

if the data source comes from the same database try to do a single query step, do your query doing the join tables in the sql query, after that you will not need the merge join step, and the transaction will be atomic.

Related

Checking of replicated data Pentaho

I have about 100 tables to which we replicate data, e.g. from the Oracle database.
I would like to quickly check that the data replicated to the tables in db2 is the same as in the source system.
Does anyone have a way to do this? I can create 100 transformations, but that's monotonous and time consuming. I would prefer to process this in a loop.
I thought I would keep the queries in a table and reach into it for records.
I read the data from Table input (sql_db2, sql_source, table_name) and write do copy rows to result. Next I read single record and I read a single record and put it into a loop.
But here came a problem because I don't know how to dynamically compare the data for the tables. Each table has different columns and here I have a problem.
I don't know if this is also possible?
You can inject metadata (in this case your metadata would be the column and table names) to a lot of steps in Pentaho, you create a transformation to collect the metadata to inject to another transformation that has only the steps and some basic information, but the bulk of the information of the columns affected by the different steps is in the transformation injecting the metadata.
Check Pentaho official documentation about Metadata Injection (MDI) and the sample with a basic example of metadata injection available in your PDI installation.

PDI or mysqldump to extract data without blocking the database nor getting inconsistent data?

I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.
However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)
Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.

merge sql statement from 2 databases

I'm trying to merge 2 tables from 2 databases on 2 differents servers.
For now, I create a linked server on one of the servers and I use a query like this:
MERGE INTO tablename1 as T1
using linkedservername.dbname.tablename2 as T2 ON
WHEN MATCHED THEN
UPDATE SET ...
WHEN NOT MATCHED THEN
INSERT ...
I would like to know if there is a solution to do that without create a linked server.
There are three general ways to do this in SSIS. But there is a lot more information if you check online.
Either way you first need to create a connection manager in SSIS pointing directly at your linked server. Start with that.
Then create a data flow task where you select from dbname.tablename2 in a data flow source
Then you can do it a few ways:
A. Staging Table
Dump that result into a staging table then run your merge statement locally in a subsequent SQL Task. This is usually the quickest (and simplest) way unless you aren't allowed to create tables/data in the target.
B. Lookup
Use a lookup in your data flow to identify if the record exists or not, followed by a OLEDB destination (for inserts) or a OLEDB command (for updates)
This is generally slow because both the lookup and update are inefficient.
C. row level merge
Feed the result into a OLEDB command, and put your merge directly in there
This is probably the slowest.
If you want more info, get your connection manager sorted and post back.

How to resume data migration from the point where error happened in ssis?

I am migrating data from an Oracle database to a SQL server 2008 r2 database using SSIS. My problem is that at a certain point the package fails, say some 40,000 rows out of 100,000 rows. What can I do so that the next time when I run the package after correcting the errors or something, I want it to be restarted from the 40,001st row, i.e, the row where the error had occured.
I have tried using checkpoint in SSIS, but the problem is that they work only between different control flow tasks. I want something that can work on the rows that are being transferred.
There's no native magic I'm aware of that is going to "know" that it failed on row 40,000 and when it restarts, it should start streaming row 40,001. You are correct that checkpoints are not the answer and have plenty of their own issues (can't serialize Object types, loops restart, etc).
How you can address the issue is through good design. If your package is created with the expectation that it's going to fail, then you should be able to handle these scenarios.
There are two approaches I'm familiar with. The first approach is to add a Lookup Transformation in the Data Flow between your source and your destination. The goal of this is to identify what records exist in the target system. If no match is found, then only those rows will be sent on to destination. This is a very common pattern and will allow you to also detect changes between source and destination (if that is a need). The downside is that you will always be transferring the full data set out of the source system and then filtering rows in the data flow. If it failed on row 99,999 out of 1,000,000 you will still need to stream all 1,000,000 rows back to SSIS for it to find the 1 that hasn't been sent.
The other approach is to use a dynamic filter in your WHERE clause of your source. If you can make assumptions like the rows are inserted in order, then you can structure your SSIS package to look like Execute SQL Task where you run a query like SELECT COALESCE(MAX(SomeId), 0) +1 AS startingPoint FROM dbo.MyTable against the Destination database and then assign that to an SSIS variable (#[User::StartingId]). You then use an expression on your select statement from the Source to be something like "SELECT * FROM dbo.MyTable T WHERE T.SomeId > " + (DT_WSTR, 10) #[User::StartingId] Now when the data flow begins, it will start where it last loaded data. The challenge on this approach is finding those scenarios where you know data hasn't been inserted out of order.
Let me know if you have questions, need things better explained, pictures, etc. Also, above code is freehanded so there could be syntax errors but the logic should be correct.

How to do Data Flow Task from/to the same table?

I am using SQL Server 2005 SSIS and we are using the Data Flow Task to move data from one table to another. This works well. Now we have another requirement to do data update from the same table using this approach.
Is this possible to use the same approach for as follow:
We have a dataset from Table A based on complex query
We update back to the Table A
The normal query UPDATE INTO is not an option due it takes awhile to process and we can't see the data movement like we did for Data Flow Task.
Any guidance or anything that will be good.
Thanks
either:
write it to a temporay table and do the update into with a single SQL task after you processed everything
break it down into smaller chunks based on SSIS variables and OFFSET and use a FOR/FOREACH LOOP
Read the data with a data source in a data flow task, and use ole db command in the data flow to update the data in the same table. If there is no locking when you read and only row-level locking when you update, that should work