How to resume data migration from the point where error happened in ssis? - sql

I am migrating data from an Oracle database to a SQL server 2008 r2 database using SSIS. My problem is that at a certain point the package fails, say some 40,000 rows out of 100,000 rows. What can I do so that the next time when I run the package after correcting the errors or something, I want it to be restarted from the 40,001st row, i.e, the row where the error had occured.
I have tried using checkpoint in SSIS, but the problem is that they work only between different control flow tasks. I want something that can work on the rows that are being transferred.

There's no native magic I'm aware of that is going to "know" that it failed on row 40,000 and when it restarts, it should start streaming row 40,001. You are correct that checkpoints are not the answer and have plenty of their own issues (can't serialize Object types, loops restart, etc).
How you can address the issue is through good design. If your package is created with the expectation that it's going to fail, then you should be able to handle these scenarios.
There are two approaches I'm familiar with. The first approach is to add a Lookup Transformation in the Data Flow between your source and your destination. The goal of this is to identify what records exist in the target system. If no match is found, then only those rows will be sent on to destination. This is a very common pattern and will allow you to also detect changes between source and destination (if that is a need). The downside is that you will always be transferring the full data set out of the source system and then filtering rows in the data flow. If it failed on row 99,999 out of 1,000,000 you will still need to stream all 1,000,000 rows back to SSIS for it to find the 1 that hasn't been sent.
The other approach is to use a dynamic filter in your WHERE clause of your source. If you can make assumptions like the rows are inserted in order, then you can structure your SSIS package to look like Execute SQL Task where you run a query like SELECT COALESCE(MAX(SomeId), 0) +1 AS startingPoint FROM dbo.MyTable against the Destination database and then assign that to an SSIS variable (#[User::StartingId]). You then use an expression on your select statement from the Source to be something like "SELECT * FROM dbo.MyTable T WHERE T.SomeId > " + (DT_WSTR, 10) #[User::StartingId] Now when the data flow begins, it will start where it last loaded data. The challenge on this approach is finding those scenarios where you know data hasn't been inserted out of order.
Let me know if you have questions, need things better explained, pictures, etc. Also, above code is freehanded so there could be syntax errors but the logic should be correct.

Related

How to get Merge Join to work for two different data sources in SSIS

I have been working on creating/loading data into a database for a school project and have been having some issues with Merge Join. I’ve researched many issues the others have had with Merge Join and typically solve my own problems but this one is a bit tricky. I’ve created an SSIS package that should pull a column from a table in Access (this column contains duplicate names to which I utilize a sort later on in the data flow) as well as pull another column from a table in my SQL Server database. For both of these OLE DB Sources I have tried using the simple method of selecting the table through the data access mode but I thought perhaps this was contributing to many warning messages because it would always pull everything from the table as opposed to the one column from each that I wanted. I am now using the SQL Command option with an extremely simple query (see below).
SELECT DISTINCT Name
FROM NameTable
For both OLE DB sources the query is the same except for the parameters selected. Following the source, I have a data conversion on each (because I found that Merge Join is a pansy when the data types don’t match) and I convert the Access one from DT_WSTR to DT_STR, while the SQL Server source is converted from DT_I4 to DT_STR. I then follow both with a sort, passing through the copy of Name and Tid, checking the “removing sorts with duplicate rows” option. Following that step, I then begin utilizing Merge Join with the Access source being my left input and the SQL Server Source (by source I am just referring to the side of the data flow, you’ll see in the image below) being the right input. Below I will also show how I am configuring the Merge Join, in case I’m doing it wrong. Lastly, I have my OLE DB Destination setup to drop this data into a table with the following columns, PrimaryKey column (it auto increments as new data is inserted), the Name column and the Tid column.
When I run the columns it says that it succeeds with no errors. I check my database and nothing has been written, I also note that in SSIS it says 0 rows written. I’m not sure what is going on as I enable the data viewers in between the sorts and the merge join and can see the data coming out both pipelines. Another important thing to note is that when I enable the data viewer after the Merge Join, it never shows up when I run the package, only the two after sort appear. At first I thought maybe the data wasn’t coming out of the Merge Join so I experimented with placing derive columns after the Merge Join and sure enough, the data does flow through. Even with those extra things in between the Merge Join and Destination, the data viewers never pop up. I mention this because I suspect that this is part of the problem. Below are also the messages that SSIS spits out after I run the package.
SSIS messages:
SSIS package "C:\Users\Liono\Documents\Visual Studio 2015\Projects\DataTest6\Package.dtsx" starting.
Information: 0x4004300A at Data Flow Task, SSIS.Pipeline: Validation phase is beginning.
Information: 0x4004300A at Data Flow Task, SSIS.Pipeline: Validation phase is beginning.
Information: 0x40043006 at Data Flow Task, SSIS.Pipeline: Prepare for Execute phase is beginning.
Information: 0x40043007 at Data Flow Task, SSIS.Pipeline: Pre-Execute phase is beginning.
Information: 0x4004300C at Data Flow Task, SSIS.Pipeline: Execute phase is beginning.
Information: 0x40043008 at Data Flow Task, SSIS.Pipeline: Post Execute phase is beginning.
Information: 0x4004300B at Data Flow Task, SSIS.Pipeline: "OLE DB Destination" wrote 0 rows.
Information: 0x40043009 at Data Flow Task, SSIS.Pipeline: Cleanup phase is beginning.
SSIS package "C:\Users\Liono\Documents\Visual Studio 2015\Projects\DataTest6\Package.dtsx" finished: Success.
The program '[9588] DtsDebugHost.exe: DTS' has exited with code 0 (0x0).
Lastly, I did ask a somewhat similar question and solved it on my own by using one source with the right SQL query, but the same thing doesn’t apply here because I’m pulling from two different sources and I am having issues with the Merge Join this time around. The code I used last time:
SELECT a.T1id,
b.T2id,
c.Nameid
FROM Table1 AS a join
Table2 AS b
On a.T1id = b.T2id,
Name AS c
ORDER BY a.[T1id] ASC
I post this because, maybe someone might know of a way to right some SQL that will allow me to forgo using Merge Join again, where I can somehow grab both sets of data and join them, then dump them in my table in SQL Server.
As always, I greatly appreciate your help and if there are any questions of clarifications that need to be made, please ask and I will do my best to help you help me.
Thanks!

Newly inserted or updated row count in pentaho data integration

I am new to Pentaho Data Integration; I need to integrate one database to another location as ETL Job. I want to count the number of insert/updat during the ETL job, and insert that count to another table . Can anyone help me on this?
I don't think that there's a built-in functionality for returning the number of affected rows of an Insert/Update step in PDI to date.
Nevertheless, most database vendors are able to provide you with the ability to get the number of affected rows from a given operation.
In PostgreSQL, for instance, it would look like this:
/* Count affected rows from INSERT */
WITH inserted_rows AS (
INSERT INTO ...
VALUES
...
RETURNING 1
)
SELECT count(*) FROM inserted_rows;
/* Count affected rows from UPDATE */
WITH updated_rows AS (
UPDATE ...
SET ...
WHERE ...
RETURNING 1
)
SELECT count(*) FROM updated_rows;
However, you're aiming to do that from within a PDI job, so I suggest that you try to get to a point where you control the SQL script.
Suggestion: Save the source data in a file on the target DB server, then use it, perhaps with a bulk loading functionality, to insert/update, then save the number of affected rows into a PDI variable. Note that you may need to use the SQL script step in the Job's scope.
EDIT: the implementation is a matter of chosen design, so the suggested solution is one of many. On a very high level, you could do something like the following.
Transformation I - extract data from source
Get the data from the source, be it a database or anything else
Prepare it for output in a way that it fits the target DB's structure
Save a CSV file using the text file output step on the file system
Parent Job
If the PDI server is the same as the target DB server:
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table (ideally, this table could also contain the time-stamp of the operation so you could keep track of things)
If the PDI server is NOT the same as the target DB server:
Upload the source data file to the server, e.g. with the FTP/SFTP file upload steps
Use the Execute SQL Script step to:
Read data from the file and perform the INSERT/UPDATE
Write the number of affected rows into a table
EDIT 2: another suggested solution
As suggested by #user3123116, you can use the Compare Fields step (if not part of your environment, check the marketplace for it).
The only shortcoming I see is that you have to query the target database before inserting/updating, which is, of course, less performant.
Eventually it could look like so (note that this is just the comparison and counting part):
Also note that you can split the input of the source data stream (COPY, not DISTRIBUTE), and do your insert/update, but this stream must wait for the stream of the field comparison to end the query on the target database, otherwise you might end up with the wrong statistics.
The "Compare Fields" step will take 2 streams as input for comparison, and its output is 4 distinct streams for "Identical", Changed", "Added", and "Removed" records. You can count those 4, and then process the "Changed", "Added", and "Removed" records with an Insert/Update.
You can do it from the Logging option inside the Transformation settings. Please follow the below steps :
Click on Edit menu --> Settings
Switch to Logging Tab
Select Step from the left menu
Provide the Log Connection & Log table name(Say StepLog)
Select the required fields for logging(LINES_OUTPUT - for inserted count & LINES_UPDATED - for updated count)
Click on SQL button and create the table by clicking on the Execute button
Now all the steps will be logged into the Log table(StepLog), you can use it for further actions.
Enjoy

exporting a table containing a large number of records to to another database on a linked server

I have a task in a project that required the results of a process, which could be anywhere from 1000 up to 10,000,000 records (approx upper limit), to be inserted into a table with the same structure in another database across a linked server. The requirement is to be able to transfer in chunks to avoid any timeouts
In doing some testing I set up a linked server and using the following code to test transfered approx 18000 records:
DECLARE #BatchSize INT = 1000
WHILE 1 = 1
BEGIN
INSERT INTO [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2] WITH (TABLOCK)
(
id
,title
,Initials
,[Last Name]
,[Address1]
)
SELECT TOP(#BatchSize)
s.id
,s.title
,s.Initials
,s.[Last Name]
,s.[Address1]
FROM [DBNAME1].[dbo].[TABLENAME1] s
WHERE NOT EXISTS (
SELECT 1
FROM [LINKEDSERVERNAME].[DBNAME2].[dbo].[TABLENAME2]
WHERE id = s.id
)
IF ##ROWCOUNT < #BatchSize BREAK
This works fine however it took 5 mins to transfer the data.
I would like to implement this using SSIS and am looking for any advice in how to do this and speed up the process.
Open Visual Studio/Business Intelligence Designer Studio (BIDS)/SQL Server Data Tools-BI edition(SSDT)
Under the Templates tab, select Business Intelligence, Integration Services Project. Give it a valid name and click OK.
In Package.dtsx which will open by default, in the Connection Managers section, right click - "New OLE DB Connection". In the Configure OLE DB Connection Manager section, Click "New..." and then select your server and database for your source data. Click OK, OK.
Repeat the above process but use this for your destination server (linked server).
Rename the above connection managers from server\instance.databasename to something better. If databasename does not change across the environments then just use the database name. Otherwise, go with the common name of it. i.e. if it's SLSDEVDB -> SLSTESTDB -> SLSPRODDB as you migrate through your environments, make it SLSDB. Otherwise, you end up with people talking about the connection manager whose name is "sales dev database" but it's actually pointing at production.
Add a Data Flow to your package. Call it something useful besides Data Flow Task. DFT Load Table2 would be my preference but your mileage may vary.
Double click the data flow task. Here you will add an OLE DB Source, a Lookup Task and a OLE DB Destination. Probably, as always, it will depend.
OLE DB Source - use the first connection manager we defined and a query
SELECT
s.id
,s.title
,s.Initials
,s.[Last Name]
,s.[Address1]
FROM [dbo].[TABLENAME1] s
Only pull in the columns you need. Your query current filters out any duplicates that already exist in the destination. Doing that can be challenging. Instead, we'll bring the entirety of TABLENAME1 into the pipeline and filter out what we don't need. For very large volumes in your source table, this may be an untenable approach and we'd need to do something different.
From the Source we need to use a Lookup Transformation. This will allow us to detect the duplicates. Use the second connection manager we defined, one that points to the destination. Change the NoMatch from "Fail Component" to "Redirect Unmatched rows" (name approximate)
Use your query to pull back the key value(s)
SELECT T2.id
FROM [dbo].[TABLENAME2] AS T2;
Map T2.id to the id column.
When the package starts, it will issue the above query against the target table and cache all the values of T2.id into memory. Since this is only a single column, that shouldn't be too expensive but again, for very large tables, this approach may not work.
There are 3 outputs now available from the Lookup: Match, NoMatch and Error. Match will be anything that exists in the source and destination. You don't care about those as you are only interested in what exists in source and not destination. When you might care is if you have to determine whether there is change between the values in source and the destination. NoMatch are the rows that exist in Source but don't exist in Destination. That's the stream you want. For completeness, Error would capture things that went very wrong but I've not experience it "in the wild" with a lookup.
Connect the NoMatch stream to the OLE DB Destination. Select your Table Name there and ensure the words Fast Load are in the destination. Click on the Columns tab and make sure everything is routed up.
Whether you need to fiddle with the knobs on the OLE DB Destination is highly variable. I would test it, especially with your larger sets of data and see whether the timeout conditions are a factor.
Design considerations for larger sets
It depends.
Really, it does. But, I would look at identifying where the pain point lies.
If my source table is very large and pulling all that data into the pipeline just to filter it back out, then I'd look at something like a Data Flow to first bring all the rows in my Lookup over to the Source database (use the T2 query) and write it into a staging table and make the one column your clustered key. Then modify your source query to reference your staging table.
Depending on how active the destination table is (whether any other process could load it), I might keep that lookup in the data flow to ensure I don't load duplicates. If this process is the only one that loads it, then drop the Lookup.
If the lookup is at fault - it can't pull in all the IDs then either go with the first alternate listed above or look at changing your caching mode from Full to Partial. Do realize that this will issue a query to the target system for potentially all the rows that come out of the source database.
If the destination is giving issues - I'd determine what the issue is. If it's network latency for the loading of data, drop the value of MaximumCommitInsertSize from 2147483647 to something reasonable, like your batch size from above (although 1k might be a bit low). If you're still encountering blocking, then perhaps staging the data to a different table on the remote server and then doing an insert locally might be an approach.

PDI or mysqldump to extract data without blocking the database nor getting inconsistent data?

I have an ETL process that will run periodically. I was using kettle (PDI) to extract the data from the source database and copy it to a stage database. For this I use several transformations with table input and table output steps. However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data. Furthermore, I don't know if the source database would be blocked. This would be a problem if the extraction takes some minutes (and it will take them). The advantage of PDI is that I can select only the necessary columns and use timestamps to get only the new data.
By the other hand, I think mysqldump with --single-transaction allows me to get the data in a consistent way and don't block the source database (all tables are innodb). The disadventage is that I would get innecessary data.
Can I use PDI, or I need mysqldump?
PD: I need to read specific tables from specific databases, so I think xtrabackup it's not a good option.
However, I think I could get inconsistent data if the source database is modified during the process, since this way I don't get a snapshot of the data
I think "Table Input" step doesn't take into account any modifications that are happening when you are reading. Try a simple experiment:
Take a .ktr file with a single table input and table output. Try loading the data into the target table. While in the middle of data load, insert few records in the source database. You will find that those records are not read into the target table. (note i tried with postgresql db and the number of rows read is : 1000000)
Now for your question, i suggest you using PDI since it gives you more control on the data in terms of versioning, sequences, SCDs and all the DWBI related activities. PDI makes it easier to load to the stage env. rather than simply dumping the entire tables.
Hope it helps :)
Interesting point. If you do all the table inputs in one transformation then at least they all start at same time but whilst likely to be consistent it's not guaranteed.
There is no reason you can't use pdi to orchestrate the process AND use mysql dump. In fact for bulk insert or extract it's nearly always better to use the vendor provided tools.

How to create a Temporary Table using (Select * into ##temp from table) syntax(For MS SQL) using Pentaho data integration

When I am using the above syntax in "Execute row script" step...it is showing success but the temporary table is not getting created. Plz help me out in this.
Yes, the behavior you're seeing is exactly what I would expect. It works fine from the TSQL prompt, throws no error in the transform, but the table is not there after transform completes.
The problem here is the execution model of PDI transforms. When a transform is run, each step gets its own thread of execution. At startup, any step that needs a DB connection is given its own unique connection. After processing finishes, all steps disconnect from the DB. This includes the connection that defined the temp table. Once that happens (the defining connection goes out of scope), the temp table vanishes.
Note, that this means in a transform (as opposed to a Job), you cannot assume a specific order of completion of anything (without Blocking Steps).
We still don't have many specifics about what you're trying to do with this temp table and how you're using it's data, but I suspect you want its contents to persist outside your transform. In that case, you have some options, but a global temp table like this simply won't work.
Options that come to mind:
Convert temp table to a permanent table. This is the simplest
solution; you're basically making a staging table, loading it with a
Table Output step (or whatever), and then reading it with Table
Input steps in other transforms.
Write table contents to a temp file with something like a Text File
Output or Serialze to File step, then reading it back in from the
other transforms.
Store rows in memory. This involves wrapping your transforms in a
Job, and using the Copy Rows to Results and Get Rows from Results steps.
Each of these approaches has its own pros and cons. For example, storing rows in memory will be faster than writing to disk or network, but memory may be limited.
Another step it sounds like you might need depending on what you're doing is the ETL Metadata Injection step. This step allows you in many cases to dynamically move the metadata from one transform to another. See the docs for descriptions of how each of these work.
If you'd like further assistance here, or I've made a wrong assumption, please edit your question and add as much detail as you can.