I'm working on a package to import data from a raw text file to a table in SQL Server. My package contains:
1) An Execute Process Task that runs a batch file to compile .txt files
2) An Execute SQL Task that Truncates the table I want to import
3) A Data Flow Task that takes the data from the raw text file and puts it in the table in SQL Server
I was able to run each step individually and they worked as expected - however, when I run the solution from inside SSIS itself, it gives me the "success" message but nothing actually happens. Even worse, the components of the data flow task are now missing.
Has anyone experienced this who found a work around?
Sorry for the lack of specifics! I actually figured it out. Let me clarify my second paragraph:
The batch portion and the Execute SQL Task work perfectly when I disable the Data Flow Task! However, upon enabling the Data Flow Task, the package would "run" but not complete the Data Flow Task and would delete the Data Flow Task's components completely. Within the data flow task I had:
1) Flat File Source
2) Conditional split that ignored rows in the first column if the value was "".
3) OLE DB destination table
What I found is that changing the Conditional Split from specifically ignoring rows with "" to making the criteria based on value length, rather than looking for that symbol, worked out and didn't completely delete out components in the data flow task.
TL;DR: For whatever reason, the solution I built didn't like the conditional split criteria being based on the "" character. When I removed that, the solution worked perfectly.
Related
I am using select statement in excel source to select just specific columns data from excel for import.
But I am wondering, is it possible to select data such way when I select for example column with name: Column_1, but if this column is not exists in excel then it will try to select column with name Column_2? Currently if Column_1 is missing, then data flow task fails.
Use a Script task and write .net code to read the excel file and then perform the check for the Column_1 availability in the file. If the column does not present then use Column_2 as input. Script Task in SSIS can act as a source.
SSIS is metadata based and will not support dynamic metadata, however you can use Script Component as #nitin-raj suggested to handle all known source columns. There is a good post below on how it can be done.
Dynamic File Connections
If you have many such files that can have varying columns then it is better to create a custom component.However, you cannot have dynamic metadata even with custom component, the set of columns should be known upfront to SSIS.
If the list of columns keep changing and you cannot know in advance what are expected columns then you are better off handling the entire thing in C#/VB.Net using Script Task of control flow
As a best practice, because SSIS meta data is static, any data quality and formatting issues in source files should be corrected before ssis data flow task runs.
I have seen this situation before and there is a very simple fix. In the beginning of your ssis package, using a file task to create copy of the source excel file and then run a c# script or execute a powershell to rename the columns so that if column 1 does not exist, it is either added at the appropriate spot in excel file or in case the column name is wrong is it corrected.
As a result of this, you will not need to refresh your ssis meta data every time it fails. This is a standard data standardization practice.
The easiest way is to add two data flow tasks, one data flow for each Excel source select statement and use precedence constraints to execute the second data flow when the first one fails.
The disadvantage of this approach is that if the first data flow task fails for another reason, it will also try to execute the second one. You will need some advanced error handling to check if the error is thrown due to missing columns or not.
But if have a similar situation, I will use a Script Task to check if the column exists and build the SQL command dynamically. Note that this SQL command must always return the same metadata (you must use aliases).
Helpful links
Overview of SSIS Precedence Constraints
Working with Precedence Constraints in SQL Server Integration Services
Precedence Constraints
I have been working on creating/loading data into a database for a school project and have been having some issues with Merge Join. I’ve researched many issues the others have had with Merge Join and typically solve my own problems but this one is a bit tricky. I’ve created an SSIS package that should pull a column from a table in Access (this column contains duplicate names to which I utilize a sort later on in the data flow) as well as pull another column from a table in my SQL Server database. For both of these OLE DB Sources I have tried using the simple method of selecting the table through the data access mode but I thought perhaps this was contributing to many warning messages because it would always pull everything from the table as opposed to the one column from each that I wanted. I am now using the SQL Command option with an extremely simple query (see below).
SELECT DISTINCT Name
FROM NameTable
For both OLE DB sources the query is the same except for the parameters selected. Following the source, I have a data conversion on each (because I found that Merge Join is a pansy when the data types don’t match) and I convert the Access one from DT_WSTR to DT_STR, while the SQL Server source is converted from DT_I4 to DT_STR. I then follow both with a sort, passing through the copy of Name and Tid, checking the “removing sorts with duplicate rows” option. Following that step, I then begin utilizing Merge Join with the Access source being my left input and the SQL Server Source (by source I am just referring to the side of the data flow, you’ll see in the image below) being the right input. Below I will also show how I am configuring the Merge Join, in case I’m doing it wrong. Lastly, I have my OLE DB Destination setup to drop this data into a table with the following columns, PrimaryKey column (it auto increments as new data is inserted), the Name column and the Tid column.
When I run the columns it says that it succeeds with no errors. I check my database and nothing has been written, I also note that in SSIS it says 0 rows written. I’m not sure what is going on as I enable the data viewers in between the sorts and the merge join and can see the data coming out both pipelines. Another important thing to note is that when I enable the data viewer after the Merge Join, it never shows up when I run the package, only the two after sort appear. At first I thought maybe the data wasn’t coming out of the Merge Join so I experimented with placing derive columns after the Merge Join and sure enough, the data does flow through. Even with those extra things in between the Merge Join and Destination, the data viewers never pop up. I mention this because I suspect that this is part of the problem. Below are also the messages that SSIS spits out after I run the package.
SSIS messages:
SSIS package "C:\Users\Liono\Documents\Visual Studio 2015\Projects\DataTest6\Package.dtsx" starting.
Information: 0x4004300A at Data Flow Task, SSIS.Pipeline: Validation phase is beginning.
Information: 0x4004300A at Data Flow Task, SSIS.Pipeline: Validation phase is beginning.
Information: 0x40043006 at Data Flow Task, SSIS.Pipeline: Prepare for Execute phase is beginning.
Information: 0x40043007 at Data Flow Task, SSIS.Pipeline: Pre-Execute phase is beginning.
Information: 0x4004300C at Data Flow Task, SSIS.Pipeline: Execute phase is beginning.
Information: 0x40043008 at Data Flow Task, SSIS.Pipeline: Post Execute phase is beginning.
Information: 0x4004300B at Data Flow Task, SSIS.Pipeline: "OLE DB Destination" wrote 0 rows.
Information: 0x40043009 at Data Flow Task, SSIS.Pipeline: Cleanup phase is beginning.
SSIS package "C:\Users\Liono\Documents\Visual Studio 2015\Projects\DataTest6\Package.dtsx" finished: Success.
The program '[9588] DtsDebugHost.exe: DTS' has exited with code 0 (0x0).
Lastly, I did ask a somewhat similar question and solved it on my own by using one source with the right SQL query, but the same thing doesn’t apply here because I’m pulling from two different sources and I am having issues with the Merge Join this time around. The code I used last time:
SELECT a.T1id,
b.T2id,
c.Nameid
FROM Table1 AS a join
Table2 AS b
On a.T1id = b.T2id,
Name AS c
ORDER BY a.[T1id] ASC
I post this because, maybe someone might know of a way to right some SQL that will allow me to forgo using Merge Join again, where I can somehow grab both sets of data and join them, then dump them in my table in SQL Server.
As always, I greatly appreciate your help and if there are any questions of clarifications that need to be made, please ask and I will do my best to help you help me.
Thanks!
I have files abc.xlsx, 1234.xlsx, and xyz.xlsx in some folder. My requirement is to develop a transformation where the Microsoft Excel Input in PDI (Pentaho Data Integration) should only pick the file based on the output of a sql query. If the output query is abc.xlsx. Microsoft Excel Input should pick of abc.xlsx for further processing. How do I achieve this? Would really appreciate your help. Thanks.
Transformations in Kettle run asynchronously, so you're probably looking into needing a job for this.
Files to create
Create a transformation that performs the SQL query you're looking for and populates a variable based on the result
Create a transformation that pulls data from the Excel file, using the variable populated as the filename
Create a job that executes the first transformation, then steps into the second transformation
Jobs run sequentially, so it will execute the first transformation, perform the query, get the result, and set a variable. Variables need to be set and retrieved in different transformations because of their asynchronous nature. This is the reason for the second transformation; the job won't step into the second transformation until the first one is done running (therefore, not until the variable is populated).
This is all assuming you only want to run the transformation once, expecting a single result from the query. If you want to loop it, pulling data from a set, then setup is a little bit different.
The Excel input step has a "accept filenames from previous step" option. You can have a table input build the full path of the file you want to read (or you somehow build it later knowing the base dir and the short filename), pass the filename to the excel input, tick that box and specify the step and the field you want to use for the filename.
I am migrating data from an Oracle database to a SQL server 2008 r2 database using SSIS. My problem is that at a certain point the package fails, say some 40,000 rows out of 100,000 rows. What can I do so that the next time when I run the package after correcting the errors or something, I want it to be restarted from the 40,001st row, i.e, the row where the error had occured.
I have tried using checkpoint in SSIS, but the problem is that they work only between different control flow tasks. I want something that can work on the rows that are being transferred.
There's no native magic I'm aware of that is going to "know" that it failed on row 40,000 and when it restarts, it should start streaming row 40,001. You are correct that checkpoints are not the answer and have plenty of their own issues (can't serialize Object types, loops restart, etc).
How you can address the issue is through good design. If your package is created with the expectation that it's going to fail, then you should be able to handle these scenarios.
There are two approaches I'm familiar with. The first approach is to add a Lookup Transformation in the Data Flow between your source and your destination. The goal of this is to identify what records exist in the target system. If no match is found, then only those rows will be sent on to destination. This is a very common pattern and will allow you to also detect changes between source and destination (if that is a need). The downside is that you will always be transferring the full data set out of the source system and then filtering rows in the data flow. If it failed on row 99,999 out of 1,000,000 you will still need to stream all 1,000,000 rows back to SSIS for it to find the 1 that hasn't been sent.
The other approach is to use a dynamic filter in your WHERE clause of your source. If you can make assumptions like the rows are inserted in order, then you can structure your SSIS package to look like Execute SQL Task where you run a query like SELECT COALESCE(MAX(SomeId), 0) +1 AS startingPoint FROM dbo.MyTable against the Destination database and then assign that to an SSIS variable (#[User::StartingId]). You then use an expression on your select statement from the Source to be something like "SELECT * FROM dbo.MyTable T WHERE T.SomeId > " + (DT_WSTR, 10) #[User::StartingId] Now when the data flow begins, it will start where it last loaded data. The challenge on this approach is finding those scenarios where you know data hasn't been inserted out of order.
Let me know if you have questions, need things better explained, pictures, etc. Also, above code is freehanded so there could be syntax errors but the logic should be correct.
Team,
My objective is to data load from Excel to Sql Tables using SSIS. However the excels are quite dynamic i.e. their column count could vary OR the order of existing columns may change. But the destination table will be the same...
So I was contemplating on few options like:
1) Using SQL Command in "Excel Source" - But unfortunately I have to keep "first row as header" setting as false(To resolve the issue of Excel Connection Mngr sensing the datatype as numeric based on first few records). So the querying based on header doesnt work here.
2) The other oprtion in my mind is Script Task and write C# code to read excel based on the columns I know. So in this case the order and insertion/deletion of new columns won't matter.
Suggest me whether Script Task is the only option available for me? Any other simple way to achieve the same in SSIS? Also if possible suggest me a reference for the same.
Thanks,
Justin Samuel.
If you need to automate the process, then I'd definitely go with a script component / OleDbDataAdapter combo (you can't use a streamreader because Excel is a proprietary format). If not, go with the import wizard.
If you try to use a connection manager based solution, it's going to fail when the file layout changes. With the script component / OleDbDataAdapter combo, you can add logic in to interpret the fields and standardize the record layout before loading. You can also create an error buffer and gracefully push error values to it with Try / Catch.
Here's some links on how to use the script component as a source in the data flow task:
http://microsoft-ssis.blogspot.com/2011/02/script-component-as-source-2.html
http://beyondrelational.com/modules/2/blogs/106/posts/11126/ssis-script-component-split-single-row-to-multiple-rows.aspx
This could be done easily using "Import and Export Data" tool available with SQL Server.
Step 1: Specify your Excel as source and your SQL Server DB as destination.
Step 2: Provide necessary mappings.
Step: 3 In the final screen, you can specify to "Save as SSIS Package" and to File System. A relevant dtsx SSIS package would be created for you.
After the SQL Server Import and Export Wizard has created the package and copied the data, you can use the SSIS Designer to open and change the saved package by adding tasks, transformations, and event-driven logic.
(Since it works based on Header, order should not matter. And if a particular column is missing, it should automatically take NULL for that)
Reference: http://msdn.microsoft.com/en-us/library/ms140052.aspx