SSIS - Only Load Certain Records - Skip the remaining - sql

I have a Flat File that I'm loading into SQL and that Flat file has 2 different RecordTypes and 2 Different File Layouts based on the RecordType.
So I may have
000010203201501011 (RecordType 1)
00002XXYYABCDEFGH2 (RecordType 2)
So I want to immediately check for Records of RecordType1 and then send those records thru [Derived Column] & [Data Conversion] & [Loading to SQL]
And I want to ignore all Records of RecordType2.
I tried a Conditional Split but it seems like the Records of RecordType2 are still trying to go thru the [Derived Column]&[DataConversion] Steps.
It gives me a DataConversion error on the RecordType2 Records.
I have the Conditional Split set up as RecordType == 1 to go thru the process i have set up.
I guess Conditional Split isn't set up to be used this way?
Where in my process can i tell it to check for RecordType1 and only send records past that point that are RecordType=1?

It makes perfect sense you are having data type errors for Record Type 2 rows since you probably have defined columns along with their data types based on Record Type 1 records. I see three options to achieve what you want to do:
Have a script task in the control flow to copy only Record Type 1
records to a fresh file that would be used by the data flow you
already have (Pro: you do not need to touch the data flow, Con:
reading file twice), OR
In the existing data flow: Instead of getting all the columns from
the data source, read every line coming from the file as one big-fat
column, then a Derived Column to get RecordType, then a Conditional
Split, then a Derived Column to re-create all the columns you had
defined in the data source, OR
Ideal if you have another package processing Record Type 2 rows:
Dump the file into a database table in the staging area, then
replace the Data Source in your Data Flow for an OLEDB Data Source
(or whatever you use) and obtain+filter the records with something
like: SELECT substring(rowdata,1,5) AS RecordType,
substring(rowdata,6,...) AS Column2, .... FROM STG.FileData WHERE
substring(rowdata,1,5) = '00001'. If using this approach it would
be better to have a dedicated column for RecordType

Related

How to force error or stoppage in SSIS data flow during the debugging process?

The problem is: i make left join for two tables and then i need to load whole data to another table but only if every row from first table has match in second one, so, to cut it short, there is no NULLs in one exact column.
If there is at least one null i want to fail my data flow so it'll not load any data to final table and then send an email with the error by executing sql task.
After many tries i can only make errors if there is nulls but this error are not fatal. How can i raise fatal error not using smth stupid like data conversion which can't be done? I was trying to make breakpoint after some variable is changed but was defeated by ssis(
If I understand correctly, the Data Flow loads data to Table1. The Execute SQL Task uses Table1 to populate Table2.
The business rule is that the Execute SQL Task should only fire if a column from the previous data load had no NULLs.
The lazy way to handle this, is to put the logic in the query itself. Something like the following and yes, there are ways to optimize this
INSERT INTO dbo.Table2 SELECT * FROM dbo.Table1 WHERE NOT EXISTS (SELECT * FROM dbo.Table1 WHERE MyColumn IS NOT NULL)
To make this happen only in SSIS,
Add a Variable to the Package called NullRowCount and initialize it to zero.
In the Data flow, add a Multicast between the Join and the Destination. Route one path to the destination
In the Data flow, connect a Conditional Split to a new path from the Multicast. Configure the Conditional Split to have have an Output Name of "No Data" and use an expression like IsNull([MyColumn]). That's a boolean, yes/no.
In the Data flow, add a Row Count transformation to the Conditional Split and attach it to the "No Data" pipe (the Default pipe will contain rows that have values in MyColumn). Use #[User::NullRowCount] in the Row Count transformation.
Finally, double click the precedent constraint you have between the Data Flow and the Execute SQL Task. Make it back to an On Success constraint and then change the evaluation option from Constraint to Constraint and Expression. Here, we'll use an expression of #[User::NullRowCount] == 0
In plain English, we are going to have the data flow count how many rows in our set have a NULL in MyColumn. The Precedent Constraint will allow/disallow the Execute SQL Task to run and the criteria we specify are that the data flow had to run successfully and the count of rows with NULL in it is zero.
If say you wanted to have an action when the count is non-zero (send email or other alert), then you would add another Task and configure it with Expression and Constraint but now use an expression of #[User::NullRowCount] > 0
Based on the comment
may be i can stop it (force an error) inside the data flow before loading data in data source? because this sql text sends an email, so i want all etl process to be done in one data flow
No, not really. Assuming you changed out the Row Count in the above with a Script Task that explicitly raises an error or a Derived Column Task that forces a divide by zero - either of those would interrupt a data flow, but you don't know whether it was the first row of the data flow that caused the exception or the one billionth. In the later case, the data has already flowed into the destination (unless you have the commit size of 0 which can lead to other issues) and you have partially loaded data.
Ultimately, you need to preprocess your data to identify if there's data that does not conform to expectation. I would make the above changes - count if you have any bad data but instead of landing data into a table, land it into a Raw File. A raw file is a compact binary of the data so yes, you'll pay disk IO penalty but it will save you reprocessing the data if it's valid.
Then you add a new Data Flow Task that only works when you have a zero null count, using the precedent constraint approach described above. This new data flow is just Raw File Source to "original destination." Now you'll get a clean separation of data landing in your table only if pristine and not have to worry about partial loads.

PDI /Kettle - Passing data from previous hop to database query

I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?
The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.
Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.

how to remove duplicate row from output table using Pentaho DI?

I am creating a transformation that take input from CSV file and output to a table. That is running correctly but the problem is if I run that transformation more then one time. Then the output table contain the duplicate rows again and again.
Now I want to remove all duplicate row from the output table.
And if I run the transformation repeatedly it should not affect the output table until it don't have a new row.
How I can solve this?
Two solutions come to my mind:
Use Insert / Update step instead of Table input step to store data into output table. It will try to search row in output table that matches incoming record stream row according to key fields (all fields / columns in you case) you define. It works like this:
If the row can't be found, it inserts the row. If it can be found and the fields to update are the same, nothing is done. If they are not all the same, the row in the table is updated.
Use following parameters:
The keys to look up the values: tableField1 = streamField1; tableField2 = streamField2; tableField3 = streamField3; and so on..
Update fields: tableField1, streamField1, N; tableField2, streamField2, N; tableField3, streamField3, N; and so on..
After storing duplicite values to the output table, you can remove duplicites using this concept:
Use Execute SQL step where you define SQL which removes duplicite entries and keeps only unique rows. You can inspire here to create such a SQL: How can I remove duplicate rows?
Another way is to use the Merge rows (diff) step, followed by a Synchronize after merge step.
As long as the number of rows in your CSV that are different from your target table are below 20 - 25% of the total, this is usually the most performance friendly option.
Merge rows (diff) takes two input streams that must be sorted on its key fields (by a compatible collation), and generates the union of the two inputs with each row marked as "new", "changed", "deleted", or "identical". This means you'll have to put Sort rows steps on the CSV input and possibly the input from the target table if you can't use an ORDER BY clause. Mark the CSV input as the "Compare" row origin and the target table as the "Reference".
The Synchronize after merge step then applies the changes marked in the rows to the target table. Note that Synchronize after merge is the only step in PDI (I believe) that requires input be entered in the Advanced tab. There you set the flag field and the values that identify the row operation. After applying the changes the target table will contain the exact same data as the input CSV.
Note also that you can use a Switch/Case or Filter Rows step to do things like remove deletes or updates if you want. I often flow off the "identical" rows and write the rest to a text file so I can examine only the changes.
I looked for visual answers, but the answers were text, so adding this visual-answer for any kettle-newbie like me
Case
user-updateslog.csv (has dup values) ---> users_table , store only latest user detail.
Solution
Step 1: Connect csv to insert/update as in the below Transformation.
Step 2: In Insert/Update, add condition to compare keys to find the candidate row, and choose "Y" fields to update.

stored procedure sql (Excel data to T-SQL)

I need to set up a new company for automated data import. The utility has provided the data in a spreadsheet. (Image 1)
Based on this data, I need to create a stored procedure that will identify the correct meter, if it exists, and perform either an insert or update to the monthly data table. For automated utility data import, I want to make sure I restrict everything to a particular utility company.
The steps are the following ( I am having a hard time converting this to SQL)
1- I just want a script that identify the correct meter to see if it exists, basically check the Meter# column in the excel with the MeterNumber column in the Meters table.
2- The next step is perform either an insert or update to the MonthlyData table. This is a screen shot of all its columns.
3- Then I just want to make sure that I am restricting everything to the particular company which in this case Site1 since 2 different companies might have the same meter#. The UtilityCompany table contains 3 columns: ID, Name, UtilityType
I honestly do not know from where to get started, would anybody help me with the script? Thank you
You will want to:
perform a Bulk Insert operation to take your data from the excel file into a staging table.
write a query to select ALL rows for the corresponding utility company (notice I didn't see iterate over each row...). This select could be an update where you update an additional column to mark the row as an INSERT, or an UPDATE.
Then the last step (2 parts), retrieve all of the rows that were marked as INSERT, and insert those into your table. Then grab all rows that were marked with an UPDATE, and update their corresponding values based on your matching criteria.

SSIS external metadata column needs to be removed

I am creating a select statement on the fly because the column names and table name can change, but they all need to go into the same data destination. There are other commonalities that make this viable, if I need to later I will go into them. So, what it comes down to is this: I am creating the select statement with 16 columns, there will always be sixteen columns, no more, no less, the column names can change and the table name can change. When I execute the package the select statement gets built just fine but when the Data Flow tries to execute, I get the following error:
The "external metadata column "ColumnName" (79)" needs to be removed from the external metadata column collection.
The actual SQL Statement being generated is:
select 0 as ColumnName, Column88 as CN1, 0 as CN2, 0 as CN3, 0 as CN4,
0 as CN5, 0 as CN6, 0 as CN7, 0 as CN8, 0 as CN9, 0 as CN10,
0 as CN11, 0 as CN12, 0 as CN13, 0 as CN14, 0 as CN15 from Table3
The column 'Column88' is generated dynamicly and so is the table name. If source columns exist for the other ' as CNx' columns, they will appear the same way (Column88 as CN1, Column89 as CN2, Column90 as CN3, etc.) and the table name will always be in the form: Tablex where x is an integer.
Could anyone please help me out with what is wrong and how to fix it?
You're in kind of deep here. You should just take it as read that you can't change the apparent column names or types. The names and types of the input columns become the names and types of the metadata flowing down from the source. If you change those, then everything that depends on them must fail.
The solution is to arrange for these to be stable, perhaps by using column aliases and casts. For one table:
SELECT COLNV, COLINT FROM TABLE1
for another
SELECT CAST(COLV AS NVARCHAR(50)) AS COLNV, CAST(COLSMALL AS INTEGER) AS COLINT FROM TABLE2
Give that a try and see if it works out for you. You just really can't change the metadata without fixing up the entire remainder of the package.
I had the same issue here when I had to remove a column from my stored procedure (which spits out to a temp table) in SQL and add two columns. To resolve the issue, I had to go through each part of my SSIS package from beginning (source - in my case, pulls from a temporary table), all the way through to your destination (in my case a flat file connection to a flat file csv). I had to re-do all the mappings along the way and I watched for errors that game up in the GUI data flow tasks in SSIS.
This error did come up for me in the form of a red X with a circle around it, I hovered over and it mentioned the metadata thing...I double clicked on it and it warned me that one of my columns didn't exist anymore and wanted to know if I wanted to delete it. I did delete it, but I can tell you that this error has more to do with SSIS telling you that your mappings are off and you need to go through each part of your SSIS package to make sure everything is all mapped out correctly.
How about using a view in front of the table. and calling the view as the SSIS source. that way, you can map the the columns as necessary, and use ISNULL or COALESCE functions to keep consistent column patterns.