How to process a CSV file with a number of columns changing in SSIS? - sql-server-2005

I have a complete CSV file that for instance have 6 columns:
Rec ID | Charge Category | Charge Type Category | Product Name | Account No | Cost
I've been running SSIS no problem with this condition.
However, we found that one of the CSV file that is using less column than we have for instance:
Rec ID | Charge Type Category | Product Name |Cost
How do we handle this cause using "Flat File Connection Manager" type of connection is not working ?!?!
Any other approach that we need to explore?
Thanks

If you only have those two cases, you can create three file connections. Use the first one to read the whole line and process just the header. Count the number of columns in it and create two parallel flows - one for "short" format, one for "long", where each of those use a different connection definition.
A bit more complicated option would be to do all this in a Script Transformation in the data flow. Let your CSV file read a file line by line and pass it to the Script, which parses the line and assigns the existing column values to a subset of all the possible outputs defined. Outputs should be defined in such a way that all the possible columns are configured (similar to this: SSIS split string)
The main parts of the solution are:
Flat file connection which reads the file line by line (set the field delimiter to a non existent combination of for example #$#) and
Scipt Task in the transformation mode with having a single input (FullLine) and all the possible outputs defined:
In the script, you can use concepts beautifully described here: http://dwbi1.wordpress.com/2011/02/27/ssis-importing-files-read-the-first-n-rows/
After writing all this, I've just realised that the Script Task Source would probably be the easiest way to do it, with the same idea of defining all the possible columns as output.

Related

PDI /Kettle - Passing data from previous hop to database query

I'm new to PDI and Kettle, and what I thought was a simple experiment to teach myself some basics has turned into a lot of frustration.
I want to check a database to see if a particular record exists (i.e. vendor). I would like to get the name of the vendor from reading a flat file (.CSV).
My first hurdle selecting only the vendor name from 8 fields in the CSV
The second hurdle is how to use that vendor name as a variable in a database query.
My third issue is what type of step to use for the database lookup.
I tried a dynamic SQL query, but I couldn't determine how to build the query using a variable, then how to pass the desired value to the variable.
The database table (VendorRatings) has 30 fields, one of which is vendor. The CSV also has 8 fields, one of which is also vendor.
My best effort was to use a dynamic query using:
SELECT * FROM VENDORRATINGS WHERE VENDOR = ?
How do I programmatically assign the desired value to "?" in the query? Specifically, how do I link the output of a specific field from Text File Input to the "vendor = ?" SQL query?
The best practice is a Stream lookup. For each record in the main flow (VendorRating) lookup in the reference file (the CSV) for the vendor details (lookup fields), based on its identifier (possibly its number or name or firstname+lastname).
First "hurdle" : Once the path of the csv file defined, press the Get field button.
It will take the first line as header to know the field names and explore the first 100 (customizable) record to determine the field types.
If the name is not on the first line, uncheck the Header row present, press the Get field button, and then change the name on the panel.
If there is more than one header row or other complexities, use the Text file input.
The same is valid for the lookup step: use the Get lookup field button and delete the fields you do not need.
Due to the fact that
There is at most one vendorrating per vendor.
You have to do something if there is no match.
I suggest the following flow:
Read the CSV and for each row look up in the table (i.e.: the lookup table is the SQL table rather that the CSV file). And put default upon not matching. I suggest something really visible like "--- NO MATCH ---".
Then, in case of no match, the filter redirect the flow to the alternative action (here: insert into the SQL table). Then the two flows and merged into the downstream flow.

Pentaho PDI how to validate source Excel metadata for the order and number of columns?

In my case, I need to process input data in Excel (xls and xlsx) format. I need to do a file level validation of the Excel file for the order and number of columns, before processing the row level data. If this file level validation is failed, then exclude this file and inform the concerned through mail.
Please guide me, with some sample or example, how to validate the excel files for metadata? I thought of placing a variable in kettle.properties with semicolon separated header fields and compare this with the source excel file. But not getting a way to extract only the header row from file as I want.
Please guide me.
Are column names on Row 1 of your file (or any other row reasonably close to row 1) and you know how many fields are in each, at most? If so, maybe you can get away with that.
Step 1: You need to understand how many rows may there be, what they may be called, what data types, etc.
Step 2: Read the first N rows of the file(s) ensuring the header row will be read; Filter everything that is not the header (how to? depends on the specific structure). Because you don't know what are the field names, just name them field0, ... field999 or whatever.
Step3: Work some magic on the headers; filtering based on position of certain fields, mapping field names to data types, etc.
Step4: Metadata injection. Using the information you already have from before, you create a template transformation that is generic in the sense that field names are not set up in the excel input step. The metadata injection allows you to set up that step in run time, depending on the entire logic you just applied on the headers.
This page has a couple example videos: http://wiki.pentaho.com/display/EAI/ETL+Metadata+Injection
I had to build something like that (only it was CSV files and not XLS) a while back and metadata injection allowed me to load every single file in one go with 100% mapping accuracy. Of course, the magic happens before, when you parse the header row.
Thanks nsousa for your answer.
I got to the required solution with the help of my colleague. Here what I did
(1) Read only the 1st row of the source Excel file as normal data (no header, limit 1) where the field names will be called as F1, F2 etc
(2) concat the fields (data) to get a pattern
(3) Match this pattern with acual metadata pattern, if they are matching, then excel file is passed
Good trick. Thanks.

Import Unformatted txt file into SQL

I am having an issue importing data into SQL from a text file. Not because I don't know how...but because the formatting is pretty much terrible for this purpose. Below is an altered sample of the types of text files I need to work with:
1 VA - P
2 VB to 1X P
3 VC to 1Y P
4 N - P
5 G to 1G,Frame P
6 Fout to 1G,Frame P
7 Open Breaker P
8 1B to 1X P
9 1C to 1Y P
Test Status: Pass
Hi-Pot # 1500V: Pass
Customer Order:904177-F
Number: G4901626-200
Serial Number: J245F6-2D03856
Catalog #: CBDC37-X5LE30-H40-L630C-4GJ-G31
Operator: TGY
Date: Aug 01, 2013
Start Time: 04:09:26
Finish Time: 04:09:33
The first 9 lines are all specific test results (tab separated), with header information below. My issue is that I need to figure out:
How can I take the data above and turn it into something broken down into a standard column format to import into SQL?
How can I then automate this such that I can loop through an entire folder structure?
-What you see above is one of hundreds of files divided into several sub-directories.
Also note that the # of test lines above the header information vary from file to file. The header information remains in much the same format though. This is all legacy data that cannot be regenerated, but needs to be imported into our SQL databases.
I am thinking of using an SSIS project with a custom script to import the data...splicing the top section from the bottom by looking for the first empty row...then pivot the data in the header into column format...merge...then move on. But I don't write much VB and I'm not sure how to approach that.
I am working in a SQL Server 2008R2 environment with access to BIDS.
Thoughts?
I would start by importing the data as all character into a table with a single field, one record per line. Then, from that table, you can parse each record into the fields and field types appropriate for each line. Hopefully there is a way to figure out what kind of data each line is, whether each file is consistant in order, or the header record indicates information about subsequent lines. From that, the data can be moved to a final (parsing may take more than one pass) table with the data stored in a format that is useable for whatever you need it.
I would first concentrate on getting the data into the database in the least complicated (and least error prone) way possible. Create a table with three columns: filename, line_number and line_data. Plop all of your files into that table and then you can start to think about how to interpret the data. I would probably be looking to use PIVOT, but if different files can have different numbers of fields it may introduce complications.
I would use a different approach and use SSDT/SSIS package to import the data.
Add a script component to read in the text file and convert it to XML. Not hard there many examples on the web. In your script Store the XML you develop into a variable.
Add a data flow
Add an XML Source. In the XML source you can select the XML variable you created and process either group of data present in your file. Here is some information on using the XML Source.
Add destination task to import it to a destination of your choice
This solution assumes your input lines are terminated {CR}{LF}, the normal Windows way.
Tell MSSQL's Import/Export Wizard to import a Flat File; the Format is "Delimited"; the "Text Qualifier" is the {CR}; the "Header Row Delimiter" is the {LF}; and the OutputColumnWidth (in "Advanced") is a bit more than the longest possible line length.
It's simple and it works.
I just used this to import 23 million lines of mixed up data, and it took less than ten minutes. Now to edit it...

Output Multiple flat files to multiple SQL tables

I have multiple flat files. I need to output each flat file to a different table using SSIS. I created a For each file Enumerator to bring every source file but it's uploading all of them to the same table which then throws error because they have different fields.
How may I configure a package to output to different tables?
You cannot, at least within a single data flow, have different source meta data. DTS supported this but SSIS does not. The number and type of columns in an SSIS package must be fixed.
You can have multiple data flows within your ForEach loop and then enable/disable them based on the file name or some other criteria to support loading different sources and destinations.
Some might suggest you read them all in a single line and then use a conditional split based on file type and then use a derived column to split it out into specific columns. That works but it is a maintenance nightmare I would not wish on my most hated enemy.

Get list of columns of source flat file in SSIS

We get weekly data files (flat files) from our vendor to import into SQL, and at times the column names change or new columns are added.
What we have currently is an SSIS package to import columns that have been defined. Since we've assigned the mapping, SSIS only throws up an error when a column is absent. However when a new column is added (apart from the existing ones), it doesn't get imported at all, as it is not named. This is a concern for us.
What we'd like is to get the list of all the columns present in the flat file so that we can check whether any new columns are present before we import the file.
I am relatively new to SSIS, so a detailed help would be much appreciated.
Thanks!
Exactly how to code this will depend on the rules for the flat file layout, but I would approach this by writing a script task that reads the flat file using the file system object and a StreamReader object, and looks at the columns, which are hopefully named in the first line of the file.
However, about all you can do if the columns have changed is send an alert. I know of no way to dynamically change your data transformation task to accomodate new columns. It will have to be edited to handle them. And frankly, if all you're going to do is send an alert, you might as well just use the error handler to do it, and save yourself the trouble of pre-reading the column list.
I agree with the answer provided by #TabAlleman. SSIS can't natively handle dynamic columns (and niether can your SQL destination).
May I propose an alternative? You can detect a change in headers without using a C# Script Tasks. One way to do this would be to create a flafile connection that reads the entire row as a single column. Use a Conditional Split to discard anything other than the header row. Save that row to a RecordSet object. Any change? Send Email.
The "Get Header Row" DataFlow would look like this. Row Number if needed.
The Control Flow level would look like this. Use a ForEach ADO RecordSet object to assign the header row value to an SSIS variable CurrentHeader..
Above, the precedent constraints (fx icons ) of
[#ExpectedHeader] == [#CurrentHeader]
[#ExpectedHeader] != [#CurrentHeader]
determine whether you load data or send email.
Hope this helps!
i have worked for banking clients. And for banks to randomly add columns to a db is not possible due to fed requirements and rules. That said I get your not fed regulated bizz. So here are some steps
This is not a code issue but more of soft skills and working with other teams(yours and your vendors).
Steps you can take are:
(1) reach a solid columns structure that you always require. Because for newer columns older data rows will carry NULL.
(2) if a new column is going to be sent by the vendor. You or your team needs to make the DDL/DML changes to the table were data will be inserted. Ofcouse of correct data type.
(3) document this change in data dictanary as over time you or another member will do analysis on this data and would like to know what is the use of each attribute or column.
(4) long-term you do not wish to keep changing table structure monthly because one of your many vendors decided to change the style the send you data. Some clients push back very aggresively other not so much.
If a third-party tool is an option for you, check out CozyRoc's Data Flow Task Plus. It handles variable columns in sources.
SSIS cannot make the columns dynamic,
one thing, i always do, is use a script task to read the first and last lines of a file.
if it is not an expected list of csv columns i mark file as errored and continue/fail as required.
Headers are obviously important, but so are footers. Files can through any unknown issue be partially built. Requesting the header be placed at the rear of the file it is a double check.
I also do not know if SSIS can do this dynamically, but it never ceases to amaze me how people add/change order of columns and assume things will still work.
1-SSIS Does not provide dynamic source and destination mapping.But some third party component such as Data flow task plus , supporting this feature
2-We can achieve this using ssis script task.
3-If the Header is correct process further for migration else fail the package before DFT execute.
4-Read the line from the header using script task and store in array or list object
5-Then compare those array values to user defined variables declare earlier contained default value as column name.
6-If values are matching exactly then progress further else fail it.