We get weekly data files (flat files) from our vendor to import into SQL, and at times the column names change or new columns are added.
What we have currently is an SSIS package to import columns that have been defined. Since we've assigned the mapping, SSIS only throws up an error when a column is absent. However when a new column is added (apart from the existing ones), it doesn't get imported at all, as it is not named. This is a concern for us.
What we'd like is to get the list of all the columns present in the flat file so that we can check whether any new columns are present before we import the file.
I am relatively new to SSIS, so a detailed help would be much appreciated.
Thanks!
Exactly how to code this will depend on the rules for the flat file layout, but I would approach this by writing a script task that reads the flat file using the file system object and a StreamReader object, and looks at the columns, which are hopefully named in the first line of the file.
However, about all you can do if the columns have changed is send an alert. I know of no way to dynamically change your data transformation task to accomodate new columns. It will have to be edited to handle them. And frankly, if all you're going to do is send an alert, you might as well just use the error handler to do it, and save yourself the trouble of pre-reading the column list.
I agree with the answer provided by #TabAlleman. SSIS can't natively handle dynamic columns (and niether can your SQL destination).
May I propose an alternative? You can detect a change in headers without using a C# Script Tasks. One way to do this would be to create a flafile connection that reads the entire row as a single column. Use a Conditional Split to discard anything other than the header row. Save that row to a RecordSet object. Any change? Send Email.
The "Get Header Row" DataFlow would look like this. Row Number if needed.
The Control Flow level would look like this. Use a ForEach ADO RecordSet object to assign the header row value to an SSIS variable CurrentHeader..
Above, the precedent constraints (fx icons ) of
[#ExpectedHeader] == [#CurrentHeader]
[#ExpectedHeader] != [#CurrentHeader]
determine whether you load data or send email.
Hope this helps!
i have worked for banking clients. And for banks to randomly add columns to a db is not possible due to fed requirements and rules. That said I get your not fed regulated bizz. So here are some steps
This is not a code issue but more of soft skills and working with other teams(yours and your vendors).
Steps you can take are:
(1) reach a solid columns structure that you always require. Because for newer columns older data rows will carry NULL.
(2) if a new column is going to be sent by the vendor. You or your team needs to make the DDL/DML changes to the table were data will be inserted. Ofcouse of correct data type.
(3) document this change in data dictanary as over time you or another member will do analysis on this data and would like to know what is the use of each attribute or column.
(4) long-term you do not wish to keep changing table structure monthly because one of your many vendors decided to change the style the send you data. Some clients push back very aggresively other not so much.
If a third-party tool is an option for you, check out CozyRoc's Data Flow Task Plus. It handles variable columns in sources.
SSIS cannot make the columns dynamic,
one thing, i always do, is use a script task to read the first and last lines of a file.
if it is not an expected list of csv columns i mark file as errored and continue/fail as required.
Headers are obviously important, but so are footers. Files can through any unknown issue be partially built. Requesting the header be placed at the rear of the file it is a double check.
I also do not know if SSIS can do this dynamically, but it never ceases to amaze me how people add/change order of columns and assume things will still work.
1-SSIS Does not provide dynamic source and destination mapping.But some third party component such as Data flow task plus , supporting this feature
2-We can achieve this using ssis script task.
3-If the Header is correct process further for migration else fail the package before DFT execute.
4-Read the line from the header using script task and store in array or list object
5-Then compare those array values to user defined variables declare earlier contained default value as column name.
6-If values are matching exactly then progress further else fail it.
Related
I am working on a transformation step for Pentaho Kettle. It selects several input columns and based on that adds two new columns during transformation. I am unable to understand (based on code from other plugins), how I can add the two new columns so that 1) steps downstream are aware of these columns and 2) i can push the transformed data into these columns.
Thanks in advance.
You might need to override meta.getStepFields() to add new ValueMetaInterface objects to the RowMetaInterface passed in. This is the standard way to add columns at runtime; however, the row's metadata (i.e. list of ValueMetaInterface objects) must be the same from row to row or else the next step in your transformation will complain.
Often when doing data-driven custom plugins, you consume as many rows as you need (using getRow()) in order to figure out what the outgoing row format/metadata will be, then you can construct a RowMetaInterface (usually using meta.getStepFields()) that will be passed into the putRow() call. If you intend to pass through the incoming fields, do something like:
RowMetaInterface outputRowMeta = getInputRowMeta().clone();
If you're creating new rows use this:
RowMetaInterface outputRowMeta = new RowMeta();
Either way when you call meta.getStepFields(outputRowMeta, ...) it should populate outputRowMeta with the appropriate fields, by adding/changing/removing ValueMetaInterface objects from outputRowMeta.
I've got a blog post using Groovy to add/replace fields in the incoming rows here:
http://funpdi.blogspot.com/2014/10/flatten-json-to-key-value-pairs-in-pdi.html
Not sure if that is similar to your use case or not. If you have more questions, feel free to find me on IRC at ##pentaho (my nick is usually mburgess_pdi)
IF i have understood your question correctly, i think you are trying to create an output file with dynamic column. So you can do this by checking on the "fast dumping" option in Text File Output Step. While doing so , donot define any column names in the "Fields" tab
Check my image below:
Hope it helps :)
In my case, I need to process input data in Excel (xls and xlsx) format. I need to do a file level validation of the Excel file for the order and number of columns, before processing the row level data. If this file level validation is failed, then exclude this file and inform the concerned through mail.
Please guide me, with some sample or example, how to validate the excel files for metadata? I thought of placing a variable in kettle.properties with semicolon separated header fields and compare this with the source excel file. But not getting a way to extract only the header row from file as I want.
Please guide me.
Are column names on Row 1 of your file (or any other row reasonably close to row 1) and you know how many fields are in each, at most? If so, maybe you can get away with that.
Step 1: You need to understand how many rows may there be, what they may be called, what data types, etc.
Step 2: Read the first N rows of the file(s) ensuring the header row will be read; Filter everything that is not the header (how to? depends on the specific structure). Because you don't know what are the field names, just name them field0, ... field999 or whatever.
Step3: Work some magic on the headers; filtering based on position of certain fields, mapping field names to data types, etc.
Step4: Metadata injection. Using the information you already have from before, you create a template transformation that is generic in the sense that field names are not set up in the excel input step. The metadata injection allows you to set up that step in run time, depending on the entire logic you just applied on the headers.
This page has a couple example videos: http://wiki.pentaho.com/display/EAI/ETL+Metadata+Injection
I had to build something like that (only it was CSV files and not XLS) a while back and metadata injection allowed me to load every single file in one go with 100% mapping accuracy. Of course, the magic happens before, when you parse the header row.
Thanks nsousa for your answer.
I got to the required solution with the help of my colleague. Here what I did
(1) Read only the 1st row of the source Excel file as normal data (no header, limit 1) where the field names will be called as F1, F2 etc
(2) concat the fields (data) to get a pattern
(3) Match this pattern with acual metadata pattern, if they are matching, then excel file is passed
Good trick. Thanks.
I am stuck with a CSV file with over 100,000 rows that contains product images from a provider. Here are the details of the issue, I would really appreciate some tips to help resolve this. Thanks.
The File has 1 Row per product and the following 4 columns.
ID,URL,HEIGHT,WIDTH
example: 1,http://i.img.com,100,200
Problem starts when a product has multiple images.
Instead of having 1 row per image the file has more columns in same row.
example:
1,http://i.img.com,100,200,//i.img.com,20,100,//i.img.com,30,50
Note that only first image has "http://" remaining images start with "//"
There is no telling how many images per product hence no way to tell how many total columns per row or max columns.
How can I import this using SSIS or sql import wizard.
Also I need to do this on regular intervals.
Thank you for your help.
I don't think that you can use any standard SSIS task or wizard to do this. You're going to have to write some custom code which parses each line. You can do this in SSIS using VB code or you can import the file into a staging table that's just a single column to hold each row and do the parsing in SQL. SSIS will probably be faster for this kind of operation.
Another possibility is to preprocess the file using regex or a search-and-replace command. Try to get double-quotes around the image list then you should be able to import the whole file fine, with the quoted part going into a single column. Catching the start of the string should be easy enough given the "http:\" for which you can search. Determining where the end quote goes might be more of a problem.
A third potential solution would be to get the source to fix the data. Even if you can't get the images in separate rows (or another file with separate rows, which would be ideal), maybe you can get the double-quotes added from the source as part of the export. This would likely be less error-prone than using the search-and-replace method.
Good luck!
Goal: I have a Db source. Depending on a variable, i need to store it into a fixed width file OR a delimited file.
How do I do this in a data flow? I tried creating a conditional split, with two conditions. One condition going to a fixed width destination, and one to a delimited condition. Problem is that conditional split executed BOTH conditions even if no data comes in one condition. Becuase filename is same, so it errors out.
I would keep your solution with the folowing tweeks.
Write out to two Filename-fixed.txt and filename-delim.txt. Before those steps add row count tasks.
Then in your control flow you have two Success paths. Edit the success paths to look for both success and expression. Add an expression that checks the count from the new row count tasks in your dataflow. If you have file system tasks as your end point have them rename your fixed or delim file to the correct file name.
Note: I didn't try this and the pics all have red x's because I find it helpful to have the picture to figure out the logic not because I actually coded the solution.
Use two different data flows and do the diversion from with in the control flow. If you want to do it within the data-flow itself I guess you will have to use different filenames.
problem
how to best parse/access/extract "excel file" data stored as binary data in an SQL 2005 field?
(so all the data can ultimately be stored in other fields of other tables.)
background
basically, our customer is requiring a large volume of verbose data from their users. unfortunately, our customer cannot require any kind of db export from their user. so our customer must supply some sort of UI for their user to enter the data. the UI our customer decided would be acceptable to all of their users was excel as it has a reasonably robust UI. so given all that, and our customer needs this data parsed and stored in their db automatically.
we've tried to convince our customer that the users will do this exactly once and then insist on db export! but the customer can not require db export of their users.
our customer is requiring us to parse an excel file
the customer's users are using excel as the "best" user interface to enter all the required data
the users are given blank excel templates that they must fill out
these templates have a fixed number of uniquely named tabs
these templates have a number of fixed areas (cells) that must be completed
these templates also have areas where the user will insert up to thousands of identically formatted rows
when complete, the excel file is submitted from the user by standard html file upload
our customer stores this file raw into their SQL database
given
a standard excel (".xls") file (native format, not comma or tab separated)
file is stored raw in a varbinary(max) SQL 2005 field
excel file data may not necessarily be "uniform" between rows -- i.e., we can't just assume one column is all the same data type (e.g., there may be row headers, column headers, empty cells, different "formats", ...)
requirements
code completely within SQL 2005 (stored procedures, SSIS?)
be able to access values on any worksheet (tab)
be able to access values in any cell (no formula data or dereferencing needed)
cell values must not be assumed to be "uniform" between rows -- i.e., we can't just assume one column is all the same data type (e.g., there may be row headers, column headers, empty cells, formulas, different "formats", ...)
preferences
no filesystem access (no writing temporary .xls files)
retrieve values in defined format (e.g., actual date value instead of a raw number like 39876)
My thought is that anything can be done, but there is a price to pay. In this particular case, the price seems to bee too high.
I don't have a tested solution for you, but I can share how I would give my first try on a problem like that.
My first approach would be to install excel on the SqlServer machine and code some assemblies to consume the file on your rows using excel API and then load them on Sql server as assembly procedures.
As I said, This is just a idea, I don't have details, but I'm sure others here can complement or criticize my idea.
But my real advice is to rethink the whole project. It makes no sense to read tabular data on binary files stored on a cell of a row of a table on database.
This looks like an "I wouldn't start from here" kind of a question.
The "install Excel on the server and start coding" answer looks like the only route, but it simply has to be worth exploring alternatives first: it's going to be painful, expensive and time-consuming.
I strongly feel that we're looking at a "requirement" that is the answer to the wrong problem.
What business problem is creating this need? What's driving that? Try the Five Whys as a possible way to explore the history.
It sounds like you're trying to store an entire database table inside a spreadsheet and then inside a single table's field. Wouldn't it be simpler to store the data in a database table to begin with and then export it as an XLS when required?
Without opening up an instance Excel and having Excel resolve worksheet references I'm not sure it's doable at all.
Could you write the varbinary to a Raw File Destination? And then use an Excel Source as your input to whatever step is next in your precedence constraints.
I haven't tried it, but that's what I would try.
Well, the whole setup seems a bit twisted :-) as others have already pointed out.
If you really cannot change the requirements and the whole setup: why don't you explore components such as Aspose.Cells or Syncfusion XlsIO, native .NET components, that allow you to read and interpret native Excel (XLS) files. I'm pretty such with either of the two, you should be able to read your binary Excel into a MemoryStream and then feed that into one of those Excel-reading components, and off you go.
So with a bit of .NET development and SQL CLR, I guess this should be doable - not sure if it's the best way to do it, but it should work.