Output Multiple flat files to multiple SQL tables - sql

I have multiple flat files. I need to output each flat file to a different table using SSIS. I created a For each file Enumerator to bring every source file but it's uploading all of them to the same table which then throws error because they have different fields.
How may I configure a package to output to different tables?

You cannot, at least within a single data flow, have different source meta data. DTS supported this but SSIS does not. The number and type of columns in an SSIS package must be fixed.
You can have multiple data flows within your ForEach loop and then enable/disable them based on the file name or some other criteria to support loading different sources and destinations.
Some might suggest you read them all in a single line and then use a conditional split based on file type and then use a derived column to split it out into specific columns. That works but it is a maintenance nightmare I would not wish on my most hated enemy.

Related

Azure Data Factory - How to create multiple datasets and apply different treatments on files in same blob container?

Starting up with Azure Data factory here.
I have a scenario where I gather csv files (different sources and formats/templates) that I store in a single Azure blob container. I would like to extract the data to an SQL DB. I need to apply different treatments to the files before pushing the data to SQL, based on the format. The format is indicated in each file name (for example: Myfile-formatA-20201201).
I am unclear on my pipeline / datasets setup. I assume I need to create a new (input) dataset for each CSV format, but cannot find a way to create differentiated datasets by relying on the different naming pattern. If creating a single input dataset instead, I can create a pipeline with differentiated copy activity using the same single dataset created in input and applying different filtering rules (relying based on my files naming pattern) - which seems to be working fine for files having the same encoding, column delimiters etc.. but as expected, fails for other files that do not.
I could not find any official information on how to to apply filters on creating multiple datasets from files contained in the same container. Is it possible at all? Or is a prerequisite to store files with different format in different containers or directories?
I created a test to copy different format csv in one pipeline.Then select different copy activities according to the file name. I think this is the answer you want.
In my container, I created csv in two formats:
Creat a dataset to the input container:
Edit: Do not specify a file in the File Path
Using Get Metadata1 activity to get the Child items.
The output is as follows:
Then in ForEach1 activity, we can traverse this array. Add dynamic content #activity('Get Metadata1').output.childItems to the Items tab.
5.Inside ForEach1 activity, we can use Switch1 activity and add dynamic content #split(item().name,'-')[1] to the Expression. It will get the format name. Such as: Myfile-formatA-20201201 -> formatA.
Case default, we can copy csv files of fortmatA.
Edit: in order to select only files of with "formatA" in their name, in the copy activity, use the Wildcard file path option:
enter image description here
Key in #item().name , so we can specify one csv file.
Add formatB case:
Then use the same source dataset.
Edit: as in previous step, use the Wildcard file path option:
enter image description here
That's all. We can set different sink at these Copy activities.

Is there any way to exclude columns from a source file/table in Pentaho using "like" or any other function?

I have a CSV file having more than 700 columns. I just want 175 columns from them to be inserted into a RDBMS table or a flat file usingPentaho (PDI). Now, the source CSV file has variable columns i.e. the columns can keep adding or deleting but have some specific keywords that remain constant throughout. I have the list of keywords which are present in column names that have to excluded, e.g. starts_with("avgbal_"), starts_with("emi_"), starts_with("delinq_prin_"), starts_with("total_utilization_"), starts_with("min_overdue_"), starts_with("payment_received_")
Any column which have the above keywords have to be excluded and should not pass onto my RDBMS table or a flat file. Is there any way to remove the above columns by writing some SQL query in PDI? Selecting specific 175 columns is not possible as they are variable in nature.
I think your example is fit to use meta data injection you can refer to example shared below
https://help.pentaho.com/Documentation/7.1/0L0/0Y0/0K0/ETL_Metadata_Injection
two things you need to be careful
maintain list of columns you need to push in.
since you have changing column names so you may face issue with valid columns as well which you want to import or work with. in order to do so make sure you generate the meta data file every time so you are sure about the column names you want to push out from the flat file.

Reading metadata CSV from a datalake, too big for a lookup activity

I need to create a pipeline to read CSVs from a folder, load from Row 8 into an Azure SQL table, Frist 5 rows will go into a different table ([tblMetadata]).
So far I have done it using Lookup Activity, works fine, but one of the files is bigger than 6 MB and it fails.
I checked all options in Lookup, read everything about Copy Activity (which I am using to load main data - skip 7 rows). The pipeline is created using GUI.
The output from the Lookup is used as parameters for a Stored Procedure to insert into tblMetadata
Can someone advise me how to deal with this? At the moment I am on the training, no one can help me on site.
You could probably do this with a single Data Flow activity that has a couple of transformations.
You would use a Source transformation that reads from a folder using folder paths and wildcards, then add a conditional split transformation to send different rows to different sinks.
I did workaround in different way, modified CSVs that are bing imported to have whole Metadata in the first row (as this was part of my different project). Then used FirstRow only in Lookup.

Check for duplicate rows while transferring data from text file to excel

Platform : SSIS
I am new to SSIS and trying to check for duplicate rows while transferring the data from text file to excel file. Heard about cache transform can be used but I am not really sure about it. Any suggestions?
One simple way to handle this is use an Aggregate transform between the source and destination. In it, group by all the columns in the source to eliminate duplicates. I have used this technique, and it works well.
This could be slow if the source is large.

Get list of columns of source flat file in SSIS

We get weekly data files (flat files) from our vendor to import into SQL, and at times the column names change or new columns are added.
What we have currently is an SSIS package to import columns that have been defined. Since we've assigned the mapping, SSIS only throws up an error when a column is absent. However when a new column is added (apart from the existing ones), it doesn't get imported at all, as it is not named. This is a concern for us.
What we'd like is to get the list of all the columns present in the flat file so that we can check whether any new columns are present before we import the file.
I am relatively new to SSIS, so a detailed help would be much appreciated.
Thanks!
Exactly how to code this will depend on the rules for the flat file layout, but I would approach this by writing a script task that reads the flat file using the file system object and a StreamReader object, and looks at the columns, which are hopefully named in the first line of the file.
However, about all you can do if the columns have changed is send an alert. I know of no way to dynamically change your data transformation task to accomodate new columns. It will have to be edited to handle them. And frankly, if all you're going to do is send an alert, you might as well just use the error handler to do it, and save yourself the trouble of pre-reading the column list.
I agree with the answer provided by #TabAlleman. SSIS can't natively handle dynamic columns (and niether can your SQL destination).
May I propose an alternative? You can detect a change in headers without using a C# Script Tasks. One way to do this would be to create a flafile connection that reads the entire row as a single column. Use a Conditional Split to discard anything other than the header row. Save that row to a RecordSet object. Any change? Send Email.
The "Get Header Row" DataFlow would look like this. Row Number if needed.
The Control Flow level would look like this. Use a ForEach ADO RecordSet object to assign the header row value to an SSIS variable CurrentHeader..
Above, the precedent constraints (fx icons ) of
[#ExpectedHeader] == [#CurrentHeader]
[#ExpectedHeader] != [#CurrentHeader]
determine whether you load data or send email.
Hope this helps!
i have worked for banking clients. And for banks to randomly add columns to a db is not possible due to fed requirements and rules. That said I get your not fed regulated bizz. So here are some steps
This is not a code issue but more of soft skills and working with other teams(yours and your vendors).
Steps you can take are:
(1) reach a solid columns structure that you always require. Because for newer columns older data rows will carry NULL.
(2) if a new column is going to be sent by the vendor. You or your team needs to make the DDL/DML changes to the table were data will be inserted. Ofcouse of correct data type.
(3) document this change in data dictanary as over time you or another member will do analysis on this data and would like to know what is the use of each attribute or column.
(4) long-term you do not wish to keep changing table structure monthly because one of your many vendors decided to change the style the send you data. Some clients push back very aggresively other not so much.
If a third-party tool is an option for you, check out CozyRoc's Data Flow Task Plus. It handles variable columns in sources.
SSIS cannot make the columns dynamic,
one thing, i always do, is use a script task to read the first and last lines of a file.
if it is not an expected list of csv columns i mark file as errored and continue/fail as required.
Headers are obviously important, but so are footers. Files can through any unknown issue be partially built. Requesting the header be placed at the rear of the file it is a double check.
I also do not know if SSIS can do this dynamically, but it never ceases to amaze me how people add/change order of columns and assume things will still work.
1-SSIS Does not provide dynamic source and destination mapping.But some third party component such as Data flow task plus , supporting this feature
2-We can achieve this using ssis script task.
3-If the Header is correct process further for migration else fail the package before DFT execute.
4-Read the line from the header using script task and store in array or list object
5-Then compare those array values to user defined variables declare earlier contained default value as column name.
6-If values are matching exactly then progress further else fail it.