Azure Data Factory - How to create multiple datasets and apply different treatments on files in same blob container? - azure-data-factory-2

Starting up with Azure Data factory here.
I have a scenario where I gather csv files (different sources and formats/templates) that I store in a single Azure blob container. I would like to extract the data to an SQL DB. I need to apply different treatments to the files before pushing the data to SQL, based on the format. The format is indicated in each file name (for example: Myfile-formatA-20201201).
I am unclear on my pipeline / datasets setup. I assume I need to create a new (input) dataset for each CSV format, but cannot find a way to create differentiated datasets by relying on the different naming pattern. If creating a single input dataset instead, I can create a pipeline with differentiated copy activity using the same single dataset created in input and applying different filtering rules (relying based on my files naming pattern) - which seems to be working fine for files having the same encoding, column delimiters etc.. but as expected, fails for other files that do not.
I could not find any official information on how to to apply filters on creating multiple datasets from files contained in the same container. Is it possible at all? Or is a prerequisite to store files with different format in different containers or directories?

I created a test to copy different format csv in one pipeline.Then select different copy activities according to the file name. I think this is the answer you want.
In my container, I created csv in two formats:
Creat a dataset to the input container:
Edit: Do not specify a file in the File Path
Using Get Metadata1 activity to get the Child items.
The output is as follows:
Then in ForEach1 activity, we can traverse this array. Add dynamic content #activity('Get Metadata1').output.childItems to the Items tab.
5.Inside ForEach1 activity, we can use Switch1 activity and add dynamic content #split(item().name,'-')[1] to the Expression. It will get the format name. Such as: Myfile-formatA-20201201 -> formatA.
Case default, we can copy csv files of fortmatA.
Edit: in order to select only files of with "formatA" in their name, in the copy activity, use the Wildcard file path option:
enter image description here
Key in #item().name , so we can specify one csv file.
Add formatB case:
Then use the same source dataset.
Edit: as in previous step, use the Wildcard file path option:
enter image description here
That's all. We can set different sink at these Copy activities.

Related

Dynamic filename in Mapping Data Flow sink without the column in file

The way I understand it, is if you want to have dynamic filenames when writing to blob storage from a mapping data flow, the solution is to set "As data in column" on the file name options on the sink. This then uses the contents of a column as the filename for each row. To set the filename in the row you can have a derived column that contains the expression.
With auto mapping enabled on the sink this then results in having a column in the file containing the filename.
With auto mapping turned off, I could map all columns except for this one, but as I want to also have schema drift enabled on the source and keep any extra columns in the destination I can't have a fixed set of output columns.
How can I dynamically set the filename that gets generated without including it as a column in the file?
Or if we assume every row will have the same filename is there another way to dynamically set a filename? I've struggled to find any documentation on the file name options, but Pattern looks like it just adds a number and single file looks like a fixed value.
When you choose 'Output to single file' option, you can create a parameter in Data Flow and use it as file name. Then pass the value from pipeline to Data Flow like this:
My test:
1.add a parameter in Data Flow.
2.use that parameter as file name.
3.pass the value to parameter.

Excel to CSV Plugin for Kettle

I am trying to develop a reusable component in Pentaho which will take an Excel file and convert it to a CSV with an encoding option.
In short, I need to develop a transformation that has an Excel input and a CSV output.
I don't know the columns in advance. The columns have to be dynamically injected to the excel input.
That's a perfect candidate for Pentaho Metadata Injection.
You should have a template transformation wich contains the basic workflow (read from the excel, write to the text file), but without specifiying the input and/or output formats. Then, you should store your metadata (the list of columns and their properties) somewhere. In Pentaho example an excel spreadsheet is used, but you're not limited to that. I've used a couple of database tables to store the metadata for example, one for the input format and another one for the output format.
Also, you need to have a transformation that has the Metadata Injection step to "inject" the metadata into the template transformation. What it basically does, is to create a new transformation at runtime, by using the template and the fields you set to be populated, and then it runs it.
Pentaho's example is pretty clear if you follow it step by step, and from that you can then create a more elaborated solution.
You'll need at least two steps in a transformation:
Input step: Microsoft Excel input
Output step: Text file output
So, Here is the solution. In your Excel Input Component, in Fields Section, mention maximum number of fields which will come in any excel. Then Route the Input excel to text field based on the Number of fields which are actually present. You need to play switch/case component here.

Reading metadata CSV from a datalake, too big for a lookup activity

I need to create a pipeline to read CSVs from a folder, load from Row 8 into an Azure SQL table, Frist 5 rows will go into a different table ([tblMetadata]).
So far I have done it using Lookup Activity, works fine, but one of the files is bigger than 6 MB and it fails.
I checked all options in Lookup, read everything about Copy Activity (which I am using to load main data - skip 7 rows). The pipeline is created using GUI.
The output from the Lookup is used as parameters for a Stored Procedure to insert into tblMetadata
Can someone advise me how to deal with this? At the moment I am on the training, no one can help me on site.
You could probably do this with a single Data Flow activity that has a couple of transformations.
You would use a Source transformation that reads from a folder using folder paths and wildcards, then add a conditional split transformation to send different rows to different sinks.
I did workaround in different way, modified CSVs that are bing imported to have whole Metadata in the first row (as this was part of my different project). Then used FirstRow only in Lookup.

Can you set Fixed File Input column definitions dynamically in Pentaho data-integration (PDI)?

I have a metadata file which contains the column name, starting position, and length. I would like to read these values and define my columns within a FIXED FILE INPUT step.
Is there a way to do this in PDI? My file contains over 200 columns at a fixed widths and manually entering the information would be very time consuming especially if this definition changes over time.
Use MetaData Injection Step to inject the MetaData into prescribed steps , refer Matt Casters on figuring out delimited file and MetaData Injection description

Output Multiple flat files to multiple SQL tables

I have multiple flat files. I need to output each flat file to a different table using SSIS. I created a For each file Enumerator to bring every source file but it's uploading all of them to the same table which then throws error because they have different fields.
How may I configure a package to output to different tables?
You cannot, at least within a single data flow, have different source meta data. DTS supported this but SSIS does not. The number and type of columns in an SSIS package must be fixed.
You can have multiple data flows within your ForEach loop and then enable/disable them based on the file name or some other criteria to support loading different sources and destinations.
Some might suggest you read them all in a single line and then use a conditional split based on file type and then use a derived column to split it out into specific columns. That works but it is a maintenance nightmare I would not wish on my most hated enemy.