I need to create a Azure Data Factory pipeline which has to first format the source file and then call another pipeline. The pipeline would be triggered every time a new file is uploaded in the source blob storage. I want to re-use this pipeline for different source file formats.
For this I intend to use a Switch activity and based on the source file name, call corresponding Copy activity to create a formatted sink file. The issue is that the source files have standard prefixes but then have a timestamp, which means that file name would be different every time, something like:
File 1:
ABCDEF_1233
ABCDEF_2244
File 2:
UVWXYZ_1222
UVWXYX_2345
Can anyone help me understand how to do this?
I was thinking of using a Switch activity, and in the expression, use the #startsWith(triggerBody().fileName, ) and then in the CASE statements, I would like to provide the file name prefixes like ABCDEF, UVWXYZ etc. and then call a copy activity for each of the CASE statements.
But I am not sure how to specify the second argument in the startsWith() function.
suppose you have the filename in a variable called filename. write expression like this to find out which file we are going to load.
Have a set variable activity and assign file prefix to another variable called prefix
#if(greater(indexof(filename),'ABCDEF'),0),'ABCDEF',if(greater(indexof(filename),'UVWXYZ'),0),'UVWXYZ'))
At the end of this set variable, your prefix will have either ABCDEF or UVWXYZ
Then, you can use a switch activity based on prefix variable and mention the cases as
ABCDEF
UVWXYZ
for each case, you can have a copy activity for doing related transforamtions.
Related
I am trying to create a lambda function that will get triggered once a folder is uploaded to a S3 Bucket. But the lambda will perform an operation that will save files back on the same folder, how can I do so without having a self calling function?
I want to upload the following folder structure to the bucket:
Project_0001/input/inputs.csv
The outputs will create and be saved on:
Project_0001/output/outputs.csv
But, my project number will change, so I can't simply assign a static prefix. Is there a way of dynamically change the prefix, something like:
Project_*/input/
From Shubham's comment I drafted my solution using the prefix and sufix.
For my case, I stated the prefix being 'Project_' and for the suffix I choose one specific file for the trigger, so my suffix is '/input/myFile.csv'.
So every time I upload the structure Project_/input/allmyfiles_with_myFile.csv it triggers the function and then I save my output in the same project folder, under the output folder, thus not triggering the function again.
I get project name with the following code
key = event['Records'][0]['s3']['object']['key']
project_id = key.split("/")[0]
I want to get the column names for a parquet file. I have a Get Metadata module in my pipeline and it is using a parquet dataset with only the root folder provided. Because only the folder is provided ADF is not letting me get the file structure that contains the column names. The file name is not provided because that can change. Can anyone provide some advice on how to approach this?
You will need 2 Get Metadata activities and a ForEach activity to get the file structure if your file name is not the same every time.
Source dataset:
Parameterize the file name as the name changes frequently.
Preview of source data:
Get Metadata1:
In the first Get Metadata activity, get the file name dynamically.
You can also specify if your file name contains any specific pattern by adding an expression in the filename or you can mention asterisk (*) if you don’t have a specific pattern or need more than 1 file in the folder needs to be processed.
Give field list as child items when you want to get the files from the folder.
Output of Get Metadata1: Get the file name from the folder.
FoEach activity:
Using the ForEach activity, you can get the item's name listed inside the Get Metadata activity output array.
Get Metadata2:
Add Get Metadata activity inside ForEach activity to get the file structure or column list of the current file from the folder. It can loop the number of items count in the folder (1 or more).
Output of Get Metadata2:
You can parameterize your file name in dataset or via GetMeta data activity, get the list of files within the folder and then via GetMetaData activity get the list of columns for those corresponding files.
I'm checking daily if certain files exist in a folder on-prem. The files have a specific format, but the first few letters indicate specific job. For example, xyz-yyyyMMdd.csv, or abc-yyMMdd.csv etc
I would like to use switch activity to see if the file for each job has arrived or an alert should be used. How can I dynamically let the switch activity read the 'xyz' portion knowing that the other part of the file name is dynamic?
Thank you
If number of your few letters is three as you said, you can try this expression:
#substring(item().name,0,3)
If no, you can try this:
#split(item().name,'-')[0]
Here is my test:
Can I use a parameter from Carte's URL in a Job? Something like this:
http://localhost:8080/kettle/startJob/?name=myjob&xml=Y&testvar=filename.txt
I want to do this because I have a job to transform an input file but I want to change that filename dynamically, and creating a new XML file for each file is a bit nonsense.
I've tried many things and I couldn't find a solution :-(
All named parameters must be declared in the parameters tab of said JOB and KTR to be executed and receive the information passed from the URL parameters.
How do you generate a path based on dynamic data inside the model the file is being saved to? An example would be having a User model with a fileAttachment field. If one instance of the User model has a registrationNumber of 123, I want to store their file at /123/fileName.pdf. If another user has a registrationNumber of 456, I want to store their file at 456/fileName.pdf. The path field accepts a string, and unfortunately during the time it is set, there's no access to the model fields. In addition, the file is named and uploaded to AWS by the time the pre('save', ...) hook is executed which prevents renaming there.