My requirement is to Copy specific file based on wild card from a container/folder in datalake to azure database using copy activity and then copy the file into a different folder with timestamp at the end of the file.
I used getmetadata and filter activities to get the specific file name from the datalake/blob folder to be loaded.But copy activities to database and the file movement with timestamp are failing.
Please find the attachment for the steps that was followed.
Can you please help.
Thanks
Found a solution for this. After filter activity used foreach activity and inside that used setvariable. Using this setvariable able to archive the source file with timestamp.
Related
Does any one have an example of appending data to a file stored in azure datalake from source using Data Factory and the rest API here
Can I use a copy activity with REST dataset on the sink side ?
Bellow is my pipeline, it consists of a copy activity inside a foreach loop. My requirement is : if the file already exists on the sink then append data to the same file. (the copy activity here overwrite the existing file with just the new data)
Sink :
Currently, the append data to a file in Azure Data Lake is not supported in the Azure Data Factory.
As a workaround,
Load multiple files using ForEach activity into data lake.
Merge the individual files to get a single file (final file) using copy data activity.
Delete the individual files after merging.
Please refer this SO thread for similar process.
I have a Data Factory V2 pipeline consisting of 'get metadata' and 'forEach' activities that reads a list of files on a file share (on-prem) and logs it in a database table. Currently, I'm only able to read file name, but would like to also retrieve the date modified and/or date created property of each file. Any help, please?
Thank you
According to the MS documentation.
We can see File system and SFTP both support the lastModified property. But we only can get the lastModified of one file or folder at a time.
I'm using File system to do the test. The process is basically the same as the previous post, we need to add a GetMetaData activity to the ForEach activity.
This is my local files.
First, I created a table for logging.
create table Copy_Logs (
Copy_File_Name varchar(max),
Last_modified datetime
)
In ADF, I'm using Child Items at Get Metadata1 activity to get the file list of the folder.
Then add dynamic content #activity('Get Metadata1').output.childItems at ForEach1 activity.
Inside the ForEach1 activity, using Last modified at Get Metadata2 activity.
In the dataset of Get Metadata2 activity, I key in #item().name as follows.
Using CopyFiles_To_Azure activity to copy local files to the Azure Data Lake Storage V2.
I key in #item().name at the source dataset of CopyFiles_To_Azure activity.
At Create_Logs activity, I'm using the following sql to get the info we need.
select '#{item().name}' as Copy_File_Name, '#{activity('Get Metadata2').output.lastModified}' as Last_modified
In the end, sink to the sql table we created previously. The result is as follows.
One way , I can think of is please add a new Getmetdata inside the FE loop and use paramterized dataset and pass a filename as the paramter . The below animation should helped , I did tested the same .
HTH .
While reading azure sql table data (which actually consists of path of the directories) from azure data factory by using the paths how to dynamically get the files from the datalake.
Can any one tell me what should I give in the dataset
Screenshot
You could use lookup activity to read data from azure sql, and then following it by an foreach activity. And then, pass #item(). to your dataset parameter k1.
I have several csv files on GCS which share the same schema but with different timestamps for example:
data_20180103.csv
data_20180104.csv
data_20180105.csv
I want to run them through dataprep and create Bigquery tables with corresponding names. This job should be run everyday with a scheduler.
Right now what I think could work is as follows:
The csv files should have a timestamp column which is the same for every row in the same file
Create 3 folders on GCS: raw, queue and wrangled
Put the raw csv files into raw folder. A Cloud function is then run to move 1 file from raw folder into queue folder if it's empty, do nothing otherwise.
Dataprep scans the queue folder as per scheduler. If a csv file is found (eg. data_20180103.csv) the corresponding job is run, output file is put into wrangled folder (eg. data.csv).
Another Cloud function is run whenever a new file is added to wrangled folder. This one will create a new BigQuery table with name according to the timestamp column in csv file (eg. 20180103). It also delete all files in queue and wrangled folder and proceed to move 1 file from raw folder to queue folder if there's any.
Repeat until all tables are created.
This seems overly complicated to me and I'm not sure how to handle cases where the Cloud functions fail to do their job.
Any other suggestion for my use-case is appreciated.
I'm trying to use oracle external tables to load flat files into a database but I'm having a bit of an issue with the location clause. The files we receive are appended with several pieces of information including the date so I was hoping to use wildcards in the location clause but it doesn't look like I'm able to.
I think I'm right in assuming I'm unable to use wildcards, does anyone have a suggestion on how I can accomplish this without writing large amounts of code per external table?
Current thoughts:
The only way I can think of doing it at the moment is to have a shell watcher script and parameter table. User can specify: input directory, file mask, external table etc. Then when a file is found in the directory, the shell script generates a list of files found with the file mask. For each file found issue a alter table command to change the location on the given external table to that file and launch the rest of the pl/sql associated with that file. This can be repeated for each file found with the file mask. I guess the benefit to this is I could also add the date to the end of the log and bad files after each run.
I'll post the solution I went with in the end which appears to be the only way.
I have a file watcher than looks for files in a given input dir with a certain file mask. The lookup table also includes the name of the external table. I then simply issue an alter table on the external table with the list of new file names.
For me this wasn't much of an issue as I'm already using shell for most of the file watching and file manipulation. Hopefully this saves someone searching for ages for a solution.