Get File Structure from Get Metadata in ADF

Get File Structure from Get Metadata in ADF - azure-data-factory-2

I want to get the column names for a parquet file. I have a Get Metadata module in my pipeline and it is using a parquet dataset with only the root folder provided. Because only the folder is provided ADF is not letting me get the file structure that contains the column names. The file name is not provided because that can change. Can anyone provide some advice on how to approach this?

You will need 2 Get Metadata activities and a ForEach activity to get the file structure if your file name is not the same every time.
Source dataset:
Parameterize the file name as the name changes frequently.
Preview of source data:
Get Metadata1:
In the first Get Metadata activity, get the file name dynamically.
You can also specify if your file name contains any specific pattern by adding an expression in the filename or you can mention asterisk (*) if you don’t have a specific pattern or need more than 1 file in the folder needs to be processed.
Give field list as child items when you want to get the files from the folder.
Output of Get Metadata1: Get the file name from the folder.
FoEach activity:
Using the ForEach activity, you can get the item's name listed inside the Get Metadata activity output array.
Get Metadata2:
Add Get Metadata activity inside ForEach activity to get the file structure or column list of the current file from the folder. It can loop the number of items count in the folder (1 or more).
Output of Get Metadata2:

You can parameterize your file name in dataset or via GetMeta data activity, get the list of files within the folder and then via GetMetaData activity get the list of columns for those corresponding files.

Related

How to iterate through node while there is a relationship

I have nodes that are structured like folder, subfolder and files. Any folder can have a relationship with a subfolder, which can have a relationship with another subfolder, which can have a relationship with files. I'd like to iterate through every folder to find every subfolder and files inside a given folder.
In one query, I'd like to be able to get every file that is inside a folder or in his subfolders. I can't find any way to do it with Cypher. I saw FOREACH and UNWIND but I don't think it helps me.

Assuming you have labelled the nodes accordingly as Folder and File, the following query will fetch all the files belonging to the starting folder, directly or through a chain of one or more sub-folders:
MATCH(ParentFolder:Folder)-[*]->(childFile:File)
WHERE ParentFolder.name='Folder1'
RETURN childFile
If you haven't used Labels (highly recommend using them), you can look for all the paths starting with the specified folder and find all the last nodes in that path.
MATCH(ParentFolder)-[*]->(childFile)
WHERE ParentFolder.name='Folder1' AND NOT (childFile)-->()
RETURN childFile
The second query will fetch all the terminal nodes, even if they are folders. You would have to use labels or add filters in the where clause to ensure only files are fetched for childFile.
Both versions of the query work based on varying length paths. The wild character(*) retrieves all paths of any length starting from ParentFolder.

Azure Data Factory - Switch Activity - File name startsWith

I need to create a Azure Data Factory pipeline which has to first format the source file and then call another pipeline. The pipeline would be triggered every time a new file is uploaded in the source blob storage. I want to re-use this pipeline for different source file formats.
For this I intend to use a Switch activity and based on the source file name, call corresponding Copy activity to create a formatted sink file. The issue is that the source files have standard prefixes but then have a timestamp, which means that file name would be different every time, something like:
File 1:
ABCDEF_1233
ABCDEF_2244
File 2:
UVWXYZ_1222
UVWXYX_2345
Can anyone help me understand how to do this?
I was thinking of using a Switch activity, and in the expression, use the #startsWith(triggerBody().fileName, ) and then in the CASE statements, I would like to provide the file name prefixes like ABCDEF, UVWXYZ etc. and then call a copy activity for each of the CASE statements.
But I am not sure how to specify the second argument in the startsWith() function.

suppose you have the filename in a variable called filename. write expression like this to find out which file we are going to load.
Have a set variable activity and assign file prefix to another variable called prefix
#if(greater(indexof(filename),'ABCDEF'),0),'ABCDEF',if(greater(indexof(filename),'UVWXYZ'),0),'UVWXYZ'))
At the end of this set variable, your prefix will have either ABCDEF or UVWXYZ
Then, you can use a switch activity based on prefix variable and mention the cases as
ABCDEF
UVWXYZ
for each case, you can have a copy activity for doing related transforamtions.

Azure Data Factory check file name dynamically

I'm checking daily if certain files exist in a folder on-prem. The files have a specific format, but the first few letters indicate specific job. For example, xyz-yyyyMMdd.csv, or abc-yyMMdd.csv etc
I would like to use switch activity to see if the file for each job has arrived or an alert should be used. How can I dynamically let the switch activity read the 'xyz' portion knowing that the other part of the file name is dynamic?
Thank you

If number of your few letters is three as you said, you can try this expression:
#substring(item().name,0,3)
If no, you can try this:
#split(item().name,'-')[0]
Here is my test:

Is there a way to list the directories in a using PySpark in a notebook?

I'm trying to see every file is a certain directory, but since each file in the directory is very large, I can't use sc.wholeTextfile or sc.textfile. I wanted to just get the filenames from them, and then pull the file if needed in a different cell. I can access the files just fine using Cyberduck and it shows the names on there.
Ex: I have the link for one set of data at "name:///mainfolder/date/sectionsofdate/indiviual_files.gz", and it works, But I want to see the names of the files in "/mainfolder/date" and in "/mainfolder/date/sectionsofdate" without having to load them all in via sc.textFile or sc.Wholetextfile. Both those functions work, so I know my keys are correct, but it takes too long for them to be loaded.

Considering that the list of files can be retrieve by one single node, you can just list the files in the directory. Look at this response.
wholeTextFiles returns a tuple (path, content) but I don't know if the file content is lazy to get only the first part of the tuple.

Download Multiple Attachments from Salesforce using Jitterbit

I am able to create a query for attachments and download 1 individual file like this:
SOQL:
SELECT Body, Id FROM Attachment WHERE Id = '00P4M00000q8ChI'
Code on Body:
<trans>$content = root$transaction.response$body$queryResponse$result$records.Attachment$Body$;
$decoded_content=Base64Decode($content);
WriteFile("<TAG>Targets/Files/FMLA _Extract</TAG>",$decoded_content);
</trans>
But when the multiple attachments are pulled, it creates 1 large file. This large file sometimes shows the first page, but most of the time Adobe is not able to read it. Instead, I would like to have multiple files listed on my target directory.
Thank you in advance for your help!
Target file:
FMLA_Extract

What does your file target look like? (Targets/File/FMLA_Extract). I'm guessing it's configured to append to existing files and you're not changing the file name, so they all get glommed on top of each other.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Get File Structure from Get Metadata in ADF - azure-data-factory-2

You can parameterize your file name in dataset or via GetMeta data activity, get the list of files within the folder and then via GetMetaData activity get the list of columns for those corresponding files.

Related

How to iterate through node while there is a relationship

Azure Data Factory - Switch Activity - File name startsWith

Azure Data Factory check file name dynamically

Is there a way to list the directories in a using PySpark in a notebook?

Download Multiple Attachments from Salesforce using Jitterbit

Categories

Resources