ADF Better way to count the number of files matching a FileMask, in a known folder - azure-data-factory-2

I have a Known\Folder Path.
That folder contains several hundred small txt files.
Generally the filenames are of the form Prefix_<Code1>_<SubCode2>_<State>.txt
I want to know how many files there are for a specific value of Code1.
I was hoping to use the GetMetadata activity, with Path Known\Folder\Prefix_Value_*.txt, but that just returns empty set :(
Currently I've got it working with GetMetadata on Known\Folder, with childItems captured, and then a foreach over all the files, with If on #startsWith(file.name, 'Prefix_Value').
But that results in hundreds of iterations of the loop, in sequence, and each activity takes ~1 second so it ends up taking minutes to do this check.
Is there a better way to do this? Either to direclty locate all files matching my mask, or a better way to count the matching elements of a hundreds-of-items array?

Lots of little activities might be expensive if you do it often.
If you only want the count, you can do this in the following hideous way (promise it isn't written in Brainf&ck) ... it relies on the fact that you can use XPATH to scan XML in ADF. You only need a set-variable activity after your metadata lookup.
Set a variable equal to this - it will contain the number of files with 'Code1' in the name.
#{xpath(xml(json(concat('{"files":{',replace(replace(replace(replace(replace(replace(string(activity('Get Metadata1').output.childitems),'[',''),']',''),'{',''),'}',''),',"type":',':'),'"name":',''),'}}'))),'count(/files/*[contains(local-name(),''Code1'')])')}
The inner part:
replace(replace(replace(replace(replace(replace(string(activity('Get Metadata1').output.childitems),'[',''),']',''),'{',''),'}',''),',"type":',':'),'"name":','')
takes the metadata activity's output and strips the []{} parts and the type and name elements, then
json(concat('{"files":{',<the foregoing>,'}}')
wraps that up in to a JSON object, with files as the outer key and the filenames as inner keys (with text = "file" but that's going to be irrelevant).
Then you can take that JSON, turn it into XML and query the XML.
xpath(xml(<the above JSON>), 'count(/files/*[contains(local-name(),''Code1'')])')
The XPATH query counts all the elements under /files (which are now our filenames) whose names contain the text 'Code1'.

There is no way to get the file count directly matching Wildcard in Get Metadata activity by now. You can vote Get Metadata for Multiple Files Matching Wildcard to progress this feature.
If you only want to copy those files, you can use Wildcard file path.
If those files stored in Azure Blob Storage or somewhere that can be got file count with prefix by API, you can use Azure Function activity.

Related

Save large BigQuery results to another project's BigQuery

I need to run a join query on BigQuery of one project, that may return large amount of data (that may not fit in VM's memeory), and then save the results in the BigQuery of another project.
Is there an easy way to do this without loading the data in VM, as data size can vary and VM may not have enough memory to load it?
One method is to bypass the VM for the operation and utilize Google Cloud Storage instead.
The process will look like following
Create a GS bucket that both projects has access to
Source project - Export the table to the GS bucket (this is possible from the web interface, pretty sure the CLI tools can do it to)
Destination project - Create a new table from the files in the GS bucket
to save result of query to a table in any project - you do not need to save it first to VM you should just set properly destination property and of course you need to have write permissions to dataset that contain that table!
Destination property can vary depend on client tool you use
for example, if you are using REST API's jobs.insert you should set below property
configuration.query.destinationTable nested object [Optional]
Describes the table where the query results should be stored. If not
present, a new table will be created to store the results. This
property must be set for large results that exceed the maximum
response size.
configuration.query.destinationTable.datasetId string [Required]
The
ID of the dataset containing this table.
configuration.query.destinationTable.projectId string [Required]
The
ID of the project containing this table.
configuration.query.destinationTable.tableId string [Required]
The ID
of the table. The ID must contain only letters (a-z, A-Z), numbers
(0-9), or underscores (_). The maximum length is 1,024 characters.

How to preserve Google Cloud Storage rows order in compressed files

We've created a query in BigQuery that returns SKUs and correlations between them. Something like:
sku_0,sku_1,0.023
sku_0,sku_2,0.482
sku_0,sku_3,0.328
sku_1,sku_0,0.023
sku_1,sku_2,0.848
sku_1,sku_3,0.736
The result has millions of rows and we export it to Google Cloud Storage which results in several compressed files.
These files are downloaded and we have a Python application that loops through them to make some calculations using the correlations.
We tried then to make use of the fact that our first columns of SKUs is already ordered and not have to apply this ordering inside of our application.
But then we just found that the files we get from GCS changes the order in which the skus appear.
It looks like the files are created by several processes reading the results and saving it in different files, which breaks the ordering we wanted to maintain.
As an example, if we have 2 files created, the first file would look something like that:
sku_0,sku_1,0.023
sku_0,sku_3,0.328
sku_1,sku_2,0.0848
And the second file:
sku_0,sku_2,0.482
sku_1,sku_0,0.328
sku_1,sku_3,0.736
This is an example of what it looks like two processes reading the results and each one saving its current row on a specific file which changes the order of the column.
So we looked for some flag that we could use to force the preservation of the ordering but couldn't find any so far.
Is there some way we could use to force the order in these GCS files to be preserved? Or is there some workaround?
Thanks in advance,
As far I know there is no flag to maintain order.
As a workaround you can rethink your data output to use of NESTED type, and make sure that what you want to group together are converted in NESTED rows, and you can export to JSON.
is there some workaround?
As an option - you can move your processing logic from Python to BigQuery, thus eliminating moving data out of BigQuery to GCS.

Notating large batch of files

I have about 30,000 different files all with different file formats names. I want to put together a list of "unique" files given that the dates/etc. are replaced by generic characters/symbols.
For example:
20160105asdf_123456_CODE.txt
Would be notated into:
YYYYMMDD*_######_XXXX.txt
Any ideas on how to do this efficiently on a large scale? I thought about parsing it out per delimiter ("_"), but I'm sure there's something a lot easier out there.

ETL file loading: files created today, or files not already loaded?

I need to automate a process to load new data files into a database. My question is about the best way to determine which files are "new" in an automated fashion.
Files are retrieved from a directory that is synced nightly, so the list of files keeps growing. I don't have the option to wipe out files that I have already retrieved.
New records are stored in a raw data table that has a field indicating the filename where each record originated, so I could compare all filenames currently in the directory with filenames already in the raw data table, and process only those filenames that aren't in common.
Or I could use timestamps that are in the filenames, and process only those files that were created since the last time the import process was run.
I am leaning toward using the first approach since it seems less prone to error, but I haven't had much luck finding whether this is actually true. What are the pitfalls of determining new files in this manner, by comparing all filenames with the filenames already in the database?
File name comparison:
If you have millions of files then comparison might not what you are
looking for.
You must be sure that the files in the said folder never gets
deleted.
Get filenames by date:
Since these filenames are retrieved once a day can guarantee the
accuracy. (Even they created in millisecond difference)
Will be efficient if many files are there.
Pentaho gives the modified date not the created date.
To do either of the above, you can use the following Pentaho step.
Configuration Get File Names step:
File/Directory: Give the folder path contains the files.
Wildcard (RegExp): .*\.* to get all or .*\.pdf to get specific
format.

how to look for the content of text file in pentaho?

I have a ETL which give text file output and I have to check the those text content has the word error or bad using pentaho.
Is there any simple way to find it?
If you are trying to process a number of files, you can use a Get Filenames step to get all the filenames. Then, if your text files are small, you can use a Get File Content step to get the whole file as one row, then use a Java Filter or other matching step (RegEx, e.g.) to search for the words.
If your text files are too big but line-based or otherwise in a fixed format (which it likely is if you used a text file output step), you can use a Text File Input step to get the lines, then a matcher step (see above) to find the words in the line. Then you can use a Filter Rows step to choose just those rows that contain the words, then Select Values to choose just the filename, then a Sort Rows on the filename, then a Unique Rows step. The result should be a list of filenames whose contents contain the search words.
This may seem like a lot of steps, but Pentaho Data Integration or PDI (aka Kettle) is designed to be a flow of steps with distinct (and very reusable) functionality. A smaller but less "PDI" method is to write a User Defined Java Class (or other scripting) step to do all the work. This solution has a smaller number of steps but is not very configurable or reusable.
If you're writing these files out yourself, then dont you already know the content? So scan the fields at the point at which you already have them in memory.
If you're trying to see if Pentaho has written an error to the file, then you should use error handling on the output step.
Finally PDI is not a text searching tool. If you really need to do this, then probably best bet is good old grep..