Notating large batch of files - sql

I have about 30,000 different files all with different file formats names. I want to put together a list of "unique" files given that the dates/etc. are replaced by generic characters/symbols.
For example:
20160105asdf_123456_CODE.txt
Would be notated into:
YYYYMMDD*_######_XXXX.txt
Any ideas on how to do this efficiently on a large scale? I thought about parsing it out per delimiter ("_"), but I'm sure there's something a lot easier out there.

Related

ADF Better way to count the number of files matching a FileMask, in a known folder

I have a Known\Folder Path.
That folder contains several hundred small txt files.
Generally the filenames are of the form Prefix_<Code1>_<SubCode2>_<State>.txt
I want to know how many files there are for a specific value of Code1.
I was hoping to use the GetMetadata activity, with Path Known\Folder\Prefix_Value_*.txt, but that just returns empty set :(
Currently I've got it working with GetMetadata on Known\Folder, with childItems captured, and then a foreach over all the files, with If on #startsWith(file.name, 'Prefix_Value').
But that results in hundreds of iterations of the loop, in sequence, and each activity takes ~1 second so it ends up taking minutes to do this check.
Is there a better way to do this? Either to direclty locate all files matching my mask, or a better way to count the matching elements of a hundreds-of-items array?
Lots of little activities might be expensive if you do it often.
If you only want the count, you can do this in the following hideous way (promise it isn't written in Brainf&ck) ... it relies on the fact that you can use XPATH to scan XML in ADF. You only need a set-variable activity after your metadata lookup.
Set a variable equal to this - it will contain the number of files with 'Code1' in the name.
#{xpath(xml(json(concat('{"files":{',replace(replace(replace(replace(replace(replace(string(activity('Get Metadata1').output.childitems),'[',''),']',''),'{',''),'}',''),',"type":',':'),'"name":',''),'}}'))),'count(/files/*[contains(local-name(),''Code1'')])')}
The inner part:
replace(replace(replace(replace(replace(replace(string(activity('Get Metadata1').output.childitems),'[',''),']',''),'{',''),'}',''),',"type":',':'),'"name":','')
takes the metadata activity's output and strips the []{} parts and the type and name elements, then
json(concat('{"files":{',<the foregoing>,'}}')
wraps that up in to a JSON object, with files as the outer key and the filenames as inner keys (with text = "file" but that's going to be irrelevant).
Then you can take that JSON, turn it into XML and query the XML.
xpath(xml(<the above JSON>), 'count(/files/*[contains(local-name(),''Code1'')])')
The XPATH query counts all the elements under /files (which are now our filenames) whose names contain the text 'Code1'.
There is no way to get the file count directly matching Wildcard in Get Metadata activity by now. You can vote Get Metadata for Multiple Files Matching Wildcard to progress this feature.
If you only want to copy those files, you can use Wildcard file path.
If those files stored in Azure Blob Storage or somewhere that can be got file count with prefix by API, you can use Azure Function activity.

count number of lines a file has in ColdFusion

I'm keeping some basic info written in a file, but 99% of the time, I just need to count the number of lines there are as efficiently as it is reasonably possible.
is there a way to get the row? or do I need to loop through the file?
Read the file and treat it as a list delimited by CR/LF characters. The listLen() will be the number of lines in the file. Depending on whether you want to count empty lines, you might need to use the includeEmptyValues option.

how to look for the content of text file in pentaho?

I have a ETL which give text file output and I have to check the those text content has the word error or bad using pentaho.
Is there any simple way to find it?
If you are trying to process a number of files, you can use a Get Filenames step to get all the filenames. Then, if your text files are small, you can use a Get File Content step to get the whole file as one row, then use a Java Filter or other matching step (RegEx, e.g.) to search for the words.
If your text files are too big but line-based or otherwise in a fixed format (which it likely is if you used a text file output step), you can use a Text File Input step to get the lines, then a matcher step (see above) to find the words in the line. Then you can use a Filter Rows step to choose just those rows that contain the words, then Select Values to choose just the filename, then a Sort Rows on the filename, then a Unique Rows step. The result should be a list of filenames whose contents contain the search words.
This may seem like a lot of steps, but Pentaho Data Integration or PDI (aka Kettle) is designed to be a flow of steps with distinct (and very reusable) functionality. A smaller but less "PDI" method is to write a User Defined Java Class (or other scripting) step to do all the work. This solution has a smaller number of steps but is not very configurable or reusable.
If you're writing these files out yourself, then dont you already know the content? So scan the fields at the point at which you already have them in memory.
If you're trying to see if Pentaho has written an error to the file, then you should use error handling on the output step.
Finally PDI is not a text searching tool. If you really need to do this, then probably best bet is good old grep..

customizing output from database and formatting it

Say you have an average looking database. And you want to generate a variety of text files (each with their own specific formatting - so the files may have rudimentary tables and spacing). So you'd be taking the data from the Database, transforming it in a specified format (while doing some basic logic) and saving it as a text file (you can store it in XML as an intermediary step).
So if you had to create 10 of these unique files what would be the ideal approach to creating these files? I suppose you can create classes for each type of transformation but then you'd need to create quite a few classes, and what if you needed to create another 10 more of these files (a year down the road)?
What do you think is a good approach to this problem? being able to maintain the customizability of the output file, yet not creating a mess of a code and maintenance effort?
Here is what I would do if I were to come up with a general approach to this vague question. I would write three pieces of code, independent of each other:
a) A query processor which can run a query on a given database and output results in a well-known xml format.
b) An XSL stylesheet which can interpret the well-known xml format in (a) and transform it to the desired format.
c) An XML-to-Text transformer which can read the files in (a) and (b) and put out the result.

efficient diff between large file and other small files

I wish to get some expert advice on this problem.
I have two text files, one very large ( ~ GB ) and other small ( ~ MB). These files essentially have information per line. I can say that bigger file has a subset of information about the smaller file. Each line in files is organized as tuples sperated by spaces and diff is found by looking at one or more of columns in those two files. Both of these files are sorted based on one of such column (document id).
I implemented it by keeping index on document id and line number and doing a random access to that line in larger file to start the diff. But this method is slow. I want to know any good mechanism for this scenario.
Thanks in advance.
If the files are known to be sorted in the same order by the same key, and the lines that share a common key are expected to match exactly, then comm is probably what you want - it has flags to allow you to show only the lines that are common between two files, or the lines that are in one file but not the other.