how to look for the content of text file in pentaho? - pentaho

I have a ETL which give text file output and I have to check the those text content has the word error or bad using pentaho.
Is there any simple way to find it?

If you are trying to process a number of files, you can use a Get Filenames step to get all the filenames. Then, if your text files are small, you can use a Get File Content step to get the whole file as one row, then use a Java Filter or other matching step (RegEx, e.g.) to search for the words.
If your text files are too big but line-based or otherwise in a fixed format (which it likely is if you used a text file output step), you can use a Text File Input step to get the lines, then a matcher step (see above) to find the words in the line. Then you can use a Filter Rows step to choose just those rows that contain the words, then Select Values to choose just the filename, then a Sort Rows on the filename, then a Unique Rows step. The result should be a list of filenames whose contents contain the search words.
This may seem like a lot of steps, but Pentaho Data Integration or PDI (aka Kettle) is designed to be a flow of steps with distinct (and very reusable) functionality. A smaller but less "PDI" method is to write a User Defined Java Class (or other scripting) step to do all the work. This solution has a smaller number of steps but is not very configurable or reusable.

If you're writing these files out yourself, then dont you already know the content? So scan the fields at the point at which you already have them in memory.
If you're trying to see if Pentaho has written an error to the file, then you should use error handling on the output step.
Finally PDI is not a text searching tool. If you really need to do this, then probably best bet is good old grep..

Related

ADF Better way to count the number of files matching a FileMask, in a known folder

I have a Known\Folder Path.
That folder contains several hundred small txt files.
Generally the filenames are of the form Prefix_<Code1>_<SubCode2>_<State>.txt
I want to know how many files there are for a specific value of Code1.
I was hoping to use the GetMetadata activity, with Path Known\Folder\Prefix_Value_*.txt, but that just returns empty set :(
Currently I've got it working with GetMetadata on Known\Folder, with childItems captured, and then a foreach over all the files, with If on #startsWith(file.name, 'Prefix_Value').
But that results in hundreds of iterations of the loop, in sequence, and each activity takes ~1 second so it ends up taking minutes to do this check.
Is there a better way to do this? Either to direclty locate all files matching my mask, or a better way to count the matching elements of a hundreds-of-items array?
Lots of little activities might be expensive if you do it often.
If you only want the count, you can do this in the following hideous way (promise it isn't written in Brainf&ck) ... it relies on the fact that you can use XPATH to scan XML in ADF. You only need a set-variable activity after your metadata lookup.
Set a variable equal to this - it will contain the number of files with 'Code1' in the name.
#{xpath(xml(json(concat('{"files":{',replace(replace(replace(replace(replace(replace(string(activity('Get Metadata1').output.childitems),'[',''),']',''),'{',''),'}',''),',"type":',':'),'"name":',''),'}}'))),'count(/files/*[contains(local-name(),''Code1'')])')}
The inner part:
replace(replace(replace(replace(replace(replace(string(activity('Get Metadata1').output.childitems),'[',''),']',''),'{',''),'}',''),',"type":',':'),'"name":','')
takes the metadata activity's output and strips the []{} parts and the type and name elements, then
json(concat('{"files":{',<the foregoing>,'}}')
wraps that up in to a JSON object, with files as the outer key and the filenames as inner keys (with text = "file" but that's going to be irrelevant).
Then you can take that JSON, turn it into XML and query the XML.
xpath(xml(<the above JSON>), 'count(/files/*[contains(local-name(),''Code1'')])')
The XPATH query counts all the elements under /files (which are now our filenames) whose names contain the text 'Code1'.
There is no way to get the file count directly matching Wildcard in Get Metadata activity by now. You can vote Get Metadata for Multiple Files Matching Wildcard to progress this feature.
If you only want to copy those files, you can use Wildcard file path.
If those files stored in Azure Blob Storage or somewhere that can be got file count with prefix by API, you can use Azure Function activity.

Return certain row from Scala table with SQL

I'm very new to Databricks and Spark, so I hope my question is clear. If not, please let me know.
I have a folder in azure with more than 2 million XML-files. The goal is to convert all these files into one CSV-file. I have code that can convert XML to CSV and then add it to a CSV-file in azure. I've tested it with 50.000 files and it worked.
However, when I want to convert all XML-files (+2 million), I get the error that the driver limit is exceeded. I do not want to increase this limit, since it is not very efficient, so I came up with the idea to convert one XML-file at a time and then add it (append) to the CSV-file. So instead of converting all XML-files in one job, I want to convert one XML-file per job.
A colleague was able to develop a code in Scala that creates a table with all +2 million file paths. I can access this table sing SQL:
(The full paths are not shown due to security reasons).
What I actually need is a code in Python that can loop thru this table and retrieve one path (as string) at a time. The reason I need this in Python is because I have the code to convert to CSV in Python. The conversion only needs the path as string to perform. If I’m able to put this in a loop, each loop a new string is retrieved from the table as string, converted to CSV and then added to one CSV-file.
So my question is: how can I loop thru this table, returning the path (value of the table) as string with each iteration? This iteration should be able to go thru the whole list (+2 million paths).
I hope my question is clear and someone can help.
Best regards,
Ganesh

customizing output from database and formatting it

Say you have an average looking database. And you want to generate a variety of text files (each with their own specific formatting - so the files may have rudimentary tables and spacing). So you'd be taking the data from the Database, transforming it in a specified format (while doing some basic logic) and saving it as a text file (you can store it in XML as an intermediary step).
So if you had to create 10 of these unique files what would be the ideal approach to creating these files? I suppose you can create classes for each type of transformation but then you'd need to create quite a few classes, and what if you needed to create another 10 more of these files (a year down the road)?
What do you think is a good approach to this problem? being able to maintain the customizability of the output file, yet not creating a mess of a code and maintenance effort?
Here is what I would do if I were to come up with a general approach to this vague question. I would write three pieces of code, independent of each other:
a) A query processor which can run a query on a given database and output results in a well-known xml format.
b) An XSL stylesheet which can interpret the well-known xml format in (a) and transform it to the desired format.
c) An XML-to-Text transformer which can read the files in (a) and (b) and put out the result.

How to find line number or page number using Lucene

Can anyone help me?
For my project i use lucene for indexing files. It only give me the file name and location not mention about the line number and page number.
If it is possible with Lucene to find line number or page number? Please Help me how to do it.
This ended up being too long for a comment so I just made it an answer.
Are you thinking of grep (*nix tool) output where you grep a set of documents and get a result set that contains matches with a line number and text? EG:
46: I saw the brown fox jumping over the lazy dog
If so, Lucene doesn't work like that. On the OS, grep, to simplify, opens each document serially and runs your specified pattern against each line of the contents inside each document. Hence, it can then produce output like the stuff I listed earlier because it's working on the file as it exists on the machine. Lucene behaves differently.
When you index a file with Lucene, Lucene creates a inverted index combining the contents of each document into a highly efficient structure that lets you quickly look up and find documents containing specific pieces of information. In turn, when you run a query against the Lucene Inverted Index, it will return its internal representation of all the documents that matched your query as well as a relevancy score to provide some indication of how useful a document might be to you, based on the query. It does this by operating against it's own internal inverted index structure, not iterating over all the files in place like grep. Lucene possesses no knowledge of line or page numbers, so no, it's not possible to replicate grep with Lucene right out of the box.

SSIS: importing files some with column names, some without

OrPresumably due to inconsistent configuration of logging devices, I need to load a collection of csv files via SSIS that will sometimes have a first row with column names and will sometimes not. The file format is otherwise identical.
There seems a chance that the logging configuration can be standardized, so I don't want to waste programming time with a script task that opens each file and determines whether it has a header row and then processes it differently depending.
Rather, I would like to specify something like Destination.MaxNumberOfErrors, that would allow up to one error row per file (so if the only problem in the file was the header, it would not fail). The Flat File Source error is fatal though, so I don't see a way of getting it to keep going.
The meaning of the failure code is defined by the component, but the
error is fatal and the pipeline stopped executing. There may be error
messages posted before this with more information about the failure.
My best choice seems to be to simply ignore the first data row for now and wait to see if a more uniform configuration can be achieved. Of course, the dataset is invalid while this strategy is in place. I should add that the data is very big, so the ETL routines need to be as efficient as possible. In my opinion this contraindicates any file parsing or conditional splitting if there is any alternative.
The question is if there is a way to configure the File Source to continue from this fatal error?
Yes there is!
In the "Error Output" page in the editor, change the Error response for each row to "Redirect row". Then you can trap the problem rows (the headers, in your case) by taking them as a single column through the error output of your source.
If you can assume the values for header names would never appear in your data, then define your flat file connection manager as having no headers. The first step inside your data flow would check the values of column 1-N vs the header row values. Only let the data flow through if the values don't match.
Is there something more complex to the problem than that?