errors in transformation Kettle - pentaho

I want to get errors generated by system in Pentaho Kettle and expose it as results in transformation or job, for example i want to get errors of the HL7 input from log and expose it as results in the next step.

I want to get errors generated by system
You mean like Apache or MySQL errors? If that's the case, you may just point a Pentaho transformation to those files. They usually have a default place like /var/logs/apache2 and that would be pretty easy to read.
The part that's not that easy is if you want to parse those errors into something easier to analyse. For that I would use "load file in memory" and some "regex evaluation" steps to get the data you want out of the raw text.
But, there are better solutions for reading your logs and analyzing errors.
See LogStash for more info or similar products.

You could you save those results in a temporary csv file that the next step(s) can consume.
If you go with this solution I would recommend:
Adding a unique jobID or identifier in the file name to ensure that your next step is reading the right file.
Adding a step at the end that removes old temp files

Related

Adding fields to Cloudwatch without using JSON

So I have typical run of the mill logs from Nginx and tomcat servers which are just single line text files with typical log format. I have changed the tomcat access logs to output pipe delimited fields so I can easily process them using some unix scripts. I'd like to get rid of my unix scripts and move to using cloudwatch to process my logs in a similar manner, however I found out that cloudwatch really doesn't understand anything beyond timestamp, message, and logstream by default.
It will add fields using JSON, but JSON is verbose when it comes to log files. I'd like to just let it process a CSV file which seems like an obvious alternative to JSON. I'm willing to change my log format to meet a requirement like that, but I can't find any information about how I could do that.
Is my only option to translate my logs into JSON in order to add fields to cloudwatch? I am aware of the parse command, but I find that cumbersome to reconstitute my fields every time I want to build a query. Especially since these will mostly be access logs which will have numerous fields. I have aws cloudwatch log agent setup on my systems and I'm currently sending these logs to cloudwatch.
The closest thing there is to handling space delimited log files is to use Metric Filters. Or at least that's how the authors of CloudWatch designed it.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/FilterAndPatternSyntax.html
The best examples of this is here:
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CountOccurrencesExample.html
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/ExtractBytesExample.html
Not sure if this is going to work for what I'm trying to do with logs, but it's a start. And it's the closest thing to a proper answer. If you want it done right, you gotta do it yo'self.

Large file to LookUp other large file when files are dependent- Mule ESB

Could you please suggest. I have two files each have 80 to 90k product and these two files are interlinked with each other(one file have information on other) and i need to generate one single file by looking up the other files. These files probably comes in the sameTime with different name.
Both the files are csv and i need to generate the new csv.
Is that the only way I should keep any one of these files in memory and keep looking by iterating.
I planned to use Batch inside dataMapper. Is there any way we can keep the first file in Datamapper userDefined table or something like that.And the getting the new file to make a look up on it.( I'm not provided with external DB)
If any one of the file have some 5000 or 10k lines it the sense, i can keep that in memory and make the 80k file to look on it. I'm not comfortable to keep 80 or 90k file in memory.
Have reference this link: Mule ESB - design a multi file processing flow when files are dependent on each other.
Could you please suggest me the best solution.
Also any idea How long to process the file it does take, Thanks in advance.
Mule studio:5.3.1 and Runtime: 3.7.2
I would think of the problem as two distinct events from Mule's perspective, and plan to keep state from the first one in a "database" of some kind. This doesn't have to be an Oracle cluster or anything, you can run H2 in process or Redis on the same server as Mule for example.
I think you're on the right track with the Batch idea. When the first file is received, I'd create a record for each in a batch job. Then when the second file is received, I'd run a second batch job that looks up the relevant information from the database, and generates the CSV file you need. It could also remove the records that have been matched from the database in a subsequent batch step.
For the transformations, I'd recommend trying DataWeave instead of DataMapper. It's a better way to write transformation logic, and Mulesoft has deprecated DataMapper, to be removed as of Mule 4.0.

Need suggestions on what tool to use for manipulating a file

We have a need to create a daily process that will manipulate a file that is now being manually generating before FTPing it to a vendor. The issues with the current file are as follows:
1) It is currently comma delimited and it needs to be pipe delimited.
2) The vendor only want specific columns to be sent. They have a limit of 26 columns.
We need to develop an automated process that can be scheduled to run once a day and pick up a file with a specific extension, do the file manipulation and FTP the file.
Ideally, we would like to have some error handling in the process. We would want an email to get sent out if there was no file to process or if there was an error during the manipulation or FTP process.
My first thought was to use SQL Server Import/Export. I've done this before but that was only for packages that could be run manually. This process needs to be fully automated (after the existing file is manually generated.) I don't see a way to pick up any file with a specific extension. It looks like I have to select a specific file.
Is there a way to use Import/Export or some similar tool?
Or, do I need to write a program to do this sort of task? It seems to me like it would be more work to write a program. So, I am trying to avoid that.
Thank you for your help!
You should write a program. Seriously.

Is it possible to force an error in an Integration Services data flow to demonstrate its rollback?

I have been tasked with demoing how Integration Services handles an error during a data flow to show that no data makes it into the destination. This is an existing package and I want to limit the code changes to the package as much as possible (since this is most likely a one time deal).
The scenario that is trying to be understood is a "systemic" failure - the source file disappears midstream, or the file server loses power, etc.
I know I can make this happen by having the Error Output of the source set to Failure and introducing bad data but I would like to do something lighter than that.
I suppose I could add a Script Transform task and look for a certain value and throw an error but I was hoping someone has come up with something easier / more elegant.
Thanks,
Matt
mess up the file that you are trying to import by pasting some bad data or saving it in another format like UTF-8 or something like that
We always have a task at the end that closes the dataflow in our meta data tables. To test errors, I simply remove the ? that is the variable for the stored proc it runs. Easy to do and easy to fix back the way it was and it doesn't mess up anything datawise as our error trapping then closes the the data flow with an error. You could do something similar by adding a task to call a stored proc with an input variable but assign no parameters to it so it will fail. Then once the test is done, simply disable that task.
Data will make it to the destination if it is not running as a transaction. If you want to prevent populating partial data you have to use transactions. Then there is an option to set the end result of a control flow item as "failed" irrespective of the actual result but this is not available in data flow items. You will have to either produce an actual error in the data or code in a situation that will create an error. There is no other way...
Could we try with transaction level property of the package?
On failure of the data flow it will revert all the data from the target.
On successful dataflow only it will commit the data to target otherwise it will roll back the data from target.

How can I speed up batch processing job in Coldfusion?

Every once in awhile I am fed a large data file that my client uploads and that needs to be processed through CMFL. The problem is that if I put the processing on a CF page, then it runs into a timeout issue after 120 seconds. I was able to move the processing code to a CFC where it seems to not have the timeout issue. However, sometime during the processing, it causes ColdFusion to crash and has to restarted. There are a number of database queries (5 or more, mixture of updates and selects) required for each line (8,000+) of the file I go through as well as other logic provided by me in the form of CFML.
My question is what would be the best way to go through this file. One caveat, I am not able to move the file to the database server and process it entirely with the DB. However, would it be more efficient to pass each line to a stored procedure that took care of everything? It would still be a lot of calls to the database, but nothing compared to what I have now. Also, what would be the best way to provide feedback to the user about how much of the file has been processed?
Edit:
I'm running CF 6.1
I just did a similar thing and use CF often for data parsing.
1) Maintain a file upload table (Parent table). For every file you upload you should be able to keep a list of each file and what status it is in (uploaded, processed, unprocessed)
2) Temp table to store all the rows of the data file. (child table) Import the entire data file into a temporary table. Attempting to do it all in memory will inevitably lead to some errors. Each row in this table will link to a file upload table entry above.
3) Maintain a processing status - For each row of the datafile you bring in, set a "process/unprocessed" tag. This way if it breaks, you can start from where you left off. As you run through each line, set it to be "processed".
4) Transaction - use cftransaction if possible to commit all of it at once, or at least one line at a time (with your 5 queries). That way if something goes boom, you don't have one row of data that is half computed/processed/updated/tested.
5) Once you're done processing, set the file name entry in the table in step 1 to be "processed"
By using the approach above, if something fails, you can set it to start where it left off, or at least have a clearer path of where to start investigating, or worst case clean up in your data. You will have a clear way of displaying to the user the status of the current upload processing, where it's at, and where it left off if there was an error.
If you have any questions, let me know.
Other thoughts:
You can increase timeouts, give the VM more memory, put it in 64 bit but all of those will only increase the capacity of your system so much. It's a good idea to do these per call and do it in conjunction with the above.
Java has some neat file processing libraries that are available as CFCS. if you run into a lot of issues with speed, you can use one of those to read it into a variable and then into the database
If you are playing with XML, do not use coldfusion's xml parsing. It works well for smaller files and has fits when things get bigger. There are several cfc's written out there (check riaforge, etc) that wrap some excellent java libraries for parsing xml data. You can then create a cfquery manually if need be with this data.
It's hard to tell without more info, but from what you have said I shoot out three ideas.
The first thing, is with so many database operations, it's possible that you are generating too much debugging. Make sure that under Debug Output settings in the administrator that the following settings are turned off.
Enable Robust Exception Information
Enable AJAX Debug Log Window
Request Debugging Output
The second thing I would do is look at those DB queries and make sure they are optimized. Make sure selects are happening with indicies, etc.
The third thing I would suspect is that the file hanging out in memory is probably suboptimal.
I would try looping through the file using file looping:
<cfloop file="#VARIABLES.filePath#" index="VARIABLES.line">
<!--- Code to go here --->
</cfloop>
Have you tried an event gateway? I believe those threads are not subject to the same timeout settings as page request threads.
SQL Server Integration Services (SSIS) is the recommended tool for complex ETL (Extract, Transform, and Load) work, which is what this sounds like. (It can be configured to access files on other servers.) The question might be, can you work up an interface between Cold Fusion and SSIS?
If you can upgrade to cf8 and take advantage of cfloop file="" which would give you greater speed and the file would not be put in memory (which is probably the cause of the crashing).
Depending on the situation you are encountering you could also use cfthread to speed up processing.
Currently, an event gateway is the only way to get around the timeout limits of an HTTP request cycle. CF does not have a way to process CF pages offline, that is, there is no command-line invocation (one of my biggest gripes about CF - very little offling processing).
Your best bet is to use an Event Gateway or rewrite your parsing logic in straight Java.
I had to do the same thing, Ben Nadel has written a bunch of great articles uses java file io, to allow you to more speedily read files, write files etc...
Really helped improve the performance of our csv importing application.