Large file to LookUp other large file when files are dependent- Mule ESB - mule

Could you please suggest. I have two files each have 80 to 90k product and these two files are interlinked with each other(one file have information on other) and i need to generate one single file by looking up the other files. These files probably comes in the sameTime with different name.
Both the files are csv and i need to generate the new csv.
Is that the only way I should keep any one of these files in memory and keep looking by iterating.
I planned to use Batch inside dataMapper. Is there any way we can keep the first file in Datamapper userDefined table or something like that.And the getting the new file to make a look up on it.( I'm not provided with external DB)
If any one of the file have some 5000 or 10k lines it the sense, i can keep that in memory and make the 80k file to look on it. I'm not comfortable to keep 80 or 90k file in memory.
Have reference this link: Mule ESB - design a multi file processing flow when files are dependent on each other.
Could you please suggest me the best solution.
Also any idea How long to process the file it does take, Thanks in advance.
Mule studio:5.3.1 and Runtime: 3.7.2

I would think of the problem as two distinct events from Mule's perspective, and plan to keep state from the first one in a "database" of some kind. This doesn't have to be an Oracle cluster or anything, you can run H2 in process or Redis on the same server as Mule for example.
I think you're on the right track with the Batch idea. When the first file is received, I'd create a record for each in a batch job. Then when the second file is received, I'd run a second batch job that looks up the relevant information from the database, and generates the CSV file you need. It could also remove the records that have been matched from the database in a subsequent batch step.
For the transformations, I'd recommend trying DataWeave instead of DataMapper. It's a better way to write transformation logic, and Mulesoft has deprecated DataMapper, to be removed as of Mule 4.0.

Related

Mosaic Decisions Azure BLOB writer node creating multiple files

I’m using mosaic decisions data flow feature to read a file from Azure blob, do a few transformations and write that data back to Azure. It worked fine except that in the output file path I have given, it created a folder and I can see many files with some strange “part-000” etc in their names. What I need is a single file in that output location – Not many. Is there a way around this?
Mosaic-Decisions uses apache spark as its backend execution engine. In Spark, the dataframe read is split into multiple partitions and these partitions are written to the output location in parallel. That's the reason it creates multiple files at the target location with "part-0000", "part-0001" etc. (part here represents partition).
The workaround on this is to check "combine-output-files-into-one" in writer node. This will combine all of the part files into one big file. But use this with caution and only if you really need a single file - as this will come with a performance tradeoff.

Mule SFTP component

Hi I have below queries with SFTP component if you guys can help me out that would be a great help:
1) Can we get the file size of the file picked up by SFTP component? I need to restrict the transfer based on the size of file.
2) Can I get the number of files and the file names picked up by the SFTP component?
3) Is the understanding correct: SFTP component picks up all the files from the server and keep in memory and do the processing 1 by 1 until it finishes all files?
4) If server has 5 files can SFTP component process all the 5 files in parallel rather than 1 by 1?
1-Mule does not populate the file-size field for SFTP as they do with FILE. There are Jira tickets open on this matter but MuleSoft has called it an enhancement and not given it a priority. I disagree. Perhaps ping MuleSoft, if enough users do maybe they will raise the priority and address it. They use the size internally, they simply do not expose it outside as is done with the FILE connector.
2-No, not really. It gives them back one at a time, not as a list.
3 & 4-It is only loading the entire file into memory if you tell it not to stream or do something else, like an onject-to-string transformer which forces a memory load. The files or files streams are passed back 1 by 1, but unless you restrict threading and make your flow synchronous, it will go to asych and multi-threaded and process multiple files in parallel. Flows default to asych, subflow are synchronous.
You can use the SFTP endpoint to retrieve files, and then use a Java or script call to get the file's attributes and filter to only process the files you are actually interested in, such as ones larger than your minimum size. This would seem more in line with what you are looking for in point 1. There are other options, but this would be more straight forward that others I can quickly think of.
I found 1 way to get the file size, if we provide transformer-refs="Object_to_Byte_Array" in and then do #[payload.size()] to get the size of file in Bytes? Will this cause any issue?

How to separate the latest file from Multiple files in Mule

I have 5000 files in a folder and on daily basis new file keep loaded in same file. I need to get the latest file only on daily basis among all the files.
Will it be possible to achieve the scenario in Mule out of box.
Tried keeping file component inside Poll component( To make use of waterMark) but not working.
Is there any way we can achieve this. If not please suggest the best way ( Any possible links).
Mule Studio: 5.3, RunTime 3.7.2.
Thanks in advance
Short answer: Not really any extremely quick out of the box solution. But there are other ways. Im not saying this is the right or only way of solving it, but I've earlier implemented a similar scenario in this way:
A Normal File inbound with a database table as file-log. Each time a new file is processed, a component checks if its name appears in the table. By choice or filter I only continue if it isn't in there already - and after processing I add the filename to the table.
This is a quite "heavy" solution though. A simpler access would be to use an idempotent filter with a object store. For example a Redis server: https://github.com/mulesoft/redis-connector/blob/master/src/test/resources/redis-objectstore-tests-config.xml
It is actually very simple if your incoming file contains timestamp........you can configure the file inbound connector by setting file:filename-regex-filter pattern="myfilename_#[function:timestamp].csv". I hope this helps
May be you can use a quartz scheduler( mention the time in cron expression), followed by a groovy script in which you can start the file connector . Keep the file connector in another flow.

Enrich CSV with metadata from database

I've been looking around for a lightweight, scaleable solution to enrich a CSV file with additional metadata from a database. Each line in the CSV represents a data item and the columns the metadata belonging to that item.
Basically I have a CSV extract and I need to add additional metadata from a database. The metadata can be accessed via ODBC or REST API call.
I have a number of options in my head but I'm looking for other ideas. My options are as follows:
Import the CSV into a database table, apply the additional metadata with sql UPDATE statements by finding the necessary metadata with SELECT statements, and then export the data back into CSV format. For this solution I was thinking to use an ETL tool which may be a bit heavyweight to tackle this problem.
I also thought about a NodeJS based solution where I read the CSV in, call web service to get the metadata and write back the data into the CSV file. The CSV can be however quite large with potentially tens of thousands of rows so this could be heavy on memory or in case of line-by-line processing not very performant.
If you have a better solution in mind, please post. Many thanks.
I think you've come up with a couple of pretty good ideas here already.
Running with your first suggestion using an ETL tool to enrich your CSV files, you should check out https://github.com/streamsets/datacollector
It's a continuous ingestion approach, so you could even monitor a directory of CSV files to load as you get them. While there's no specific functionality yet for doing lookups in a database, its certainly possible in a number of ways (including writing your own custom logic in Java, or a script in python or JavaScript).
*Full disclosure I work on this project.

How can I speed up batch processing job in Coldfusion?

Every once in awhile I am fed a large data file that my client uploads and that needs to be processed through CMFL. The problem is that if I put the processing on a CF page, then it runs into a timeout issue after 120 seconds. I was able to move the processing code to a CFC where it seems to not have the timeout issue. However, sometime during the processing, it causes ColdFusion to crash and has to restarted. There are a number of database queries (5 or more, mixture of updates and selects) required for each line (8,000+) of the file I go through as well as other logic provided by me in the form of CFML.
My question is what would be the best way to go through this file. One caveat, I am not able to move the file to the database server and process it entirely with the DB. However, would it be more efficient to pass each line to a stored procedure that took care of everything? It would still be a lot of calls to the database, but nothing compared to what I have now. Also, what would be the best way to provide feedback to the user about how much of the file has been processed?
Edit:
I'm running CF 6.1
I just did a similar thing and use CF often for data parsing.
1) Maintain a file upload table (Parent table). For every file you upload you should be able to keep a list of each file and what status it is in (uploaded, processed, unprocessed)
2) Temp table to store all the rows of the data file. (child table) Import the entire data file into a temporary table. Attempting to do it all in memory will inevitably lead to some errors. Each row in this table will link to a file upload table entry above.
3) Maintain a processing status - For each row of the datafile you bring in, set a "process/unprocessed" tag. This way if it breaks, you can start from where you left off. As you run through each line, set it to be "processed".
4) Transaction - use cftransaction if possible to commit all of it at once, or at least one line at a time (with your 5 queries). That way if something goes boom, you don't have one row of data that is half computed/processed/updated/tested.
5) Once you're done processing, set the file name entry in the table in step 1 to be "processed"
By using the approach above, if something fails, you can set it to start where it left off, or at least have a clearer path of where to start investigating, or worst case clean up in your data. You will have a clear way of displaying to the user the status of the current upload processing, where it's at, and where it left off if there was an error.
If you have any questions, let me know.
Other thoughts:
You can increase timeouts, give the VM more memory, put it in 64 bit but all of those will only increase the capacity of your system so much. It's a good idea to do these per call and do it in conjunction with the above.
Java has some neat file processing libraries that are available as CFCS. if you run into a lot of issues with speed, you can use one of those to read it into a variable and then into the database
If you are playing with XML, do not use coldfusion's xml parsing. It works well for smaller files and has fits when things get bigger. There are several cfc's written out there (check riaforge, etc) that wrap some excellent java libraries for parsing xml data. You can then create a cfquery manually if need be with this data.
It's hard to tell without more info, but from what you have said I shoot out three ideas.
The first thing, is with so many database operations, it's possible that you are generating too much debugging. Make sure that under Debug Output settings in the administrator that the following settings are turned off.
Enable Robust Exception Information
Enable AJAX Debug Log Window
Request Debugging Output
The second thing I would do is look at those DB queries and make sure they are optimized. Make sure selects are happening with indicies, etc.
The third thing I would suspect is that the file hanging out in memory is probably suboptimal.
I would try looping through the file using file looping:
<cfloop file="#VARIABLES.filePath#" index="VARIABLES.line">
<!--- Code to go here --->
</cfloop>
Have you tried an event gateway? I believe those threads are not subject to the same timeout settings as page request threads.
SQL Server Integration Services (SSIS) is the recommended tool for complex ETL (Extract, Transform, and Load) work, which is what this sounds like. (It can be configured to access files on other servers.) The question might be, can you work up an interface between Cold Fusion and SSIS?
If you can upgrade to cf8 and take advantage of cfloop file="" which would give you greater speed and the file would not be put in memory (which is probably the cause of the crashing).
Depending on the situation you are encountering you could also use cfthread to speed up processing.
Currently, an event gateway is the only way to get around the timeout limits of an HTTP request cycle. CF does not have a way to process CF pages offline, that is, there is no command-line invocation (one of my biggest gripes about CF - very little offling processing).
Your best bet is to use an Event Gateway or rewrite your parsing logic in straight Java.
I had to do the same thing, Ben Nadel has written a bunch of great articles uses java file io, to allow you to more speedily read files, write files etc...
Really helped improve the performance of our csv importing application.