We are trying to extract approx. 40 GB data from database and want to generate multiple csv files. We used mule DB connector in streaming fashion, which is returning 'ResultSetIterator'
Q1) How to convert this ResultSetIterator to arraylist? or any readable format which we can use further to generate files
Q2) We tried using For-Each component to split this data in chunks, its working for limited set of data and for huge data giving SerializationException
In below input snippet we are making chunks of data using for-each and providing it to batch process for multiple files
<batch:job name="testBatchWithDBOutside">
<batch:input>
<logger message="#[payload]" level="INFO" doc:name="Logger"/>
</batch:input>
<batch:process-records>
<batch:step name="Batch_Step">
<batch:commit size="10" doc:name="Batch Commit">
<object-to-string-transformer doc:name="Object to String"/>
<logger message="#[payload]" level="INFO" doc:name="Logger"/>
<file:outbound-endpoint path="C:\output" outputPattern="#[message.id].txt" responseTimeout="10000" doc:name="File"/>
</batch:commit>
</batch:step>
</batch:process-records>
</batch:job>
<flow name="testBatchWithDBOutsideFlow" processingStrategy="synchronous">
<file:inbound-endpoint path="C:\input" responseTimeout="10000" doc:name="File"/>
<db:select config-ref="MySQL_Configuration" streaming="true" fetchSize="10" doc:name="Database">
<db:parameterized-query><![CDATA[select * from classicmodels]]></db:parameterized-query>
</db:select>
<foreach batchSize="5" doc:name="For Each">
<batch:execute name="testBatchWithDBOutside" doc:name="testBatchWithDBOutside"/>
</foreach>
</flow>
Q1. You don't want to convert the Iterator to a List, as this will defeat the purpose of streaming from the DB connector and load all records into memory. Mule handles Iterators and Lists in the same way anyway.
Q2. The batch module implies a for-each operation. The output of batch:input needs to be a List or an Iterator. You should be able to simplify this
<batch:job name="testBatch">
<batch:input>
<db:select config-ref="MySQL_Configuration" streaming="true" fetchSize="10" doc:name="Database">
<db:parameterized-query><![CDATA[select * from classicmodels]]></db:parameterized-query>
</db:select>
</batch:input>
<batch:process-records>
<batch:step name="Batch_Step">
<object-to-string-transformer doc:name="Object to String"/>
<file:outbound-endpoint path="C:\output" outputPattern="#[message.id].txt" responseTimeout="10000" doc:name="File"/>
</batch:step>
</batch:process-records>
</batch:job>
You will also need to replace the object-to-string-transformer with a component that converts a database record (the payload at this point will be a map where the key is the column name, and the value is the record value) into a csv line.
You can find a decent example in the Mule blog here: https://blogs.mulesoft.com/dev/anypoint-platform-dev/batch-module-reloaded/
Another option would be to remove the batch processor and use DataWeave to generate csv output and stream it to the file. This might be helpful: https://docs.mulesoft.com/mule-user-guide/v/3.7/dataweave-streaming
Dataweave will call next on the ResultSetIterator as it processes each record, and that Iterator will handle selecting chunks of records from the underlying database, so there is no queueing in between steps, or loading the full dataset into memory.
<flow name="batchtestFlow">
<http:listener config-ref="HTTP_Listener_Configuration" path="/batch" allowedMethods="GET" doc:name="HTTP"/>
<db:select config-ref="Generic_Database_Configuration" streaming="true" doc:name="Database">
<db:parameterized-query><![CDATA[select * from Employees]]></db:parameterized-query>
</db:select>
<dw:transform-message doc:name="Transform Message">
<dw:set-payload><![CDATA[%dw 1.0
%input payload application/java
%output application/csv streaming=true, header=true, quoteValues=true
---
payload map ((e, i) -> {
surname: e.SURNAME,
firstname: e.FIRST_NAME
})]]></dw:set-payload>
</dw:transform-message>
<file:outbound-endpoint path="C:/tmp" outputPattern="testbatchfile.csv" connector-ref="File" responseTimeout="10000" doc:name="File"/>
</flow>
You want to use OutputHander. Make sure you have streaming turned on and then use a script component, for instance select groovy and handle each row one at a time like so:
// script.groovy
return {evt, out ->
payload.each { row ->
out << row.SOMECOLUMN.... }
} as OutputHandler
And the component in your xml
<scripting:transformer returnClass="TODO" doc:name="ScriptComponent">
<scripting:script engine="Groovy" file="script.groovy" />
</scripting:transformer>
If you want to return some output. However if you want to write to a file in your case you wouldn't use the variable out but instead write to your files.
I found a simple and quickest way as below:
Here DB connector is in streaming mode, and For-Each is splitting the records in given Batch Size
<flow name="testFlow" processingStrategy="synchronous">
<composite-source doc:name="Composite Source">
<quartz:inbound-endpoint jobName="test" cronExpression="0 48 13 1/1 * ? *" repeatInterval="0" connector-ref="Quartz" responseTimeout="10000" doc:name="Quartz">
<quartz:event-generator-job/>
</quartz:inbound-endpoint>
<http:listener config-ref="HTTP_Listener_Configuration" path="/hit" doc:name="HTTP"/>
</composite-source>
<db:select config-ref="MySQL_Configuration" streaming="true" fetchSize="10000" doc:name="Database">
<db:parameterized-query><![CDATA[SELECT * FROM tblName]]></db:parameterized-query>
</db:select>
<foreach batchSize="10000" doc:name="For Each">
<dw:transform-message doc:name="Transform Message">
<dw:set-payload><![CDATA[%dw 1.0
%output application/csv
---
payload map {
field1:$.InterfaceId,
field2:$.Component
}]]></dw:set-payload>
</dw:transform-message>
<file:outbound-endpoint path="F:\output" outputPattern="#[message.id].csv" responseTimeout="10000" doc:name="File"/>
</foreach>
<set-payload value="*** Success ***" doc:name="Set Payload"/>
</flow>
Related
I am reading the multiple files from different folder and merging them into one but not able to merge into one file.
I am using composite source where I added two file connector then I am logging the payload into logger. payload I am getting one by one. How can I get the one payload combination of the two different payloads or multiple files input?
<flow name="file2Flow">
<composite-source doc:name="Copy_of_Composite Source">
<file:inbound-endpoint path="src/main/resources/input1" responseTimeout="10000" doc:name="File"/>
<file:inbound-endpoint path="src/main/resources/input2" responseTimeout="10000" doc:name="File"/>
</composite-source>
<file:file-to-string-transformer doc:name="File to String"/>
<logger message="#[payload]" level="INFO" doc:name="Logger"/>
</flow>
also I am trying this but not getting output
<flow name="file2file2Flow">
<http:listener config-ref="HTTP_Listener_Configuration" path="/files" doc:name="HTTP"/>
<scatter-gather doc:name="Scatter-Gather">
<file:outbound-endpoint path="src/main/resources/input1" responseTimeout="10000" doc:name="File"/>
<file:outbound-endpoint path="src/main/resources/input1" responseTimeout="10000" doc:name="File"/>
</scatter-gather>
<dw:transform-message doc:name="Transform Message">
<dw:set-payload><![CDATA[%dw 1.0
%output application/json
---
{
post1: payload[0],
post2: payload[1]
}]]>
</dw:set-payload>
</dw:transform-message>
<logger message="#[payload]" level="INFO" doc:name="Logger"/>
</flow>
file:inbound-endpoint will poll one directory, so if you need different directories that won't work.
composite-source allows it, but they wont be available in the same payload.
file:outbound-endpoint is for writing files only.
In Mule 3, you can achieve this though through a combination of a poll to trigger the flow, scatter-gather to route to multiple processors and the mule requester module to read files mid flow.
Mule Requester Module: https://www.mulesoft.com/exchange/68ef9520-24e9-4cf2-b2f5-620025690913/requester-module/
Rough example:
<flow name="dw-testFlow">
<poll doc:name="Poll" frequency="10000">
<logger level="INFO" doc:name="Logger" />
</poll>
<scatter-gather doc:name="Scatter-Gather">
<mulerequester:request config-ref="muleRequesterConfig" resource="myFileEndpoint" doc:name="Mule Requester" />
<mulerequester:request config-ref="muleRequesterConfig" resource="myFileEndpoint" doc:name="Mule Requester" />
</scatter-gather>
</flow>
We would like to use DB query inside mule cache scope.
Wants to store the output of DB query in cache to save DB query trip.
If the DB query doesn't give any output or payload is empty, we dont want to save in mule cache.
How to invalidate the cache of the empty payload entries ?
thank you.
The answer to this is in mule forum, https://forums.mulesoft.com/questions/84675/mule-cache-scope-how-to-invalidate-mule-cache-for.html
<ee:cache cachingStrategy-ref="Caching_Strategy" filterExpression="#
[payload
!= 'testData']" doc:name="Cache">
<db:select config-ref="DBConfig" fetchSize="100"
doc:name="Database">
<db:dynamic-query><![CDATA[select * from STUDENT where
student_id = 'TEST']]></db:dynamic-query>
</db:select>
<choice doc:name="Choice">
<when expression="#[message.payload.size() == 0]">
<logger message="Payload is empty" level="INFO"
doc:name="Logger"/>
<dw:transform-message doc:name="Transform Message">
<dw:set-payload><![CDATA[%dw 1.0
%output application/json
---
payload]]></dw:set-payload>
</dw:transform-message>
</when>
<otherwise>
<object-to-string-transformer doc:name="Object to String"/>
</otherwise>
</choice>
<logger message="After Choice" level="INFO" doc:name="Logger"/>
</ee:cache>
I have an input file having 500k records. I need to process these records in batch, apply transformation and write to an output file. I'm trying to experiment it a bit with the below flow. The batch.block size is set to 1000. The output file contains only 1000 records. The rest of 490k records are lost.
As per my understanding, batch starts a new instance for each block size, in this case, every 1000 records will be processed by a new thread. Are these threads overwriting each other ? How do I collect all the transformed records into output file ?
<flow name="poll-inbound-file">
<file:inbound-endpoint path="${file.inbound.location}"
pollingFrequency="${file.polling.frequency}" responseTimeout="10000"
doc:name="File" metadata:id="abce53af-7d82-411a-a75a-5cd8ae8e55ae"
fileAge="${file.fileage}" moveToDirectory="${file.outbound.location}"/>
<custom-interceptor
class="com.example.TimerInterceptor" doc:name="Timer" />
<dw:transform-message doc:name="Transform Message"
metadata:id="dcf84872-5aca-404f-9169-d448c9e4cd76">
<dw:input-payload mimeType="application/csv" />
<dw:set-payload><![CDATA[%dw 1.0
%output application/java
---
payload as :iterator]]></dw:set-payload>
</dw:transform-message>
<batch:job name="process-batchBatch" block-size="${batch.blocksize}">
<batch:process-records>
<batch:step name="Batch_Step1">
<logger level="TRACE" doc:name="Logger" message="#[payload]" />
</batch:step>
<batch:step name="Batch_Step2">
<logger level="TRACE" doc:name="Logger" message="#[payload]" />
</batch:step>
<batch:step name="Batch_Step3">
<batch:commit doc:name="Batch Commit" size="1000">
<expression-component doc:name="Expression"><![CDATA[StringBuilder sb=new StringBuilder();
for(String s: payload)
{
sb.append(s);
sb.append(System.lineSeparator());
}
payload= sb.toString();]]></expression-component>
<file:outbound-endpoint path="${file.outbound.location}"
responseTimeout="10000" doc:name="File" />
</batch:commit>
</batch:step>
</batch:process-records>
<batch:on-complete>
<logger
message="******************************************** Batch Report **************************************"
level="INFO" doc:name="Logger" />
</batch:on-complete>
</batch:job>
</flow>
Writing to a file from multiple threads at the same time is generally not safe. Instead write your results to a queue such as ActiveMQ or the likes and have another flow which reads form the queue and then writes to the file. You can decide if you want to start processing from the queue before or after you have processed the file.
How to convert type=java.lang.String to type=java.lang.Iterable as batch step (Process Records) is expecting type of java.lang.Iterable. Note : Input is an xml file and mule flow is a batch job.
When the xml has only one 'Report_Entry' record below error is recieved. For multiple entries of 'Report_Entry' flow works fine.
<object-to-string-transformer doc:name="Object to String"/>
<logger message="#[payload]" level="INFO" doc:name="Logger"/>
<set-payload
value="#[xpath3('/*:Report_Data/*:Report_Entry', payload, 'NODESET')]" doc:name="Set Payload"/>
<logger message="XML Record - #[payload]" level="INFO" doc:name="Logger"/>
</batch:input>
<batch:process-records>
<batch:step name="Batch_Step1">
<json:object-to-json-transformer doc:name="Object to JSON"/>
<logger message="XML Record - #[payload]" level="INFO" doc:name="Logger"/>
<amqp:outbound-endpoint exchangeName="${amqp.exchangeName}" queueName="${amqp.queueName}" responseTimeout="10000" encoding="UTF-8" mimeType="application/xml" connector-ref="AMQP_Connector" doc:name="AMQP"/>
</batch:step>
</batch:process-records>
In the logger it is printing 'org.mule.api.processor.LoggerMessageProcessor: XML Record - net.sf.saxon.dom.DOMNodeList#57d263b4' after the set-payload condition. Our requirement is to convert the xml record to JSON and write to AMQP.
That's because of the splitter. If you just want a collection/iterable before the batch job, jsut use set-payload:
<set-payload
value="#[xpath3('/*:Report_Data/*:Report_Entry', payload, 'NODESET')]" />
<batch:execute name="test" />
This should wok regardless of the amount nodes.
Converting from CSV to JSON using mule datamapper. I want to check if required field is empty. If empty log that field and discard it for further processing.
I know in script option we have if(input.data.length >0).'
But how to discard the whole row if this fails??
You can do this within mule datamapper simply by encapsulating the whole conversion within the if statements opening and closing braces. Something like this:
if ( input.Quantity > 0 ) {
output.id = input.id;
output.Customer = input.Customer;
output.Quantity = input.Quantity;
output.Price = input.Price;
}
However a different, perhaps better, approach would be to let the datamapper transform every row into JSON and then split and filter as seperate steps in the flow.
<flow name="filterindatamapperFlow2" doc:name="filterindatamapperFlow2">
<file:inbound-endpoint path="/tmp/inbox" doc:name="Inbound file"/>
<data-mapper:transform config-ref="CSV_To_UnfilteredJSON" doc:name="CSV To Unfiltered JSON"/>
<request-reply>
<vm:outbound-endpoint path="splittandprocess" exchange-pattern="one-way"/>
<vm:inbound-endpoint path="result"/>
</request-reply>
<json:object-to-json-transformer doc:name="Object to JSON"/>
<file:outbound-endpoint path="/tmp/outbox" doc:name="Outbound file"/>
</flow>
<flow name="splittandprocess">
<vm:inbound-endpoint path="splittandprocess" exchange-pattern="one-way"/>
<json:json-to-object-transformer returnClass="java.util.List" doc:name="JSON to Object"/>
<splitter expression="#[payload]" doc:name="Splitter"/>
<json:json-to-object-transformer returnClass="java.util.Map" doc:name="JSON to Object"/>
<message-filter doc:name="Filter Out Orders With No Quantity" onUnaccepted="handleFilteredMessages">
<expression-filter expression="#[payload['Quantity'] > 0]" />
</message-filter>
<collection-aggregator failOnTimeout="false" timeout="1000"/>
<vm:outbound-endpoint path="result" exchange-pattern="one-way"/>
</flow>
<flow name="handleFilteredMessages">
<logger message="Payload filtered #[payload]" level="ERROR" doc:name="Logger"/>
</flow>