My current scenario:
I have 10000 records as input to batch.
As per my understanding, batch is only for record-by-record processing.Hence, i am transforming each record using dataweave component inside batch step(Note: I havenot used any batch-commit) and writing each record to file. The reason for doing record-by-record processing is suppose in any particular record, there is an invalid data, only that particular record gets failed, rest of them will be processed fine.
But in many of the blogs I see, they are using a batchcommit(with streaming) with dataweave component. So as per my understanding, all the records will be given in one shot to dataweave, and if one record has invalid data, all the 10000 records will get failed(at dataweave). Then, the point of record-by-record processing is lost.
Is the above assumption correct or am I thinking wrong way??
That is the reason I am not using batch Commit.
Now, as I said am sending each record to a file. Actually, i do have the requirement of sending each record to 5 different CSV files. So, currently I am using Scatter-Gather component inside my BatchStep to send it to five different routes.
As, you can see the image. the input phase gives a collection of 10000 records. Each record will be send to 5 routes using Scatter-Gather.
Is, the approach I am using is it fine, or any better Design can be followed??
Also, I have created a 2nd Batch step, to capture ONLY FAILEDRECORDS. But, with the current Design, I am not able to Capture failed records.
SHORT ANSWERS
Is the above assumption correct or am I thinking wrong way??
In short, yes you are thinking the wrong way. Read my loooong explanation with example to understand why, hope you will appreciate it.
Also, I have created a 2nd Batch step, to capture ONLY FAILEDRECORDS.
But, with the current Design, I am not able to Capture failed records.
You probably forget to set max-failed-records = "-1" (unlimited) on batch job. Default is 0, on first failed record batch will return and not execute subsequent steps.
Is, the approach I am using is it fine, or any better Design can be
followed??
I think it makes sense if performance is essential for you and you can't cope with the overhead created by doing this operation in sequence.
If instead you can slow down a bit it could make sense to do this operation in 5 different steps, you will loose parallelism but you can have a better control on failing records especially if using batch commit.
MULE BATCH JOB IN PRACTICE
I think the best way to explain how it works it trough an example.
Take in consideration the following case:
You have a batch processing configured with max-failed-records = "-1" (no limit).
<batch:job name="batch_testBatch" max-failed-records="-1">
In this process we input a collection composed by 6 strings.
<batch:input>
<set-payload value="#[['record1','record2','record3','record4','record5','record6']]" doc:name="Set Payload"/>
</batch:input>
The processing is composed by 3 steps"
The first step is just a logging of the processing and the second step will instead do a logging and throw an exception on record3 to simulate a failure.
<batch:step name="Batch_Step">
<logger message="-- processing #[payload] in step 1 --" level="INFO" doc:name="Logger"/>
</batch:step>
<batch:step name="Batch_Step2">
<logger message="-- processing #[payload] in step 2 --" level="INFO" doc:name="Logger"/>
<scripting:transformer doc:name="Groovy">
<scripting:script engine="Groovy"><![CDATA[
if(payload=="record3"){
throw new java.lang.Exception();
}
payload;
]]>
</scripting:script>
</scripting:transformer>
</batch:step>
The third step will instead contain just the commit with a commit count value of 2.
<batch:step name="Batch_Step3">
<batch:commit size="2" doc:name="Batch Commit">
<logger message="-- committing #[payload] --" level="INFO" doc:name="Logger"/>
</batch:commit>
</batch:step>
Now you can follow me in the execution of this batch processing:
On start all 6 records will be processed by the first step and logging in console would look like this:
-- processing record1 in step 1 --
-- processing record2 in step 1 --
-- processing record3 in step 1 --
-- processing record4 in step 1 --
-- processing record5 in step 1 --
-- processing record6 in step 1 --
Step Batch_Step finished processing all records for instance d8660590-ca74-11e5-ab57-6cd020524153 of job batch_testBatch
Now things would be more interesting on step 2 the record 3 will fail because we explicitly throw an exception but despite this the step will continue in processing the other records, here how the log would look like.
-- processing record1 in step 2 --
-- processing record2 in step 2 --
-- processing record3 in step 2 --
com.mulesoft.module.batch.DefaultBatchStep: Found exception processing record on step ...
Stacktrace
....
-- processing record4 in step 2 --
-- processing record5 in step 2 --
-- processing record6 in step 2 --
Step Batch_Step2 finished processing all records for instance d8660590-ca74-11e5-ab57-6cd020524153 of job batch_testBatch
At this point despite a failed record in this step batch processing will continue because the parameter max-failed-records is set to -1 (unlimited) and not to the default value of 0.
At this point all the successful records will be passed to step3, this because, by default, the accept-policy parameter of a step is set to NO_FAILURES. (Other possible values are ALL and ONLY_FAILURES).
Now the step3 that contains the commit phase with a count equal to 2 will commit the records two by two:
-- committing [record1, record2] --
-- committing [record4, record5] --
Step: Step Batch_Step3 finished processing all records for instance d8660590-ca74-11e5-ab57-6cd020524153 of job batch_testBatch
-- committing [record6] --
As you can see this confirms that record3 that was in failure was not passed to the next step and therefore not committed.
Starting from this example I think you can imagine and test more complex scenario, for example after commit you could have another step that process only failed records for make aware administrator with a mail of the failure.
After you can always use external storage to store more advanced info about your records as you can read in my answer to this other question.
Hope this helps
Related
I have a flow which submits around 10-20 salesforce bulk query job details to anypoint mq to be processed asynchronously.
I am using normal Queue, Not using FIFO queue and wants process one message at a time.
My subscriber configurations are given below. I am putting this whooping ack timeout to 15 minutes as max it has taken 15 minutes for a Job to change the status from jobUpload to JobCompleted.
MuleRuntime: 4.4
MQ Connector Version: 3.2.0
<anypoint-mq:subscriber doc:name="Subscribering Bulk Query Job Details"
config-ref="Anypoint_MQ_Config"
destination="${anyPointMq.name}"
acknowledgementTimeout="15"
acknowledgementTimeoutUnit="MINUTES">
<anypoint-mq:subscriber-type >
<anypoint-mq:prefetch maxLocalMessages="1" />
</anypoint-mq:subscriber-type>
</anypoint-mq:subscriber>
Anypoint MQ Connector Configuration
<anypoint-mq:config name="Anypoint_MQ_Config" doc:name="Anypoint MQ Config" doc:id="ce3aaed9-dcba-41bc-8c68-037c5b1420e2">
<anypoint-mq:connection clientId="${secure::anyPointMq.clientId}" clientSecret="${secure::anyPointMq.clientSecret}" url="${anyPointMq.url}">
<reconnection>
<reconnect frequency="3000" count="3" />
</reconnection>
<anypoint-mq:tcp-client-socket-properties connectionTimeout="30000" />
</anypoint-mq:connection>
</anypoint-mq:config>
Subscriber flow
<flow name="sfdc-bulk-query-job-subscription" doc:id="7e1e23d0-d7f1-45ed-a609-0fb35dd23e6a" maxConcurrency="1">
<anypoint-mq:subscriber doc:name="Subscribering Bulk Query Job Details" doc:id="98b8b25e-3141-4bd7-a9ab-86548902196a" config-ref="Anypoint_MQ_Config" destination="${anyPointMq.sfPartnerEds.name}" acknowledgementTimeout="${anyPointMq.ackTimeout}" acknowledgementTimeoutUnit="MINUTES">
<anypoint-mq:subscriber-type >
<anypoint-mq:prefetch maxLocalMessages="${anyPointMq.prefecth.maxLocalMsg}" />
</anypoint-mq:subscriber-type>
</anypoint-mq:subscriber>
<json-logger:logger doc:name="INFO - Bulk Job Details have been fetched" doc:id="b25c3850-8185-42be-a293-659ebff546d7" config-ref="JSON_Logger_Config" message='#["Bulk Job Details have been fetched for " ++ payload.object default ""]'>
<json-logger:content ><![CDATA[#[output application/json ---
payload]]]></json-logger:content>
</json-logger:logger>
<set-variable value="#[p('serviceName.sfdcToEds')]" doc:name="ServiceName" doc:id="f1ece944-0ed8-4c0e-94f2-3152956a2736" variableName="ServiceName"/>
<set-variable value="#[payload.object]" doc:name="sfObject" doc:id="2857c8d9-fe8d-46fa-8774-0eed91e3a3a6" variableName="sfObject" />
<set-variable value="#[message.attributes.properties.key]" doc:name="key" doc:id="57028932-04ab-44c0-bd15-befc850946ec" variableName="key" />
<flow-ref doc:name="bulk-job-status-check" doc:id="c6b9cd40-4674-47b8-afaa-0f789ccff657" name="bulk-job-status-check" />
<json-logger:logger doc:name="INFO - subscribed bulk job id has been processed successfully" doc:id="7e469f92-2aff-4bf4-84d0-76577d44479a" config-ref="JSON_Logger_Config" message='#["subscribed bulk job id has been processed successfully for salesforce " ++ vars.sfObject default "" ++ " object"]' tracePoint="END"/>
</flow>
After the bulk query job subscriber, I am checking the status of the job for 5 time with an interval of 1 minutes inside until successful scope. It generally exhausts all 5 attempts and subscribe it again and do the same process again until it gets completed. I have seen until successfull scope gets exhausted more than one for a single job.
Once the job's status changes to jobComplete. I fetch the result and sends to AWS S3 bucket via mulesoft system api. Here also I use a retry logic as due to large volume of data I always get this message while making first call
HTTP POST on resource 'https://****//dlb.lb.anypointdns.net:443/api/sys/aws/s3/databricks/object' failed: Remotely closed.
But during the second retry it gets successful response from S3 Bucket system api.
Now the main problem:
Though I am using normal queue. I have notice messages remains in flight mode for infinite amount of time and still not get picket up by mule flow/subscriber. Below screenshot shows an example, there were 7 messages in flight but were not being picked up even after many days.
As I have kept maxConcurrency and maxPrefetchLocalMsg to 1. But there are more than 1 messages are been taken out of the queue. Please help understand this.
I am new to mulesoft, have a question regarding batch processing.if batch process crashed in between and some record already processed, what happened when batch processing start again. Duplicate data ???
It will try to continue processing the pending data in the batch queues, unless it was corrupted by the crash.
The answer depends upon a few things. First, unless you configure the batch job scope to allow record level failures, the entire job will stop when a record fails. Second, if you do configure to allow failures, then the batch will continue to process all records. In such a case, each batch step can be configured to accept only successful records (the default), or only failed records, or all records.
So the answer to your question is dependent upon configuration.
And as far as duplicate data, this part is entirely up to you.
If you have the job stop for failure, when you restart it, the set of records you provide at that time will be the ones processed. If you submit records that have been processed once before, then they will be processed again. You can provide for filtering either upon reentry to the batch job, or as the records are successfully processed.
Before answering your question I would like to know couple of things.
a. what do u mean by batch crashes ?
are you saying during the batch processing there is some JVM hit and the
batch started again ?
or There is failure during processing of some records the batch?
a.1 --> if there is JVM hit then then the entire program halts.
need to restart the programme. which results into processing
the same set of record.
a.2 --> To handle the failure record inside batch.
In the batch step you can do three steps as below .
set the batch job to continue irrespective of any error
<batch:job jobName="Batch1" maxFailedRecords="-1">
In the batch job create 3 batch step :-
a.processRecord.- process the 1st record in batch
queue.
b.ProcessSuccessful-
if there is no exception in a. go to batch step b.
c.processFailure record.-
if there is exception in a. goto batch step. c
THE ENTIRE SAMPLE CODE IS SHOWN BELOW .
<batch:job jobName="batchJob" maxFailedRecords="-1" >
<batch:process-records>
<batch:step name="processRecord" acceptPolicy="ALL" >
log.info("process any record COMES IN THE STEP");
</batch:step>
<batch:step name="ProcessSuccessful"
acceptPolicy="NO_FAILURES">
log.info("process only SUCCESSFUL record")
</batch:step>
<batch:step name="processFailure"
acceptPolicy="ONLY_FAILURES">
log.info("process only FAILURE record");
</batch:step>
</batch:process-records>
<batch:on-complete>
log.info("on complete phase , log the successful and
failure count");
</batch:on-complete>
</batch:job>
N.B --> the code as per Mule 4.0.
I'm using mule batch flow to process the files. As per the requirement I should stop processing the batch step for further processing after 10 failures.
So I've configured max-failed-records="10" but still I see around 99 failures in my logger that is kept in complete phase. The file which the app recieves will have around 8657 rows. so loaded records will be 8657 records.
Logger in complete phase:
<logger message="#['Failed Records'+payload.failedRecords]" level="INFO" doc:name="Logger"/>
Below image is my flow:
Its default behavior of the mule. As per Batch Documentation Mule loads 1600 records at once (16 threads x 100 records per block). Though max failure is set 10 it will process all loaded records, but it wont load next record blocks as max failure limit is reached.
Hope this helps.
I am using parallel processing.
CALL FUNCTION 'ZABC' STARTING NEW TASK taskname
DESTINATION IN GROUP srv_grp PERFORMING come_back ON END OF TASK
EXPORTING
...
EXCEPTIONS
...
.
I am calling this FM inside a loop. sometimes, my records are skipped. I am not getting a desired output. Sometimes 2000 records are processed and sometimes 1000. The numbers are varying. what can be the problem ? Can you provide me some cases where records can be skipped in parallel processing ?
See the example code SAP provides: https://help.sap.com/viewer/753088fc00704d0a80e7fbd6803c8adb/7.4.16/en-US/4892082ffeb35ed2e10000000a42189d.html
Look at how they handle the resource exceptions. A resource exception means that you tried to start a new aRFC but there were no more processes available. Your program has to handle these cases. Without handling these cases the entries will be skipped. The normal handling is to just wait a little time for some of the active processes to finish, in the example program:
WAIT UNTIL rcv_jobs >= snd_jobs
UP TO 5 SECONDS.
gv_semaphore = 0.
" your internal table size.
DESCRIBE TABLE lt_itab LINES lv_lines.
LOOP AT lt_itab INTO ls_itab.
" parallel proc. -->>
CALL FUNCTION 'ZABC' STARTING NEW TASK taskname
DESTINATION IN GROUP srv_grp PERFORMING come_back ON END OF TASK
EXPORTING
...
EXCEPTIONS
...
.
" <<--
ENDLOOP.
" wait until all parallel processes have terminated. **
WAIT UNTIL gv_semaphore = lv_lines.
NOTE: You should check total parallel process count. There should be some upper limits for opened threads.
Thanks.
I have a loop that runs three flow references in order. At least that is the plan. Run in the debugger, processing takes place in the following unexpected order:
the first flow-ref (A)
the second flow-ref (B)
the first component of flow A
the third flow-ref (C)
the first component of flow B
the second component of the flow A
the first component of flow C
the second component of flow B
the third component of flow A
...now things blow up (in 1st of flow C), since payload is not expected
I changed the processing strategy from implicit to 'synchronous' with no noticeable change.
What is going on?
<flow name="Loop_until_successfull" doc:name="Loop_until_successfull" processingStrategy="synchronous">
<flow-ref name="A" doc:name="Go to A"></flow-ref>
<flow-ref name="B" doc:name="Go to B"></flow-ref>
<flow-ref name="C" doc:name="Go to C"></flow-ref>
</flow>
Changing the "Loop_until_successful" flow to synchronous will only assure that calls to "Loop_until_successful" are processed synchronously, not necessarily any other flow called by it. You need to change each of flows called by "Loop_until_successful" to be processed synchronously to ensure you get the response back from each call out before you make a call to the next flow. If you do this, then Loop_until_successful (I'll call L.U.S for now on) calls A, waits for a response, then calls B, waits for a response, then calls C. The way it is configured now, L.U.S. calls A and then moves immediately on to B using the payload it has rather than waiting for the response from A.