mule : Batch Processing vs For-each - mule

I have a scenario where I have a list of IDs, for each ID fetch the data from multiple API and aggregate them (this is loading phase) and then write it to DB. I know that we can use batch processing for writing to db, but what about loading phase?

You should be able to use a foreach scope for this.
Your list of ID's will be in your payload before it reaches the foreach. You can use HTTP components set as request-response, this way all the data you need will be fetched before you reach your DB component for saving the data.

Fetching data from multiple APIs is something that takes time and has to be kept inside batch step. For each record, after fetching the data, move that to a VM queue. In the on complete phase, use a mule requester to fetch details from vm queue and insert in db. Inserting in db is a single step and does not require batch processing

You can use scatter-gather for each id and fetch data from multiple api's. Scatter-Gather sends a request message to multiple targets concurrently.Based on the responses you can implement aggregation strategy for responses.
Similar can be done using mule batch as well.
References: https://docs.mulesoft.com/mule-user-guide/v/3.9/scatter-gather

Related

How to batch streaming inserts to BigQuery from a Beam job

I'm writing to BigQuery in a beam job from an unbounded source. I'm using STREAMING INSERTS as the Method. I was looking at how to throttle the rows to BigQuery based on the recommendations in
https://cloud.google.com/bigquery/quotas#streaming_inserts
The BigQueryIO.Write API doesn't provide a way to set the micro batches.
I was looking at using Triggers but not sure if BigQuery groups everything in a pane into a request. I've setup the trigger as below
Window.<Long>into(new GlobalWindows())
.triggering(
Repeatedly.forever(
AfterFirst.of(
AfterPane.elementCountAtLeast(5),
AfterProcessingTime.pastFirstElementInPane().plusDelayOf(Duration.standardMinutes(2)))
))
.discardingFiredPanes());
Q1. Does Beam support micro batches or does it create one request for each element in the PCollection?
Q2. If the above trigger makes sense? Even If I set the window/trigger it could be sending one request for every element.
I don't know what you mean by micro-batch. The way I see it, BigQuery support loading data either as batches, either in streaming.
Basically, batch loads are subject to quotas and streaming loads are a bit more expensive.
Once you set the insertion method for your BigQueryIO the documentation states :
Note: If you use batch loads in a streaming pipeline, you must use withTriggeringFrequency to specify a triggering frequency.
Never tried it, but withTriggeringFrequency seems to be what you need here.

Flink batching Sink

I'm trying to use flink in both a streaming and batch way, to add a lot of data into Accumulo (A few million a minute). I want to batch up records before sending them to Accumulo.
I ingest data either from a directory or via kafka, convert the data using a flatmap and then pass to a RichSinkFunction, which adds the data to a collection.
With the streaming data, batching seems ok, in that I can add the records to a collection of fixed size which get sent to accumulo once the batch threshold is reached. But for the batch data which is finite, I'm struggling to find a good approach to batching as it would require a flush time out in case there is no further data within a specified time.
There doesn't seem to be an Accumulo connector unlike for Elastic search or other alternative sinks.
I thought about using a Process Function with a trigger for batch size and time interval, but this requires a keyed window. I didn't want to go down the keyed route as data looks to be very skewed, in that some keys would have a tonne of records and some would have very few. If I don't use a windowed approach, then I understand that the operator won't be parallel. I was hoping to lazily batch, so each sink only cares about numbers or an interval of time.
Has anybody got any pointers on how best to address this?
You can access timers in a sink by implementing ProcessingTimeCallback. For an example, look at the BucketingSink -- its open and onProcessingTime methods should get you started.

Get/Set BigQuery Job ID while doing BigQueryIO.write()

Is it possible to set BigQuery JobID or to get it while the batch pipeline is running.
I know it's possible using BigQuery API but is it possible if I'm using BigQueryIO from Apache Beam?
I need to send an acknowledgement after writing to BigQuery that the load is complete.
Currently this is not possible. It is complicated by the fact that a single BigQueryIO.write() may use many BigQuery jobs under the hood (i.e. BigQueryIO.write() is a general-purpose API for writing data to BigQuery, rather than an API for working with a single specific BigQuery load job), e.g.:
In case the amount of data to be loaded is larger than the BigQuery limits for a single load job, BigQueryIO.write() will shard it into multiple load jobs.
In case you are using one of the destination-dependent write methods (e.g. DynamicDestinations), and are loading into multiple tables at the same time, there'll be at least 1 load job per table.
In case you are writing an unbounded PCollection using the BATCH_LOADS method, it will periodically issue load jobs for newly arrived data, subject to the notes above.
In case you're using the STREAMING_INSERTS method (it is allowed to use it even if you're writing a bounded PCollection), there will be no load jobs at all.
You will need to use one of the typical workarounds for "doing something after something else is done", which is, e.g. wait until the entire pipeline is done using pipeline.run().waitUntilFinish() in your main program and then do your second action.

Using NServiceBus how can I process multiple messages as a batch?

We currently have an process that involves sending a third-party an xml file containing changes that have occurred within our system.
We are moving to use NServiceBus and the changes are modelled as individual commands sent to an endpoint.
We do not want to send these changes as individual files; instead we want to batch receive a number of commands, concatenating this information into a single file.
How might one go about batching multiple commands into a single export file?
Have you looked at Sagas? Sagas allow you to model long-running business processes. So if a file has multiple parts, the Saga could begin when it gets the first part, and complete when it has all the parts it needs.
http://cdn.nservicebus.com/sagas.aspx
You can send the commands in using Bus.Send(IMessage[]messages). Note it takes an array and the messages will be packed into one queue message over the wire. On the receiving side, the handler will be invoked once per message. In your handler you should be able to just keep appending to your file. In the handler you could place logic to determine when to "roll" the file if necessary.

How to export a huge data from a database to a file using mule?

I have a http endpoint, on request into this endpoint, i need to create a huge database (more than million records) dump and generate a single xml file. I am planning to write a component which will query with pagination and then write it into file? I am new to mule. Can i stream the data from the component into a file connector. If yes how to do it?
A Http endpoint will stream by default unless its an anonymous post. You can do both operations if you use an All processor. For the XML you can use data mapper (with streaming enabled) and for the JDBC you can just send the payload to the JDBC outbound and it will do batched inserting if the payload is of type list
Generally for huge set of Data and bulk operation there is Batch module in Mule.
It process messages in batches and extremely useful in handling large set of data or db operations in bulk. You can consider using batch job
ref :- https://developer.mulesoft.com/docs/display/current/Batch+Processing