How to export a huge data from a database to a file using mule? - mule

I have a http endpoint, on request into this endpoint, i need to create a huge database (more than million records) dump and generate a single xml file. I am planning to write a component which will query with pagination and then write it into file? I am new to mule. Can i stream the data from the component into a file connector. If yes how to do it?

A Http endpoint will stream by default unless its an anonymous post. You can do both operations if you use an All processor. For the XML you can use data mapper (with streaming enabled) and for the JDBC you can just send the payload to the JDBC outbound and it will do batched inserting if the payload is of type list

Generally for huge set of Data and bulk operation there is Batch module in Mule.
It process messages in batches and extremely useful in handling large set of data or db operations in bulk. You can consider using batch job
ref :- https://developer.mulesoft.com/docs/display/current/Batch+Processing

Related

Sending avro schema along with payload

I want to implement avro serializer/deserializer for Kafka producer/consumer. There can be multiple scenarios
Writer schema and reader schema are same, will never change. In such scenario, no need to send avro schema along with payload. At consumer we can use reader schema itself to deserialise payload. Sample implementation is provided in this post
Using schema resolution feature when schema will evolve over time. So avro can still deserialize different reader and writer schema using schema resolution rules. So we need to send avro scehma along with payload
My Question How to send schema as well while producing, so that deserialiser read whole bytes and separate out actual payload and schema ? I am using avro generated class. Note, I don't want to use schema registry.
You need a reader and writer schema, in any Avro use-case, even if they are the same. SpecificDatumWriter (for serializer) and SpecificDatumReader (for deserializer) both take a schema.
You could use Kafka record headers to encode the AVSC string, and send along with the payload, but keep in mind that Kafka records/batches have an upper-bound in allowed size. Using some Schema Registry (doesn't have to be Confluent's), reduces the overhead from a whole string to a simple integer ID.

how does file reading ( streaming ) really work in mule?

Am trying to understand how streaming works w.r.t mule 4.4.
Am reading a large file and am using 'Repeating file store stream' as streaming strategy
'In memory size' = 128 KB
The file is 24 MB and for sake of argument lets say 1000 records is equivalent to 128 KB
so about 1000 records will be stored in memory and rest all will be written to file store by mule .
Here's the flow:
At stage#1 we are reading a file
At stage#2 we are logging payload - so I am assuming initially 128KB worth of data is logged and internally mule will move rest of the data from file storage to in memory and this data will be written to log.
Question : so does the heap memory increase from 128KB to 24 MB ?
I am assuming no , but needed confirmation ?
At stage#3 we are using transform script to create a json payload
So what happens here :
so now is the json payload all in memory now ? ( say 24 MB ) ?
what has happened to the stream ?
so really I am struggling to understand how stream is beneficial if during transformation the data is stored in memory ?
Thanks
It really depends on how each component works but usually logging means to load the full payload in memory. Having said that, logging 'big' payloads is considered a bad practice and you should avoid doing it in the first place. Even a few KBs logs are really not a good idea. Logs are not intended to be used that way. Using logs, as any computational operation, have a cost in processing and resource usage. I have seen several times people causing out of memory errors or performance issues because of excessive logging.
The case with the Transform component is different. In some cases it is able to benefit from streaming, depending on the format used and the script. Sequential access to records is required for streaming to work. If you try an indexed access to the 24MB payload it will probably load the entire payload in memory (example payload[14203]). Also referencing the payload more than once in a step may fail. Streamed records are consumed after being read so it is not possible to use them twice.
Streaming for Dataweave needs to be enabled (it is not the default) by using the property streaming=true.
You can find more details in the documentation for DataWeave Streaming and Mule Streaming.

Can we create queue data consumers?

Consider I have rabbitmq or Amazon SQS from which I have to consume the data and validate the same with data in DB.
Is it possible to write consumer using karate which simply consumes data from a queue and stores it and validate against the data in DB.
Yes, using Java interop you can do anything. Refer this example: https://github.com/intuit/karate/tree/master/karate-netty#consumer-provider-example

mule : Batch Processing vs For-each

I have a scenario where I have a list of IDs, for each ID fetch the data from multiple API and aggregate them (this is loading phase) and then write it to DB. I know that we can use batch processing for writing to db, but what about loading phase?
You should be able to use a foreach scope for this.
Your list of ID's will be in your payload before it reaches the foreach. You can use HTTP components set as request-response, this way all the data you need will be fetched before you reach your DB component for saving the data.
Fetching data from multiple APIs is something that takes time and has to be kept inside batch step. For each record, after fetching the data, move that to a VM queue. In the on complete phase, use a mule requester to fetch details from vm queue and insert in db. Inserting in db is a single step and does not require batch processing
You can use scatter-gather for each id and fetch data from multiple api's. Scatter-Gather sends a request message to multiple targets concurrently.Based on the responses you can implement aggregation strategy for responses.
Similar can be done using mule batch as well.
References: https://docs.mulesoft.com/mule-user-guide/v/3.9/scatter-gather

JMeter - Launch Several SQL Requests

Context :
I want to load test the database but the use case is a bit larger than a single request. For instance, with the GUI we can launch a payment. This means that our software will analyse each operation running on markets, will close operation, calculate an amount, send and log payment for accountancy. Each operation has a backup in the database (for recovery, security and accountancy).
Objective :
It's a long use case but each operation is very short. I created a JDBC driver that logs each SQL operation. So I have a list of 2000ish operations that I want to replay to measure execution time.
Jmeter :
I use Jmeter to test a single request. I can setup 2 or 3 requests but I want to tests a sequence of requests as explained
You can put your SQLs in a CSV file
Then use a CSV DataSet that will reference this file using variable sqlQuery
Then in JDBC Request use ${sqlQuery}
If you want the response time for the list of SQLs use Transaction Controller as parent of all JDBC Requests