If multiple files are to be written to? Does Pentaho's "Text file Output" guarantee the records to be written in the order of what is sent to it? Or while dealing with multiple files and multiple streams, it has some internal logic which might change the order of record writes?
As far as I have used it, it writes lines in the same order it gets.
Of course if you have the same file in multiple steps, Pentaho won't assure you which of them will be executed first as all steps are processed in parallel.
If that's the case, and you need to write one stream before the other, use a blocking step, but keep in mind that your job will execute slower.
Related
I have a huge file that can have anywhere from few hundred thousand to 5 million records. Its tab-delimited file. I need to read the file from ftp location , transform it and finally write it in a FTP location.
I was going to use FTP connector get the repeatable stream and put it into mule batch. Inside mule batch process idea was to use a batch step to transform the records and finally in batch aggregate FTP write the file to destination in append mode 100 records at a time.
Q1. Is this a good approach or is there some better approach?
Q2. How does mule batch load and dispatch phase work (https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch ) Is it waiting for entire stream of millions of records to be read in memory before dispatching a mule batch instance ?
Q3. While doing FTP write in batch aggregate there is a chance that parallel threads will start appending content to FTP at same time thereby corrupting the records. Is that avoidable. I read about File locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) . My assumption is it will simply raise File lock exception and not necessarily wait to write FTP in append mode.
Q1. Is this a good approach or is there some better approach?
See answer Q3, this might not work for you. You could instead use a foreach and process the file sequentially though that will increase the time for processing significantly.
Q2. How does mule batch load and dispatch phase work
(https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch
) Is it waiting for entire stream of millions of records to be read in
memory before dispatching a mule batch instance ?
Batch doesn't load big numbers of records in memory, it uses file based queues. And yes, it loads all records in the queue before starting to process them.
Q3. While doing FTP write in batch aggregate there is a chance that
parallel threads will start appending content to FTP at same time
thereby corrupting the records. Is that avoidable. I read about File
locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) .
My assumption is it will simply raise File lock exception and not
necessarily wait to write FTP in append mode
The file write operation will throw a FILE:FILE_LOCK error if the file is already locked. Note that Mule 4 doesn't manage errors through exceptions, it uses Mule errors.
If you are using DataWeave flatfile to parse the input file, note that it will load the file in memory and use significantly more memory than the file itself to process it, so you probably are going to get an out of memory error anyway.
I have to implement a kafka consumer which reads data from a topic and writes it a file based on the account id(will be close to million) present in the payload. Assuming there will be around 3K events per second. Is it ok to open and close file for each message read?
or should I consider a different approach?
I am assuming following:
Each account id will be unique and will have its own unique file.
It is okay to have a little lag in the data in the file, i.e. the data in the file will be near real time.
The data read per event is not huge.
Solution:
Kafka Consumer reads the data and writes to a database, preferably a NoSQL db.
A separate Single thread periodically reads the database for new records inserted, groups them by accountId.
Then iterates over the accoundId and for each accountId opens the File, writes the data at once, closes the File and moves to the next accountId.
Advantages:
Your consumer will not be blocked due to File Handling, as the two operations are decoupled.
Even if File Handling fails then the data is always present in DB to reprocess.
If your account id repeats, then it is better to windowing. You can aggregate all events of say 1 min by windowing, then you can group events by key and process all accountId at once.
This way, you will not have to open a file multiple times.
It is not okay to open a file for every single message, you should buffer a fixed amount of message, then write to a file when you each that limit.
You can use the HDFS Kafka Connector provided by Confluent to manage this.
If configured with the FieldPartitioner writing out to a local filesystem given store.url=file:///tmp, for example, that will create one directory per unique accountId field in your topic. Then the flush.size configuration determines how many messages will end up in a single file
Hadoop does not need to be installed as the HDFS libraries are included in the Kafka Connect classpath and they support local filesystems
You would start it like this after creating two property files
bin/connect-standalone worker.properties hdfs-local-connect.properties
CSV files get uploaded to some FTP server (for which I don't have SSH access) in a daily basis and I need to generate weekly data that merges those files with transformations. That data would go into a history table in BQ and a CSV file in GCS.
My approach goes as follows:
Create a Linux VM and set a cron job that syncs the files from the
FTP server with a GCS bucket (I'm using GCSFS)
Use an external table in BQ for each category of CSV files
Create views with complex queries that transform the data
Use another cron job to create a table with the historic data and also the CSV file on a weekly basis.
My idea is to remove as much middle processes as I can and to make the implementation as easy as possible, including dataflow for ETL, but I have some questions first:
What's the problem with my approach in terms of efficiency and money?
Is there anything DataFlow can provide that my approach can't?
any ideas about other approaches?
BTW, I ran into one problem that might be fixable by parsing the csv files myself rather than using external tables, which is invalid characters, like the null char, so I can get rid of them, while as an external table there is a parsing error.
Probably your ETL will be simplified by Google DataFlow Pipeline batch execution job. Upload your files to the GCS bucket. For transforming use pipeline transformation to strip null values and invalid character (or whatever your need is). On those transformed dataset use your complex queries like grouping it by key, aggregating it (sum or combine) and also if you need side inputs data-flow provides ability to merge other data-sets into the current the data-set too. Finally the transformed output can written to BQ or you can write your own custom implementation for writing those results.
So the data-flow gives you very high flexibility to your solution, you can branch the pipeline and work differently on each branch with same data-set. And regarding the cost, if you run your batch job with three workers, which is the default that should not be very costly, but again if you just want to concentrate on your business logic and not worry about the rest, google data-flow is pretty interesting and its very powerful if used wisely.
Data-flow helps you to keep everything on a single plate and manage them effectively. Go through its pricing and determine if it could be the best fit for you (your problem is completely solvable with google data-flow), Your approach is not bad but needs extra maintenance with those pieces.
Hope this helps.
here are a few thoughts.
If you are working with a very low volume of data then your approach may work just fine. If you are working with more data and need several VMs, dataflow can automatically scale up and down the number of workers your pipeline uses to help it run more efficiently and save costs.
Also, is your linux VM always running? Or does it only spin up when you run your cron job? A batch Dataflow job only runs when it needed, which also helps to save on costs.
In Dataflow you could use TextIO to read each line of the file in, and add your custom parsing logic.
You mention that you have a cron job which puts the files into GCS. Dataflow can read from GCS, so it would probably be simplest to keep that process around and have your dataflow job read from GCS. Otherwise you would need to write a custom source to read from your FTP server.
Here are some useful links:
https://cloud.google.com/dataflow/service/dataflow-service-desc#autoscaling
In Mule, I have quite many records to process, where processing includes some calculations, going back and forth to database etc.. We can process collections of records with these options
Batch processing
ForEach
Splitter-Aggregator
So what are the main differences between them? When should we prefer one to others?
Mule batch processing option does not seem to have batch job scope variable definition, for example. Or, what if I want to benefit multithreading to fasten the overall task? Or, which is better if I want to modify the payload during processing?
When you write "quite many" I assume it's too much for main memory, this rules out spliter/aggregator because it has to collect all records to return them as a list.
I assume you have your records in a stream or iterator, otherwise you probably have a memory problem...
So when to use for-each and when to use batch?
For Each
The most simple solution, but it has some drawbacks:
It is single threaded (so may be too slow for your use case)
It is "fire and forget": You can't collect anything within the loop, e.g. a record count
There is not support handling "broken" records
Within the loop, you can have several steps (message processors) to process your records (e.g. for the mentioned database lookup).
May be a drawback, may be an advantage: The loop is synchronous. (If you want to process asynchronous, wrap it in an async-scope.)
Batch
A little more stuff to do / to understand, but more features:
When called from a flow, always asynchronous (this may be a drawback).
Can be standalone (e.g. with a poll inside for starting)
When the data generated in the loading phase is too big, it is automatically offloaded to disk.
Multithreading for free (number of threads configurable)
Handling for "broken records": Batch steps may be executed for good/broken records only.
You get statitstics at the end (number of records, number of successful records etc.)
So it looks like you better use batch.
For Splitter and Aggregator , you are responsible for writing the splitting logic and then joining them back at the end of processing. It is useful when you want to process records asynchronously using different server. It is less reliable compared to other option, here parallel processing is possible.
Foreach is more reliable but it process records iteratively using single thread ( synchronous), hence parallel processing is not possible. Each records creates a single message by default.
Batch processing is designed to process millions of records in a very fast and reliable way. By default 16 threads will process your records and it is reliable as well.
Please go through the link below for more details.
https://docs.mulesoft.com/mule-user-guide/v/3.8/splitter-flow-control-reference
https://docs.mulesoft.com/mule-user-guide/v/3.8/foreach
I have been using approach to pass on records in array to stored procedure.
You can call stored procedure inside for loop and setting batch size of the for loop accordingly to avoid round trips. I have used this approach and performance is good. You may have to create another table to log results and have that logic in stored procedure as well.
Below is the link which has all the details
https://dzone.com/articles/passing-java-arrays-in-oracle-stored-procedure-fro
When I load more than 1 csv file, how does big query handles the errors?
bq load --max_bad_record=30 dbname.finalsep20xyz
gs://sep20new/abc.csv.gz,gs://sep20new/xyzcsv.gz
There are a few files in the batch job they may fail to load since the number of expected columns will not match. I want to load the rest of the files though. If the file abc.csv fails Will the xyz.csv file be executed?
Or will the entire job fail and no record will be inserted?
I tried with dummy records but could not conclusively find how the errors in multiple files are handled.
Loads are atomic -- either all files commit or no files do. You can break the loads up into multiple jobs if you want them to complete independently. An alternative would be to set max_bad_records to something much higher.
We would still prefer that you launch fewer jobs with more files, since we have more flexibility in how we handle the imports. That said, recent changes to load quotas mean that you can submit more simultaneous load jobs, and still higher quotas are planned soon.
Also please note that all BigQuery actions that modify BQ state (load, copy, query with a destination table) are atomic; the only job type that isn't atomic is extract, since there is a chance that it might fail after having written out some of the exported data.