What are the guarantees for Apache Flume HDFS sink file writes? - apache

Could somebody shed some light on what happens if the Flume agent gets killed in the middle of the HDFS file write (say using Avro format)? Will the file get corrupted and all events there lost?
I understand that there are transactions between different elements of the Flume data chain (source->channel->sink). But I believe that the HDFS files may stay open between consecutive channel->sink transactions (as .tmp). So if one transaction of say 100 events is successful (and the events are stored in a file, transaction committed) and the next one fails in the middle of the HDFS write could it be that the original 100 events from the first transaction are not readable (because the file corruption for instance?). How come Flume can assure that the original 100 events from the first transaction are not affected by this type of failure? Or maybe there is no guarantee there?

If the Flume agent is killed in the middle of HDFS file write , the file won't get corrupted and there will be no data loss.
IF flume is writing to a file say FlumeData123456789.tmp when the flume agent is killed, then all the records written into that file up till that point will remain intact and the file will be saved as FlumeData123456789.

Related

Fetching S3 object Nifi

I want to fetch one particular file from S3, only once. So I used listS3 and FetchS3object processors. But whenever I start the listS3 processor and stop it after 1 second, by that time it generates thousands of flowfiles and when these flowfiles are passed to FetchS3object processor, the same file is fetched thousands of time.
I even changed the scheduling time in listS3 processor, but still, the same effect exists.
Could someone tell me where I was going wrong with the configurations?

Mule 4 Batch Process on large input file

I have a huge file that can have anywhere from few hundred thousand to 5 million records. Its tab-delimited file. I need to read the file from ftp location , transform it and finally write it in a FTP location.
I was going to use FTP connector get the repeatable stream and put it into mule batch. Inside mule batch process idea was to use a batch step to transform the records and finally in batch aggregate FTP write the file to destination in append mode 100 records at a time.
Q1. Is this a good approach or is there some better approach?
Q2. How does mule batch load and dispatch phase work (https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch ) Is it waiting for entire stream of millions of records to be read in memory before dispatching a mule batch instance ?
Q3. While doing FTP write in batch aggregate there is a chance that parallel threads will start appending content to FTP at same time thereby corrupting the records. Is that avoidable. I read about File locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) . My assumption is it will simply raise File lock exception and not necessarily wait to write FTP in append mode.
Q1. Is this a good approach or is there some better approach?
See answer Q3, this might not work for you. You could instead use a foreach and process the file sequentially though that will increase the time for processing significantly.
Q2. How does mule batch load and dispatch phase work
(https://docs.mulesoft.com/mule-runtime/4.3/batch-processing-concept#load-and-dispatch
) Is it waiting for entire stream of millions of records to be read in
memory before dispatching a mule batch instance ?
Batch doesn't load big numbers of records in memory, it uses file based queues. And yes, it loads all records in the queue before starting to process them.
Q3. While doing FTP write in batch aggregate there is a chance that
parallel threads will start appending content to FTP at same time
thereby corrupting the records. Is that avoidable. I read about File
locks (https://docs.mulesoft.com/ftp-connector/1.5/ftp-write#locks) .
My assumption is it will simply raise File lock exception and not
necessarily wait to write FTP in append mode
The file write operation will throw a FILE:FILE_LOCK error if the file is already locked. Note that Mule 4 doesn't manage errors through exceptions, it uses Mule errors.
If you are using DataWeave flatfile to parse the input file, note that it will load the file in memory and use significantly more memory than the file itself to process it, so you probably are going to get an out of memory error anyway.

apache nifi S3 PutObject stuck

Sorry if this is a dumb question, very new to nifi.
Have set up a process group to dump sql queries to CSV and then upload them to S3. Worked fine with small queries, but appears to be stuck with larger files.
The input queue to the PutS3Object processor has a limit of 1GB, but the file it is trying to put is almost 2 GB. I have set the multi-part parameters in the S3 processor to be 100M but it is still stuck.
So my theory is the S3PutObject needs a complete file before it starts uploading. Is this correct? Is there no way to get it uploading in a "streaming" manner? Or do I just have to up the input queue size?
Or am I on the wrong track and there is something else holding this all up.
The screenshot suggests that the large file is in PutS3Object's input queue, and PutS3Object is actively working on it (from the 1 thread indicator in the top-right of the processor box).
As it turns out, there were no errors, just a delay from processing a large file.

Kafka Consumer which writes into multiple file

I have to implement a kafka consumer which reads data from a topic and writes it a file based on the account id(will be close to million) present in the payload. Assuming there will be around 3K events per second. Is it ok to open and close file for each message read?
or should I consider a different approach?
I am assuming following:
Each account id will be unique and will have its own unique file.
It is okay to have a little lag in the data in the file, i.e. the data in the file will be near real time.
The data read per event is not huge.
Solution:
Kafka Consumer reads the data and writes to a database, preferably a NoSQL db.
A separate Single thread periodically reads the database for new records inserted, groups them by accountId.
Then iterates over the accoundId and for each accountId opens the File, writes the data at once, closes the File and moves to the next accountId.
Advantages:
Your consumer will not be blocked due to File Handling, as the two operations are decoupled.
Even if File Handling fails then the data is always present in DB to reprocess.
If your account id repeats, then it is better to windowing. You can aggregate all events of say 1 min by windowing, then you can group events by key and process all accountId at once.
This way, you will not have to open a file multiple times.
It is not okay to open a file for every single message, you should buffer a fixed amount of message, then write to a file when you each that limit.
You can use the HDFS Kafka Connector provided by Confluent to manage this.
If configured with the FieldPartitioner writing out to a local filesystem given store.url=file:///tmp, for example, that will create one directory per unique accountId field in your topic. Then the flush.size configuration determines how many messages will end up in a single file
Hadoop does not need to be installed as the HDFS libraries are included in the Kafka Connect classpath and they support local filesystems
You would start it like this after creating two property files
bin/connect-standalone worker.properties hdfs-local-connect.properties

import text file to SQL Server 2008 Database real time

I have a machine which creates a new log file at the beginning of the day(12am) and updates the log file whenever there is any changes until the end of the day.
How do I import the data in real time (30 sec, 1min or whenever there is any changes) to my SQL server database?
Will SQL Server 2008 be able to access the active log file? If not will it be easier if i let my machine create a new log file whenever there is any updates? But if it is so, how do i import so many log files with different names in real time. ( I must be able to scale the solution up to multiple machines)
Thx a lot
You can log each new line with a reversed time stamp.
Since you need to log only when the file changes you can implement an in memory queue
which reads from the file and stores the data.
Then implement a producer consumer model wherein one thread reads and loads data from the queue and the consumer logs to the database.
A windows service then can keep reading from the queue and log to the SQL Server.
(Since it's a producer consumer there will not be any busy waiting in case the queue is empty)
Somehow you will also have to notify the producer thread whenever every log is made. This can be done through Sockets/or some other means in case you have access to the code which is doing the logging.
If you have no control over the application producing the file then you have little option but to poll the file. Write an application that regularly polls the file and writes the deltas to the database. The application will need to record a high water mark that it has last read to.
Another wrinkle is that if the application does not close the file between writes then the last accessed time stamp might not be updated, so checking the age of the file may not be reliable. In this case you need to implement something like this process:
Open the log file
Seek to your last recorded EOF position
Try reading
If successful, process the new data until you get to the new EOF.
Update your persistent EOF position
Close the file
You will need to make sure that the number of bytes read aligns with your file seek position. If the log file is unicode then it may not have a 1:1 mapping between bytes and characters. You may need to read chunks of the file in binary mode and do the translation to characters from the buffer.
Once you have the log file entries parsed then you can just insert the data, or use SQLBulkCopy for larger data volumes.
If you can relax your latency constraints and the log file is small enough then you could possibly just implement a process that copies the log file to a staging area and reloads the whole thing periodically.
How about an SSIS package being called by an SQL Server Scheduled Job?