Kafka Consumer which writes into multiple file - file-io

I have to implement a kafka consumer which reads data from a topic and writes it a file based on the account id(will be close to million) present in the payload. Assuming there will be around 3K events per second. Is it ok to open and close file for each message read?
or should I consider a different approach?

I am assuming following:
Each account id will be unique and will have its own unique file.
It is okay to have a little lag in the data in the file, i.e. the data in the file will be near real time.
The data read per event is not huge.
Solution:
Kafka Consumer reads the data and writes to a database, preferably a NoSQL db.
A separate Single thread periodically reads the database for new records inserted, groups them by accountId.
Then iterates over the accoundId and for each accountId opens the File, writes the data at once, closes the File and moves to the next accountId.
Advantages:
Your consumer will not be blocked due to File Handling, as the two operations are decoupled.
Even if File Handling fails then the data is always present in DB to reprocess.

If your account id repeats, then it is better to windowing. You can aggregate all events of say 1 min by windowing, then you can group events by key and process all accountId at once.
This way, you will not have to open a file multiple times.

It is not okay to open a file for every single message, you should buffer a fixed amount of message, then write to a file when you each that limit.
You can use the HDFS Kafka Connector provided by Confluent to manage this.
If configured with the FieldPartitioner writing out to a local filesystem given store.url=file:///tmp, for example, that will create one directory per unique accountId field in your topic. Then the flush.size configuration determines how many messages will end up in a single file
Hadoop does not need to be installed as the HDFS libraries are included in the Kafka Connect classpath and they support local filesystems
You would start it like this after creating two property files
bin/connect-standalone worker.properties hdfs-local-connect.properties

Related

Flink batching Sink

I'm trying to use flink in both a streaming and batch way, to add a lot of data into Accumulo (A few million a minute). I want to batch up records before sending them to Accumulo.
I ingest data either from a directory or via kafka, convert the data using a flatmap and then pass to a RichSinkFunction, which adds the data to a collection.
With the streaming data, batching seems ok, in that I can add the records to a collection of fixed size which get sent to accumulo once the batch threshold is reached. But for the batch data which is finite, I'm struggling to find a good approach to batching as it would require a flush time out in case there is no further data within a specified time.
There doesn't seem to be an Accumulo connector unlike for Elastic search or other alternative sinks.
I thought about using a Process Function with a trigger for batch size and time interval, but this requires a keyed window. I didn't want to go down the keyed route as data looks to be very skewed, in that some keys would have a tonne of records and some would have very few. If I don't use a windowed approach, then I understand that the operator won't be parallel. I was hoping to lazily batch, so each sink only cares about numbers or an interval of time.
Has anybody got any pointers on how best to address this?
You can access timers in a sink by implementing ProcessingTimeCallback. For an example, look at the BucketingSink -- its open and onProcessingTime methods should get you started.

What are the guarantees for Apache Flume HDFS sink file writes?

Could somebody shed some light on what happens if the Flume agent gets killed in the middle of the HDFS file write (say using Avro format)? Will the file get corrupted and all events there lost?
I understand that there are transactions between different elements of the Flume data chain (source->channel->sink). But I believe that the HDFS files may stay open between consecutive channel->sink transactions (as .tmp). So if one transaction of say 100 events is successful (and the events are stored in a file, transaction committed) and the next one fails in the middle of the HDFS write could it be that the original 100 events from the first transaction are not readable (because the file corruption for instance?). How come Flume can assure that the original 100 events from the first transaction are not affected by this type of failure? Or maybe there is no guarantee there?
If the Flume agent is killed in the middle of HDFS file write , the file won't get corrupted and there will be no data loss.
IF flume is writing to a file say FlumeData123456789.tmp when the flume agent is killed, then all the records written into that file up till that point will remain intact and the file will be saved as FlumeData123456789.

Node-Local Map reduce job

I am currently attempting to write a map-reduce job where the input data is not in HDFS and cannot be loaded into HDFS basically because the programs using the data cannot use data from HDFS and there is too much to copy it into HDFS, at least 1TB per node.
So I have 4 directories on each of the 4 nodes in my cluster. Ideally I would like my mappers to just receive the paths for these 4 local directories and read them, using something like file:///var/mydata/... and then 1 mapper can work with each directory. i.e. 16 Mappers in total.
However to be able to do this I need to ensure that I get exactly 4 mappers per node and exactly the 4 mappers which have been assigned the paths local to that machine. These paths are static and so can be hard coded into my fileinputformat and recordreader, but how do I guarantee that given splits end up on a given node with a known hostname. If it were in HDFS I could use a varient on FileInputFormat setting isSplittable to false and hadoop would take care of it but as all the data is local this causes issues.
Basically all I want is to be able to crawl local directory structures on every node in my cluster exactly once, process a collection of SSTables in these directories and emit rows (on the mapper), and reduce the results (in the reduce step) into HDFS for further bulk processing.
I noticed that the inputSplits provide a getLocations function but I believe that this does not guarantee locality of execution, only optimises it and clearly if I try and use file:///some_path in each mapper I need to ensure exact locality otherwise I may end up reading some directories repeatedly and other not at all.
Any help would be greatly appreciated.
I see there are three ways you can do it.
1.) Simply load the data into HDFS, which you do not it want to do. But it is worth trying as it will be useful for future processing
2.) You can make use of NLineInputFormat. Create four different files with the URLs of the input files in each of your node.
file://192.168.2.3/usr/rags/data/DFile1.xyz
.......
You load these files into HDFS and write your program on these files to access the data data using these URLs and process your data. If you use NLineInputFormat with 1 line. You will process 16 mappers, each map processing an exclusive file. The only issue here, there is a high possibility that the data on one node may be processed on another node, however there will not be any duplicate processing
3.) You can further optimize the above method by loading the above four files with URLs separately. While loading any of these files you can remove the other three nodes to ensure that the file exactly goes to the node where the data files are locally present. While loading choose the replication as 1 so that the blocks are not replicated. This process will increase the probability of the maps launched processing the local files to a very high degree.
Cheers
Rags

import text file to SQL Server 2008 Database real time

I have a machine which creates a new log file at the beginning of the day(12am) and updates the log file whenever there is any changes until the end of the day.
How do I import the data in real time (30 sec, 1min or whenever there is any changes) to my SQL server database?
Will SQL Server 2008 be able to access the active log file? If not will it be easier if i let my machine create a new log file whenever there is any updates? But if it is so, how do i import so many log files with different names in real time. ( I must be able to scale the solution up to multiple machines)
Thx a lot
You can log each new line with a reversed time stamp.
Since you need to log only when the file changes you can implement an in memory queue
which reads from the file and stores the data.
Then implement a producer consumer model wherein one thread reads and loads data from the queue and the consumer logs to the database.
A windows service then can keep reading from the queue and log to the SQL Server.
(Since it's a producer consumer there will not be any busy waiting in case the queue is empty)
Somehow you will also have to notify the producer thread whenever every log is made. This can be done through Sockets/or some other means in case you have access to the code which is doing the logging.
If you have no control over the application producing the file then you have little option but to poll the file. Write an application that regularly polls the file and writes the deltas to the database. The application will need to record a high water mark that it has last read to.
Another wrinkle is that if the application does not close the file between writes then the last accessed time stamp might not be updated, so checking the age of the file may not be reliable. In this case you need to implement something like this process:
Open the log file
Seek to your last recorded EOF position
Try reading
If successful, process the new data until you get to the new EOF.
Update your persistent EOF position
Close the file
You will need to make sure that the number of bytes read aligns with your file seek position. If the log file is unicode then it may not have a 1:1 mapping between bytes and characters. You may need to read chunks of the file in binary mode and do the translation to characters from the buffer.
Once you have the log file entries parsed then you can just insert the data, or use SQLBulkCopy for larger data volumes.
If you can relax your latency constraints and the log file is small enough then you could possibly just implement a process that copies the log file to a staging area and reloads the whole thing periodically.
How about an SSIS package being called by an SQL Server Scheduled Job?

Service Broker Design

I’m looking to introduce SS Service Broker,
I have a remote orders database and a local processing database, all activity on the processing database has to happen in sequence, this seems a perfect job for Service Broker!
I’ve set up the infrastructure, I can send and receive messages and now I’m looking at the design of the processing. As I said all processes for one order need to be completed in sequence so I’ll put them in one conversation.
One of these processes is a request for external flat file data, we then wait (could be several days) and then import and process this file when it returns. How can I process half the tasks, then wait for the flat file to return before processing the other half.
I’ve had some ideas but I’m sure I’m missing a trick somewhere
1) Write all queue items to a status table and use status values – seems to remove some of the flexibility of SSSB and add another layer of tasks
2) Keep the transaction open until we get the data back – not ideal
3) Have the flat file import task continually polling for the file to appear – this seems inefficient
What is the most efficient way of managing this workflow?
thanks in advance
In my opinion it is like chain of responsibility. As far as i can understand we have the following workflow.
1.) Process for message.
2.) Wait for external file, now this can be a busy wait or if external data provides you a notification then we can actually do it in non-polling manner.
3.) Once data is received then process the data.
So my suggestion would be to use 3 different Queues one for each part, when one is done it will forward or put a new message in chained queue.
I am assuming, one order processing will not disrupt another order processing.
I am thinking MSMQ with Windows Sequential Work flow, might also be a candidate for this task.