I am new to NiFi, and advice welcomed.
We get data sent in from external sources in many small records. I am thinking of pulling those records into NiFi via RabbitMQ. I'd like to "spool" or "batch" those records up into larger grouping (perhaps based on some index in the records), and when a group of records reaches a certain size threshold write out to S3.
How to best accomplish this in NiFi? Any other suggestions?
Thanks, Gary
RabbitMQ is based upon AMQP. Nifi supports a processor for AMQP called as ConsumeAMQP. You will find Additional details in link which has documentation specific to RabbitMQ. Configure the processor according to the documentation and you are good to go.
For the second part you need to use PutS3Object processor and there you will be able define the thresholds.
This should be achievable... I don't know that much about RabbitMQ, but assuming that it supports a JMS interface, then you could probably use NiFi's ConsumeJMS processor, followed by MergeContent to merge until your threshold is reached, and then PutS3Object to write to S3.
Related
Please be aware that I am a relative newbie to ActiveMQ.
I am currently working with a small cluster of ActiveMQ (version 5.15.x) nodes (< 5). I recently experimented with setting up the configuration to use "Shared File System Master Slave" with kahadb, in order to provide high availability to the cluster.
Having done so and seeing how it operates, I'm now considering whether this configuration provides the level of throughput required for both consumers/producers, as only one broker's ports are available at one time.
My question is basically two part. First, does it make sense to configure the cluster as highly available AND load balanced (through Network of Brokers)? Second, is the above even technically viable, or do I need to review my design consideration to favor one aspect over the other?
I had some discussions with the ActiveMQ maintainers in IRC on this topic a few months back.
It seems that they would recommend using ActiveMQ Artemis instead of ActiveMQ 5.
Artemis has a HA solution:
https://activemq.apache.org/artemis/docs/latest/clusters.html
https://activemq.apache.org/artemis/docs/latest/ha.html
The idea is to use Data Replication to allow for failover, etc:
When using replication, the live and the backup servers do not share the same data directories, all data synchronization is done over the network. Therefore all (persistent) data received by the live server will be duplicated to the backup.
And, I think you want to have at least 3 nodes (or some odd number) to avoid issues with split brain during network partitions.
It seems like Artemis can mostly be used as a drop-in replacement for ActiveMQ; it can still speak the OpenWire protocol, etc.
However, I haven't actually tried this yet, so YMMV.
I use AWS Kinesis stream with several shards. The partition keys I set when I put records in the stream is not constant, to map the records to every shards.
To be sure about the fact that every shard is used, how can I monitor the activity of the shards ?
I saw that in a enhanced level of AWS Cloudwatch, the metrics of Kinesis can be split by shards. That is not my case, and as my need is just occasional, I don't want to pay for it.
You can enable shard level metrics when you want, then disable when you don't need to. Although you specified that you did not want this solution, this is by far the best way.
On the consumer side, you can use custom logging. For each record batch processed in your IRecordProcessor implementation, you can count the incoming data counts for each shard. Sample code here. You can even add 3rd party metrics platforms (such as Prometheus).
You can customize producer, and log PutRecordResponses. It returns "your data is placed under XXX shard" for each Put call. See AWS Documentation for details.
Generally, if your have a problem regardnig non-uniform data distribution between your shards, best way is to use random partition key while sending data in Kinesis Producer applications.
I am trying to solve a problem with real-time analytics. I would like to compute values in real-time. I receive streaming data and process it with Kafka and Storm and finally write it to Redis. Now I would like to push/pull all the data stored in Redis again into Storm to do further computation with it. The problem is, this must be repeated every minute. So every minute all the values from Redis have to be pulled/pushed and computed. I do not know if this is the right way to solve my problem, but I need a kind of cache. Do you have any recommendations?
Thank you in advance.
Regards
you can use druid instead. Which stores values in kafka and uses storm to insert values. It's column based storage and specially designed for real-time analytics. Redis is quick, but you can't achieve all the analytical requirements with redis, to achieve a simple group by or order by queries you need to write your own implementation logic, Whereas druid is specifically designed to serve for this purpose.
http://druid.io/
Hope this helps.
I'm developing a prototype Lambda system and my data is streaming in via Flume to HDFS. I also need to get the data into Storm. Flume is a push system and Storm is more pull so I don't believe it's wise to try to connect a spout to Flume, but rather I think there should be a message queue between the two. Again this is a prototype, so I'm looking for best practices, not perfection. I'm thinking of putting an AMQP compliant queue as a Flume sink and then pulling the messages from a spout.
Is this a good approach? If so, I want to use a message queue that has relatively robust support in both the Flume world (as a sink) and the Storm world (as a spout). If I go AMQP then I assume that gives me the option to use whatever AMQP-compliant queue I want to use, correct? Thanks.
If your going to use AMQP, I'd recommend sticking to the finalized 1.0 version of the AMQP spec. Otherwise, your going to feel some pain when you try to upgrade to it from previous versions.
Your approach makes a lot of sense, but, for us the AMQP compliant issue looked a little less important. I will try to explain why.
We are using Kafka to get data into storm. The main reason is mainly around performance and usability. AMQP complaint queues do not seem to be designed for holding information for a considerable time, while with Kafka this is just a definition. This allows us to keep messages for a long time and allow us to "playback" those easily (as the message we wish to consume is always controlled by the consumer we can consume the same messages again and again without a need to set up an entire system for that purpose). Also, Kafka performance is incomparable to anything that I have seen.
Storm has a very useful KafkaSpout, in which the main things to pay attention to are:
Error reporting - there is some improvement to be done there. Messages are not as clear as one would have hoped.
It depends on zookeeper (which is already there if you have storm) and a path is required to be manually created for it.
According to the storm version, pay attention to the Kafka version in use. It is documented, but, can really easily be missed and cause unclear problems.
You can have the data streamed to a broker topic first. Then flume and storm spout can both consume from that topic. Flume has a jms source that makes it easy to consume from the message broker. And a storm jms spout to get the messages into storm.
I have already working Camel configuration that is watching a database table (through spring and hibernate), and when something shows up in DB, Camel consumes it and sends message to JMS broker.
This works flawlessly. And is built in DSL in MyOwnMessageRouteBuilder.configure()
Now I'd like to add monitoring that do something if no new data shows up in DB in given (like 3h) time. Is that possible in Camel at all? I can see callbacks like onCompletion or onException, but nothing like onIdle()....
Best regards
You may look at BAM
http://camel.apache.org/bam
However usually some monitoring tooling may already be able to do this, and thus you may be able to find a generic solution.
I think your best best is to use a timer/quartz route to check the database periodically and compare the timestamp of the most recent data vs. current time...if its greater than 3h old, then react accordingly...