Dynamically consume and sink Kafka topics with Flink - amazon-s3

I haven't been able to find much information about this online. I'm wondering if its possible to build a Flink app that can dynamically consume all topics matching a regex pattern and sync those topics to s3. Also, each topic being dynamically synced would have Avro messages and the Flink app would use Confluent's Schema Registry.

So lucky man! Flink 1.4 just released a few days ago and this is the first version that provides consuming Kafka topics using REGEX. According to java docs here is how you can use it:
FlinkKafkaConsumer011
public FlinkKafkaConsumer011(PatternsubscriptionPattern,DeserializationSchema<T> valueDeserializer,Properties props)
Creates a new Kafka streaming source consumer for Kafka 0.11.x. Use
this constructor to subscribe to multiple topics based on a regular
expression pattern. If partition discovery is enabled (by setting a
non-negative value for
FlinkKafkaConsumerBase.KEY_PARTITION_DISCOVERY_INTERVAL_MILLIS in the
properties), topics with names matching the pattern will also be
subscribed to as they are created on the fly.
Parameters:
subscriptionPattern - The regular expression for a pattern of topic names to subscribe to.
valueDeserializer - The de-/serializer used to convert between Kafka's byte messages and Flink's objects.
props - The properties used to configure the Kafka consumer client, and the ZooKeeper client.
Just notice that running Flink streaming application, it fetch topic data from Zookeeper at intervals specified using the consumer config :
FlinkKafkaConsumerBase.KEY_PARTITION_DISCOVERY_INTERVAL_MILLIS
It means every consumer should resync the metadata including topics, at some specified intervals.The default value is 5 minutes. So adding a new topic you should expect consumer starts to consume it at most in 5 minutes. You should set this configuration for Flink consumer with your desired time interval.

Subscribing to Kafka topics with a regex pattern was added in Flink 1.4. See the documentation here.
S3 is one of the file systems supported by Flink. For reliable, exactly-once delivery of a stream into a file system, use the flink-connector-filesystem connector.
You can configure Flink to use Avro, but I'm not sure what the status is of interop with Confluent's schema registry.
For searching on these and other topics, I recommend the search on the Flink doc page. For example: https://ci.apache.org/projects/flink/flink-docs-release-1.4/search-results.html?q=schema+registry

Related

Kafka Streams Fault Tolerance with Offset Management in Parellel

Description :
I have one Kafka Stream application which is consuming from a topic.
The events are coming at high volumes.
KafkaStream will consume the events as a terminal operation and club the events in a bunch say 1000 events and writes it to AWS S3.
I have threads that are writing to s3 in parallel after consuming events from Kafka topic.
Not using kafka-connector-s3 due to some business application logics and processings.
Problem ::
I want the application to be fault-tolerant don't want to loose messages.
--> CRASH SCENARIO
Suppose the application has 10 threads all are running and trying to put the events in S3, and a crash happens, in that case, since the KafkaStream has ( enable.auto.commit = false )and we cannot commit the offset manually and all the threads have consumed messages from Kafka topic.
In this case, KafkaStreams has already committed the offset after reading but it could not have processed the events to S3.
I need a mechanism so that I can be sure of that what was the last offset till the events get written to the S3 file successfully.
And In crash scenarios, how should I deal with this and how to manage the Kafka offsets in Kafka Streams as I am using say 10 threads. What if some failed to write to s3 and some are passed. How do I ensure the ordering of offset getting successfully processed to s3 or not?
Let me know if I am not clear to describe my problem statement.
Thanks!
I can assure you that enable.auto.commit is set to false in Kafka Streams. The Javadocs at https://kafka.apache.org/26/javadoc/org/apache/kafka/streams/StreamsConfig.html state
"enable.auto.commit" (false) - Streams client will always disable/turn off auto committing
You are right that Kafka Streams will automatically commit in more or less regular intervals. However, Kafka Streams waits until records are processed before committing the corresponding offsets. That means you would at least get at-least-once guarantees and not lose messages.
As far as I understand your application, your terminal processor does not block until the records are sent to S3. That means, Kafka Streams cannot know when the sending is completed. Kafka Streams just sees that the terminal processor completed its processing and then -- if the commit interval elapsed -- it commits the offsets.
You say
Not using kafka-connector-s3 due to some business application logics and processings.
Could you put the business application logic in the Kafka Streams application, write the results to a Kafka topic with operator to(), and then use the kafka-connector-s3 to send the messages in that topic to S3?
I am not a connect expert, but I guess that would make sure that messages are not lost and would make your implementation simpler.
Using kafka-stream ,you could aggragate 5000 messages from source topic to one big message and send the big one to another topic like middle_topic. You need another proceccor source from the middle_topic and sink to s3 using s3-connector.

Is there any way to read messages from Kafka topic without consumer?

Just for testing purpose, I want to automate scenario where I need to check Kafka messages content, so just wanted to know if it is possible to read messages without consumers directly from TOPIC using Kafka java libraries?
I'm new to Kafka so any suggestion will be good for me.
Thanks in advance!
You could SSH to the broker in question, then dump the log segments into a deserialized fashion, but it would take less time to simply use a consumer in any language, not necessarily Java
"For testing purposes" Kafka Java API provides MockProducer and MockConsumer, which are backed by Lists, not a full broker

API or other queryable source for getting total NiFi queued data

Is there an API point or whatever other queryable source where I can get the total queued data?:
setting up a little dataflow in NiFi to monitor NiFi itself sounds sketchy, but if it's a common practice, let's be it. Anyway, I cannot find the API endpoint to get that total
Note: I have a single NiFi instance: I don't have nor will implement S2S reporting since I am on a single instance, single node NiFi setup
The Site-to-Site Reporting tasks were developed because they work for clustered, standalone, and multiple instances thereof. You'd just need to put an Input Port on your canvas and have the reporting task send to that.
An alternative as of NiFi 1.10.0 (via NIFI-6780) is to get the nifi-sql-reporting-nar and use QueryNiFiReportingTask, you can use a SQL query to get the metrics you want. That uses a RecordSinkService controller service to determine how to send the results, there are various implementations such as Site-to-Site, Kafka, Database, etc. The NAR is not included in the standard NiFi distribution due to size constraints, but you can get the latest version (1.11.4) here, or change the URL to match your NiFi version.
#jonayreyes You can find information about how to get queue data from NiFi API Here:
NiFi Rest API - FlowFile Count Monitoring

Using RabbitMQ with Stormcrawler

I want to use RabbitMQ with StormCrawler. I already saw that there is a repository for using RabbitMQ with Storm:
https://github.com/ppat/storm-rabbitmq
How would you use this for the StormCrawler? I would like to use the Producer as well as the consumer.
For the consumer there seems to be some documentation. What about the Producer? Can you just put the config entries in the storm crawler config or would I need to change the source code of the RabbitMQProducer?
You'd want the bolt which sends URLs to RabbitMQ to extend AbstractStatusUpdaterBolt as the super class does a lot of useful things under the bonnet, which means that you would not use the Producer out of the box but will need to write some custom code.
Unless you are certain that there will be no duplicates URLs, you'll need to deduplicate the URLs before sending them to the queues anyway, which could be done e.g. with Redis within your custom status updater.

how to make sub-topics in Kafka

I am trying to represent Topics and Sub-topics in Kafka.
Example : Topic 'Sports' Sub-topic 'Football', 'Handball'
And as I know Kafka doesn't support this. what I am using now are Topics like this 'Sports_Football' , 'Sports_Handball'...
this is not really functional because when we need to when we want the Topic 'Sports' with all the subs we need to query all the topics for it.
we are also using Redis and Apache Storm. So please is there a better way of doing this?
You are correct. There is no such thing as a "subtopic" in Kafka, however, consuming all topics that begin with the word 'Sports' is trivial. Assuming you're using Java, once you have initialized a consumer use the method consumer.subscribe(Pattern.compile("^Sports_.+")) to subscribe to your "subtopics." Calling consumer.poll(timeout) will now read from all topics beginning with 'Sports_'.
The only downside to doing it this way is that the consumer will have to resubscribe when new 'Sports_' topics are added.
Apache Kafka doesn't support it, you are right. But Kafka supports message partitioning. Kafka provides garantee, that all messages with the same key go to the same partition.
You can consume all partitions, or only single one. So you basically can set different keys for the each sport in order to separate messages.
There is also the option of using Redis streams,
using kafka-redis-connector you can push data to Redis streams.
but consider the benefits and inconvenients of Redis streams.
An other interesting solution is using Kafka Streams, so you can create subtopics.
Broker(Sport) ==> Sport_Stream(Football, Handball) ==> Consumer can receive topics from the Broker or receive a subtopic from the Stream.