Kafka To S3 Connector - amazon-s3

Let assume we are using Kafka S3 Sink Connector in a Standalone mode.
As it's written on the confluent page, it has exactly once delivery garantee.
I don't understand how does it work...
If for example - at some point of time, the connector wrote messages to the S3, but didn't manage to commit offsets to the Kafka topic and crushed.
The next time it starts up, it should process previous messages again?
Or does it use transactions internally?

Related

Send data from kafka to s3 using python

For my current project, I am working with Kafka (python) and wanted to know if there is any method by which I can send the streaming Kafka data to the AWS S3 bucket(without using Confluent). I am getting my source data from Reddit API.
I even wanted to know whether Kafka+s3 is a good combination for storing the data which will be processed using pyspark or I should skip the s3 step and directly read data from Kafka.
Kafka S3 Connector doesn't require "using Confluent". It's completely free, open source and works with any Apache Kafka cluster.
Otherwise, sure, Spark or plain Kafka Python consumer can write events to S3, but you've not clearly explained what happens when data is in S3, so maybe start with processing the data directly from Kafka

Get event/message on Kafka when new file on S3

Im quite new to AWS and also new to Kafka (using Confluent platform and .NET) .
We will receive large files (~1-40+Mb) to our S3-bucket and the consuming side of this should process these files. We will have all our messaging over Kafka.
Ive read that you should not send large files over Kafka, but maybe Im misinformed here?
If we instead want to just get an event that a new file has arrived on our S3-bucket (and of course some kind of reference to it), how would we go about?
You can receive notifications about events that happen in your S3 bucket like when a new object is created/deleted etc.
From the S3 documentation (as of writing this), the following destinations are supported:
Simple Notification Service (SNS)
Simple Queue Service (SQS)
AWS Lamdba function
For instance, you can choose SQS as your S3 notification destination and use Kafka SQS Source Connector to stream the events to Kafka.
Then you can write your Kafka consumer applications that react to this events.
And yes, it is not recommended to send large files over Kafka. Just send pointers to them and let the consumer application fetch the information using those pointers. If you are consumer wants to fetch some s3 objects, configure your consumer to use the S3 SDKs.
Useful resources:
Enabling event notifications in S3
S3 Notification Event Structure (JSON) with examples
Kafka SQS Source Connector

Kafka HBase Sink Connector unable to deliver its messages to HBase

I have particular Kafka HBase Sink Connector problem for which I will appreciate any advise or suggestions.
It is a 3-node Kafka cluster - 2 nodes for connect-distributed and 1 node for schema registry + kafka streaming. The Kafka version is 0.10.1 and is part of the Hortonworks platform 2.6.3. There are SSL and Kerberos authentication settings also. On top of it I have custom Kafka application that receives messages, processes them via Kafka streaming and delivers them in HBase.
The process model is:
1) Input topic;
2) Processing (in Kafka streaming);
3) Output topic;
4) HBase sink connector;
5) HBase.
The delivered messages in 1) are successfully transferred and processed until the step 3) inclusive. Then the though the sink connector works fine no message is delivered to HBase.
That being said I tested our custom application model with the Unit tests creating embedded Kafka cluster with its own basic settings and the tests were successful. This could quite likely indicate that the connectivity problem comes from some cluster setting(s).
For your information I observed 3 specific things:
The standard consumer console functionality is able to successfully consume the messages from the sink topic;
There is no consumer id for the sink connection established;
The process of connections starts successfully but stops for not logged reasons and do not call the WorkerSinkTask java class, where actually the writing to the HBase happens.
Addtional important point is the whole SSL encryption and Kerberos authentication setup that might be misconfigured.
In case anyone faced such a case I will greatly appreciate any comments that could be of a help.
Dimitar

How to properly restart a kafka s3 sink connect?

I started a kafka s3 sink connector (bundle connector from confluent package) since 1 May. It works fine until 8 May. Checking the status, it tells that some aws exception crashes this connector. This should not be a big problem, so I want to restore it.
I tried the following steps:
I POST /connectors/s3sink/restart . Then I saw the connector is in RUNNING mode, but the task is still FAIL.
Then I PUT /connectors/s3sink/task/0/restart. Ok, now the task is in RUNNING mode.
But then I tail the log, I found it starts to rewrite the old data, such as 3 May data. And it messed the old data!
So, does connect restart REST API reset the offset? I thought it will save the offset and just start from the offset it fails.
And how to restart a failed connector task correctly? By deleting those PODs? (using kubernetes), or by REST /task/0/restart? When should I use /connectors/s3sink/restart?
/connector/:name/restart is a rolling restart operation on the worker leader that needs to propagate to all worker server tasks in async fashion. So, you need to ensure network connection between the leader worker and all others.
/connector/:name/task/:num/restart will send request straight to that worker, restarting the thread.
Restart should not reset the offset since they are stored in the consumer offsets topic for that connect cluster. If anything, the tasks were not able to commit offsets back to the __consumer_offsets topic, but you should see logs for that.

S3 connectors to connect with Kafka for streaming data from on-premise to cloud

I want to stream data from on-premise to Cloud(S3) using Kafka. For which I need to intsall kafka on source machine and also on cloud. But I don't want to intsall it on cloud. I need some S3 connector through which I can connect with kafka and stream data from on-premise to cloud.
If your data is in Avro or Json format (or can be converted to those formates), you can use the S3 connector for Kafka Connect. See Confluent's docs on that
Should you want to move actual (bigger) files via Kafka, be aware that Kafka is designed for small messages and not for file transfers.
There is a kafka-connect-s3 project consisting of both sink and source connector from Spreadfast, which can handle text format. Unfortunately it is not really updated, but works nevertheless