Im quite new to AWS and also new to Kafka (using Confluent platform and .NET) .
We will receive large files (~1-40+Mb) to our S3-bucket and the consuming side of this should process these files. We will have all our messaging over Kafka.
Ive read that you should not send large files over Kafka, but maybe Im misinformed here?
If we instead want to just get an event that a new file has arrived on our S3-bucket (and of course some kind of reference to it), how would we go about?
You can receive notifications about events that happen in your S3 bucket like when a new object is created/deleted etc.
From the S3 documentation (as of writing this), the following destinations are supported:
Simple Notification Service (SNS)
Simple Queue Service (SQS)
AWS Lamdba function
For instance, you can choose SQS as your S3 notification destination and use Kafka SQS Source Connector to stream the events to Kafka.
Then you can write your Kafka consumer applications that react to this events.
And yes, it is not recommended to send large files over Kafka. Just send pointers to them and let the consumer application fetch the information using those pointers. If you are consumer wants to fetch some s3 objects, configure your consumer to use the S3 SDKs.
Useful resources:
Enabling event notifications in S3
S3 Notification Event Structure (JSON) with examples
Kafka SQS Source Connector
Related
I'm pretty new to S3. I'm trying to create a Bucket and receive notifications on Object Created events using code only (not with the AWS Management UI).
I'm writing in dotnet so I'm using the AWSSDK.Core nuget package.
Until now I've managed to create a bucket using the sdk.
It seems like a trivial task though I couldn't find references around the web to accomplish it.
Also, the object storage is S3 compatible, not AWS S3.
I tried configuring a SNS Topic, but it seems that in order to enable notifications, the API requires SQS as a Queueing service, not RabbitMQ.
I did see another approach - configuration of a lambda function that transfers messages to RabbitMQ, but couldn't find references and documentation as well.
Any help is appreciated :)
For my current project, I am working with Kafka (python) and wanted to know if there is any method by which I can send the streaming Kafka data to the AWS S3 bucket(without using Confluent). I am getting my source data from Reddit API.
I even wanted to know whether Kafka+s3 is a good combination for storing the data which will be processed using pyspark or I should skip the s3 step and directly read data from Kafka.
Kafka S3 Connector doesn't require "using Confluent". It's completely free, open source and works with any Apache Kafka cluster.
Otherwise, sure, Spark or plain Kafka Python consumer can write events to S3, but you've not clearly explained what happens when data is in S3, so maybe start with processing the data directly from Kafka
Once or twice a day some files are being uploaded to S3 Bucket. I want the uploaded data to be refreshed with the In-memory data of each server on every s3 upload.
Note there are multiple servers running and I want to store the same data in all the servers. Also, the servers are scaling based on the traffic(also on start-up of the new server goes up and older ones go down means server instances will not be the same always).
Like I want to keep updated data in the cache.
I want to build an architecture where auto-scaling of the server can be supported. I came across the FAN-OUT architecture of AWS by using the SNS and multiple SQS from which different servers can poll.
How can we handle the auto-scaling of the queue with respect to servers?
Or is there any other way to handle the scenario?
PS: I m totally new to the AWS environment.
It Will be a great help for any reference.
To me there are a few things that you need to have to make this work. These are opinions and, as with most architectural designs, there is certainly more than one way to handle this.
I start with the assumption that you've got an application running on an EC2 of some sort (Elastic Beanstalk, Fargate, Raw EC2s with auto scaling, etc.) and that you've solved for having the application installed and configured when a scale-up event occurs.
Conceptually I'd have this diagram:
The setup involves having the S3 bucket publish likely s3:ObjectCreated events to the SNS topic. These events will be published when an object in the bucket is updated or created.
Next:
During startup your application will pull the current data from S3.
As part of application startup create a queue named after the instance id of the EC2 (see here for some examples) The queue would need to subscribe to the SNS topic. If the queue already exists then that's not an error.
Your application would have a background thread or process that polls the SQS queue for messages.
If you get a message on the queue then that needs to tell the application to refresh the cache from S3.
When an instance is shut down there is an event from at least Elastic Beanstalk and the load balancers that your instance will be shut down. Remove the SQS queue tied to the instance at that time.
The only issue might be that a hard crash of an environment would leave orphan queues. It may be advisable to either manually clean these up or have a periodic task clean them up.
Let assume we are using Kafka S3 Sink Connector in a Standalone mode.
As it's written on the confluent page, it has exactly once delivery garantee.
I don't understand how does it work...
If for example - at some point of time, the connector wrote messages to the S3, but didn't manage to commit offsets to the Kafka topic and crushed.
The next time it starts up, it should process previous messages again?
Or does it use transactions internally?
I want to stream data from on-premise to Cloud(S3) using Kafka. For which I need to intsall kafka on source machine and also on cloud. But I don't want to intsall it on cloud. I need some S3 connector through which I can connect with kafka and stream data from on-premise to cloud.
If your data is in Avro or Json format (or can be converted to those formates), you can use the S3 connector for Kafka Connect. See Confluent's docs on that
Should you want to move actual (bigger) files via Kafka, be aware that Kafka is designed for small messages and not for file transfers.
There is a kafka-connect-s3 project consisting of both sink and source connector from Spreadfast, which can handle text format. Unfortunately it is not really updated, but works nevertheless