Celery with Amazon SQS and S3 events - amazon-s3

I would like to use Celery to consume S3 events as delivered by Amazon on SQS. However, the S3 message format does not match what Celery expects.
How can I consume these messages with minimal hackiness? Should I write a custom serializer? Should I give up and make a custom bridge using boto or boto3?
As a sidenote, I also want to connect Celery to a different broker (RabbitMQ) for the rest of the application messaging, if that matters.

You're going to need to create a service that listens to the S3 notifications and then runs the appropriate celery task.
You have a variety of options - the S3 notifications go out via SQS, SNS or AWS Lambda.
In fact the simplest option may be to not use Celery at all and simply write some code to run in AWS Lambda. I haven't used this service (Lambda is relatively new) but it looks like it would mean you don't have to e.g. run a monitoring service or celery workers.

Configure the AWS S3 event to invoke an AWS Lambda function. The function should be written to transform the S3 event message into the Celery message format, then publish the Celery message to SQS. Celery would pick up the message from SQS.
S3 Event -> Lambda -> SQS -> Celery

For my specific use case, it turned out to be easiest to create a bridge worker that polls SQS and gives tasks to Celery with the default broker.
Not hard to do (although boto and SQS could use more documentation), and Celery is not well suited for connecting to two different brokers at once, so it feels like the best way to do it.

The notification message that Amazon S3 sends to publish an event is in the JSON format.
So you can configure your celery to serialize json.
Below is an extract from my config file (using django).
# AWS Credentials
AWS_ACCESS_KEY_ID = os.environ.get('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')
# Celery
BROKER_URL = "sqs://%s:%s#" % (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
CELERY_ACCEPT_CONTENT = ['application/json']
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TASK_SERIALIZER = 'json'
CELERY_DEFAULT_QUEUE = '<queue_name>'
CELERY_RESULT_BACKEND = None # Disabling the results backend
BROKER_TRANSPORT_OPTIONS = {
'region': 'us-west-2',
'polling_interval': 20,
}

Related

Get event/message on Kafka when new file on S3

Im quite new to AWS and also new to Kafka (using Confluent platform and .NET) .
We will receive large files (~1-40+Mb) to our S3-bucket and the consuming side of this should process these files. We will have all our messaging over Kafka.
Ive read that you should not send large files over Kafka, but maybe Im misinformed here?
If we instead want to just get an event that a new file has arrived on our S3-bucket (and of course some kind of reference to it), how would we go about?
You can receive notifications about events that happen in your S3 bucket like when a new object is created/deleted etc.
From the S3 documentation (as of writing this), the following destinations are supported:
Simple Notification Service (SNS)
Simple Queue Service (SQS)
AWS Lamdba function
For instance, you can choose SQS as your S3 notification destination and use Kafka SQS Source Connector to stream the events to Kafka.
Then you can write your Kafka consumer applications that react to this events.
And yes, it is not recommended to send large files over Kafka. Just send pointers to them and let the consumer application fetch the information using those pointers. If you are consumer wants to fetch some s3 objects, configure your consumer to use the S3 SDKs.
Useful resources:
Enabling event notifications in S3
S3 Notification Event Structure (JSON) with examples
Kafka SQS Source Connector

Configure Filebeat to not delete AWS SQS message if the message does not match the file_selectors

Scenario:
AWS S3 is configured to send event notification to SQS queue
Filebeat is using aws-s3-plugin to pull logs from S3 through the SQS queue
filebeat.inputs:
- type: aws-s3
queue_url: https://sqs.us-east-2.amazonaws.com/aws-id/queue-name
gzip: true
file_selectors:
- regex: '.*-mytype.log.*'
Filebeat deletes all the messages present in the queue, whereas it should only process and delete the messages matching the file_selectors regex
Question: How to configure filebeat to not delete SQS queue message
if the message does not match the file_selectors regex?
P.S.
The use case that I am trying to achieve is
AWS S3 notification -> SQS queue(all type of files)
Multiple filebeat instances configured with the same queue
Filebeat-1: configured file_selectors for file_type_1. filebeat output: Logstash
Filebeat-2: configured file_selectors for file_type_2. filebeat output: Elasticsearch
#sniper I do not think the configuration would work in the manner you want it to using a single SQS queue. Each filebeat input will try to read all messages from the queue and use file_selectors to decide which files to download but read and ignore all other messages.
A pattern that you can use is fanout using SNS and SQS. S3 publishes events to SNS when a file is uploaded. Multiple SQS queues each corresponding to a filebeat input is subscribed to the SNS.
https://docs.aws.amazon.com/sns/latest/dg/sns-sqs-as-subscriber.html

How to handle S3 events inside a Kubernetes Cluster?

I have a Kubernetes cluster that runs on AWS EKS,
now I want to handle S3 object creation events in a pod,
like I would with AWS Lambda.
How can I handle S3 events from inside a Kubernetes cluster?
S3 events can only be sent to SQS, SNS or Lambda functions. So any S3 -> K8s integration will have to use those event destinations. I can think of couple of options to trigger K8s Jobs using S3 events (I'm assuming the K8s cluster is EKS)
Option 1:
S3 events are sent to a SQS queue
K8s Job polling the SQS queue -- the Job will have a ServiceAccount mapped to an IAM role which can read from SQS
Option 2:
S3 events trigger a Lambda function
This Lambda function uses container image which has kubectl/helm installed.
Lambda function's IAM role will be mapped to a K8s Role
Lambda function runs some code/script that creates following when invoked (using kubectl/helm)
a. A ConfigMap with the event information is
b. A Job that takes the ConfigMap name/label as a parameter or environment variable

AWS Batch Logs to splunk

I am using AWS Batch Service for my job. i want to send the logs generated from AWS Batch directly to Splunk instead of sending that to cloud-watch. How can i configure log-driver in AWS Batch to achieve this?
-ND
Splunk provides 3 methods to forward logs from a host server to the cloud server.
Splunk Forwarder (agent)
Http Event Collector (HEC)
Splunk logging driver for Docker
But, Splunk HTTP Event Collector (HEC) is the easy and efficient way to send data to Splunk Enterprise and Splunk Cloud in your scenario. You can send logs through Http request using HEC. This can be defined in your AWS batch job definition. Tutorial.
Other than that, you can use Splunk Docker logging driver, since AWS batch job will be spawn on an ECS container. For this method, you should define a custom AMI(for compute environment) which configured the docker daemon to send all the container logs to particular Splunk server.
AWS Batch logs can be sent to Cloudwatch and using Splunk Add on for AWS or using one of the AWS LAMBDA functions (HTTP Event Collector) can be onboarded into Splunk.

Can a celery worker/server accept tasks from a non celery producer?

I want to use a comet server written using java nio for sending out live updates. When receiving information I want it to scan the data, and send tasks to worker threads via rabbitmq. Ideally I would like a celery server to sit on the other end of rabbit, managing a pool of worker threads that will handle these tasks.
However, from my understanding, celery works by sitting on both ends of rabbitmq, and it essentially takes over the role of producer and consumer by being embedded in both the consumer and producer's code. Is there a way to set up celery as I described above? Thanks
Yes, of cource !
You can add Custom Message Consumers to a celery app.
Please refer to Extensions and Bootsteps in celery documents.
Here is a part of example code in the link above:
from celery import Celery
from celery import bootsteps
from kombu import Consumer, Exchange, Queue
my_queue = Queue('custom', Exchange('custom'), 'routing_key')
app = Celery(broker='amqp://')
class MyConsumerStep(bootsteps.ConsumerStep):
def get_consumers(self, channel):
return [Consumer(channel,
queues=[my_queue],
callbacks=[self.handle_message],
accept=['json'])]
def handle_message(self, body, message):
print('Received message: {0!r}'.format(body))
message.ack()
app.steps['consumer'].add(MyConsumerStep)
Test it:
python -m celery -A main worker
See also: Using Celery with existing RabbitMQ messages
It is not necessary to use Celery to publish messages. You can publish messages to RabbitMQ or to other broker from your own app and use Celery to consume tasks.
Celery uses simple message protocol. You can implement the client side in you application.
If you don't want to implement the client side of the protocol you can implement a simple http server which accepts requests and makes appropriate calls. Like this.