How to handle S3 events inside a Kubernetes Cluster? - amazon-s3

I have a Kubernetes cluster that runs on AWS EKS,
now I want to handle S3 object creation events in a pod,
like I would with AWS Lambda.
How can I handle S3 events from inside a Kubernetes cluster?

S3 events can only be sent to SQS, SNS or Lambda functions. So any S3 -> K8s integration will have to use those event destinations. I can think of couple of options to trigger K8s Jobs using S3 events (I'm assuming the K8s cluster is EKS)
Option 1:
S3 events are sent to a SQS queue
K8s Job polling the SQS queue -- the Job will have a ServiceAccount mapped to an IAM role which can read from SQS
Option 2:
S3 events trigger a Lambda function
This Lambda function uses container image which has kubectl/helm installed.
Lambda function's IAM role will be mapped to a K8s Role
Lambda function runs some code/script that creates following when invoked (using kubectl/helm)
a. A ConfigMap with the event information is
b. A Job that takes the ConfigMap name/label as a parameter or environment variable

Related

In amazon eks - how to view logs which are prior to eks fargate node creation and logs while pods is starting up

I'm using amazon EKS fargate. I can see container logs using fluentbit side car etc no problem at all. But those logs ONLY show what is happening inside the container AFTER it has started up
I enabled aws eks cluster logging fully
Now I would like to see logs in cloudwatch which is equivalent of
kubectl describe pod
command
I have searched the ENTIRE cloudwatch clustername log group and am not able to find logs like
"pulling image into container"
"efs not mounted"
etc
I want to see logs in cloudwatch prior to the actual container creation stage
IS it possible at all using eks fargate ?
Thanks a bunch
You can use Container Insights which can collect metrics by using performance log events using the embedded metric format. The logs are stored in CloudWatch Logs. CloudWatch generates several metrics automatically from the logs which you can view in the CloudWatch console.
In Amazon EKS and Kubernetes, Container Insights uses a containerized version of the CloudWatch agent to discover all of the running containers in a cluster. It then collects performance data at every layer of the performance stack.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/Container-Insights-view-metrics.html

AWS Use assume-role to create EKS cluster

I read that we should use assume-role to create AWS EKS cluster. From the documention, I only find the use of EKS service role but I don't see how to create a cluster with a role. Do I miss anything?

EMR serverless cannot connect to s3 in another region

I have an EMR serverless app that cannot connect to S3 bucket in another region. Is there a workaround for that? Maybe a parameter to set in Job parameters or Spark parameters when submitting a new job.
The error is this:
ExitCode: 1. Last few exceptions: Caused by: java.net.SocketTimeoutException: connect timed out Caused by: com.amazon.ws.emr.hadoop.fs.shaded.org.apache.http.conn.ConnectTimeoutException
In order to connect to an S3 bucket in another region or access external services, the EMR Serverless application needs to be created with a VPC.
This is mentioned on the considerations page:
Without VPC connectivity, a job can access some AWS service endpoints in the same AWS Region. These services include Amazon S3, AWS Glue, Amazon DynamoDB, Amazon CloudWatch, AWS KMS, and AWS Secrets Manager.
Here's an example AWS CLI command to create an application in a VPC - you need to provide a list of Subnet IDs and Security Group IDs. More details can be found in configuring VPC access.
aws emr-serverless create-application \
--type SPARK \
--name etl-jobs \
--release-label "emr-6.6.0" \
--network-configuration '{
"subnetIds": ["subnet-01234567890abcdef","subnet-01234567890abcded"],
"securityGroupIds": ["sg-01234566889aabbcc"]
}'

AWS Batch Logs to splunk

I am using AWS Batch Service for my job. i want to send the logs generated from AWS Batch directly to Splunk instead of sending that to cloud-watch. How can i configure log-driver in AWS Batch to achieve this?
-ND
Splunk provides 3 methods to forward logs from a host server to the cloud server.
Splunk Forwarder (agent)
Http Event Collector (HEC)
Splunk logging driver for Docker
But, Splunk HTTP Event Collector (HEC) is the easy and efficient way to send data to Splunk Enterprise and Splunk Cloud in your scenario. You can send logs through Http request using HEC. This can be defined in your AWS batch job definition. Tutorial.
Other than that, you can use Splunk Docker logging driver, since AWS batch job will be spawn on an ECS container. For this method, you should define a custom AMI(for compute environment) which configured the docker daemon to send all the container logs to particular Splunk server.
AWS Batch logs can be sent to Cloudwatch and using Splunk Add on for AWS or using one of the AWS LAMBDA functions (HTTP Event Collector) can be onboarded into Splunk.

Celery with Amazon SQS and S3 events

I would like to use Celery to consume S3 events as delivered by Amazon on SQS. However, the S3 message format does not match what Celery expects.
How can I consume these messages with minimal hackiness? Should I write a custom serializer? Should I give up and make a custom bridge using boto or boto3?
As a sidenote, I also want to connect Celery to a different broker (RabbitMQ) for the rest of the application messaging, if that matters.
You're going to need to create a service that listens to the S3 notifications and then runs the appropriate celery task.
You have a variety of options - the S3 notifications go out via SQS, SNS or AWS Lambda.
In fact the simplest option may be to not use Celery at all and simply write some code to run in AWS Lambda. I haven't used this service (Lambda is relatively new) but it looks like it would mean you don't have to e.g. run a monitoring service or celery workers.
Configure the AWS S3 event to invoke an AWS Lambda function. The function should be written to transform the S3 event message into the Celery message format, then publish the Celery message to SQS. Celery would pick up the message from SQS.
S3 Event -> Lambda -> SQS -> Celery
For my specific use case, it turned out to be easiest to create a bridge worker that polls SQS and gives tasks to Celery with the default broker.
Not hard to do (although boto and SQS could use more documentation), and Celery is not well suited for connecting to two different brokers at once, so it feels like the best way to do it.
The notification message that Amazon S3 sends to publish an event is in the JSON format.
So you can configure your celery to serialize json.
Below is an extract from my config file (using django).
# AWS Credentials
AWS_ACCESS_KEY_ID = os.environ.get('AWS_ACCESS_KEY_ID')
AWS_SECRET_ACCESS_KEY = os.environ.get('AWS_SECRET_ACCESS_KEY')
# Celery
BROKER_URL = "sqs://%s:%s#" % (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
CELERY_ACCEPT_CONTENT = ['application/json']
CELERY_RESULT_SERIALIZER = 'json'
CELERY_TASK_SERIALIZER = 'json'
CELERY_DEFAULT_QUEUE = '<queue_name>'
CELERY_RESULT_BACKEND = None # Disabling the results backend
BROKER_TRANSPORT_OPTIONS = {
'region': 'us-west-2',
'polling_interval': 20,
}