How to tag S3 bucket objects using Kafka connect s3 sink connector - amazon-s3

Is there any way we can tag the objects written in S3 buckets through the Kafka Connect S3 sink connector.
I am reading messages from Kafka and writing the avro files in S3 bucket using S3 sink connector. When the files are written in S3 bucket I need to tag the files.

there is an API inside source code on the GitHub called addTags(), but it's now private and is not exposed to the connector client except this small config feature called S3_OBJECT_TAGGING_CONFIG which allows you to add start/end offsets as well as record count to s3 object.
configDef.define(
S3_OBJECT_TAGGING_CONFIG,
Type.BOOLEAN,
S3_OBJECT_TAGGING_DEFAULT,
Importance.LOW,
"Tag S3 objects with start and end offsets, as well as record count.",
group,
++orderInGroup,
Width.LONG,
"S3 Object Tagging"
);
If you want to add other/custom tags then answer is NO you cannot do it right now.
Useful feature would be to take the tags from the predefined part of an input document in Kafka but this is not available right now.

Related

Apache Flink: Reading parquet files from S3 in Data Stream APIs

We have several external jobs producing small (500MiB) parquet objects on S3 partitioned by time. The goal is to create an application that would read those files, join them on a specific key and dump the result into a Kinesis stream or another S3 bucket.
Can it be achieved by just the means of Flink? Can it monitor and load new S3 objects being created and load them into the application?
The newer FileSource class (available in recent Flink versions) supports monitoring a directory for new/modified files. See FileSource.forBulkFileFormat() in particular, for reading Parquet files.
You use the FileSourceBuilder returned by the above method call, and then .monitorContinuously(Duration.ofHours(1)); (or whatever interval makes sense).

Can I map multiple buckets with multiple topics in a single Kafka-Connector S3 sink connector?

Google didn't help me, so I want to ask you. I have a lot of kafka topics, and I want to store the messages of a particular topic in a particular S3 bucket.
Do I need to create an S3 sink connector for each bucket or can I configure all the stuff in a single S3 json file (map a bucket with a topic)?
Thanks for your help! :)
The S3 bucket name is specified per-connector, so you will need to create one connector per bucket.
Note that you can specify multiple topics per connector (topics / topics.regex), so if you wanted multiple topics going to one bucket (but different paths), you could do this with a single connector. But for multiple buckets, you'll need multiple connectors.

aws s3 sync cli ignoring multipart upload config when syncing between buckets

I'm trying to sync a large number of files from one bucket to another, some of the files are up to 2GB in size after using the aws cli's s3 sync command like so
aws s3 sync s3://bucket/folder/folder s3://destination-bucket/folder/folder
and verifying the files that had been transferred it became clear that the large files had lost the metadata that was present on the original file in the original bucket.
This is a "known" issue with larger files where s3 switches to multipart upload to handled the transfer.
This multipart handeling can be configured via the .aws/config file which has been done like so
[default]
s3 =
multipart_threshold = 4500MB
However when again testing the transfer the metadata on the larger files is still not present, it is present on any of the smaller files so it's clear that I'm heating the multipart upload issue.
Given this is an s3 to s3 transfer is the local s3 configuration taken into consideration at all?
As an alternative to this is there a way to just sync the metadata now that all the files have been transferred?
Have also tried doing aws s3 cp with no luck either.
You could use Cross/Same-Region Replication to copy the objects to another Amazon S3 bucket.
However, only newly added objects will copy between the buckets. You can, however, trigger the copy by copying the objects onto themselves. I'd recommend you test this on a separate bucket first, to make sure you don't accidentally lose any of the metadata.
The method suggested seems rather complex: Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena | AWS Big Data Blog
The final option would be to write your own code to copy the objects, and copy the metadata at the same time.
Or, you could write a script that compares the two buckets to see which objects did not get their correct metadata, and have it just update the metadata on the target object. This actually involves copying the object to itself, while specifying the metadata. This is probably easier than copying ALL objects yourself, since it only needs to 'fix' the ones that didn't get their metadata.
Finally managed to implement a solution for this and took the oportunity to play around with the Serverless framework and Step Functions.
The general flow I went with was:
Step Function triggered using a Cloudwatch Event Rule targetting S3 Events of the type 'CompleteMultipartUpload', as the metadata is only ever missing on S3 objects that had to be transfered using a multipart process
The initial Task on the Step Function checks if all the required MetaData is present on the object that raised the event.
If it is present then the Step Function is finished
If it is not present then the second lambda task is fired which copies all metadata from the source object to the destination object.
This could be achieved without Step Functions however was a good simple exercise to give them a go. The first 'Check Meta' task is actually redundant as the metadata is never present if multipart transfer is used, I was originally also triggering off of PutObject and CopyObject as well which is why I had the Check Meta task.

How can I search the changes made on a `s3` bucket between two timestamp?

I am using s3 bucket to store my data. And I keep pushing data to this bucket every single day. I wonder whether there is feature I can compare the files different in my bucket between two date. I not, is there a way for me to build one via aws cli or sdk?
The reason I want to check this is that I have a s3 bucket and my clients keep pushing data to this bucket. I want to have a look how much data they pushed since the last time I load them. Is there a pattern in aws support this query? Or do I have to create any rules in s3 bucket to analyse it?
Listing from Amazon S3
You can activate Amazon S3 Inventory, which can provide a daily file listing the contents of an Amazon S3 bucket. You could then compare differences between two inventory files.
List it yourself and store it
Alternatively, you could list the contents of a bucket and look for objects dated since the last listing. However, if objects are deleted, you will only know this if you keep a list of objects that were previously in the bucket. It's probably easier to use S3 inventory.
Process it in real-time
Instead of thinking about files in batches, you could configure Amazon S3 Events to trigger something whenever a new file is uploaded to the Amazon S3 bucket. The event can:
Trigger a notification via Amazon Simple Notification Service (SNS), such as an email
Invoke an AWS Lambda function to run some code you provide. For example, the code could process the file and send it somewhere.

S3 bucket does not append new data objects

I'm trying to send all my AWS IoT incoming sensor value messages to the same s3 bucket, but despite turning on versioning in my bucket, the file keeps getting overwritten and showing only the last input sensor value rather then all of them. I'm using "Store messages in an Amazon S3 bucket" direct from the AWS IoT console. Any easy way to solve this problem?
So after further research and speaking with Amazon Dev support you actually cant append records tot he same file in S3 from the IoT console directly. I mentioned this was a feature most IoT developers would want as a default, and he said it would likely be possible soon but not way to do it now. Anyway the simplest workaound I tested is to set up a Kinesis stream with a firehose to a S3 bucket. This will be constrained by an adjustable data size and stream duration but it works well otherwise. It also allows you to insert a Lambda functino for data transform if needed.