Is it possible to copy data from from Amazon S3 objects in Requester Pays Buckets using Azure Data Factory V2? - amazon-s3

In Azure Data Factory exist connection to Amason S3 where I can set AccessKey and SecretKey. After I create Linked Service and try to use dataset in copy activity I get 403 error when I try to get data from requester pays bucket. In Amazon S3 documentation https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectsinRequesterPaysBuckets.html
writes that it just need to put additional header to access this data. Does exist any ways to put this header in Amazon dataset or other workarounds?

Related

Importing data from S3 to BigQuery via Bigquery Omni (location restrictions)

I am trying to import some data from S3 bucket to bigQuery. And, I ended up seeing the bigQuery omni option.
However, when I try to connect to the S3 bucket, I see that I am given a set of regions to choose from. In my case, aws-us-east-1 and aws-ap-northeast-2 as in the attached screenshot.
My data on S3 bucket is on the region eu-west-2.
Wondering why BQ allows us to look for specific regions on S3.
What should I be doing so that I can query data from an S3 bucket in the region where the data is uploaded to?
The S3 service is unusual in that the names of buckets are globally unique and do not contain region information in the ARN of the bucket. However, buckets are located in regions but they can be accessed from any region.
My best guess here is that the connection location will be the S3 API endpoint that BigQuery will connect to when it attempts to get the data. If you don't see eu-west-2 as an option then try using us-east-1. From that endpoint it is always possible to find out the location information of the bucket and then make the appropriate S3 client.

AWS S3 Folder wise metrics

We are using grafana's cloudwatch data source for aws metrics. We would like to differentiate folders in S3 bucket with respect to their sizes and show them as graphs. We know that cloudwatch doesn't give object level metrics but bucket level. In order to monitor the size of the folders in the bucket, let us know if any possible solution out there.
Any suggestion on the same is appreciated.
Thanks in advance.
Amazon CloudWatch provides daily storage metrics for Amazon S3 buckets but, as you mention, these metrics are for the whole bucket, rather than folder-level.
Amazon S3 Inventory can provide a daily CSV file listing all objects. You could load this information into a database or use Amazon Athena to query the contents.
If you require storage metrics at a higher resolution than daily, then you would need to track this information yourself. This could be done with:
An Amazon S3 Event that triggers an AWS Lambda function whenever an object is created or deleted
An AWS Lambda function that receives this information and updates a database
Your application could then retrieve the storage metrics from the database
Thanks for the reply John,
However I found a solution for it using an s3_exporter. It gives metrics according to size of the folders & sub-folders inside S3 bucket.

What architecture is best for creating a serverless aws service?

I need to implement an AWS service used to store back-up data from devices.
Devices are identified via ids. Service consists of 3 endpoints:
Save device backup.
Get device backup.
Get latest device backup time.
Backup: binary data, from 10kb up to 1mb
Load examples
100ะบ saved backups per day. 2k restored backups per day.
Take p1 and multiply by 100
I came up with 2 architectures.
Which architecture is better to choose or build a new one?
Can I combine the gateway api into one or do I need a separate API for each request?
Can I merge lambda into one or do I need a separate action for each action?
A device backup would consist of two elements:
The backup data: Best stored in Amazon S3
Metadata about the backup (user, timestamp, pointer to backup data): Best stored in some type of database, such as DynamoDB
The processes would then be:
Saving backup: Send backup data via API Gateway to Lambda. The Lambda function would save the data in Amazon S3 and add an entry to the DynamoDB database, returning a reference to the backup entry in the database.
Retrieving backup: Send request via API Gateway to Lambda. The Lambda function uses the metadata in DynamoDB to determine which backup to serve, then creates an Amazon S3 pre-signed URL and returns the URL to the device. The device then retrieves the backup directly from the S3 bucket.
Listing backups: Send request via API Gateway to Lambda. The Lambda function uses the metadata in DynamoDB to retrieve a list of backups (or just the latest backup), then returns the values.
It would be cleaner to use a separate Lambda function for each type of request (save, retrieve, list). These would be triggered via different paths within API Gateway.

How to tag S3 bucket objects using Kafka connect s3 sink connector

Is there any way we can tag the objects written in S3 buckets through the Kafka Connect S3 sink connector.
I am reading messages from Kafka and writing the avro files in S3 bucket using S3 sink connector. When the files are written in S3 bucket I need to tag the files.
there is an API inside source code on the GitHub called addTags(), but it's now private and is not exposed to the connector client except this small config feature called S3_OBJECT_TAGGING_CONFIG which allows you to add start/end offsets as well as record count to s3 object.
configDef.define(
S3_OBJECT_TAGGING_CONFIG,
Type.BOOLEAN,
S3_OBJECT_TAGGING_DEFAULT,
Importance.LOW,
"Tag S3 objects with start and end offsets, as well as record count.",
group,
++orderInGroup,
Width.LONG,
"S3 Object Tagging"
);
If you want to add other/custom tags then answer is NO you cannot do it right now.
Useful feature would be to take the tags from the predefined part of an input document in Kafka but this is not available right now.

How can I search the changes made on a `s3` bucket between two timestamp?

I am using s3 bucket to store my data. And I keep pushing data to this bucket every single day. I wonder whether there is feature I can compare the files different in my bucket between two date. I not, is there a way for me to build one via aws cli or sdk?
The reason I want to check this is that I have a s3 bucket and my clients keep pushing data to this bucket. I want to have a look how much data they pushed since the last time I load them. Is there a pattern in aws support this query? Or do I have to create any rules in s3 bucket to analyse it?
Listing from Amazon S3
You can activate Amazon S3 Inventory, which can provide a daily file listing the contents of an Amazon S3 bucket. You could then compare differences between two inventory files.
List it yourself and store it
Alternatively, you could list the contents of a bucket and look for objects dated since the last listing. However, if objects are deleted, you will only know this if you keep a list of objects that were previously in the bucket. It's probably easier to use S3 inventory.
Process it in real-time
Instead of thinking about files in batches, you could configure Amazon S3 Events to trigger something whenever a new file is uploaded to the Amazon S3 bucket. The event can:
Trigger a notification via Amazon Simple Notification Service (SNS), such as an email
Invoke an AWS Lambda function to run some code you provide. For example, the code could process the file and send it somewhere.