Importing data from S3 to BigQuery via Bigquery Omni (location restrictions) - amazon-s3

I am trying to import some data from S3 bucket to bigQuery. And, I ended up seeing the bigQuery omni option.
However, when I try to connect to the S3 bucket, I see that I am given a set of regions to choose from. In my case, aws-us-east-1 and aws-ap-northeast-2 as in the attached screenshot.
My data on S3 bucket is on the region eu-west-2.
Wondering why BQ allows us to look for specific regions on S3.
What should I be doing so that I can query data from an S3 bucket in the region where the data is uploaded to?

The S3 service is unusual in that the names of buckets are globally unique and do not contain region information in the ARN of the bucket. However, buckets are located in regions but they can be accessed from any region.
My best guess here is that the connection location will be the S3 API endpoint that BigQuery will connect to when it attempts to get the data. If you don't see eu-west-2 as an option then try using us-east-1. From that endpoint it is always possible to find out the location information of the bucket and then make the appropriate S3 client.

Related

Copy Files from S3 SignedURL to GCS Signed URL

I am developing a service in which two different cloud storage providers are involved. I am trying to copy data from S3 bucket to GCS.
To access the data I have been offered signedUrls, and to upload the data to GCS I also have signedUrls available which allow me to write content into a specified storage path;
Is there a possibility to move this data "in cloud"? Downloading from S3 and uploading the content to GCS will create bandwidth problems;
I must also mention that this is a on-demand job and it only moves a small number of files. I can not do a full bucket transfer;
Kind regards
You can use Skyplane to move data across clouds object stores. To move a single file from S3 to Google Cloud, you can use the command:
skyplane cp gcs://<BUCKET>/<FILE> s3://<BUCKET>/<FILE>

Google Play Bucket not shown in the Cloud Storage

I am trying to load some Google Play reports to my BigQuery project, but having issues with finding the bucket in the Could Storage.
I have copied the Cloud Storage URL in the Google Play console (gs://pubsite_prod_rev_... format)
When I open my Cloud Storage this bucket is not in the list of available buckets.
But if I enter this URL in the Data Transfer from bucket to dataset, it will work (although not all reports will be loaded to to my dataset :( )
If I enter this URL in the Data Transfer from bucket to bucket, it won't work, because the transfer lacks some permissions to the source bucket. But I cannot change the permissions to this Google Play bucket because I can't see it in my buckets list.
So my question is - what could be the reason this bucket is not displayed in my storage and how to get access to it?
Thanks!

AWS S3 Folder wise metrics

We are using grafana's cloudwatch data source for aws metrics. We would like to differentiate folders in S3 bucket with respect to their sizes and show them as graphs. We know that cloudwatch doesn't give object level metrics but bucket level. In order to monitor the size of the folders in the bucket, let us know if any possible solution out there.
Any suggestion on the same is appreciated.
Thanks in advance.
Amazon CloudWatch provides daily storage metrics for Amazon S3 buckets but, as you mention, these metrics are for the whole bucket, rather than folder-level.
Amazon S3 Inventory can provide a daily CSV file listing all objects. You could load this information into a database or use Amazon Athena to query the contents.
If you require storage metrics at a higher resolution than daily, then you would need to track this information yourself. This could be done with:
An Amazon S3 Event that triggers an AWS Lambda function whenever an object is created or deleted
An AWS Lambda function that receives this information and updates a database
Your application could then retrieve the storage metrics from the database
Thanks for the reply John,
However I found a solution for it using an s3_exporter. It gives metrics according to size of the folders & sub-folders inside S3 bucket.

How can I search the changes made on a `s3` bucket between two timestamp?

I am using s3 bucket to store my data. And I keep pushing data to this bucket every single day. I wonder whether there is feature I can compare the files different in my bucket between two date. I not, is there a way for me to build one via aws cli or sdk?
The reason I want to check this is that I have a s3 bucket and my clients keep pushing data to this bucket. I want to have a look how much data they pushed since the last time I load them. Is there a pattern in aws support this query? Or do I have to create any rules in s3 bucket to analyse it?
Listing from Amazon S3
You can activate Amazon S3 Inventory, which can provide a daily file listing the contents of an Amazon S3 bucket. You could then compare differences between two inventory files.
List it yourself and store it
Alternatively, you could list the contents of a bucket and look for objects dated since the last listing. However, if objects are deleted, you will only know this if you keep a list of objects that were previously in the bucket. It's probably easier to use S3 inventory.
Process it in real-time
Instead of thinking about files in batches, you could configure Amazon S3 Events to trigger something whenever a new file is uploaded to the Amazon S3 bucket. The event can:
Trigger a notification via Amazon Simple Notification Service (SNS), such as an email
Invoke an AWS Lambda function to run some code you provide. For example, the code could process the file and send it somewhere.

Is it possible to copy data from from Amazon S3 objects in Requester Pays Buckets using Azure Data Factory V2?

In Azure Data Factory exist connection to Amason S3 where I can set AccessKey and SecretKey. After I create Linked Service and try to use dataset in copy activity I get 403 error when I try to get data from requester pays bucket. In Amazon S3 documentation https://docs.aws.amazon.com/AmazonS3/latest/dev/ObjectsinRequesterPaysBuckets.html
writes that it just need to put additional header to access this data. Does exist any ways to put this header in Amazon dataset or other workarounds?