What is the best approach to sync data from AWS 3 bucket to Azure Data Lake Gen 2 - amazon-s3

Currently, I download csv files from AWS S3 to my local computer using:
aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]
I thought about 5 potential paths to solving this problem:
Use AWS CLI commands within Azure. I'm not entirely sure how to do that without running an Azure VM. Also, I would like to have my AWS profile credentials persist?
Use Python's subprocess library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.
Use Python's Boto3 library to access AWS services. In the past, it appears that Boto3 didn't support the AWS sync command. So, developers like #raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?
Use Azure Data Factory to copy data from AWS S3 bucket. [see Copy data from Amazon Simple Storage Service by using Azure Data Factory] My concern would be that I would want to sync rather than copy. I believe Azure Data Factory has functionality to check if a file already exists, but what if the file has been deleted from AWS S3 data source?
Use Azure Data Science Virtual Machine to: a) install the AWS CLI, 2) create my AWS profile to store the access credentials, and 3) run the aws s3 sync... command.
Any tips, suggestions, or ideas on automating this process are greatly appreciated.

Adding one more to the list :)
6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json
I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts .
#3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .
#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity

AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/

Related

Is it possible to sync an azure repo with MWAA (Amazon Workflows for Apache Airflow)?

I have set up a private MWAA instance in AWS. It has set up a bucket that stores DAGs in S3.
I've created a private repository in Azure DevOps and have set up a role that can access this bucket.
With Azure-Pipelines is it possible to sync the entire repository to control the DAGs created/modified in that S3 bucket?
I've seen it's possible to create artefacts and push them to the S3 bucket, but what if a dag is deleted? The DAG will still persist in the S3 Bucket and will still be available in MWAA.
Any guidance will be appreciated.
If you just want to sync entire repository to S3 bucket,you can use the task Amazon S3 Upload in your azure pipeline.
I'm not sure if that will fully address your problem, though.
If there is any misunderstanding, please feel free to add comments related to your issue.

Providing credentials to the AWS CLI in ECS/Fargate

I would like to create an ECS task with Fargate, and have that upload a file to S3 using the AWS CLI (among other things). I know that it's possible to create task roles, which can provide the task with permissions on AWS services/resources. Similarly, in OpsWorks, the AWS SDK is able to query instance metadata to obtain temporary credentials for its instance profile. I also found these docs suggesting that something similar is possible with the AWS CLI on EC2 instances.
Is there an equivalent for Fargateā€”i.e., can the AWS CLI, running in a Fargate container, query the metadata service for temporary credentials? If not, what's a good way to authenticate so that I can upload a file to S3? Should I just create a user for this task and pass in AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables?
(I know it's possible to have an ECS task backed by EC2, but this task is short-lived and run maybe monthly; it seemed a good fit for Fargate.)
"I know that it's possible to create task roles, which can provide the
task with permissions on AWS services/resources."
"Is there an equivalent for Fargate"
You already know the answer. The ECS task role isn't specific to EC2 deployments, it works with Fargate deployments as well.
You can get the task metadata, including IAM access keys, through the ECS metadata service. But you don't need to worry about that, because the AWS CLI, and any AWS SDK, will automatically pull that information when it is running inside an ECS task.

backup distributed cache data to cloud storage

I want to backup the REDIS data on google storage bucket as flat file, is there any existing utility to do that?
Although, I do not fully agree to idea of backing up of cache data on cloud. I was wondering if there is any existing utility rather than reinventing the wheel.
If you are using Cloud Memorystore for Redis you can simply refer to the following documentation. Notice that you can simply use the following gcloud command:
gcloud redis instances export gs://[BUCKET_NAME]/[FILE_NAME].rdb [INSTANCE_ID] --region=[REGION] --project=[PROJECT_ID]
or use the Export operation from the Cloud Console.
If you manage your own instance (e.g. you have the Redis instance hosted on a Compute Engine Instance) you could simply use the SAVE or BGSAVE (preferred) commands to take a snapshot of the instance and then upload the .rdb file to Google Cloud Storage using any of the available methods, from which I think the most convenient one would be gsutil (notice that it will require the following installation procedure) in a similar fashion to:
gsutil cp path/to/your-file.rdb gs://[DESTINATION_BUCKET_NAME]/

Syncing buckets between two S3 Storage Providers

I am currently using RIAK CS as an S3 Provider but I want to change to Scality S3. Therefore, I need to migrate the existing data from RIAK to Scality. Is there a quick an easy way of syncing buckets between the two different storage providers? I have got two docker containers running containing the docker images for the two.
One way of doing it would be to simply download the contents of the buckets to a local folder and then upload to Scality using s3cmd or a similar tool. However, I was hoping there was a direct route between the buckets.
Any ideas?
There would not be a "direct route between the buckets".
While the Amazon S3 CopyObject command can copy objects between different Amazon S3 buckets (even if they are in different regions), it will not work with a non-Amazon endpoint.
Your only hope is if Riak/Scality have somehow built-in connectivity with each other.

Migrate s3 data to google cloud storage

I have a python web application deployed on Google App Engine.
I need to grab a log file stored on Amazon S3 and load it into Google Cloud Storage. Once it is in Google Cloud Storage I may need to perform some transformations and eventually import the data into BigQuery for analysis.
I tried using gsutil as a some sort of proof of concept, since boto is under the hood of gsutil and I'd like to use boto in my project. This did not work.
I'd like to know if anyone has managed to transfer file directly between the 2 clouds. If possible I'd like to see a simple example. In the end this task has to be accomplished through code executing on GAE.
Per this thread, you can stream data from S3 to Google Cloud Storage using gsutil but every byte still has to take two hops: S3 to your local computer and then your computer to GCS. Since you're using App Engine, however, you should be able to pull from S3 and deposit into GCS. It's the same progression as above except App Engine is the intermediary, i.e. every byte travels from S3 to your app and then to GCS. You could use boto for the pull side and the Google Cloud Storage API for the push side.
Google allows you to import entire buckets from S3 to the storage service:
https://cloud.google.com/storage/transfer/getting-started
You can set file filters on the source bucket to only import the file you want, or a "directory" (i.e. anything with a certain prefix).
I'm not aware of any cloud provider that provides an API for transferring data to a competing cloud provider. Cloud providers have no incentive to help you move your data to the competition. You will almost certainly have to read the data to an intermediate machine then write it to Google.
GCP supports not only transfer from S3, also it supports all the storage which have S3-compatible API's.
https://cloud.google.com/storage-transfer/docs/create-transfers
https://cloud.google.com/storage-transfer/docs/s3-compatible