Is it possible to sync an azure repo with MWAA (Amazon Workflows for Apache Airflow)? - amazon-s3

I have set up a private MWAA instance in AWS. It has set up a bucket that stores DAGs in S3.
I've created a private repository in Azure DevOps and have set up a role that can access this bucket.
With Azure-Pipelines is it possible to sync the entire repository to control the DAGs created/modified in that S3 bucket?
I've seen it's possible to create artefacts and push them to the S3 bucket, but what if a dag is deleted? The DAG will still persist in the S3 Bucket and will still be available in MWAA.
Any guidance will be appreciated.

If you just want to sync entire repository to S3 bucket,you can use the task Amazon S3 Upload in your azure pipeline.
I'm not sure if that will fully address your problem, though.
If there is any misunderstanding, please feel free to add comments related to your issue.

Related

Providing credentials to the AWS CLI in ECS/Fargate

I would like to create an ECS task with Fargate, and have that upload a file to S3 using the AWS CLI (among other things). I know that it's possible to create task roles, which can provide the task with permissions on AWS services/resources. Similarly, in OpsWorks, the AWS SDK is able to query instance metadata to obtain temporary credentials for its instance profile. I also found these docs suggesting that something similar is possible with the AWS CLI on EC2 instances.
Is there an equivalent for Fargateā€”i.e., can the AWS CLI, running in a Fargate container, query the metadata service for temporary credentials? If not, what's a good way to authenticate so that I can upload a file to S3? Should I just create a user for this task and pass in AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables?
(I know it's possible to have an ECS task backed by EC2, but this task is short-lived and run maybe monthly; it seemed a good fit for Fargate.)
"I know that it's possible to create task roles, which can provide the
task with permissions on AWS services/resources."
"Is there an equivalent for Fargate"
You already know the answer. The ECS task role isn't specific to EC2 deployments, it works with Fargate deployments as well.
You can get the task metadata, including IAM access keys, through the ECS metadata service. But you don't need to worry about that, because the AWS CLI, and any AWS SDK, will automatically pull that information when it is running inside an ECS task.

What is the best approach to sync data from AWS 3 bucket to Azure Data Lake Gen 2

Currently, I download csv files from AWS S3 to my local computer using:
aws s3 sync s3://<cloud_source> c:/<local_destination> --profile aws_profile. Now, I would like to use the same process to sync the files from AWS to Azure Data Lake Storage Gen2 (one-way sync) on a daily basis. [Note: I only have read/download permissions for the S3 data source.]
I thought about 5 potential paths to solving this problem:
Use AWS CLI commands within Azure. I'm not entirely sure how to do that without running an Azure VM. Also, I would like to have my AWS profile credentials persist?
Use Python's subprocess library to run AWS CLI commands. I run into similar issues as option 1, namely a) maintaining a persistent install of AWS CLI, b) passing AWS profile credentials, and c) running without an Azure VM.
Use Python's Boto3 library to access AWS services. In the past, it appears that Boto3 didn't support the AWS sync command. So, developers like #raydel-miranda developed their own. [see Sync two buckets through boto3]. However, it now appears that there is a DataSync class for Boto3. [see DataSync | Boto3 Docs 1.17.27 documentation]. Would I still need to run this in an Azure VM or could I use Azure Data Factory?
Use Azure Data Factory to copy data from AWS S3 bucket. [see Copy data from Amazon Simple Storage Service by using Azure Data Factory] My concern would be that I would want to sync rather than copy. I believe Azure Data Factory has functionality to check if a file already exists, but what if the file has been deleted from AWS S3 data source?
Use Azure Data Science Virtual Machine to: a) install the AWS CLI, 2) create my AWS profile to store the access credentials, and 3) run the aws s3 sync... command.
Any tips, suggestions, or ideas on automating this process are greatly appreciated.
Adding one more to the list :)
6. Please do also look into Azcopy option . https://learn.microsoft.com/en-us/azure/storage/common/storage-use-azcopy-s3?toc=/azure/storage/blobs/toc.json
I am not aware of any tool which helps in syncing the data , more or less all will do the copy , I think you will have to implement that . Couple of quick thoughts .
#3 ) You can run this from a batch service . You can initate that from Azure data factory . Also since are talking about Python , you can also run that from Azure data bricks .
#4) ADF does not have any sync logic for the files to be deleted. We can implement that using the getMetadat activity . https://learn.microsoft.com/en-us/azure/data-factory/control-flow-get-metadata-activity
AzReplciate is another option - especially for very large containers https://learn.microsoft.com/en-us/samples/azure/azreplicate/azreplicate/

How to stop AWS ElasticBeanstalk from creating an S3 Bucket or inserting into it?

It created an S3 bucket. If I delete it, it just creates a new one. How can I set it to not create a bucket or to stop write permissions from it?
You cannot prevent AWS Elastic Beanstalk from creating S3 Bucket as it stores your application and settings as a bundle in that bucket and executes deployments. That bucket is required till the time you run/deploy your application using AWS EB. Please be vary of deleting these buckets as this may cause your deployments/applications to crash. Although, you may remove older objects (which may not be in use).
Take a look at this link for a detailed information on how EB uses S3 buckets for deployments https://docs.aws.amazon.com/elasticbeanstalk/latest/dg/AWSHowTo.S3.html

How to use codepipline to deploy to other region

I wanna do something pretty simple, a pipeline which deliver content of an aws codecommit repo to a S3 on another region.
From what I see I have to create the pipeline on the codecommit region otherwise I can't access it.
From what I have read codepipline support across regions actions. However I have an error in the deploy stage :
Replication of artifact 'SourceOutput' failed: Failed replicating artifact from ao-content-deploy-codepipelineartifactstorebucket-xxx in eu-west-1 to ao-content-deploy-codepipelineartifactstorebucket-xxx in us-east-2: The destination artifact bucket is in a different region. Please use a artifact bucket in the same region.
I'm not sure on how to proceed ? can anybody help ? just confirm that it's possible ?
Thanks for your help.
Best
Are you using the S3 Deploy action, by chance?
What you want is a pipeline configured with an artifact bucket for each region you want cross-region actions. And you need to specify the region for cross-region actions. This is described here: https://docs.aws.amazon.com/codepipeline/latest/userguide/actions-create-cross-region.html

block file system on S3

i am a little puzzled i hope someone can help me out.
we create some ORC-Files that we would like to query while they are stored on S3.
We noticed that the S3 native Filesystem S3n does not really work out for this manner. I am not really sure what the problem is - but my guess is, that the reader is not able to jump to specific bytes inside the file so that he has to load the whole file before he can query it.
So we tried storing the files on S3 (uri s3://) which is a block file system just like HDFS backed by s3 and it worked great.
But i am a little worried after reading up on this source about Amazon EMR which says
Amazon S3 block file system (URI path: s3bfs://)
The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.
Important
We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.
EMRFS (URI path: s3://)
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.
I am not using EMR - i create my files by launching an EC2 cluster and then use s3 as a cold storage - but I am kind of puzzled right now and not sure which filesystem I use when I store my files on s3 using the URI scheme s3:// - do i use EMRFS or do i use the deprecated s3bfs filesystem?
Amazon S3 is an object storage system. It is not recommended to "mount" S3 as a filesystem. Amazon Elastic Block Store (EBS) is a block storage system that appears as volumes on Amazon EC2 instances.
When used from Amazon Elastic MapReduce (EMR), Hadoop has extensions that make it easy to work with Amazon S3. However, if you are not using EMR, there is no need to use EMRFS (which is available only on EMR), nor should you use S3 as a block storage system.
The easiest way to use S3 from EC2 is via the AWS Command-Line Interface (CLI). You can copy files to/from S3 by using the aws s3 cp command. There's also a sync command to make it easy to syncrhonize data to/from S3.
You can also programmatically connect to Amazon S3 via an SDK, so that your app can directly transfer files to/from S3.
As to which to choose... typically, applications like to work with files on a local filesystem, so copy your files from S3 to a local device. However, if your app can directly communicate with S3, there will be less "moving parts".