Terraform resource for AWS S3 Batch Operation - amazon-s3

I couldn't find Terraform resource for AWS S3 batch operation? I was able to create AWS s3 inventory file through terraform but couldn't create an s3 batch operation.
Did anyone create the s3 batch opearion through terraform?

No, there is no Terraform resource for an S3 batch operation. In general, most Terraform providers only have resources for things that are actually resources (they hang around), not things that could be considered "tasks". For the same reason, there's no CloudFormation resource for S3 batch operations either.
Your best bet is to use a module that allows you to run shell commands and use the AWS CLI for it. I like to use this module for these kinds of tasks. You would use it in combination with the AWS CLI command for S3 batch jobs.

Related

Is it possible to sync an azure repo with MWAA (Amazon Workflows for Apache Airflow)?

I have set up a private MWAA instance in AWS. It has set up a bucket that stores DAGs in S3.
I've created a private repository in Azure DevOps and have set up a role that can access this bucket.
With Azure-Pipelines is it possible to sync the entire repository to control the DAGs created/modified in that S3 bucket?
I've seen it's possible to create artefacts and push them to the S3 bucket, but what if a dag is deleted? The DAG will still persist in the S3 Bucket and will still be available in MWAA.
Any guidance will be appreciated.
If you just want to sync entire repository to S3 bucket,you can use the task Amazon S3 Upload in your azure pipeline.
I'm not sure if that will fully address your problem, though.
If there is any misunderstanding, please feel free to add comments related to your issue.

Azure Devops - pipeline to delete single s3 file

I would like a pipeline setup that I can run manually. The idea here is that it deletes a single file held within an AWS S3 account. I know technically there are many ways to do this, but what is best practice?
Thank you!
You can use a task: AWS CLI and add it into pipeline to delete a single file held within an AWS S3 account.
You can follow below steps :
1、 You should create a service connection before adding a AWS CLI task to pipeline.
Create AWS service connection
2、 Add AWS CLI task to pipeline and configure required parameters. Please know the meaning of parameters about AWS CLI. You can refer the document :
Command structure in the AWS CLI
The command structure is like:
aws <command> <subcommand> [options and parameters]
In this example, you can use the command below to delete a single s3 file:
aws s3 rm s3://BUCKET_NAME/uploads/file_name.jpg
“s3://BUCKET_NAME/uploads/file_name.jpg” is the file path you saved in S3.
AWS CLI in pipeline
3 run the pipeline and the single s3 file can be deleted successfully.

Providing credentials to the AWS CLI in ECS/Fargate

I would like to create an ECS task with Fargate, and have that upload a file to S3 using the AWS CLI (among other things). I know that it's possible to create task roles, which can provide the task with permissions on AWS services/resources. Similarly, in OpsWorks, the AWS SDK is able to query instance metadata to obtain temporary credentials for its instance profile. I also found these docs suggesting that something similar is possible with the AWS CLI on EC2 instances.
Is there an equivalent for Fargate—i.e., can the AWS CLI, running in a Fargate container, query the metadata service for temporary credentials? If not, what's a good way to authenticate so that I can upload a file to S3? Should I just create a user for this task and pass in AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables?
(I know it's possible to have an ECS task backed by EC2, but this task is short-lived and run maybe monthly; it seemed a good fit for Fargate.)
"I know that it's possible to create task roles, which can provide the
task with permissions on AWS services/resources."
"Is there an equivalent for Fargate"
You already know the answer. The ECS task role isn't specific to EC2 deployments, it works with Fargate deployments as well.
You can get the task metadata, including IAM access keys, through the ECS metadata service. But you don't need to worry about that, because the AWS CLI, and any AWS SDK, will automatically pull that information when it is running inside an ECS task.

block file system on S3

i am a little puzzled i hope someone can help me out.
we create some ORC-Files that we would like to query while they are stored on S3.
We noticed that the S3 native Filesystem S3n does not really work out for this manner. I am not really sure what the problem is - but my guess is, that the reader is not able to jump to specific bytes inside the file so that he has to load the whole file before he can query it.
So we tried storing the files on S3 (uri s3://) which is a block file system just like HDFS backed by s3 and it worked great.
But i am a little worried after reading up on this source about Amazon EMR which says
Amazon S3 block file system (URI path: s3bfs://)
The Amazon S3 block file system is a legacy file storage system. We strongly discourage the use of this system.
Important
We recommend that you do not use this file system because it can trigger a race condition that might cause your cluster to fail. However, it might be required by legacy applications.
EMRFS (URI path: s3://)
EMRFS is an implementation of HDFS used for reading and writing regular files from Amazon EMR directly to Amazon S3.
I am not using EMR - i create my files by launching an EC2 cluster and then use s3 as a cold storage - but I am kind of puzzled right now and not sure which filesystem I use when I store my files on s3 using the URI scheme s3:// - do i use EMRFS or do i use the deprecated s3bfs filesystem?
Amazon S3 is an object storage system. It is not recommended to "mount" S3 as a filesystem. Amazon Elastic Block Store (EBS) is a block storage system that appears as volumes on Amazon EC2 instances.
When used from Amazon Elastic MapReduce (EMR), Hadoop has extensions that make it easy to work with Amazon S3. However, if you are not using EMR, there is no need to use EMRFS (which is available only on EMR), nor should you use S3 as a block storage system.
The easiest way to use S3 from EC2 is via the AWS Command-Line Interface (CLI). You can copy files to/from S3 by using the aws s3 cp command. There's also a sync command to make it easy to syncrhonize data to/from S3.
You can also programmatically connect to Amazon S3 via an SDK, so that your app can directly transfer files to/from S3.
As to which to choose... typically, applications like to work with files on a local filesystem, so copy your files from S3 to a local device. However, if your app can directly communicate with S3, there will be less "moving parts".

How do I copy files from S3 to Amazon EMR HDFS?

I'm running hive over EMR,
and need to copy some files to all EMR instances.
One way as I understand is just to copy files to the local file system on each node the other is to copy the files to the HDFS however I haven't found a simple way to copy stright from S3 to HDFS.
What is the best way to go about this?
the best way to do this is to use Hadoop's distcp command. Example (on one of the cluster nodes):
% ${HADOOP_HOME}/bin/hadoop distcp s3n://mybucket/myfile /root/myfile
This would copy a file called myfile from an S3 bucket named mybucket to /root/myfile in HDFS. Note that this example assumes you are using the S3 file system in "native" mode; this means that Hadoop sees each object in S3 as a file. If you use S3 in block mode instead, you would replace s3n with s3 in the example above. For more info about the differences between native S3 and block mode, as well as an elaboration on the example above, see http://wiki.apache.org/hadoop/AmazonS3.
I found that distcp is a very powerful tool. In addition to being able to use it to copy a large amount of files in and out of S3, you can also perform fast cluster-to-cluster copies with large data sets. Instead of pushing all the data through a single node, distcp uses multiple nodes in parallel to perform the transfer. This makes distcp considerably faster when transferring large amounts of data, compared to the alternative of copying everything to the local file system as an intermediary.
Now Amazon itself has a wrapper implemented over distcp, namely : s3distcp .
S3DistCp is an extension of DistCp that is optimized to work with
Amazon Web Services (AWS), particularly Amazon Simple Storage Service
(Amazon S3). You use S3DistCp by adding it as a step in a job flow.
Using S3DistCp, you can efficiently copy large amounts of data from
Amazon S3 into HDFS where it can be processed by subsequent steps in
your Amazon Elastic MapReduce (Amazon EMR) job flow. You can also use
S3DistCp to copy data between Amazon S3 buckets or from HDFS to Amazon
S3
Example Copy log files from Amazon S3 to HDFS
This following example illustrates how to copy log files stored in an Amazon S3 bucket into HDFS. In this example the --srcPattern option is used to limit the data copied to the daemon logs.
elastic-mapreduce --jobflow j-3GY8JC4179IOJ --jar \
s3://us-east-1.elasticmapreduce/libs/s3distcp/1.latest/s3distcp.jar \
--args '--src,s3://myawsbucket/logs/j-3GY8JC4179IOJ/node/,\
--dest,hdfs:///output,\
--srcPattern,.*daemons.*-hadoop-.*'
Note that according to Amazon, at http://docs.amazonwebservices.com/ElasticMapReduce/latest/DeveloperGuide/FileSystemConfig.html "Amazon Elastic MapReduce - File System Configuration", the S3 Block FileSystem is deprecated and its URI prefix is now s3bfs:// and they specifically discourage using it since "it can trigger a race condition that might cause your job flow to fail".
According to the same page, HDFS is now 'first-class' file system under S3 although it is ephemeral (goes away when the Hadoop jobs ends).