AWS EMR --steps - amazon-emr

I am running the following .sh to run a command on AWS using EMR:
aws emr create-cluster --name "Big Matrix Re Run 5" --ami-version 3.1.0 --auto-terminate --log-uri FILE LOCATION --enable-debugging --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.xlarge InstanceGroupType=CORE,InstanceCount=3,InstanceType=c3.xlarge --steps NAME AND LOCATION OF FILE
I've deleted the pertinent file name and locations as those aren't my issue, but I am having an issue with the --steps portion of the script.
How do I specify the steps that I want to run in the cluster? The documentation doesn't give any examples.
Here is the error:
Error parsing parameter '--steps': should be: Key value pairs, where values are separated by commas, and multiple pairs are separated by spaces.
--steps Name=string1,Jar=string1,ActionOnFailure=string1,MainClass=string1,Type=string1,Properties=string1,Args=string1,string2 Name=string1,Jar=string1,ActionOnFailure=string1,MainClass=string1,Type=string1,Properties=string1,Args=string1,string2

The documentation page for the AWS Command-Line Interface create-cluster command shows examples for using the --steps parameter.
Steps can be supplied on the command-line, or can refer to files available within HDFS or Amazon S3.
Within HDFS:
aws emr create-cluster --steps file://./multiplefiles.json --ami-version 3.3.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
Within Amazon S3:
aws emr create-cluster --steps Type=HIVE,Name='Hive program',ActionOnFailure=CONTINUE,ActionOnFailure=TERMINATE_CLUSTER,Args=[-f,s3://elasticmapreduce/samples/hive-ads/libs/model-build.q,-d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://mybucket/hive-ads/output/2014-04-18/11-07-32,-d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs] --applications Name=Hive --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge


AWS cli throws error when copying large files

I'm trying to copy objects from an s3 bucket to another using aws cli tool.
It works OK for small objects, but on large file buckets, as soon as the copy starts, I get one of the following errors:
copy failed: s3://bucket/file.ogv to s3://bucket-tmp/file.ogv ('Connection aborted.', OSError(0, 'Error'))
copy failed: s3://bucket/file.ogv to s3://bucket-tmp/file.ogv An error occurred (NoSuchKey) when calling the UploadPartCopy operation: Unknown
if I include the --no-guess-mime-type I get
fatal error: ('Connection aborted.', OSError(0, 'Error'))
I tryied --debug, but I really didn't understand much of the debug output but I could see OSError(0, 'Error') again in the log.
Anyone has seen anything like this ? in another answer (this one), people told about another tool s3cmd, but I couldn't make it work.
I'm trying to access ceph on a corporate server with path-style urls and https endpoint.
My command:
aws --endpoint-url https://myendpoint.url s3 cp s3://mybucket s3://mybucket-tmp --recursive
Also when I tried to configure s3cmd I get an ungly python debug output with OSError: [Errno 0] Error in the middle.
I discovered that if I use s3api command instead of s3 command it works. Format of working command:
aws --endpoint-url <my-endpoint-url> s3api copy-object --copy-source my-source-bucket/whatever/path/file.txt --key whatever/path/file.txt --bucket my-destination-bucket
It only copys one file at once. You can grab a list of objects in the bucket using s3 command ls or s3api command list-objects

Move files in S3 bucket to folder based on file name pattern

I have an S3 bucket with a few thousand files where the file names always match the pattern {hostname}.{contenttype}.{yyyyMMddHH}.zip. I want to create a script that will run once a day to move these files into folders based on the year and month in the file name.
If I try the following aws-cli command
aws s3 mv s3://mybucket/*.202001* s3://mybucket/202001/
I get the following error:
fatal error: An error occurred (404) when calling the HeadObject operation: Key "*.202001*" does not exist
Is there an aws-cli command that I could run on a schedule to achieve this?
I think the way forward would be through the --filter parameter used in S3 CLI commands.
So, for your case,
aws s3 mv s3://mybucket/ s3://mybucket/202001/ --recursive --exclude "*" --include "*.202001*"
should probably do the trick.
For scheduling the CLI command to run daily, I think you can refer to On AWS, run an AWS CLI command daily

aws-cli fails to work with one particular S3 bucket on one particular machine

I'm trying to remove the objects (empty bucket) and then copy new ones into an AWS S3 bucket:
aws s3 rm s3://BUCKET_NAME --region us-east-2 --recursive
aws s3 cp ./ s3://BUCKET_NAME/ --region us-east-2 --recursive
The first command fails with the following error:
An error occurred (InvalidRequest) when calling the ListObjects
operation: You are attempting to operate on a bucket in a region that
requires Signature Version 4. You can fix this issue by explicitly
providing the correct region location using the --region argument, the
AWS_DEFAULT_REGION environment variable, or the region variable in the
AWS CLI configuration file. You can get the bucket's location by
running "aws s3api get-bucket-location --bucket BUCKET". Completed 1
part(s) with ... file(s) remaining
Well, the error prompt is self-explanatory but the problem is that I've already applied the solution (I've added the --region argument) and I'm completely sure that it is the correct region (I got the region the same way the error message is suggesting).
Now, to make things even more interesting, the error happens in a gitlab CI environment (let's just say some server). But just before this error occurs, there are other buckets which the exact same command can be executed against and they work. It's worth mentioning that those other buckets are in different regions.
Now, to top it all off, I can execute the command on my personal computer with the same credentials as in CI server!!! So to summarize:
server$ aws s3 rm s3://OTHER_BUCKET --region us-west-2 --recursive <== works
server$ aws s3 rm s3://BUCKET_NAME --region us-east-2 --recursive <== fails
my_pc$ aws s3 rm s3://BUCKET_NAME --region us-east-2 --recursive <== works
Does anyone have any pointers what might the problem be?
For anyone else that might be facing the same problem, make sure your aws is up-to-date!!!
server$ aws --version
aws-cli/1.10.52 Python/2.7.14 Linux/4.13.9-coreos botocore/1.4.42
my_pc$ aws --version
aws-cli/1.14.58 Python/3.6.5 Linux/4.13.0-38-generic botocore/1.9.11
Once I updated the server's aws cli tool, everything worked. Now my server is:
server$ aws --version
aws-cli/1.14.49 Python/2.7.14 Linux/4.13.5-coreos-r2 botocore/1.9.2

How to start pig with -t ColumnMapKeyPrune on aws emr

In my pig script i want file name with each record for some further processing so i used -tagFile option. Now after using -tagFile option, the column names were getting un aligned so i used below command to get only required columns after referring this blog :
pig -x mapreduce -t ColumnMapKeyPrune
Now i want to run the script on AWS EMR but i am not sure how to enable this -t ColumnMapKeyPrune option on EMR Pig.?
I am using AWS CLI to create aws cluster and submit jobs.
Any pointer for how to enable -t ColumnMapKeyPrune on EMR Pig.?
I got the solution. I need to add below line in pig script:
set pig.optimizer.rules.disabled 'ColumnMapKeyPrune';

Cannot use apache flink in amazon emr

I can not a start a yarn session of Apache Flink in Amazons EMR. The error message I get is
$ tar xvfj flink-0.9.0-bin-hadoop26.tgz
$ cd flink-0.9.0
$ ./bin/ -n 4 -jm 1024 -tm 4096
Diagnostics: File file:/home/hadoop/.flink/application_1439466798234_0008/flink-conf.yaml does not exist File file:/home/hadoop/.flink/application_1439466798234_0008/flink-conf.yaml does not exist
I am using Flink verision 0.9 and Amazons Hadoop version 4.0.0. Any ideas or hints?
The full log can be found here:
From the log:
The file system scheme is 'file'. This indicates that the specified Hadoop configuration path is wrong and the sytem is using the default Hadoop configuration values.The Flink YARN client needs to store its files in a distributed file system
Flink failed to read the Hadoop configuration files. They are either picked up from the environment variables, e.g. HADOOP_HOME, or you can set the configuration dir in the flink-conf.yaml before you execute your YARN command.
Flink needs to read the Hadoop configuration to know how to upload the Flink jar to the cluster file system such that the newly created YARN cluster can access it. If Flink fails to resolve the Hadoop configuration, it uses the local file system for uploading the jar. That means that the jar will be put on the machine you launch your cluster from. Thus, it won't be accessible from the Flink YARN cluster.
Please see the Flink configuration page for more information.
edit: On Amazong EMR, export HADOOP_CONF_DIR=/etc/hadoop/conf let's Flink discover the Hadoop configuration directory.
if i were you i would try with this:
./bin/ -n 1 -jm 768 -tm 768