AWS CLI for S3 Select - sql

I have the following code, which is used to run a SQL query on a keyfile, located in a S3 bucket. This runs perfectly. My question is, I do not wish to have the output written over to an output file. Could I see the output on the screen (my preference #1)? If not, what about an ability to append to the output file, rather than over-write it (my preference #2). I am using the AWS-CLI binaries to run this query. If there is another way, I am happy to try (as long as it is within bash)
aws s3api select-object-content \
--bucket "project2" \
--key keyfile1 \
--expression "SELECT * FROM s3object s where Lower(s._1) = 'email#search.com'" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"FieldDelimiter": ":"}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {"FieldDelimiter": ":"}}' "OutputFile"

Of course, you can use AWS CLI to do this since stdout is just a special file in linux.
aws s3api select-object-content \
--bucket "project2" \
--key keyfile1 \
--expression "SELECT * FROM s3object s where Lower(s._1) = 'email#search.com'" \
--expression-type 'SQL' \
--input-serialization '{"CSV": {"FieldDelimiter": ":"}, "CompressionType": "GZIP"}' \
--output-serialization '{"CSV": {"FieldDelimiter": ":"}}' /dev/stdout
Note the /dev/stdout in the end.

The AWS CLI does not offer such options.
However, you are welcome to instead call it via an AWS SDK of your choice.
For example, in the boto3 Python SDK, there is a select_object_content() function that returns the data as a stream. You can then read, manipulate, print or save it however you wish.

I think it opens /dev/stdout twice causing kaos.

Related

Using the `s3fs` python library with Task IAM role credentials on AWS Batch

I'm trying to get an ML job to run on AWS Batch. The job runs in a docker container, using credentials generated for a Task IAM Role.
I use DVC to manage the large data files needed for the task, which are hosted in an S3 repository. However, when the task tries to pull the data files, it gets an access denied message.
I can verify that the role has permissions to the bucket, because I can access the exact same files if I run an aws s3 cp command (as shown in the example below). But, I need to do it through DVC so that it downloads the right version of each file and puts it in the expected place.
I've been able to trace down the problem to s3fs, which is used by DVC to integrate with S3. As I demonstrate in the example below, it gets an access denied message even when I use s3fs by itself, passing in the credentials explicitly. It seems to fail on this line, where it tries to list the contents of the file after failing to find the object via a head_object call.
I suspect there may be a bug in s3fs, or in the particular combination of boto, http, and s3 libraries. Can anyone help me figure out how to fix this?
Here is a minimal reproducible example:
Shell script for the job:
#!/bin/bash
AWS_CREDENTIALS=$(curl http://169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI)
export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=$(echo "$AWS_CREDENTIALS" | jq .AccessKeyId -r)
export AWS_SECRET_ACCESS_KEY=$(echo "$AWS_CREDENTIALS" | jq .SecretAccessKey -r)
export AWS_SESSION_TOKEN=$(echo "$AWS_CREDENTIALS" | jq .Token -r)
echo "AWS_ACCESS_KEY_ID=<$AWS_ACCESS_KEY_ID>"
echo "AWS_SECRET_ACCESS_KEY=<$(cat <(echo "$AWS_SECRET_ACCESS_KEY" | head -c 6) <(echo -n "...") <(echo "$AWS_SECRET_ACCESS_KEY" | tail -c 6))>"
echo "AWS_SESSION_TOKEN=<$(cat <(echo "$AWS_SESSION_TOKEN" | head -c 6) <(echo -n "...") <(echo "$AWS_SESSION_TOKEN" | tail -c 6))>"
dvc doctor
# Succeeds!
aws s3 ls s3://company-dvc/repo/
# Succeeds!
aws s3 cp s3://company-dvc/repo/00/0e4343c163bd70df0a6f9d81e1b4d2 mycopy.txt
# Fails!
python3 download_via_s3fs.py
download_via_s3fs.py:
import os
import s3fs
# Just to make sure we're reading the credentials correctly.
print(os.environ["AWS_ACCESS_KEY_ID"])
print(os.environ["AWS_SECRET_ACCESS_KEY"])
print(os.environ["AWS_SESSION_TOKEN"])
print("running with credentials")
fs = s3fs.S3FileSystem(
key=os.environ["AWS_ACCESS_KEY_ID"],
secret=os.environ["AWS_SECRET_ACCESS_KEY"],
token=os.environ["AWS_SESSION_TOKEN"],
client_kwargs={"region_name": "us-east-1"}
)
# Fails with "access denied" on ListObjectV2
print(fs.exists("company-dvc/repo/00/0e4343c163bd70df0a6f9d81e1b4d2"))
Terraform for IAM role:
data "aws_iam_policy_document" "standard-batch-job-role" {
# S3 read access to related buckets
statement {
actions = [
"s3:Get*",
"s3:List*",
]
resources = [
data.aws_s3_bucket.company-dvc.arn,
"${data.aws_s3_bucket.company-dvc.arn}/*",
]
effect = "Allow"
}
}
Environment
OS: Ubuntu 20.04
Python: 3.10
s3fs: 2023.1.0
boto3: 1.24.59

Using aws emr add-steps for spark-submit

We have a complicated spark-submit command that we would like to submit to AWS EMR using the aws emr add-steps CLI command. We are having trouble figuring out the correct syntax to use. For example, consider the example command from Apache's Running Spark on YARN page:
$ ./bin/spark-submit --class my.main.Class \
--master yarn \
--deploy-mode cluster \
--jars my-other-jar.jar,my-other-other-jar.jar \
my-main-jar.jar \
app_arg1 app_arg2
Following the guidance from this EMR command-runner page, we created something like this:
$ aws emr add-steps\
--cluster-id j-123456789 \
--steps Type=CUSTOM_JAR,Name='Test_Job',Jar='command-runner.jar',ActionOnFailure=CONTINUE,Args=[\
./bin/spark-submit,\
--class,my.main.Class,\
--master,yarn,\
--deploy-mode cluster,\
--jars,my-other-jar.jar,my-other-other-jar.jar,my-main-jar.jar,app_arg1,app_arg2]
However, due to the apparent parsing at commas, the command appears to only associate --jars with "my-other-jar", whereas "my-other-other-jar" is not. I'm hoping somebody can tell us the proper syntax to use. For example, should we use --jars for each extra jar, like:
--jars,my-other-jar.jar,--jars,my-other-other-jar.jar
or maybe there is some special list syntax, e.g.,
--jars,[my-other-jar.jar,my-other-other-jar.jar]
or something else. Can anybody tell us, or point us to, the correct syntax to use for spark-submit arguments that might take a list, i.e., not just --jars, but also --conf, --files, ...?

Any way to use presigned URL uploads and enforce tagging?

Is there any way to issue a presigned URL to a client to upload a file to S3, and ensure that the uploaded file has certain tags? Using the Python SDK here as an example, this generates a URL as desired:
s3.generate_presigned_url('put_object',
ExpiresIn=3600,
Params=dict(Bucket='foo',
Key='bar',
ContentType='text/plain',
Tagging='foo=bar'))
This is satisfactory when uploading while explicitly providing tags:
$ curl 'https://foo.s3.amazonaws.com/bar?AWSAccessKeyId=...&Signature=...&content-type=text%2Fplain&x-amz-tagging=foo%3Dbar&Expires=1538404508' \
-X PUT
-H 'Content-Type: text/plain' \
-H 'x-amz-tagging: foo=bar' \
--data-binary foobar
However, S3 also accepts the request when omitting -H 'x-amz-tagging: foo=bar', which uploads the object without tags. Since I don't have control over the client, that's… bad.
I've tried creating an empty object first and tagging it, then issuing the presigned URL to it, but PUTting the object replaces it entirely, including removing any tags.
I've tried issuing a presigned POST URL, but that doesn't seem to support the tagging parameter at all:
s3.generate_presigned_post('foo', 'bar', {'tagging': '<Tagging><TagSet><Tag><Key>Foo</Key><Value>Bar</Value></Tag></TagSet></Tagging>'})
$ curl https://foo.s3.amazonaws.com/ \
-F key=bar \
-F 'tagging=<Tagging><TagSet><Tag><Key>Foo</Key><Value>Bar</Value></Tag></TagSet></Tagging>'
-F AWSAccessKeyId=... \
-F policy=... \
-F signature=... \
-F file=#/tmp/foo
<Error><Code>AccessDenied</Code><Message>Invalid according to Policy:
Extra input fields: tagging</Message>...
I simply want to let a client upload a file directly to S3, and ensure that it's tagged a certain way in the process. Any way to do that?
Try the following code:
fields = {
"x-amz-meta-u1": "value1",
"x-amz-meta-u2": "value2"
}
conditions = [
{"x-amz-meta-u1": "value1"},
{"x-amz-meta-u2": "value2"}
]
presignedurl = s3_client.generate_presigned_post(
bucket_name, "YOUR_BUCKET_NAME",
Fields=copy.deepcopy(fields),
Conditions=copy.deepcopy(conditions)
)
Python code:
fields = {
'tagging': '<Tagging><TagSet><Tag><Key>Foo</Key><Value>Bar</Value></Tag></TagSet></Tagging>',
}
conditions = [
{'tagging': '<Tagging><TagSet><Tag><Key>Foo</Key><Value>Bar</Value></Tag></TagSet></Tagging>'}
]
presigned_url = s3_client.generate_presigned_post(
Bucket="foo",
Key="file/key.json",
Fields=copy.deepcopy(fields),
Conditions=copy.deepcopy(conditions)
)
CURL command:
$ curl -v --form-string "tagging=<Tagging><TagSet><Tag><Key>Foo</Key><Value>Bar</Value></Tag></TagSet></Tagging>" \
-F key=file/key.json \
-F x-amz-algorithm=... \
-F x-amz-credential=... \
-F x-amz-date=... \
-F x-amz-security-token=... \
-F policy=...\
-F x-amz-signature=... \
-F file=#key.json \
https://foo.s3.amazonaws.com/
Explanation
It is imperative that --form-string is used in the CURL command, otherwise CURL will interpret the =< as reading in a file!
Also ensure that key.json is in your current working directory for CURL to upload the file to S3 using the pre-signed-url.

Resuming interrupted s3 download with awscli

I was downloading a file using awscli:
$ aws s3 cp s3://mybucket/myfile myfile
But the download was interrupted (computer went to sleep). How can I continue the download? S3 supports the Range header, but awscli s3 cp doesn't let me specify it.
The file is not publicly accessible so I can't use curl to specify the header manually.
There is a "hidden" command in the awscli tool which allows lower level access to S3: s3api.† It is less user friendly (no s3:// URLs and no progress bar) but it does support the range specifier on get-object:
--range (string) Downloads the specified range bytes of an object. For
more information about the HTTP range header, go to
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.
Here's how to continue the download:
$ size=$(stat -f%z myfile) # assumes OS X. Change for your OS
$ aws s3api get-object \
--bucket mybucket \
--key myfile \
--range "bytes=$size-" \
/dev/fd/3 3>>myfile
You can use pv for a rudimentary progress bar:
$ aws s3api get-object \
--bucket mybucket \
--key myfile \
--range "bytes=$size-" \
/dev/fd/3 3>&1 >&2 | pv >> myfile
(The reason for this unnamed pipe rigmarole is that s3api writes a debug message to stdout at the end of the operation, polluting your file. This solution rebinds stdout to stderr and frees up the pipe for regular file contents through an alias. The version without pv could technically write to stderr (/dev/fd/2 and 2>), but if an error occurs s3api writes to stderr, which would then get appended to your file. Thus, it is safer to use a dedicated pipe there, as well.)
† In git speak, s3 is porcelain, and s3api is plumbing.
Use s3cmd it has a --continue function built in. Example:
# Start a download
> s3cmd get s3://yourbucket/yourfile ./
download: 's3://yourbucket/yourfile' -> './yourfile' [1 of 1]
123456789 of 987654321 12.5% in 235s 0.5 MB/s
[ctrl-c] interrupt
# Pick up where you left off
> s3cmd --continue get s3://yourbucket/yourfile ./
Note that S3 cmd is not multithreaded where awscli is multithreaded, e.g. awscli is faster. A currently maintained fork of s3cmd, called s4cmd appears to provide the multi-threaded capabilities while maintaining the usability features of s3cmd:
https://github.com/bloomreach/s4cmd

Filter S3 list-objects results to find a key matching a pattern

I would like to use the AWS CLI to query the contents of a bucket and see if a particular file exists, but the bucket contains thousands of files. How can I filter the results to only show key names that match a pattern? For example:
aws s3api list-objects --bucket myBucketName --query "Contents[?Key==*mySearchPattern*]"
The --query argument uses JMESPath expressions. JMESPath has an internal function contains that allows you to search for a string pattern.
This should give the desired results:
aws s3api list-objects --bucket myBucketName --query "Contents[?contains(Key, `mySearchPattern`)]"
(With Linux I needed to use single quotes ' rather than back ticks ` around mySearchPattern.)
If you want to search for keys starting with certain characters, you can also use the --prefix argument:
aws s3api list-objects --bucket myBucketName --prefix "myPrefixToSearchFor"
I tried on Ubuntu 14, awscli 1.2
--query "Contents[?contains(Key,'stati')].Key"
--query "Contents[?contains(Key,\'stati\')].Key"
--query "Contents[?contains(Key,`stati`)].Key"
Illegal token value '?contains(Key,'stati')].Key'
After upgraded the aws version to 1.16 , worked with
--query "Contents[?contains(Key,'stati')].Key"