Using the `s3fs` python library with Task IAM role credentials on AWS Batch - amazon-s3

I'm trying to get an ML job to run on AWS Batch. The job runs in a docker container, using credentials generated for a Task IAM Role.
I use DVC to manage the large data files needed for the task, which are hosted in an S3 repository. However, when the task tries to pull the data files, it gets an access denied message.
I can verify that the role has permissions to the bucket, because I can access the exact same files if I run an aws s3 cp command (as shown in the example below). But, I need to do it through DVC so that it downloads the right version of each file and puts it in the expected place.
I've been able to trace down the problem to s3fs, which is used by DVC to integrate with S3. As I demonstrate in the example below, it gets an access denied message even when I use s3fs by itself, passing in the credentials explicitly. It seems to fail on this line, where it tries to list the contents of the file after failing to find the object via a head_object call.
I suspect there may be a bug in s3fs, or in the particular combination of boto, http, and s3 libraries. Can anyone help me figure out how to fix this?
Here is a minimal reproducible example:
Shell script for the job:
#!/bin/bash
AWS_CREDENTIALS=$(curl http://169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI)
export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=$(echo "$AWS_CREDENTIALS" | jq .AccessKeyId -r)
export AWS_SECRET_ACCESS_KEY=$(echo "$AWS_CREDENTIALS" | jq .SecretAccessKey -r)
export AWS_SESSION_TOKEN=$(echo "$AWS_CREDENTIALS" | jq .Token -r)
echo "AWS_ACCESS_KEY_ID=<$AWS_ACCESS_KEY_ID>"
echo "AWS_SECRET_ACCESS_KEY=<$(cat <(echo "$AWS_SECRET_ACCESS_KEY" | head -c 6) <(echo -n "...") <(echo "$AWS_SECRET_ACCESS_KEY" | tail -c 6))>"
echo "AWS_SESSION_TOKEN=<$(cat <(echo "$AWS_SESSION_TOKEN" | head -c 6) <(echo -n "...") <(echo "$AWS_SESSION_TOKEN" | tail -c 6))>"
dvc doctor
# Succeeds!
aws s3 ls s3://company-dvc/repo/
# Succeeds!
aws s3 cp s3://company-dvc/repo/00/0e4343c163bd70df0a6f9d81e1b4d2 mycopy.txt
# Fails!
python3 download_via_s3fs.py
download_via_s3fs.py:
import os
import s3fs
# Just to make sure we're reading the credentials correctly.
print(os.environ["AWS_ACCESS_KEY_ID"])
print(os.environ["AWS_SECRET_ACCESS_KEY"])
print(os.environ["AWS_SESSION_TOKEN"])
print("running with credentials")
fs = s3fs.S3FileSystem(
key=os.environ["AWS_ACCESS_KEY_ID"],
secret=os.environ["AWS_SECRET_ACCESS_KEY"],
token=os.environ["AWS_SESSION_TOKEN"],
client_kwargs={"region_name": "us-east-1"}
)
# Fails with "access denied" on ListObjectV2
print(fs.exists("company-dvc/repo/00/0e4343c163bd70df0a6f9d81e1b4d2"))
Terraform for IAM role:
data "aws_iam_policy_document" "standard-batch-job-role" {
# S3 read access to related buckets
statement {
actions = [
"s3:Get*",
"s3:List*",
]
resources = [
data.aws_s3_bucket.company-dvc.arn,
"${data.aws_s3_bucket.company-dvc.arn}/*",
]
effect = "Allow"
}
}
Environment
OS: Ubuntu 20.04
Python: 3.10
s3fs: 2023.1.0
boto3: 1.24.59

Related

How to see output from executors on Amazon EMR?

I am running the following code on AWS EMR:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
sc = spark.sparkContext
def f(_):
print("executor running") # <= I can not find this output
return 1
from operator import add
output = sc.parallelize(range(1, 3), 2).map(f).reduce(add)
print(output) # <= I found this output
spark.stop()
I am recording logs to s3 (Log URI is s3://brand17-logs/).
I can see output from master node here:
s3://brand17-logs/j-20H1NGEP519IG/containers/application_1618292556240_0001/container_1618292556240_0001_01_000001/stdout.gz
Where can I see output from executor node ?
I see this output when running locally.
You are almost there while browsing the log files.
The general convention of the stored log is something like this: Inside the containers path where there are multiple application_id, the first one(something like this application_1618292556240_0001 ending with 001) will be of the driver node and the rest will be from the executor.
I have no official documentation where it is mentioned above. But I have seen this in all my clusters.
So if you browse to the other application id, you will be able to see the executor log file.
Having said that it is very painful to browse to so many executors and search for the log.
How do I personally see the log from EMR cluster:
log in to one of the EC2 instance having enough access to download the files from S3 where the log of EMR is getting saved.
Navigate to the right path on the instance.
mkdir -p /tmp/debug-log/ && cd /tmp/debug-log/
Download all the files from S3 in a recursive manner.
aws s3 cp --recursive s3://your-bucket-name/cluster-id/ .
In your case, it would be
`aws s3 cp --recursive s3://brand17-logs/j-20H1NGEP519IG/ .`
Uncompress the log file:
find . -type f -exec gunzip {} \;
Now that all the compressed files are uncompressed, we can do a recursive grep like below:
grep -inR "message-that-i-am-looking-for"
the flag with grep means the following:
i -> case insensitive
n -> will display the file and line number where the message is present
R -> search it in a recursive manner.
Browse to the exact file by vi pointed by the above grep command and see the more relevant log in that file.
More readings can be found here:
View Log Files
access spark log

Locally test AWS Lambda container with .NET 5 web api and Lambda RIE

I'm following the instructions to locally test lambda container https://docs.aws.amazon.com/lambda/latest/dg/images-test.html
but I am unable to do so.
I've created a sample project to reproduce it https://gitlab.com/sunnyatticsoftware/sandbox/lambda-dotnet5-webapi (see the README for step by step on its generation)
Basically I am using an Amazon dotnet template that generates an AWS Lambda function as a .NET 5 web api using containers.
It's all good with the project. The Dockerfile is described as
FROM public.ecr.aws/lambda/dotnet:5.0
WORKDIR /var/task
COPY "bin/Release/net5.0/publish" .
Now I want to test it locally using the Amazon Lambda Runtime Interface Emulator (RIE) and these are the steps I follow:
Build project with dotnet build -c Release
Publish artifacts with dotnet publish -c Release
Build docker image with docker build -t lambda-dotnet .
Download the RIE with
mkdir -p ~/.aws-lambda-rie && curl -Lo ~/.aws-lambda-rie/aws-lambda-rie https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/latest/download/aws-lambda-rie && chmod +x ~/.aws-lambda-rie/aws-lambda-rie
I can see the emulator downloaded properly
ls -la ~/.aws-lambda-rie/aws-lambda-rie
-rw-r--r-- 1 diego.martin 1049089 8155136 Feb 22 14:32 /c/Users/diego.martin/.aws-lambda-rie/aws-lambda-rie
Run the emulator passing the lambda image
docker run -d -v ~/.aws-lambda-rie:/aws-lambda -p 9000:8080 --entrypoint /aws-lambda/aws-lambda-rie lambda-dotnet:latest
Here is when I get the error
12997dddc6e50aca3020527be30a1479eee9ceef412ab5009b99e9eb8cf1fa67
docker: Error response from daemon: OCI runtime create failed: container_linux.go:370: starting container process caused: exec: "C:/Users/diego.martin/AppData/Local/Programs/Git/aws-lambda/aws-lambda-rie": stat C:/Users/diego.martin/AppData/Local/Programs/Git/aws-lambda/aws-lambda-rie: no such file or directory: unknown.
What am I missing? I am not specifying any entrypoint because I don't have any.
PS: The last step would be to send some lambda event to my container's function with
curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'
The lambda docker images for dotnet already include the RIE, so it's enough with the following (see repo with further details):
To build image
docker build -t lambda-dotnet:latest .
To run it
docker run -p 9000:8080 lambda-dotnet "LambdaDotNet5::LambdaDotNet5.LambdaEntryPoint::FunctionHandlerAsync"
And then to test it, I can use CURL from a different terminal
curl -vX POST http://localhost:9000/2015-03-31/functions/function/invocations -d #test_request.json --header "Content-Type: application/json"
and in the test_request.json file I can have the json for the event I want to send to the lambda.

AWS s3 copy hangs after large transfer when run in Ansible from Terraform

I'm provisioning a large (r5.2xl) ec2 instance with Terraform, and configuring using Ansible. The application's install kit, and data - 640GB, are in an S3 bucket in my region that I have full access to. My first play uses the shell module to invoke the aws s3 cp --recursive cli to move the data to a non-root, 2TB EBS volume, mounted at /opt/app.
I can tell that the task transfers all the data, but it never ends. My only symptom is a process on the target machine running AnsiballZ_command.py. It doesn't look like it's doing much, and killing it doesn't affect the playbook, or the Terraform process. I've validated that I can run the transfer directly from cli on the provisioned machine, and when I move the playbooks there, they run successfully too.
The only option at this point is to kill the TF, hosing my state file and forcing a manual tear down.
What's the proper way to do this?
My TF (v0.11.11) code, minus the connection info, run from a null_resource:
provisioner "remote-exec" {
inline = ["sudo apt install --yes cowsay"]
}
provisioner "local-exec" {
command = <<EOT
sleep 30;
>spookykat.ini;
echo "[spookykat]" | tee -a spookykat.ini;
echo "${element(aws_instance.this.*.private_ip, count.index)} ansible_user=${local.ec2-user} ansible_private_key_file=${var.private_key_path}" | tee -a spookykat.ini;
export ANSIBLE_HOST_KEY_CHECKING=False;
export ANSIBLE_NOCOLOR=true;
export ANSIBLE_LOG_PATH="./ansible_log.log";
export ANSIBLE_DISPLAY_ARGS_TO_STDOUT=true;
ansible-playbook -u ${local.ec2-user} --private-key ${var.private_key_path} -i spookykat.ini ${path.root}/playbooks/spookykat_configure.yml --extra-vars #${local.extra-vars-json}
EOT
}
And, my Ansible task:
- name: Sync Spatialkat S3 to local
shell: >
aws s3 cp --recursive s3://{{ src_bucket }}/{{ root_path }}/{{ release }}
{{ dest_path }}/{{ root_path }}/{{ release }}

Resuming interrupted s3 download with awscli

I was downloading a file using awscli:
$ aws s3 cp s3://mybucket/myfile myfile
But the download was interrupted (computer went to sleep). How can I continue the download? S3 supports the Range header, but awscli s3 cp doesn't let me specify it.
The file is not publicly accessible so I can't use curl to specify the header manually.
There is a "hidden" command in the awscli tool which allows lower level access to S3: s3api.† It is less user friendly (no s3:// URLs and no progress bar) but it does support the range specifier on get-object:
--range (string) Downloads the specified range bytes of an object. For
more information about the HTTP range header, go to
http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35.
Here's how to continue the download:
$ size=$(stat -f%z myfile) # assumes OS X. Change for your OS
$ aws s3api get-object \
--bucket mybucket \
--key myfile \
--range "bytes=$size-" \
/dev/fd/3 3>>myfile
You can use pv for a rudimentary progress bar:
$ aws s3api get-object \
--bucket mybucket \
--key myfile \
--range "bytes=$size-" \
/dev/fd/3 3>&1 >&2 | pv >> myfile
(The reason for this unnamed pipe rigmarole is that s3api writes a debug message to stdout at the end of the operation, polluting your file. This solution rebinds stdout to stderr and frees up the pipe for regular file contents through an alias. The version without pv could technically write to stderr (/dev/fd/2 and 2>), but if an error occurs s3api writes to stderr, which would then get appended to your file. Thus, it is safer to use a dedicated pipe there, as well.)
† In git speak, s3 is porcelain, and s3api is plumbing.
Use s3cmd it has a --continue function built in. Example:
# Start a download
> s3cmd get s3://yourbucket/yourfile ./
download: 's3://yourbucket/yourfile' -> './yourfile' [1 of 1]
123456789 of 987654321 12.5% in 235s 0.5 MB/s
[ctrl-c] interrupt
# Pick up where you left off
> s3cmd --continue get s3://yourbucket/yourfile ./
Note that S3 cmd is not multithreaded where awscli is multithreaded, e.g. awscli is faster. A currently maintained fork of s3cmd, called s4cmd appears to provide the multi-threaded capabilities while maintaining the usability features of s3cmd:
https://github.com/bloomreach/s4cmd

Transporting with data from S3 amazon to local server

I am trying to import data from S3 and using the described below script (which I sort of inherited). It's a bit long...The problem is I kept receiving following output:
The config profile (importer) could not be found
I am not a bash person-so be gentle, please. It seemed there are some credentials missing or something else is wrong with configuration of "importer" on local machine.
In S3 configs(the console) - there is a user with the same name, which, according to permissions can perform access the bucket and download data.
I have tried changing access keys in amazon console for the user and creating file, named "credentials" in home/.aws(there was no .aws folder in home dir by default-created it), including the new keys in the file, tried upgrading AWS CLI with pip - nothing helped
Then I have modified the "credentials", placing [importer] as profile name, so it looked like:
[importer]
aws_access_key = xxxxxxxxxxxxxxxxx
aws_secret+key = xxxxxxxxxxxxxxxxxxx
Appears, that I have gone through the "miss-configuration":
A client error (InvalidAccessKeyId) occurred when calling the ListObjects operation: The AWS Access Key Id you provided does not exist in our records.
Completed 1 part(s) with ... file(s) remaining
And here's the part, where I am stuck...I placed the keys, I have obtained from the amazon into that config file. Double checked...Any suggestions? I can't produce anymore keys-aws quota/user. Below is part of the script:
#!/bin/sh
echo "\n$0 started at: `date`"
incomming='/Database/incomming'
IFS='
';
mkdir -p ${incomming}
echo "syncing files from arrivals bucket to ${incomming} incomming folder"
echo aws --profile importer \
s3 --region eu-west-1 sync s3://path-to-s3-folder ${incomming}
aws --profile importer \
s3 --region eu-west-1 sync s3://path-to-s3-folder ${incomming}
count=0
echo ""
echo "Searching for zip files in ${incomming} folder"
for f in `find ${incomming} -name '*.zip'`;
do
echo "\n${count}: ${f} --------------------"
count=$((count+1))
name=`basename "$f" | cut -d'.' -f1`
dir=`dirname "$f"`
if [ -d "${dir}/${name}" ]; then
echo "\tWarning: directory "${dir}/${name}" already exist for file: ${f} ... - skipping - not imported"
continue
fi