How to see output from executors on Amazon EMR? - amazon-s3

I am running the following code on AWS EMR:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("PythonPi")\
.getOrCreate()
sc = spark.sparkContext
def f(_):
print("executor running") # <= I can not find this output
return 1
from operator import add
output = sc.parallelize(range(1, 3), 2).map(f).reduce(add)
print(output) # <= I found this output
spark.stop()
I am recording logs to s3 (Log URI is s3://brand17-logs/).
I can see output from master node here:
s3://brand17-logs/j-20H1NGEP519IG/containers/application_1618292556240_0001/container_1618292556240_0001_01_000001/stdout.gz
Where can I see output from executor node ?
I see this output when running locally.

You are almost there while browsing the log files.
The general convention of the stored log is something like this: Inside the containers path where there are multiple application_id, the first one(something like this application_1618292556240_0001 ending with 001) will be of the driver node and the rest will be from the executor.
I have no official documentation where it is mentioned above. But I have seen this in all my clusters.
So if you browse to the other application id, you will be able to see the executor log file.
Having said that it is very painful to browse to so many executors and search for the log.
How do I personally see the log from EMR cluster:
log in to one of the EC2 instance having enough access to download the files from S3 where the log of EMR is getting saved.
Navigate to the right path on the instance.
mkdir -p /tmp/debug-log/ && cd /tmp/debug-log/
Download all the files from S3 in a recursive manner.
aws s3 cp --recursive s3://your-bucket-name/cluster-id/ .
In your case, it would be
`aws s3 cp --recursive s3://brand17-logs/j-20H1NGEP519IG/ .`
Uncompress the log file:
find . -type f -exec gunzip {} \;
Now that all the compressed files are uncompressed, we can do a recursive grep like below:
grep -inR "message-that-i-am-looking-for"
the flag with grep means the following:
i -> case insensitive
n -> will display the file and line number where the message is present
R -> search it in a recursive manner.
Browse to the exact file by vi pointed by the above grep command and see the more relevant log in that file.
More readings can be found here:
View Log Files
access spark log

Related

Using the `s3fs` python library with Task IAM role credentials on AWS Batch

I'm trying to get an ML job to run on AWS Batch. The job runs in a docker container, using credentials generated for a Task IAM Role.
I use DVC to manage the large data files needed for the task, which are hosted in an S3 repository. However, when the task tries to pull the data files, it gets an access denied message.
I can verify that the role has permissions to the bucket, because I can access the exact same files if I run an aws s3 cp command (as shown in the example below). But, I need to do it through DVC so that it downloads the right version of each file and puts it in the expected place.
I've been able to trace down the problem to s3fs, which is used by DVC to integrate with S3. As I demonstrate in the example below, it gets an access denied message even when I use s3fs by itself, passing in the credentials explicitly. It seems to fail on this line, where it tries to list the contents of the file after failing to find the object via a head_object call.
I suspect there may be a bug in s3fs, or in the particular combination of boto, http, and s3 libraries. Can anyone help me figure out how to fix this?
Here is a minimal reproducible example:
Shell script for the job:
#!/bin/bash
AWS_CREDENTIALS=$(curl http://169.254.170.2$AWS_CONTAINER_CREDENTIALS_RELATIVE_URI)
export AWS_DEFAULT_REGION=us-east-1
export AWS_ACCESS_KEY_ID=$(echo "$AWS_CREDENTIALS" | jq .AccessKeyId -r)
export AWS_SECRET_ACCESS_KEY=$(echo "$AWS_CREDENTIALS" | jq .SecretAccessKey -r)
export AWS_SESSION_TOKEN=$(echo "$AWS_CREDENTIALS" | jq .Token -r)
echo "AWS_ACCESS_KEY_ID=<$AWS_ACCESS_KEY_ID>"
echo "AWS_SECRET_ACCESS_KEY=<$(cat <(echo "$AWS_SECRET_ACCESS_KEY" | head -c 6) <(echo -n "...") <(echo "$AWS_SECRET_ACCESS_KEY" | tail -c 6))>"
echo "AWS_SESSION_TOKEN=<$(cat <(echo "$AWS_SESSION_TOKEN" | head -c 6) <(echo -n "...") <(echo "$AWS_SESSION_TOKEN" | tail -c 6))>"
dvc doctor
# Succeeds!
aws s3 ls s3://company-dvc/repo/
# Succeeds!
aws s3 cp s3://company-dvc/repo/00/0e4343c163bd70df0a6f9d81e1b4d2 mycopy.txt
# Fails!
python3 download_via_s3fs.py
download_via_s3fs.py:
import os
import s3fs
# Just to make sure we're reading the credentials correctly.
print(os.environ["AWS_ACCESS_KEY_ID"])
print(os.environ["AWS_SECRET_ACCESS_KEY"])
print(os.environ["AWS_SESSION_TOKEN"])
print("running with credentials")
fs = s3fs.S3FileSystem(
key=os.environ["AWS_ACCESS_KEY_ID"],
secret=os.environ["AWS_SECRET_ACCESS_KEY"],
token=os.environ["AWS_SESSION_TOKEN"],
client_kwargs={"region_name": "us-east-1"}
)
# Fails with "access denied" on ListObjectV2
print(fs.exists("company-dvc/repo/00/0e4343c163bd70df0a6f9d81e1b4d2"))
Terraform for IAM role:
data "aws_iam_policy_document" "standard-batch-job-role" {
# S3 read access to related buckets
statement {
actions = [
"s3:Get*",
"s3:List*",
]
resources = [
data.aws_s3_bucket.company-dvc.arn,
"${data.aws_s3_bucket.company-dvc.arn}/*",
]
effect = "Allow"
}
}
Environment
OS: Ubuntu 20.04
Python: 3.10
s3fs: 2023.1.0
boto3: 1.24.59

google gsutil limit ls to a given depth level

I need to figure out what subfolders are present on a bucket in order to decide what path to sync.
ls -r gs://<my_bucket>/**
returns all files and folders and I have a tree depth of >10 there!!
How can I get the list of folders and subfolders only and until a final depth of lets say 3 as with the find -maxdepth argument?
Thanks in advance
As per documentation there is no such command mentioned to list the folders/subfolders using maximum depth in gsutil.
To get the top level objects of a bucket using node.js, it can be done by API as mentioned :
`const [files, nextQuery, apiResponse] = await storage.bucket(bucketName).getFiles({autoPaginate: false, delimiter: "/", prefix: ""});`
For more information you can refer to the link and github case where a similar issue has been discussed.
This command is mentioned as a trick :
gsutil ls -l gs://bucket_name/folder_name | xargs -I{} gsutil du -sh {}
There's no --max-depth support in gsutil du. You could write a script that lists the first-level folder names, and then iterate over those folders and run gsutil ls -l $folder/*

AWS EMR --steps

I am running the following .sh to run a command on AWS using EMR:
aws emr create-cluster --name "Big Matrix Re Run 5" --ami-version 3.1.0 --auto-terminate --log-uri FILE LOCATION --enable-debugging --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.xlarge InstanceGroupType=CORE,InstanceCount=3,InstanceType=c3.xlarge --steps NAME AND LOCATION OF FILE
I've deleted the pertinent file name and locations as those aren't my issue, but I am having an issue with the --steps portion of the script.
How do I specify the steps that I want to run in the cluster? The documentation doesn't give any examples.
Here is the error:
Error parsing parameter '--steps': should be: Key value pairs, where values are separated by commas, and multiple pairs are separated by spaces.
--steps Name=string1,Jar=string1,ActionOnFailure=string1,MainClass=string1,Type=string1,Properties=string1,Args=string1,string2 Name=string1,Jar=string1,ActionOnFailure=string1,MainClass=string1,Type=string1,Properties=string1,Args=string1,string2
Thanks!
The documentation page for the AWS Command-Line Interface create-cluster command shows examples for using the --steps parameter.
Steps can be supplied on the command-line, or can refer to files available within HDFS or Amazon S3.
Within HDFS:
aws emr create-cluster --steps file://./multiplefiles.json --ami-version 3.3.1 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge --auto-terminate
Within Amazon S3:
aws emr create-cluster --steps Type=HIVE,Name='Hive program',ActionOnFailure=CONTINUE,ActionOnFailure=TERMINATE_CLUSTER,Args=[-f,s3://elasticmapreduce/samples/hive-ads/libs/model-build.q,-d,INPUT=s3://elasticmapreduce/samples/hive-ads/tables,-d,OUTPUT=s3://mybucket/hive-ads/output/2014-04-18/11-07-32,-d,LIBS=s3://elasticmapreduce/samples/hive-ads/libs] --applications Name=Hive --ami-version 3.1.0 --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m3.xlarge InstanceGroupType=CORE,InstanceCount=2,InstanceType=m3.xlarge

working with ebextensions in aws

I am working with AWS elastic beanstalk and since I can't modify the httdp conf file to AllowOverride All I was suggested to work with ebextensions:
http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/customize-containers.html
Hence, I have created an .ebextensions folder, and within it a setup.config file with the following command:
container_commands:
01_setup_apache:
command: "cp .ebextensions/enable_mod_rewrite.conf /etc/httpd/conf.d/enable_mod_rewrite.conf"
I am not even sure if this is the proper command to enable mod rewrite, but I get the following error while trying to upload the instance:
[Instance: i-80bbbd77] Command failed on instance. Return code: 1 Output: cp: cannot stat '.ebextensions/enable_mod_rewrite.conf': No such file or directory. container_command 01_setup_apache in .ebextensions/setup.config failed. For more detail, check /var/log/eb-activity.log using console or EB CLI.
You can't copy from ".ebextensions/enable_mod_rewrite.conf" because that relative path will not be valid from the init script. Using absolute paths may work, but there i'd suggest you fetch from S3 instead:
container_commands:
01_setup_apache:
command: "aws s3 cp s3://[my-ebextensions-bucket]/enable_mod_rewrite.conf /etc/httpd/conf.d/enable_mod_rewrite.conf"
But if you need complex changes to your instance, it may be a better option to run a docker container instead: http://docs.aws.amazon.com/elasticbeanstalk/latest/dg/create_deploy_docker.html

PgSQL - Export select query data direct to amazon s3 with headers

I have this requirement where i need to export the report data directly to csv since getting the array/query response and then building the scv and again uploading the final csv to amazon takes time. Is there a way by which i can directly create the csv with the redshift postgresql.
PgSQL - Export select query data direct to amazon s3 servers with headers
here is my version of pgsql - Version PgSQL 8.0.2 on amazon redshift
Thanks
You can use UNLOAD statement to save results to a S3 bucket. Keep in mind that this will create multiple files (at least one per computing node).
You will have to download all the files, combine them locally, sort (if needed), then add column headers and upload result back to S3.
Using the EC2 instance shouldn't take a lot of time - connection between EC2 and S3 is quite good.
In my experience, the quickest method is to use shells' commands:
# run query on the redshift
export PGPASSWORD='__your__redshift__pass__'
psql \
-h __your__redshift__host__ \
-p __your__redshift__port__ \
-U __your__redshift__user__ \
__your__redshift__database__name__ \
-c "UNLOAD __rest__of__query__"
# download all the results
s3cmd get s3://path_to_files_on_s3/bucket/files_prefix*
# merge all the files into one
cat files_prefix* > files_prefix_merged
# sort merged file by a given column (if needed)
sort -n -k2 files_prefix_merged > files_prefix_sorted
# add column names to destination file
echo -e "column 1 name\tcolumn 2 name\tcolumn 3 name" > files_prefix_finished
# add merged and sorted file into destination file
cat files_prefix_sorted >> files_prefix_finished
# upload destination file to s3
s3cmd put files_prefix_finished s3://path_to_files_on_s3/bucket/...
# cleanup
s3cmd del s3://path_to_files_on_s3/bucket/files_prefix*
rm files_prefix* files_prefix_merged files_prefix_sorted files_prefix_finished