Sequential Scripts conditioned by S3 file existence - amazon-s3

I have three python scripts. These are supposed be executed sequentially, but in different environments.
script1: Generate training and test dataset using an AWS EMR cluster and save it on S3.
script2: Train a Machine Learning model using the training data and save the trained model on S3. (Executed on an AWS GPU instance)
script3: Run evaluation based on the test data and trained model and save the result on S3. (Executed on an AWS GPU instance)
I would like to run all these scripts automatically, without executing them one by one. My questions are:
Are there good practices for handling S3 file existence conditions? (false tolerence etc)
How can I trigger launching GPU instances and EMR clusters?
Are there good ways or tools to handle this kind of process?

Take a look at https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
You can configure your notification when an event occur on a bucket, for example when an object is created.
You can attach this notification directly to an AWS lambda function that, if will be set with the right role can create EMR cluster and all other resources accessible by AWS SDK.

Related

How to overcome Spark "No Space left on the device" error in AWS Glue Job

I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error
java.io.IOException: No space left on the device
On analysis, I found AWS Glue workers G1.x has 4 vCPU, 16 GB of memory, 64 GB disk. So we tried to increase the number of workers
Even after increasing the number of Glue workers (G1.X) to 50, Glue Jobs keeps on failing with the same error.
Is there is way to configure the Spark local temp directory to s3 instead of the Local Filesystem? or can we mount EBS volume on the Glue workers.
I had tried configuring the property in the Spark Session builder, but Still, Spark is using the local tmp directory
SparkSession.builder.appName("app").config("spark.local.dir", "s3a://s3bucket/temp").getOrCreate()
As #Prajappati stated, there are several solutions.
These solutions are described in detail in the aws blog that presents s3 shuffle feature. I am going to ommit the shuffle configuration tweaking since it is not too much reliable. So, basically, you can either:
Scale out vertically, increasing the size of the machine (i.e. going from G.1X to G.2X) which increases the cost.
Disaggregate compute and storage: which in this case means using s3 as storage service for spills and shuffles.
At the time of writting, to configure this disaggreagation, the job must be configured with the following settings:
Glue 2.0 Engine
Glue job parameters:
Parameter
Value
Explanation
--write-shuffle-files-to-s3
true
Main parameter (required)
--write-shuffle-spills-to-s3
true
Optional
--conf
spark.shuffle.glue.s3ShuffleBucket=S3://<your-bucket-name>/<your-path>
Optional. If not set, the path --TempDir/shuffle-data will be used instead
Remember to assign the proper iam permissions to the job to access the bucket and write under the s3 path provided or configured by default.
According to the error message, it appears as if the Glue job is running out of disk space when writing a DynamicFrame.
As you may know, Spark will perform a shuffle on certain operations, writing the results to disk. When the shuffle is too large, it the job will fail and
There are 2 option to consider.
Upgrade your worker type to G.2X and/or increase the number of workers.
Implement AWS Glue Spark Shuffle manager with S3 [1]. To implement this option, you will need to downgrade to Glue version 2.0.
The Glue Spark shuffle manager will write the shuffle-files and shuffle-spills data to S3, lowering the probability of your job running out of memory and failing.
Please could you add the following additional job parameters. You can do this via the following steps:
Open the "Jobs" tab in the Glue console.
Select the job you want to apply this to, then click "Actions" then click "Edit Job".
Scroll down and open the drop down named "Security configuration, script libraries, and job parameters (optional)".
Under job parameters, enter the following key value pairs:
Key: --write-shuffle-files-to-s3 Value: true
Key: --write-shuffle-spills-to-s3 Value: true
Key: --conf Value:
spark.shuffle.glue.s3ShuffleBucket=S3://
Remember to replace the triangle brackets <> with the name of the S3 bucket where you would like to store the shuffle data.
5) Click "Save" then run the job.
FWIW I discovered that first thing you need to check is that Spark UI is not enabled on the job: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html
AWS documentation mentions that logs generated for Spark UI are flushed to S3 path every 30 seconds, but it doesn't look like they are rotated on the worker. So sooner or later, depending on the workload and worker type, all of them run out of disk space and the run fails with Command failed with exit code 10.
The documentation states that spark.local.dir is used to specify a local directory only.
This error can be addressed modifying the logging properties or, depending on the cluster manager used, the cluster manager properties such as for yarn in this answer.

Using pyspark on AWS EMR

I am new to both PySpark and AWS EMR. I have been given a small project where I need to scrub large amounts of data files every hour and build aggregated data sets based on them. These data files are stored on S3 and I can utilize some of the basic functions in Spark (like filter and map) to derive the aggregated data. To save on egress costs and after performing some CBA analysis, I decided to create an EMR cluster and make pypark calls. The concept is working fine using Lambda functions triggered by file created in the S3 bucket. I am writing the output files back to S3.
But I am not able to comprehend the need for the 3 node EMR cluster I created and its use for me. How can I use the Hadoop file system to my advantage here and all the storage that is made available on the nodes?
How do I view (if possible) the utilization of the slave/core nodes in the cluster? How do I know they are used, how often, etc etc? I am executing the pyspark code on the master node.
Are there alternatives to EMR that I can use with pyspark?
Is there any good documentation available to get a better understanding.
Thanks
Spark is a framework for distributed computing. It can process larger than memory datasets and split the workload in chunks onto multiple workers in parallel. By default EMR creates 1 master node and 2 worker nodes. The disk space on the spark nodes is typically not used directly. Spark can use the space to cache temp results.
To use a Hadoop filesystem, you need to start a hdfs service in aws .
However s3 is also distributed storage. It is supported by Hadoop libraries. Spark EMR ships with Hadoop drivers and support S3 out of the box. Using spark with S3 is perfectly valid storage solution and will be good enough for a lot of basic data processing tasks.
The is a spark manager UI in AWS EMR. You can see each running spark application session and current job. By clicking on the job you can see how many executors are used. Whether those executors run on all nodes depends on your spark memory and cpu configuration. Tuning those is a really big topic. There are good hints here on SO.
There is also a hardware monitoring tab, showing cpu and memory usage for each node.
The spark code is always executed on the master node. But it just creates a DAG plan on that node and shifts the actual work to the worker nodes according to the plan. Hence the guides speak of submitting the spark application rather than executing.
Yes. You can start your own spark cluster on normal ec2 instances. There is even a standalone mode , allowing to start spark on only one machine. It is quite some footprint, that is installed then. And you still need to tune the memory, cpu and executor settings. So it is quite a complexity compared to just implement some multiprocessing in python or use dask. However there are valid reasons to do so. It allows to use all cores on one machine. And it allows you to use a well known , good documented api. The same one, which can be used to process petabytes of data. The linked article above, explains the motivation.
Another possibility is to use AWS Glue. It is serverless spark. The
service will submit your jobs to some on demand spark nodes on AWS,
where you have no control over. Similar to how lambda functions run
on random AWS EC2 instances. However glue has some limitations. With
pyspark on glue, you cannot install python libs with c-extensions
e.g numpy, pandas, most of ml libs. Also Glue forces you to create
schema mapping of your data in Athena catalog. But standalone spark
can just process those on the fly.
Databricks also offers a separate serverless spark solution outside of AWS. It is more sophisticated in my opinion. It also allows custom c-extensions.
Big part of official documentation is focusing on the different data processing apis and not on the internals of apache spark. There are some good notes on spark internals on github. I assume every good book will cover some inner workings on spark. AWS EMR is just an automated spark cluster with yarn orchestrator. (Unfortunately, never read some good book on spark, got some info here and there, so cannot recommend one)

How to copy Big Data from GCS to S3?

How to copy a few terabytes of data from GCS to S3?
There's nice "Transfer" feature in GCS that allows to import data from S3 to GCS. But how to do the export, the other way (besides moving data generation jobs to AWS)?
Q: Why not gsutil?
Yes, gsutil supports s3://, but transfer is limited by that machine network throughput. How to easier do it in parallel?
I tried Dataflow (aka Apache Beam now), that would work fine, because it's easy to parallelize on like a hundred of nodes, but don't see there's simple 'just copy it from here to there' function.
UPDATE: Also, Beam seems to be computing a list of source files on the local machine in a single thread, before starting the pipeline. In my case that takes around 40 minutes. Would be nice to distribute it on the cloud.
UPDATE 2: So far I'm inclined to use two own scripts that would:
Script A: Lists all objects to transfer, and enqueue a transfer task for each one into a PubSub queue.
Script B: Executes these transfer tasks. Runs on cloud (e.g. Kubernetes), many instances in parallel
The drawback is that it's writing a code that may contain bugs etc, not using a built-in solution like GCS "Transfer".
You could use gsutil running on Compute Engine (or EC2) instances (which may have higher network bandwidth available than your local machine).
Using gsutil -m cp will parallelize copying across objects, but individual objects will still be copied sequentially.

Apache Airflow or Apache Beam for data processing and job scheduling

I'm trying to give useful information but I am far from being a data engineer.
I am currently using the python library pandas to execute a long series of transformation to my data which has a lot of inputs (currently CSV and excel files). The outputs are several excel files. I would like to be able to execute scheduled monitored batch jobs with parallel computation (I mean not as sequential as what I'm doing with pandas), once a month.
I don't really know Beam or Airflow, I quickly read through the docs and it seems that both can achieve that. Which one should I use ?
The other answers are quite technical and hard to understand. I was in your position before so I'll explain in simple terms.
Airflow can do anything. It has BashOperator and PythonOperator which means it can run any bash script or any Python script.
It is a way to organize (setup complicated data pipeline DAGs), schedule, monitor, trigger re-runs of data pipelines, in a easy-to-view and use UI.
Also, it is easy to setup and everything is in familiar Python code.
Doing pipelines in an organized manner (i.e using Airflow) means you don't waste time debugging a mess of data processing (cron) scripts all over the place.
Nowadays (roughly year 2020 onwards), we call it an orchestration tool.
Apache Beam is a wrapper for the many data processing frameworks (Spark, Flink etc.) out there.
The intent is so you just learn Beam and can run on multiple backends (Beam runners).
If you are familiar with Keras and TensorFlow/Theano/Torch, the relationship between Keras and its backends is similar to the relationship between Beam and its data processing backends.
Google Cloud Platform's Cloud Dataflow is one backend for running Beam on.
They call it the Dataflow runner.
GCP's offering, Cloud Composer, is a managed Airflow implementation as a service, running in a Kubernetes cluster in Google Kubernetes Engine (GKE).
So you can either:
manual Airflow implementation, doing data processing on the instance itself (if your data is small (or your instance is powerful enough), you can process data on the machine running Airflow. This is why many are confused if Airflow can process data or not)
manual Airflow implementation calling Beam jobs
Cloud Composer (managed Airflow as a service) calling jobs in Cloud Dataflow
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment itself, using Airflow's KubernetesPodOperator (KPO)
Cloud Composer running data processing containers in Composer's Kubernetes cluster environment with Airflow's KPO, but this time in a better isolated fashion by creating a new node-pool and specifying that the KPO pods are to be run in the new node-pool
My personal experience:
Airflow is lightweight and not difficult to learn (easy to implement), you should use it for your data pipelines whenever possible.
Also, since many companies are looking for experience using Airflow, if you're looking to be a data engineer you should probably learn it
Also, managed Airflow (I've only used GCP's Composer so far) is much more convenient than running Airflow yourself, and managing the airflow webserver and scheduler processes.
Apache Airflow and Apache Beam look quite similar on the surface. Both of them allow you to organise a set of steps that process your data and both ensure the steps run in the right order and have their dependencies satisfied. Both allow you to visualise the steps and dependencies as a directed acyclic graph (DAG) in a GUI.
But when you dig a bit deeper there are big differences in what they do and the programming models they support.
Airflow is a task management system. The nodes of the DAG are tasks and Airflow makes sure to run them in the proper order, making sure one task only starts once its dependency tasks have finished. Dependent tasks don't run at the same time but only one after another. Independent tasks can run concurrently.
Beam is a dataflow engine. The nodes of the DAG form a (possibly branching) pipeline. All the nodes in the DAG are active at the same time, and they pass data elements from one to the next, each doing some processing on it.
The two have some overlapping use cases but there are a lot of things only one of the two can do well.
Airflow manages tasks, which depend on one another. While this dependency can consist of one task passing data to the next one, that is not a requirement. In fact Airflow doesn't even care what the tasks do, it just needs to start them and see if they finished or failed. If tasks need to pass data to one another you need to co-ordinate that yourself, telling each task where to read and write its data, e.g. a local file path or a web service somewhere. Tasks can consist of Python code but they can also be any external program or a web service call.
In Beam, your step definitions are tightly integrated with the engine. You define the steps in a supported programming language and they run inside a Beam process. Handling the computation in an external process would be difficult if possible at all*, and is certainly not the way Beam is supposed to be used. Your steps only need to worry about the computation they're performing, not about storing or transferring the data. Transferring the data between different steps is handled entirely by the framework.
In Airflow, if your tasks process data, a single task invocation typically does some transformation on the entire dataset. In Beam, the data processing is part of the core interfaces so it can't really do anything else. An invocation of a Beam step typically handles a single or a few data elements and not the full dataset. Because of this Beam also supports unbounded length datasets, which is not something Airflow can natively cope with.
Another difference is that Airflow is a framework by itself, but Beam is actually an abstraction layer. Beam pipelines can run on Apache Spark, Apache Flink, Google Cloud Dataflow and others. All of these support a more or less similar programming model. Google has also cloudified Airflow into a service as Google Cloud Compose by the way.
*Apache Spark's support for Python is actually implemented by running a full Python interpreter in a subprocess, but this is implemented at the framework level.
Apache Airflow is not a data processing engine.
Airflow is a platform to programmatically author, schedule, and
monitor workflows.
Cloud Dataflow is a fully-managed service on Google Cloud that can be used for data processing. You can write your Dataflow code and then use Airflow to schedule and monitor Dataflow job. Airflow also allows you to retry your job if it fails (number of retries is configurable). You can also configure in Airflow if you want to send alerts on Slack or email, if your Dataflow pipeline fails.
I am doing the same as you with airflow, and I've got very good results. I am not very sure about the following: Beam is machine learning focused and airflow is for anything you want.
Finally you can create a hive with kubernetes +airflow.

AWS Lambda: mount S3 files

Because of lots of dependencies my ML code is not working inside Lambda.
1. Is there any way to load shared object inside lambda
2. Can we mount external file system inside lambda.
We have used AWS Athena and Redshift Spectrum for our ML / Neural Networks training.
You can use AWS Athena and query selective records for your ML training. This will work for any large set.
If you are looking for performance in your queries, would recommend an external table with Redshift Spectrum
Both of the above technologies will mount S3 files to them and let you access it quickly and selectively with SQL queries.
Hope it helps.