AWS Lambda: mount S3 files - amazon-s3

Because of lots of dependencies my ML code is not working inside Lambda.
1. Is there any way to load shared object inside lambda
2. Can we mount external file system inside lambda.

We have used AWS Athena and Redshift Spectrum for our ML / Neural Networks training.
You can use AWS Athena and query selective records for your ML training. This will work for any large set.
If you are looking for performance in your queries, would recommend an external table with Redshift Spectrum
Both of the above technologies will mount S3 files to them and let you access it quickly and selectively with SQL queries.
Hope it helps.

Related

Using pyspark on AWS EMR

I am new to both PySpark and AWS EMR. I have been given a small project where I need to scrub large amounts of data files every hour and build aggregated data sets based on them. These data files are stored on S3 and I can utilize some of the basic functions in Spark (like filter and map) to derive the aggregated data. To save on egress costs and after performing some CBA analysis, I decided to create an EMR cluster and make pypark calls. The concept is working fine using Lambda functions triggered by file created in the S3 bucket. I am writing the output files back to S3.
But I am not able to comprehend the need for the 3 node EMR cluster I created and its use for me. How can I use the Hadoop file system to my advantage here and all the storage that is made available on the nodes?
How do I view (if possible) the utilization of the slave/core nodes in the cluster? How do I know they are used, how often, etc etc? I am executing the pyspark code on the master node.
Are there alternatives to EMR that I can use with pyspark?
Is there any good documentation available to get a better understanding.
Thanks
Spark is a framework for distributed computing. It can process larger than memory datasets and split the workload in chunks onto multiple workers in parallel. By default EMR creates 1 master node and 2 worker nodes. The disk space on the spark nodes is typically not used directly. Spark can use the space to cache temp results.
To use a Hadoop filesystem, you need to start a hdfs service in aws .
However s3 is also distributed storage. It is supported by Hadoop libraries. Spark EMR ships with Hadoop drivers and support S3 out of the box. Using spark with S3 is perfectly valid storage solution and will be good enough for a lot of basic data processing tasks.
The is a spark manager UI in AWS EMR. You can see each running spark application session and current job. By clicking on the job you can see how many executors are used. Whether those executors run on all nodes depends on your spark memory and cpu configuration. Tuning those is a really big topic. There are good hints here on SO.
There is also a hardware monitoring tab, showing cpu and memory usage for each node.
The spark code is always executed on the master node. But it just creates a DAG plan on that node and shifts the actual work to the worker nodes according to the plan. Hence the guides speak of submitting the spark application rather than executing.
Yes. You can start your own spark cluster on normal ec2 instances. There is even a standalone mode , allowing to start spark on only one machine. It is quite some footprint, that is installed then. And you still need to tune the memory, cpu and executor settings. So it is quite a complexity compared to just implement some multiprocessing in python or use dask. However there are valid reasons to do so. It allows to use all cores on one machine. And it allows you to use a well known , good documented api. The same one, which can be used to process petabytes of data. The linked article above, explains the motivation.
Another possibility is to use AWS Glue. It is serverless spark. The
service will submit your jobs to some on demand spark nodes on AWS,
where you have no control over. Similar to how lambda functions run
on random AWS EC2 instances. However glue has some limitations. With
pyspark on glue, you cannot install python libs with c-extensions
e.g numpy, pandas, most of ml libs. Also Glue forces you to create
schema mapping of your data in Athena catalog. But standalone spark
can just process those on the fly.
Databricks also offers a separate serverless spark solution outside of AWS. It is more sophisticated in my opinion. It also allows custom c-extensions.
Big part of official documentation is focusing on the different data processing apis and not on the internals of apache spark. There are some good notes on spark internals on github. I assume every good book will cover some inner workings on spark. AWS EMR is just an automated spark cluster with yarn orchestrator. (Unfortunately, never read some good book on spark, got some info here and there, so cannot recommend one)

to source the data from Athena or Redshift to Sage maker or AWS Forecast instead of the flat file

I am trying to source the data from Athena or Redshift to Sage maker or AWS Forecast directly without using the flat data. In Sage maker I use Jupyter Notebook python code. Is there anyway to do so without even connecting to S3.
So far I have been using flat data which is not what I wanted.
if you're only using a SageMaker notebook instance, your data doesn't have to be in S3. You can use the boto3 SDK or a SQL connection (depending on the backend) to download data, store it locally, and work on it in your notebook.
If you're using the SageMaker SDK to train, then yes, data must be in S3. You can either do this manually if you're experimenting, or use services like AWS Glue or AWS Batch to automate your data pipeline.
Indeed, Athena data is probably already in S3, although it may be in a format that your SageMaker training code doesn't support. Creating a new table with the right SerDe (say, CSV) may be enough. If not, you can certainly get the job done with AWS Glue or Amazon EMR.
When it comes to Redshift, dumping CSV data to S3 is as easy as:
unload ('select * from mytable')
to 's3://mybucket/mytable'
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
delimiter ',';
Hope this helps.
If you are using SageMaker you have to use S3 to read data, SageMaker does not read data from Redshift, but will be able to read data from Athena using PyAthena.
If your data source is in Redshift you need to load your data to S3 first to be able to use in SageMaker. If you are using Athena your data is already in S3.
Amazon Machine Learning used to support reading data from Redshift or RDS but unfortunately it's not available any more.
SageMaker Data Wrangler now allows you to read data directly from Amazon Redshift. But I'm not sure if you can from across AWS accounts (e.g. if you had one account for dev and another account for prod)

Sequential Scripts conditioned by S3 file existence

I have three python scripts. These are supposed be executed sequentially, but in different environments.
script1: Generate training and test dataset using an AWS EMR cluster and save it on S3.
script2: Train a Machine Learning model using the training data and save the trained model on S3. (Executed on an AWS GPU instance)
script3: Run evaluation based on the test data and trained model and save the result on S3. (Executed on an AWS GPU instance)
I would like to run all these scripts automatically, without executing them one by one. My questions are:
Are there good practices for handling S3 file existence conditions? (false tolerence etc)
How can I trigger launching GPU instances and EMR clusters?
Are there good ways or tools to handle this kind of process?
Take a look at https://docs.aws.amazon.com/AmazonS3/latest/dev/NotificationHowTo.html
You can configure your notification when an event occur on a bucket, for example when an object is created.
You can attach this notification directly to an AWS lambda function that, if will be set with the right role can create EMR cluster and all other resources accessible by AWS SDK.

How map-reduce works on HDFS vs S3?

I have been trying to understand how different a map-reduce job is executed on HDFS vs S3. Can someone please address my questions:
Typically HDFS clusters are not only storage oriented, but also contain horsepower to execute MR jobs; and that is why the jobs are mapped on several data nodes and reduced on few. To be exact, the mapping (filter etc) is done on data locally, whereas the reducing (aggregation) is done on common node.
Does this approach work as it is on S3? As far as I understand, S3 is just a data store. Does hadoop has to COPY WHOLE data from S3 and then run Map (filter) and reduce (aggregation) locally? or it follows exactly same approach as HDFS. If the former case is true, running jobs on S3 could be slower than running jobs on HDFS (due to copying overhead).
Please share your thoughts.
Performance of S3 is slower than HDFS, but it provides other features like bucket versioning and elasticity and other data recovery schemes(Netflix uses a Hadoop cluster using S3).
Theoretically, before the split computation, the sizes of input files need to be determined, so hadoop itself has an filesystem implementation on top of S3 which allows higher layers to be agnostic of the source of the data. Map-Reduce calls the generic file listing API against each input directory to get the size of all files in the directory.
Amazons EMR have a special version of the S3 File System that can stream data directly to S3 instead of buffering to intermediate local files this can make it faster on EMR.
If you have a Hadoop cluster in EC2 and you run a MapReduce job over S3 data, yes the data will be streamed into the cluster in order to run the job. As you say, S3 is just a data store, so you can not bring the computation to the data. These non-local reads could cause a bottleneck on processing large jobs, depending on the size of the data and the size of the cluster.

Writing single Hadoop map reduce output into multiple S3 objects

I am implementing a Hadoop Map reduce job that needs to create output in multiple S3 objects.
Hadoop itself creates only a single output file (an S3 object) but I need to partition the output into multiple files.
How do I achieve this?
I did this by just writing the output directly from my reducer method to S3, using an S3 toolkit. Since I was running on EC2, this was quick and free.
In general, you want Hadoop to handle your input and output as much as possible, for cleaner mappers and reducers; and, of course, you want to write to S3 at the very end of your pipeline, to let Hadoop's code moving do it's job over HDFS.
In any case, I recommend doing all of your data partitioning, and writing entire output sets to S3 in a final reduce task, one set per S3 file. This puts as little writer logic in your code as possible. This paid off for me because I ended up with a minimal Hadoop S3 toolkit which I used for several task flows.
I needed to write to S3 in my reducer code because the S3/S3n filesystems weren't mature; they might work better now.
Do you also know the MultipleOutputFormat?
It's not related to S3, but in general it allows to write output to multiple files, implementing a given logic.