How to run the map reduce jobs on EMRserverless? - amazon-emr

Based on the documentation, Amazon EMR serverless seems to accepts only Spark and Hive as job driver. Is there any support for custom Hadoop jar for map reduce jobs on serverless similar to EMR ?

That's correct. EMR serverless currently only supports Spark and Hive jobs, so no MapReduce.

Related

How to write EMR pyspark to local hdfs and later copy to s3?

I am using a transient cluster for Spark on EMR. an currently working on spark 3.0.1. Which command should be used such that data needs to be written,into HDFS on EMR and later get it copied over to S3 bucket?

Can you connect Amazon ElastiŠ”ache Redis to amazon EMR pyspark?

I have been trying several solutions with custom jars from redislab and also with --packages in spark submit emr and still no suceess is there any simple way in emr to connect to elasticache ?

Orchestration of jobs using AWS Step functions using EMR Serverless

Recently Amazon launched EMR Serverless and I want to repurpose my exiting data pipeline orchestration that uses AWS Step Functions: There are steps that create EMR cluster, run some lambda functions, submit Spark Jobs (mostly Scala jobs using spark-submit) and finally terminate the cluster. All these steps are of sync type (arn:aws:states:::elasticmapreduce:addStep.sync)
There are documentation and github samples that describe submitting jobs from orchestration framework such as AirFlow but there is nothing that describes how to use AWS Step Function with EMR Serverless. Any help in this regard is appreciated.
Primarily I am interested in repurposing task step function of type arn:aws:states:::elasticmapreduce:addStep.sync that takes parameters such as ClusterId but in case of EMR Serverless there is no such id.
In summary is there equivalent of Call Amazon EMR with Step Functions for EMR Serverless?
Currently there is no direct integration of EMR Serverless with Step Functions. However a possible solution is adding a Lambda Layer on top and use the SDK to create emr serverless applications and submit jobs. However you would need an additional lambda to implement a poller that tracks the success of the jobs (in case of interdependent jobs) as it is highly likely that the emr job will outrun the 15 min runtime limitation of the lambda.

Scheduling over different AWS Components - Glue and EMR

I was wondering how I would tackle the following on AWS? - or whether it was not possible?
Transient EMR Cluster for some bulk Spark processing
When that cluster terminates, then and only then use a Glue Job to do some limited processing
I am not convinced AWS Glue Triggers will help over environments.
Or could one say, well just keep on in the EMR Cluster, it's not a good use case? Glue can write to SAP Hana with appropriate Connector and Redshift Spectrum is common use case to load Redshift via Glue job with Redshift Spectrum.
You can use "Run a job" service integration using AWS Step Functions. Step functions supports both EMR and Glue integration.
Please refer to the link for details.
Having spoken to Amazon on this aspect, they indicate that Airflow via MWAA is the preferred option now.

Flink Streaming application writing to s3 in parquet format

I am developing a Flink streaming application which consumes messages from kafaka/kinesis and after processing them it has to writes the output to S3 in parquet format every let say 5 minutes.
Kindly suggest an approach to achieve this as i am facing lots of issues.
Currently I am using Flink 1.4.2 as i am thinking to deploy it on AWS EMR 5.15 cluster.
Approach already tried - I have used module "flink-s3-fs-hadoop" and parquet api and flink "BucketingSink"