Scheduling over different AWS Components - Glue and EMR - amazon-emr

I was wondering how I would tackle the following on AWS? - or whether it was not possible?
Transient EMR Cluster for some bulk Spark processing
When that cluster terminates, then and only then use a Glue Job to do some limited processing
I am not convinced AWS Glue Triggers will help over environments.
Or could one say, well just keep on in the EMR Cluster, it's not a good use case? Glue can write to SAP Hana with appropriate Connector and Redshift Spectrum is common use case to load Redshift via Glue job with Redshift Spectrum.

You can use "Run a job" service integration using AWS Step Functions. Step functions supports both EMR and Glue integration.
Please refer to the link for details.

Having spoken to Amazon on this aspect, they indicate that Airflow via MWAA is the preferred option now.

Related

Orchestration of jobs using AWS Step functions using EMR Serverless

Recently Amazon launched EMR Serverless and I want to repurpose my exiting data pipeline orchestration that uses AWS Step Functions: There are steps that create EMR cluster, run some lambda functions, submit Spark Jobs (mostly Scala jobs using spark-submit) and finally terminate the cluster. All these steps are of sync type (arn:aws:states:::elasticmapreduce:addStep.sync)
There are documentation and github samples that describe submitting jobs from orchestration framework such as AirFlow but there is nothing that describes how to use AWS Step Function with EMR Serverless. Any help in this regard is appreciated.
Primarily I am interested in repurposing task step function of type arn:aws:states:::elasticmapreduce:addStep.sync that takes parameters such as ClusterId but in case of EMR Serverless there is no such id.
In summary is there equivalent of Call Amazon EMR with Step Functions for EMR Serverless?
Currently there is no direct integration of EMR Serverless with Step Functions. However a possible solution is adding a Lambda Layer on top and use the SDK to create emr serverless applications and submit jobs. However you would need an additional lambda to implement a poller that tracks the success of the jobs (in case of interdependent jobs) as it is highly likely that the emr job will outrun the 15 min runtime limitation of the lambda.

How to run the map reduce jobs on EMRserverless?

Based on the documentation, Amazon EMR serverless seems to accepts only Spark and Hive as job driver. Is there any support for custom Hadoop jar for map reduce jobs on serverless similar to EMR ?
That's correct. EMR serverless currently only supports Spark and Hive jobs, so no MapReduce.

How to automate ETL job deployment and run?

We have ETL jobs i.e. a java jar(performs etl operations) is run via shell script. The shell script is passed with some parameters as per the job being run. These shell scripts are run via crontab as well as manually depending on the requirements. Sometimes there is need of running some sql commands/scripts on posgresql RDS DB too, before the shell script run.
We have everything on AWS i.e. Ec2 talend server, Postgresql RDS, Redshift, ansible etc.
How can we automate this process? How to deploy and handle passing custom parameters etc. Pointers are welcome.
I would prefer to go with AWS Data pipeline, and add steps to perform any pre / post operations on your ETL job, like running shell scripts, or any hql etc.
AWS Glue runs on Spark engine, and it has other features as well as such AWS Glue Development Endpoint, Crawler, Catalog, Job schedulers. I think AWS Glue would be ideal if you are starting afresh, or plan to move your ETL to AWS Glue. Please refer here on price comparison.
AWS Pipeline: For details on AWS Pipeline
AWS Glue FAQ:For details on supported languages for AWS Glue
Please note according to AWS Glue FAQ:
Q: What programming language can I use to write my ETL code for AWS
Glue?
You can use either Scala or Python.
Edit: As Jon scott commented, Apache Airflow is another option for job scheduling, but I have not used it.
You can use Aws Glue for performing serverless ETL. Glue also has triggers which lets you automate their jobs.

How to integrate Apache Nifi with Amazon Athena?

My Requirements:
1. User's will run the sql queries through Apache nifi to Amazon S3.
Is this possible to achieve Nifi integration with Amazon Athena?
You should be able to easily integrate Apache NiFi and Amazon Athena. The NiFi capabilities to leverage/plug-in JDBC drivers and reuse that context in many areas helps here greatly. See here for info on the JDBC drivers with Athena https://docs.aws.amazon.com/athena/latest/ug/connect-with-jdbc.html and here for using some of NiFi's DBCP facilities https://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-dbcp-service-nar/1.5.0/org.apache.nifi.dbcp.DBCPConnectionPool/index.html
You should be able to by using a combination of an ExecuteStreamCommand and the awscli. The cli has the capabilities to issue Athena queries

What's Amazon Web Services *native* offering is closest to Apache Kudu?

I am looking for a native offering, such as any of the RDS solutions, Elastic Cache, Amazon Redshift, not something that I would have to host myself.
From the Apache Kudu: https://kudu.apache.org/ :
Kudu provides a combination of fast inserts/updates and efficient columnar
scans to enable multiple real-time analytic workloads across a single storage
layer. As a new complement to HDFS and Apache HBase, Kudu gives architects the
flexibility to address a wider variety of use cases without exotic workarounds.
As I understand it, Kudu is a columnar distributed storage engine for tabular data that allows for fast scans and ad-hoc analytical queries but ALSO allows for random updates and inserts. Every table has a primary key that you can use to find and update single records...
Second answer after question was revised.
The answer is Amazon EMR running Apache Kudu.
Amazon EMR is Amazon's service for Hadoop. Apache Kudu is a package that you install on Hadoop along with many others to process "Big Data".
If you are looking for a managed service for only Apache Kudu, then there is nothing. Apache Kudu is an open source tool that sits on top of Hadoop and is a companion to Apache Impala. On AWS both require Amazon EMR running Hadoop version 2.x or greater.