How to automate ETL job deployment and run? - automation

We have ETL jobs i.e. a java jar(performs etl operations) is run via shell script. The shell script is passed with some parameters as per the job being run. These shell scripts are run via crontab as well as manually depending on the requirements. Sometimes there is need of running some sql commands/scripts on posgresql RDS DB too, before the shell script run.
We have everything on AWS i.e. Ec2 talend server, Postgresql RDS, Redshift, ansible etc.
How can we automate this process? How to deploy and handle passing custom parameters etc. Pointers are welcome.

I would prefer to go with AWS Data pipeline, and add steps to perform any pre / post operations on your ETL job, like running shell scripts, or any hql etc.
AWS Glue runs on Spark engine, and it has other features as well as such AWS Glue Development Endpoint, Crawler, Catalog, Job schedulers. I think AWS Glue would be ideal if you are starting afresh, or plan to move your ETL to AWS Glue. Please refer here on price comparison.
AWS Pipeline: For details on AWS Pipeline
AWS Glue FAQ:For details on supported languages for AWS Glue
Please note according to AWS Glue FAQ:
Q: What programming language can I use to write my ETL code for AWS
Glue?
You can use either Scala or Python.
Edit: As Jon scott commented, Apache Airflow is another option for job scheduling, but I have not used it.

You can use Aws Glue for performing serverless ETL. Glue also has triggers which lets you automate their jobs.

Related

Do I need an S3 bucket for Python ETL scripts that run as AWS Batch jobs for a Splunk Forwarder?

I am trying to deploy (in CDK) scheduled Python ETL scripts as Batch jobs (Fargate?) to parse data from AWS and other tools we utilize. A Spunk Forwarder consumes this data and sends it to our Splunk Index. Am I going to need an S3 bucket for the output of logs from my ETL scripts? How can I deploy the Splunk Forwarder alongside these scripts?
There are about 5-6 scripts that I would like to deploy via CDK.
AWS Batch jobs can send STDERR and STDOUT to CloudWatch Logs. Depends on how logging is configured in your Python scripts, that may be the easy answer. If logging is configured to write to a file, then yes I would recommend you upload the file to S3 after the ETL is finished.
Output from the scripts (the ETL results) will need to land someplace, and S3 is a great choice for that. Your Splunk Forwarder can be set up to monitor the bucket for new data and ingest. If the scripts directly send data to the forwarder you should not need a S3 bucket, but I personally would recommend that you decouple the ETL data from ingestion of the result into Splunk.
Splunk Forwarders (stable servers) would be deployed separate from AWS Batch resources.

Orchestration of jobs using AWS Step functions using EMR Serverless

Recently Amazon launched EMR Serverless and I want to repurpose my exiting data pipeline orchestration that uses AWS Step Functions: There are steps that create EMR cluster, run some lambda functions, submit Spark Jobs (mostly Scala jobs using spark-submit) and finally terminate the cluster. All these steps are of sync type (arn:aws:states:::elasticmapreduce:addStep.sync)
There are documentation and github samples that describe submitting jobs from orchestration framework such as AirFlow but there is nothing that describes how to use AWS Step Function with EMR Serverless. Any help in this regard is appreciated.
Primarily I am interested in repurposing task step function of type arn:aws:states:::elasticmapreduce:addStep.sync that takes parameters such as ClusterId but in case of EMR Serverless there is no such id.
In summary is there equivalent of Call Amazon EMR with Step Functions for EMR Serverless?
Currently there is no direct integration of EMR Serverless with Step Functions. However a possible solution is adding a Lambda Layer on top and use the SDK to create emr serverless applications and submit jobs. However you would need an additional lambda to implement a poller that tracks the success of the jobs (in case of interdependent jobs) as it is highly likely that the emr job will outrun the 15 min runtime limitation of the lambda.

CI/CD pipeline on AWS cloud for pyspark EMR application

I need to create CI/CD pipeline in AWS cloud for a pyspark application , finally this py-spark is to be invoked through a airflow DAG.
I am no expert on this either, but you can follow this guide:
https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-apache-spark-applications-using-aws/
The idea is to automate job testing in Spark local mode, then run a live job with infrastructure created on the fly and finally deploy the job to production if all the previous steps succeed. I would keep my production jobs automated in Airflow and run this CI/CD pipeline on development branches (these ones without deploying to production, of course) as well as on PR on the main branch. That way your production jobs will always be functioning correctly and only incorporate new functionality/changes after they are fully tested on development branches.

Scheduling over different AWS Components - Glue and EMR

I was wondering how I would tackle the following on AWS? - or whether it was not possible?
Transient EMR Cluster for some bulk Spark processing
When that cluster terminates, then and only then use a Glue Job to do some limited processing
I am not convinced AWS Glue Triggers will help over environments.
Or could one say, well just keep on in the EMR Cluster, it's not a good use case? Glue can write to SAP Hana with appropriate Connector and Redshift Spectrum is common use case to load Redshift via Glue job with Redshift Spectrum.
You can use "Run a job" service integration using AWS Step Functions. Step functions supports both EMR and Glue integration.
Please refer to the link for details.
Having spoken to Amazon on this aspect, they indicate that Airflow via MWAA is the preferred option now.

Airflow best practice : s3_to_sftp_operator instead of running aws cli?

What would be the best solution to transfer files between s3 and an EC2 instance using airflow?
After research i found there was a s3_to_sftp_operator but i know it's good practice to execute tasks on the external systems instead of the airflow instance...
I'm thinking about running a bashoperator that executes an aws cli on the remote ec2 instance since it respects the principle above.
Do you have any production best practice to share about this case ?
The s3_to_sftp_operator is going to be the better choice unless the files are large. Only if the files are large would I consider a bash operator with an ssh onto a remote machine. As for what large means, I would just test with the s3_to_sftp_operator and if the performance of everything else on airflow isn't meaningfully impacted then stay with it. I'm regularly downloading and opening ~1 GiB files with PythonOperators in airflow on 2 vCPU airflow nodes with 8 GiB RAM. It doesn't make sense to do anything more complex on files that small.
The best solution would be not to transfer the files, and most likely to get rid of the EC2 while you are at it.
If you have a task that needs to run on some data in S3, then just run that task directly in airflow.
If you can't run that task in airflow because it needs vast power or some weird code that airflow won't run, then have the EC2 instance read S3 directly.
If you're using airflow to orchestrate the task because the task is watching the local filesystem on the EC2, then just trigger the task and have the task read S3.