CI/CD pipeline on AWS cloud for pyspark EMR application - amazon-s3

I need to create CI/CD pipeline in AWS cloud for a pyspark application , finally this py-spark is to be invoked through a airflow DAG.

I am no expert on this either, but you can follow this guide:
https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-apache-spark-applications-using-aws/
The idea is to automate job testing in Spark local mode, then run a live job with infrastructure created on the fly and finally deploy the job to production if all the previous steps succeed. I would keep my production jobs automated in Airflow and run this CI/CD pipeline on development branches (these ones without deploying to production, of course) as well as on PR on the main branch. That way your production jobs will always be functioning correctly and only incorporate new functionality/changes after they are fully tested on development branches.

Related

how do we restore kubeflow from backups if the installation is destroyed or how we can back the kubflow as it was if the eks cluster is destroyed

How am i going to take backup for my kubeflow pipeline and restore it if the installing is failed or the eks cluster is destroyed. i have some finding to get the image of the vanila i am using for database and find out how to take backup and restore but i didnt have any luck so far.
i have e kubflow running on aws eks cluster
and i have 15/16 kubeflow pipeline running
i used vanilla for database
so now i need yours help to know how to backup the pipelines
and restore the kubeflow pipeline if anything happens to the kubeflow
or eks.
If you inspect the Kubeflow manifest file, you'll see the list of dependencies it has. The largest one is the database. For those on AWS, you can use RDS as a target for running the database rather than a self-hosted one in Kubernetes.
You can see the instructions for that here.
I used to be the Product Manager for Open Source MLOps at AWS and my team wrong that post.

How to build a development and production environment in apache nifi

I have 2 apache nifi servers that are development and production hosted on AWS, currently the migration between development and production is done manually. I would like to know if it is possible to automate this process and ensure that people do not develop in production?
I thought about uploading the entire nifi in github and having it deploy the new nifi on the production server, but I don't know if that would be correct to do.
One option is to use NiFi registry, store the flows in the registry and share the registry between Development and Production environments. You can then promote the latest version of the flow from dev to prod.
As you say, another option is to potentially use Git to share the flow.xml.gz between environments and using a deploy script. The flow.xml.gz stores the data flow configuration/canvas. You can use parameterized flows (https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#Parameters) to point NiFi at different external dev/prod services (eg. NiFi dev processor uses a dev database URL, NiFi prod points to prod database URL).
One more option is to export all or part of the NiFi flow as a template, and upload the template to your production NiFi, however registry is probably a better way of handling this. More info on templates here: https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#templates.
I believe the original design plan behind NiFi was not necessarily to have different environments, and to allow live changes in production. I guess you would build your initial data flow using some test data in production and then once it's ready start the live data flow. But I think it's reasonable to want to have separate environments.

How to automate ETL job deployment and run?

We have ETL jobs i.e. a java jar(performs etl operations) is run via shell script. The shell script is passed with some parameters as per the job being run. These shell scripts are run via crontab as well as manually depending on the requirements. Sometimes there is need of running some sql commands/scripts on posgresql RDS DB too, before the shell script run.
We have everything on AWS i.e. Ec2 talend server, Postgresql RDS, Redshift, ansible etc.
How can we automate this process? How to deploy and handle passing custom parameters etc. Pointers are welcome.
I would prefer to go with AWS Data pipeline, and add steps to perform any pre / post operations on your ETL job, like running shell scripts, or any hql etc.
AWS Glue runs on Spark engine, and it has other features as well as such AWS Glue Development Endpoint, Crawler, Catalog, Job schedulers. I think AWS Glue would be ideal if you are starting afresh, or plan to move your ETL to AWS Glue. Please refer here on price comparison.
AWS Pipeline: For details on AWS Pipeline
AWS Glue FAQ:For details on supported languages for AWS Glue
Please note according to AWS Glue FAQ:
Q: What programming language can I use to write my ETL code for AWS
Glue?
You can use either Scala or Python.
Edit: As Jon scott commented, Apache Airflow is another option for job scheduling, but I have not used it.
You can use Aws Glue for performing serverless ETL. Glue also has triggers which lets you automate their jobs.

Scheduler not queuing jobs

I'm trying to test out Airflow on Kubernetes. The Scheduler, Worker, Queue, and Webserver are all on different deployments and I am using a Celery Executor to run my tasks.
Everything is working fine except for the fact that the Scheduler is not able to queue up jobs. Airflow is able to run my tasks fine when I manually execute it from the Web UI or CLI but I am trying to test the scheduler to make it work.
My configuration is almost the same as it is on a single server:
sql_alchemy_conn = postgresql+psycopg2://username:password#localhost/db
broker_url = amqp://user:password#$RABBITMQ_SERVICE_HOST:5672/vhost
celery_result_backend = amqp://user:password#$RABBITMQ_SERVICE_HOST:5672/vhost
I believe that with these configurations, I should be able to make it run but for some reason, only the workers are able to see the DAGs and their state, but not the scheduler, even though the scheduler is able to log their heartbeats just fine. Is there anything else I should debug or look at?
First, you use postgres as database for airflow, don't you? Do you deploy a pod and service for postgres? If it is the case, do you verify that in your config file you have :
sql_alchemy_conn = postgresql+psycopg2://username:password#serviceNamePostgres/db
You can use this github. I used it 3 weeks ago for a first test and it worked pretty well.
The entrypoint is useful to verify that rabbitMq and Postgres are well configured.

Automatic cluster setup and app deployment on GCE Kubernetes

We are looking for a solid, declarative (yaml), based proceedure to automate the setup of our Kubernetes cluster and application deployments on Google Container Engine.
As our last resort in a serious failure we want to be able to:
Create a new GCE cluster
Execute all our deployments to their latest versions
Execute all the steps in the correct order
What are the solutions people are currently using. Doing this manually takes us about an hour and is error prone. Really it could take 15-20 mins if automated.
You should take a look at Google Cloud Deployment Manager. It "automates the creation and management of your Google Cloud Platform resources for you" meaning that it can create a Google Container Engine cluster as well as create your deployments.
Looking through the GKE deployment manager example should help get you started.