Amazon EMR how to find out when job is finish? - amazon-s3

I'm using Amazon Elastic MapReduce Ruby (http://aws.amazon.com/developertools/2264) to run my hive job. Is there a way to know when the job is done? Right now all I could think of is the keep running emrclient with "--list --active" but I'm hoping there is a better way to do this.
Thank you

You may also get to know this from the aws console's EMR section.
If your concern is to terminate the cluster once your job is done then while launching the cluster don not use the option --stay-alive. Or alternatively, you can have a script which would poll for the current status of the running cluster and terminate it once it gets to 'waiting' state.
I do not think there is another way.

Related

systemd vs gitlab cicd

this may be a crazy question -
I want to host an algo-trading system which will trigger morning 9.00 AM and runs till 3.00 PM. I'm considering hosting either as a service using systemd or using gitlab cicd to trigger this. (i can watch activity here at any moment).
what is the best choice? is cicd reliable for running the whole day ?
I know your bounty is saying you're looking for a canonical answer, but I don't think such an answer really exists for this question since there is no real right answer based on your use-case.
You can absolutely create a CI/CD job and set the timeout to 6 hours, however I don't think that's really what you want to do here. It sounds like you essentially just want a background job that kicks off every day and processes your trades. You may also want notification if something in the job fails, or you may want it to restart the job automatically.
Systemd would be the simplest way to do this, and KISS is always a good principle to follow when designing your solution. Using GitLab would require you to host the GitLab service itself, along with a runner that would execute the jobs each day, whereas Systemd would only require you to register the service.
If you scale up to the point where you're trying to run many such jobs at once, you'll still likely be better off with a workflow manager such as Apache Airflow (or AWS step functions, etc).
So overall, I wouldn't recommend a CI/CD solution to run what is effectively a job server. Start with Systemd while you're small, then migrate to a true workflow solution when you need to scale.

Airflow best practice : s3_to_sftp_operator instead of running aws cli?

What would be the best solution to transfer files between s3 and an EC2 instance using airflow?
After research i found there was a s3_to_sftp_operator but i know it's good practice to execute tasks on the external systems instead of the airflow instance...
I'm thinking about running a bashoperator that executes an aws cli on the remote ec2 instance since it respects the principle above.
Do you have any production best practice to share about this case ?
The s3_to_sftp_operator is going to be the better choice unless the files are large. Only if the files are large would I consider a bash operator with an ssh onto a remote machine. As for what large means, I would just test with the s3_to_sftp_operator and if the performance of everything else on airflow isn't meaningfully impacted then stay with it. I'm regularly downloading and opening ~1 GiB files with PythonOperators in airflow on 2 vCPU airflow nodes with 8 GiB RAM. It doesn't make sense to do anything more complex on files that small.
The best solution would be not to transfer the files, and most likely to get rid of the EC2 while you are at it.
If you have a task that needs to run on some data in S3, then just run that task directly in airflow.
If you can't run that task in airflow because it needs vast power or some weird code that airflow won't run, then have the EC2 instance read S3 directly.
If you're using airflow to orchestrate the task because the task is watching the local filesystem on the EC2, then just trigger the task and have the task read S3.

Is it safe to run SCRIPT FLUSH on Redis cluster?

Recently, I started to have some trouble with one of me Redis cluster. used_memroy and used_memory_rss increasing constantly.
According to some Googling, I found following discussion:
https://github.com/antirez/redis/issues/4570
Now I am wandering if it is safe to run SCRIPT FLUSH command on my production Redis cluster?
Yes - you can run the SCRIPT FLUSH command safely in a production cluster. The only potential side effect is blocking the server while it executes. Note, however, that you'll want to call it in each of your nodes.

Scheduler not queuing jobs

I'm trying to test out Airflow on Kubernetes. The Scheduler, Worker, Queue, and Webserver are all on different deployments and I am using a Celery Executor to run my tasks.
Everything is working fine except for the fact that the Scheduler is not able to queue up jobs. Airflow is able to run my tasks fine when I manually execute it from the Web UI or CLI but I am trying to test the scheduler to make it work.
My configuration is almost the same as it is on a single server:
sql_alchemy_conn = postgresql+psycopg2://username:password#localhost/db
broker_url = amqp://user:password#$RABBITMQ_SERVICE_HOST:5672/vhost
celery_result_backend = amqp://user:password#$RABBITMQ_SERVICE_HOST:5672/vhost
I believe that with these configurations, I should be able to make it run but for some reason, only the workers are able to see the DAGs and their state, but not the scheduler, even though the scheduler is able to log their heartbeats just fine. Is there anything else I should debug or look at?
First, you use postgres as database for airflow, don't you? Do you deploy a pod and service for postgres? If it is the case, do you verify that in your config file you have :
sql_alchemy_conn = postgresql+psycopg2://username:password#serviceNamePostgres/db
You can use this github. I used it 3 weeks ago for a first test and it worked pretty well.
The entrypoint is useful to verify that rabbitMq and Postgres are well configured.

Amazon EMR managing my spark cluster

I have a spark setup on Amazon EC2 machines with 2 worker machines running. It reads data from cassandra, do some processing and write to sql server. I have heard about amazon EMR and read about it. I want a managed system where my worker machines are automatically added to my cluster if my job is taking more time and shutdown when my job gets completed.
Can I achieve this through Amazon EMR?
The requirements are:
My worker machines are automatically added to my cluster if my job is taking more time.
Shutdown when my job gets completed.
No. 2 is definitely possible if your job is launched from the steps. There is an option that auto-terminates cluster after the last step is completed. Alternatively, this could also be done programatically with the SDK.
No. 1 is a little more difficult but EMR has three classes of nodes; master, core, and task. Task nodes can be added after cluster creation. The trigger for that would probably have to be done programatically or utilizing another Amazon service, like Lambda.