Ideal way of triggering an Airflow DAG with config - config

Current Structure:
I am currently deploying airflow on the servers. I am having a server dedicated for airflow. There are also few other servers as worker servers, where each server has applications to perform the airflow task.
Usage:
For each DAG, I am using SSHOperators to do SSH commands on the worker servers, to complete the tasks.
Config:
For each task, it will need to access a config file that contains the file paths and keyed values for the operation. The config file is likely to be slightly different for every DAG run.
I do understand that there are many ways to trigger a DAG, including
passing a config object at run time, either via CLI or REST API
having a config.json stored on the worker servers, and having each
task to load it when the task starts
saving the config information in the airflow admin page, and access the config element using xcom
Concerns:
I am currently passing the config as a JSON string (2-3KB) via REST API, and embed the config in the SSH bash commands
/task/foo --do-something --config "{{ dag_run.conf["foo"] }}".
I am worry if that will one day overload the Airflow database, or someone who mistakenly send a huge config (>10MB).
Questions:
I am wondering what will be the ideal way for triggering an Airflow DAG with config? How is the run_dag config being stored? Is there any garbage collection feature that will clean out the cached config periodically?

Related

Access a specific task inside a Fargate service

I'm new to aws so forgive me if the question is trivial.
I have a Cluster running a single fargate service with two tasks that is hosting my internal api service. I can access the api via the main level and everything works.
https://<serviceid>.execute-api.us-east-1.amazonaws.com/lookupx will return the lookupx result from one of two tasks as determined by the load balancer.
I would like to get the result from each task. I know the ein for each task and I know the private IPs.
What do I need to do in in order to access a specific task in a call?
Why do I care? The service reads 40+ files from s3 at startup into memory and provides an endpoint to lookup a value and return corresponding data. I'd like to add an endpoint to reload a file on demand, but I need to make sure both tasks get updated. Not my design and I do not have time and budget to rebuild. Just looking for a better solution than restarting the tasks, reloading all 40+ files, just to update one. Wasn't bad with weekly updates, kinda sucks with daily updates.
Please notice private IP can change after task is restarted.
You can run extra scheduled/on demand task with the same or different task definition to find the service via AWS API, get its current tasks and their IPs and then call your API for all of them.
the script can be bash or any other supported language
https://aws.amazon.com/developer/tools/
with bash
you can list all service tasks:
aws ecs list-tasks --cluster <clusterName> --service <serviceName>
and their ip:
aws ecs describe-tasks --cluster <clusterName> --tasks <taskARN1 taskARN2> --query 'tasks[].attachments[].details[?name==`privateIPv4Address`].value[]'

How to insert parameters from external config file into Spinnaker pipeline?

On Spinnaker UI, I could see in the Pipelines Configuration stage, there is a section called “Parameters” wherein I can specify parameters to be used in the subsequent stages.
However, instead of manually hand configuring parameters one-by-one from Spinnaker UI, is it possible to have some stage in Spinnaker pipeline read these parameters from an external file or from a file on GitHub repository?
As #Mickey Donald mentioned all Spinnaker pipelines are just JSON files.
You can use consul-template to generate or set the values for those parameters by retrieving them from a Consul instance.
Another approach is to generate a JSON file with Terraform, later in reference and import the file using Jsonnet into your pipeline to generate a new one with the values already populated.
Whatever you method you decide to use, you’ll end up needed a new Spinnaker pipeline with a Save artifact stage to load the new pipeline into Spinnaker or use the spin cli to load it via GitHub Actions, Jenkins, etc…

MLflow artifacts on S3 but not in UI

I'm running mlflow on my local machine and logging everything through a remote tracking server with my artifacts going to an S3 bucket. I've confirmed that they are present in S3 after a run but when I look at the UI the artifacts section is completely blank. There's no error, just empty space.
Any idea why this is? I've included a picture from the UI.
You should see the 500 response in your artifacts request to the MLflow tracking server e.g. by clicking on the model of interest page (in the browser console). The UI service wouldnt know the location (since you set that to be an S3 bucket) and tries to load the defaults.
You need to specify the --artifacts-destination s3://yourPathToArtifacts argument to your mlflow server command. Also, when running the server in your environment dont forget to supply some common AWS credentials provider(s) (such as AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY env variables) as well as the MLFLOW_S3_ENDPOINT_URL env variable to point to your S3 endpoint.
I had the same issue with mlflow running on an ec2 instance. I logged into the server and noticed that it was overloaded and no disk space was left. I deleted a few temp files and the mlflow UI started displaying the files again. It seems like mlflow stores tons of tmp files, but that is a separate issue.

What are the best practices for Tekton implementation with multiple repositories with multiple deployments

We have multiple repositories that have multiple deployments in K8S.
Today, we have Tekton with the following setup:
We have 3 different projects, that should be build the same and deploy (they are just different repo and different name)
We defined 3 Tasks: Build Image, Deploy to S3, and Deploy to K8S cluster.
We defined 1 Pipeline that accepts parameters from the PipelineRun.
Our problem is that we want to get Webhooks externally from GitHub and to run the appropriate Pipeline automatically without the need to run it with params.
In addition, we want to be able to have the PipelineRun with default paramaters, so Users can invoke deployments automatically.
So - is our configuration and setup seems ok? Should we do something differently?
Our problem is that we want to get Webhooks externally from GitHub and to run the appropriate Pipeline automatically without the need to run it with params. In addition, we want to be able to have the PipelineRun with default paramaters, so Users can invoke deployments automatically.
This sounds ok. The GitHub webhook initiates PipelineRuns of your Pipeline through a Trigger. But your Pipeline can also be initiated by the users directly in the cluster, or by using the Tekton Dashboard.

Apache Hama on Amazon Elastic MapReduce

I am trying to run Apache Hama on Amazon Elastic MapReduce using https://github.com/awslabs/emr-bootstrap-actions/tree/master/hama script. However, when trying out with one master node and two slave nodes, peer.getNumPeers() in the BSP code reports only 1 peer. I am suspecting whether Hama runs in local mode.
Moreover, looking at configurations at https://hama.apache.org/getting_started_with_hama.html, my understanding is that the list of all the servers should go in hama-site.xml file for property hama.zookeeper.quorum and also in groomservers file. However, I wonder whether these are being configured properly in the install script. Would really appreciate if anyone could point out whether it's a limitation in the script or whether I am doing something wrong.
#Madhura
Hama doesn't always need groomserver file to run fully distributed mode.
groomserver file is needed to run hama cluster using only start-bspd.sh. But emr-bootstrap-action of hama runs groomservers on each slave nodes using hama-daemon.sh file. Code executed in install script is as follow.
$ /bin/hama-daemon.sh --config ${HAMA_HOME}/conf start groom
I think you need to check the emr logs whether they have error or not.