I am running a snakemake pipeline on a SGE cluster, and I want snakemake to submit 1 job every 10s. So I tried to set --max-jobs-per-second 0.1, but snakemake still submits all the jobs at the same time. If anyone knows the solution for that.
Related
When running Snakemake on a cluster, jobs get scheduled fine via slurm. Sometimes I have a case that one job is failing and consequently leads to a stop of the snakemake instance/run after completion of the still running jobs. To speed up this I have stopped snakemake (CTRl+C) and restarted it. What I did not thought of was that in this case some jobs from the previous run might still be running on the cluster. Hence it could potentially happen that the same job is started again in case no output has been written until then. In this case it could finally lead to the situation where 2 jobs write to the same output file. Or is that prevented by some other log of snakemake to care about successful completion?
I hope you can follow this explanation. Happy for every comment !
In this case it could finally lead to the situation where 2 jobs write to the same output file.
Snakemake should be aware that the previous execution didn't exit clean (because of Ctrl+C) and the jobs that were running at that moment are incomplete or absent. However, snakemake cannot know that those pending jobs are still running as independent processes.
So yes, I think it can happen that jobs steps on each other feet in what you are doing.
In my opinion, before re-running snakemake it would be safer to kill the pending jobs and start fresh. (Those that have completed before snakemake was killed are ok of course).
Note that there is an option in snakemake that may help you:
--keep-going, -k Go on with independent jobs if a job fails. (default:
False)
I launch a Dataproc cluster and serve Hive on it. Remotely from any machine I use Pyhive or PyODBC to connect to Hive and do things. It's not just one query. It can be a long session with intermittent queries. (The query itself has issues; will ask separately.)
Even during one single, active query, the operation does not show as a "Job" (I guess it's Yarn) on the dashboard. In contrast, when I "submit" tasks via Pyspark, they show up as "Jobs".
Besides the lack of task visibility, I also suspect that, w/o a Job, the cluster may not reliably detect a Python client is "connected" to it, hence the cluster's auto-delete might kick in prematurely.
Is there a way to "register" a Job to companion my Python session, and cancel/delete the job at times of my choosing? For my case, it is a "dummy", "nominal" job that does nothing.
Or maybe there's a more proper way to let Yarn detect my Python client's connection and create a job for it?
Thanks.
This is not supported right now, you need to submit jobs via Dataproc Jobs API to make them visible on jobs UI page and to be taken into account by cluster TTL feature.
If you can not use Dataproc Jobs API to execute your actual jobs, then you can submit a dummy Pig job that sleeps for desired time (5 hours in the example below) to prevent cluster deletion by max idle time feature:
gcloud dataproc jobs submit pig --cluster="${CLUSTER_NAME}" \
--execute="sh sleep $((5 * 60 * 60))"
I'm running a training job on the google AI platform, just training a simple tf.Estimator. Is there a way to prevent the whole job from completing if there's still an evaluation task running?
I remember someone using Kubeflow in GCP that needed to use the '--stream-logs' flag when submitting a AI Platform training job using the gcloud command (1). Otherwise, the job would get stopped before completion.
According to the documentation,
'with the --stream-logs flag, the job will continue to
run after this command exits and must be cancelled with gcloud
ai-platform jobs cancel JOB_ID)'
It is worth giving it a try and check if, in your case, this flag can also keep the job running instead of terminating it prematurely.
In the case that the issue kept happening when activating the flag, you might want to inspect the logs of the job to better understand the root cause of this behaviour.
My workflow often includes PBS job submissions to a shared cluster that need to either wait in the scheduling queue, take over 24hrs to run or both. I'd like to run snakemake in the 'background' and get my prompt back while these jobs are running. I know this can be done using tmux, screen, or & but is there is a better way to do this?
I guess submitting a bash wrapper script with the snakemake commands inside is an option but I think I'm lacking some understanding of the workflow.
tmux is the recommended way to execute a Snakemake workflow. It will give you all you need, regardless of whether you are in a cluster or on a compute server.
Is there a plugin or can I somehow configure it, that a job (that is triggered by 3 other jobs) queues until a specified time and only then executes the whole queue?
Our case is this:
we have tests run for 3 branches
each of the 3 build jobs for those branches triggers the same smoke-test-job that runs immediately
each of the 3 build jobs for those branches triggers the same complete-test-job
points 1. and 2. work perfectly fine.
The complete-test-job should queue the tests all day long and just execute them in the evening or at night (starting from a defined time like 6 pm), so that the tests are run at night and during the day the job is silent.
It's no option to trigger the complete-test-job on a specified time with the newest version. we absolutely need the trigger of the upstream build-job (because of promotion plugin and we do not want to run already run versions again).
That seems a rather strange request. Why queue a build if you don't want it now... And if you want a build later, then you shouldn't be triggering it now.
You can use Jenkins Exclusion plugin. Have your test jobs use a certain resource. Make another job whose task is to "hold" the resource during the day. While the resource is in use, the test jobs won't run.
Problem with this: you are going to kill your executors by having queued non-executing jobs, and there won't be free executors for other jobs.
Haven't tried it myself, but this sounds like a solution to your problem.