logrotate using lastaction with long running jobs - jobs

I'm using logrotate with lastaction in order to do stuff with the last rotated file.
The job is very long and I don't know if this is the reason but since I added this script the logs are no longer rotated and I have to kill the cron.daily process.
Is it a good idea to run it in the background ? And if yes, how to do that ? What are good practices with long running jobs ?

Related

Celery - automatic retrying of long running tasks running on crashed worker

I'm using Celery with Django over Redis.
Some of my tasks are quite long, taking about 1 hour. I'm aware that this is suboptimal, and preferably I should use shorter tasks, but this is what I got...
Sometimes the task/worker crash. This can happen for various unimportant reasons. Maybe this worker crashed, network problem, spot-instance when preempted, killed by OOM, or any other unexpected reason that I can't "catch" and handle.
I want to make sure the task will be tried again as fast as possible.
I can use ack_late, but the problem is that this task has a very long timeout (about 90 minutes), which means that if the task started and the worker crashed after 2 minutes, I will now wait for another 88 minutes until the task will get back to the queue and will start executing again on another worker.
I'm wondering if there exists another solution, that will see the worker "disappeared" and will put the task back in the queue?
You could give task_reject_on_worker_lost a try... It is a tricky one, but have a look...

Can a snakemake job mistakenly run twice at the same time?

When running Snakemake on a cluster, jobs get scheduled fine via slurm. Sometimes I have a case that one job is failing and consequently leads to a stop of the snakemake instance/run after completion of the still running jobs. To speed up this I have stopped snakemake (CTRl+C) and restarted it. What I did not thought of was that in this case some jobs from the previous run might still be running on the cluster. Hence it could potentially happen that the same job is started again in case no output has been written until then. In this case it could finally lead to the situation where 2 jobs write to the same output file. Or is that prevented by some other log of snakemake to care about successful completion?
I hope you can follow this explanation. Happy for every comment !
In this case it could finally lead to the situation where 2 jobs write to the same output file.
Snakemake should be aware that the previous execution didn't exit clean (because of Ctrl+C) and the jobs that were running at that moment are incomplete or absent. However, snakemake cannot know that those pending jobs are still running as independent processes.
So yes, I think it can happen that jobs steps on each other feet in what you are doing.
In my opinion, before re-running snakemake it would be safer to kill the pending jobs and start fresh. (Those that have completed before snakemake was killed are ok of course).
Note that there is an option in snakemake that may help you:
--keep-going, -k Go on with independent jobs if a job fails. (default:
False)

Is it possible to request more time to a running job in SLURM?

I know it's possible on a queued job to change directives via scontrol, for example
scontrol update jobid=111111 TimeLimit=08:00:00
This only works in some cases, depending on the administrative configuration of the slurm instance (I'm not an admin). Thus this post does not answer my question.
What I'm looking for is a way to ask SLURM to add more time to a running job, if resources are available, and even if it's already running. Sort of like a nested job request.
Particularly a running job that was initiated with srun on-the-fly.
In https://slurm.schedmd.com/scontrol.html, it is clearly written under TimeLimit:
Only the Slurm administrator or root can increase job's TimeLimit.
So I fear what you want is not possible.
An it makes sense, since the scheduler looks at job time to decide which jobs to launch and some short jobs can benefit from back-filling to start before longer jobs, it would be really a mess if users where allowed to change the job length while running. Indeed, how to define "when resource are available"? Some node can be IDLE for some time because slurm knows that it will need it soon for a large job

Snakemake + tmux

My workflow often includes PBS job submissions to a shared cluster that need to either wait in the scheduling queue, take over 24hrs to run or both. I'd like to run snakemake in the 'background' and get my prompt back while these jobs are running. I know this can be done using tmux, screen, or & but is there is a better way to do this?
I guess submitting a bash wrapper script with the snakemake commands inside is an option but I think I'm lacking some understanding of the workflow.
tmux is the recommended way to execute a Snakemake workflow. It will give you all you need, regardless of whether you are in a cluster or on a compute server.

hadoop: hanging more than an hour to execute a dump of co-grouping in pig grunt

It's been more than an hour, and the job is still running, I guess it is dead already, what I was doing is very simple:
I have two very small textfiles, and I imported them to hdfs already and would like to practice some pig latin operations. Here is what I did:
1. I created two relations, one for each
2. I created a co-grouping
3. I tried to get a dump
The dump lasted for more than an hours now, I checked a few times in GUI, and found the same job has been ended and started again:
1. completed 50%
Started again and hanging
btw: what the heck is Dr. Who showing in this screenshot (top right corner):
In this case you may want to kill the job, the command is:
yarn -kill application_xxxxxx
and refresh the queue after the job is killed:
yarn rmadmin -refreshQueue