I am trying to build a scheduler and one of the use case in it is to check for job dependencies and delay the execution of dependent job by a delta say 20 minutes.
Following is my example,
from apscheduler.schedulers.blocking import BlockingScheduler
import datetime
import logging
sched = BlockingScheduler()
log_file_path="path\to\log\file"
#sched.scheduled_job('cron', day_of_week='mon-fri', hour=19, minute=53)
def scheduled_job():
sched.add_job(run_job, id='demo_reschedule')
logging.info("Schdeuled job at {}".format(datetime.datetime.now()))
def run_job():
now = datetime.datetime.now()
now_plus_20 = now + datetime.timedelta(minutes = 20)
sched.reschedule_job('demo_reschedule',trigger='date',run_date=now_plus_20)
logging.info("Rescheduled Job demo_reschedule to new time {}".format(now_plus_20))
if __name__ == "__main__":
logging.basicConfig(filename=log_file_path,
filemode='a',
format=('[%(asctime)s] %(levelname)-8s %(name)-12s %(message)s'),
datefmt='%H:%M:%S',
level=logging.INFO)
logging.info("Starting scheduler")
sched.start()
The run_job method gets added successfully but when it executes, i receive the following error,
[19:51:14] INFO root Starting scheduler
[19:51:14] INFO apscheduler.scheduler Added job "scheduled_job" to job store "default"
[19:51:14] INFO apscheduler.scheduler Scheduler started
[19:53:00] INFO apscheduler.executors.default Running job "scheduled_job (trigger: cron[day_of_week='mon-fri', hour='19', minute='53'], next run at: 2019-06-07 19:53:00 IST)" (scheduled at 2019-06-07 19:53:00+05:30)
[19:53:00] INFO apscheduler.scheduler Added job "run_job" to job store "default"
[19:53:00] INFO root Schdeuled job at 2019-06-07 19:53:00.024887
[19:53:00] INFO apscheduler.executors.default Job "scheduled_job (trigger: cron[day_of_week='mon-fri', hour='19', minute='53'], next run at: 2019-06-10 19:53:00 IST)" executed successfully
[19:53:00] INFO apscheduler.executors.default Running job "run_job (trigger: date[2019-06-07 19:53:00 IST], next run at: 2019-06-07 19:53:00 IST)" (scheduled at 2019-06-07 19:53:00.023890+05:30)
[19:53:00] INFO apscheduler.scheduler Removed job demo_reschedule
[19:53:00] ERROR apscheduler.executors.default Job "run_job (trigger: date[2019-06-07 19:53:00 IST], next run at: 2019-06-07 19:53:00 IST)" raised an exception
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\site-packages\apscheduler\executors\base.py", line 125, in run_job
retval = job.func(*job.args, **job.kwargs)
File "path/to/demo_reschedule.py", line 33, in run_job
sched.reschedule_job('demo_reschedule',jobstore=None,trigger='date',run_date=now_plus_20)
File "C:\ProgramData\Anaconda3\lib\site-packages\apscheduler\schedulers\base.py", line 511, in reschedule_job
return self.modify_job(job_id, jobstore, trigger=trigger, next_run_time=next_run_time)
File "C:\ProgramData\Anaconda3\lib\site-packages\apscheduler\schedulers\base.py", line 483, in modify_job
job, jobstore = self._lookup_job(job_id, jobstore)
File "C:\ProgramData\Anaconda3\lib\site-packages\apscheduler\schedulers\base.py", line 816, in _lookup_job
raise JobLookupError(job_id)
apscheduler.jobstores.base.JobLookupError: 'No job by the id of demo_reschedule was found'
As per my observations, after executing the job it is immediately removed from the job store and may be due to this it is not able to find that job id but I am not sure.
Kindly advice on how to mitigate this issue.
Appreciate the help :)
When the scheduler submits finds that a schedule has run its course, it will delete the job. This is what happens here – the job has been submitted to the executor but it no longer exists in the job store because its trigger has run out of fire times. Try adding a new job instead.
Related
I am running snakemake using this command:
snakemake --profile slurm -j 1 --cores 1 --cluster-cancel "scancel"
which writes this to standard out:
Submitted job 224 with external jobid 'Submitted batch job 54174212'.
but after I cancel the run with ctrl + c, I get the following error:
scancel: error: Invalid job id Submitted batch job 54174212
What I would guess is that the jobid is 'Submitted batch job 54174212'
and snakemake tries to run scancel 'Submitted batch job 54174212' instead of the expected scancel 54174212. If this is the case, how do I change the jobid to something that works with scancel?
Your suspicion is probably correct, snakemake probably tries to cancel the wrong job (id Submitted batch job 54174212).
Check your slurm profile for snakemake which you invoke (standard location ~/.config/snakemake/slurm/config.yaml):
Does it contain the --parsable flag for sbatch?
Missing to include that flag is a mistake I made before. Adding the flag solved it for me.
The overall problem I'm trying to solve is a way to count the number of reads present in each file at every step of a QC pipeline I'm building. I have a shell script I've used in the past which takes in a directory and outputs the number of reads per file. Since I'm looking to use a directory as input, I tried following the format laid out by Rasmus in this post:
https://bitbucket.org/snakemake/snakemake/issues/961/rule-with-folder-as-input-and-output
Here is some example input created earlier in the pipeline:
$ ls -1 cut_reads/
97_R1_cut.fastq.gz
97_R2_cut.fastq.gz
98_R1_cut.fastq.gz
98_R2_cut.fastq.gz
99_R1_cut.fastq.gz
99_R2_cut.fastq.gz
And a simplified Snakefile to first aggregate all reads by creating symlinks in a new directory, and then use that directory as input for the read counting shell script:
import os
configfile: "config.yaml"
rule all:
input:
"read_counts/read_counts.txt"
rule agg_count:
input:
cut_reads = expand("cut_reads/{sample}_{rdir}_cut.fastq.gz", rdir=["R1", "R2"], sample=config["forward_reads"])
output:
cut_dir = directory("read_counts/cut_reads")
run:
os.makedir(output.cut_dir)
for read in input.cut_reads:
abspath = os.path.abspath(read)
shell("ln -s {abspath} {output.cut_dir}")
rule count_reads:
input:
cut_reads = "read_counts/cut_reads"
output:
"read_counts/read_counts.txt"
shell:
'''
readcounts.sh {input.cut_reads} >> {output}
'''
Everything's fine in the dry-run, but when I try to actually execute it, I get a fairly cryptic error message:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 agg_count
1 all
1 count_reads
3
[Tue Jun 18 11:31:22 2019]
rule agg_count:
input: cut_reads/99_R1_cut.fastq.gz, cut_reads/98_R1_cut.fastq.gz, cut_reads/97_R1_cut.fastq.gz, cut_reads/99_R2_cut.fastq.gz, cut_reads/98_R2_cut.fastq.gz, cut_reads/97_R2_cut.fastq.gz
output: read_counts/cut_reads
jobid: 2
Job counts:
count jobs
1 agg_count
1
[Tue Jun 18 11:31:22 2019]
Error in rule agg_count:
jobid: 0
output: read_counts/cut_reads
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/douglas/snakemake/scrap_directory/.snakemake/log/2019-06-18T113122.202962.snakemake.log
read_counts/ was created, but there's no cut_reads/ directory inside. No other error messages are present in the complete log. Anyone know what's going wrong or how to receive a more descriptive error message?
I'm also (obviously) fairly new to snakemake, so there might be a better way to go about this whole process. Any help is much appreciated!
... And it was a typo. Typical. os.makedir(output.cut_dir) should be os.makedirs(output.cut_dir). I'm still really curious why snakemake isn't displaying the AttributeError python throws when you try to run this:
AttributeError: module 'os' has no attribute 'makedir'
Is there somewhere this is stored or can be accessed to prevent future headaches?
Are you sure the error message is due to the typo in os.makedir? In this test script os.makedir does throw AttributeError ...:
rule all:
input:
'tmp.done',
rule one:
output:
x= 'tmp.done',
xdir= directory('tmp'),
run:
os.makedir(output.xdir)
When executed:
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1
Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 all
1 one
2
[Wed Jun 19 09:05:57 2019]
rule one:
output: tmp.done, tmp
jobid: 1
Job counts:
count jobs
1 one
1
[Wed Jun 19 09:05:57 2019]
Error in rule one:
jobid: 0
output: tmp.done, tmp
RuleException:
AttributeError in line 10 of /home/dario/Tritume/Snakefile:
module 'os' has no attribute 'makedir'
File "/home/dario/Tritume/Snakefile", line 10, in __rule_one
File "/home/dario/miniconda3/envs/tritume/lib/python3.6/concurrent/futures/thread.py", line 56, in run
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /home/dario/Tritume/.snakemake/log/2019-06-19T090557.113876.snakemake.log
Use f-string to resolve local variables like {abspath}:
for read in input.cut_reads:
abspath = os.path.abspath(read)
shell(f"ln -s {abspath} {output.cut_dir}")
Wrap the wildcards that snakemake resolves automatically into double braces inside of f-strings.
One of my import jobs is failing continuously with the error "Errors encountered during job execution. System error"
Job id is bqjob_r2257fe91a590a308_0000014e3e922faf_1
I can get the following information from the system
root#6bd6c7262e96:/opt/batch/jobs# bq show -j bqjob_r2257fe91a590a308_0000014e3e922faf_1
Job infra-bedrock-861:bqjob_r2257fe91a590a308_0000014e3e922faf_1
Job Type State Start Time Duration Bytes Processed
---------- --------- ----------------- ---------- -----------------
load FAILURE 29 Jun 09:07:42 0:00:02
Errors encountered during job execution. System error. The error has been logged and we will investigate.
Any idea what can cause this?
I launched two m1.medium nodes on amazon ec2 for executing my pig script, but looks like it failed at the first line (even before MapReduce start): raw = LOAD 's3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000' USING TextLoader as (line:chararray);
The error message I got:
2015-02-04 02:15:39,804 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2015-02-04 02:15:39,821 [JobControl] INFO org.apache.hadoop.mapred.JobClient - Default number of map tasks: null
2015-02-04 02:15:39,822 [JobControl] INFO org.apache.hadoop.mapred.JobClient - Setting default number of map tasks based on cluster size to : 20
... (omitted)
2015-02-04 02:18:40,955 [main] WARN org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Ooops! Some job has failed! Specify -stop_on_failure if you want Pig to stop immediately on failure.
2015-02-04 02:18:40,956 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - job job_201502040202_0002 has failed! Stop running all dependent jobs
2015-02-04 02:18:40,956 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2015-02-04 02:18:40,997 [main] ERROR org.apache.pig.tools.pigstats.SimplePigStats - ERROR 2997: Unable to recreate exception from backed error: Error: Java heap space
2015-02-04 02:18:40,997 [main] ERROR org.apache.pig.tools.pigstats.PigStatsUtil - 1 map reduce job(s) failed!
2015-02-04 02:18:40,997 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: HadoopVersion PigVersion UserId StartedAt FinishedAt Features 1.0.3 0.11.1.1-amzn hadoop 2015-02-04 02:15:32 2015-02-04 02:18:40 GROUP_BY
Failed!
Failed Jobs:
JobId Alias Feature Message Outputs
job_201502050202_0002 ngroup,raw,triples,tt GROUP_BY,COMBINER Message: Job failed! Error - # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201502050202_0002_m_000022
Input(s):
Failed to read data from "s3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000"
Output(s):
Counters:
Total records written : 0
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
I think the code should be fine since I have ever successfully loaded other data with the same syntax, and the link to s3n://uw-cse-344-oregon.aws.amazon.com/btc-2010-chunk-000 looks valid. I suspect it might be related to some of my EC2 settings, but not sure how to investigate further or narrow down the problem. Anyone has a clue?
"Java heap space" error message gives some clues. Your files seem to be quite large (~2GB). Make sure that you have enough memory for each task runner to read the data.
The problem was currently solved by changing my node from m1.medium to m3.large , thanks for the good hint from #Nat as he pointed out the error message regarding with java heap space. I'll update more details later.
I have what I hope is a pretty simple question, but I'm not super familiar with Sun Grid, so I've been having trouble finding the answer. I am currently submitting jobs to a grid using a bash submission script that generates a command and then executes it. I have read online that if a sun grid job exits with a code of 99, it gets re-submitted to the grid. I have successfully written my bash script to do this:
[code to generate command, stores in $command]
$command
STATUS=$?
if [[ $STATUS -ne 0 ]]; then
exit 99
fi
exit 0
When I submit this job to the grid with a command that I know has a non-zero exit status, the job does indeed appear to be resubmitted, however the scheduler never sends it to another host, instead it just remains stuck in the queue with the status "Rq":
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
2150015 0.55500 GridJob.sh my_user Rq 04/08/2013 17:49:00 1
I have a feeling that this is something simple in the config options for the queue, but I haven't been able to find anything googling. I've tried submitting this job with the qsub -r y option, but that doesn't seem to change anything.
Thanks!
Rescheduled jobs will only get run in queues that have their rerun attribute (FALSE by default) set to TRUE, so check your queue configuration (qconf -mq myqueue). Without this, your job remains in the rescheduled-pending state indefinitely because it has nowhere to go.
IIRC, submitting jobs with qsub -r yes only qualifies them for automatic rescheduling in the event of an exec node crash, and that exiting with status 99 should trigger a reschedule regardless.