Snakemake: exporting d3dag - MissingInputException - snakemake

I want to export the DAG of my workflow in D3.js compatible JSON format:
snakemake.snakemake(snakefile=smfile,
dryrun=True,
forceall=True,
printdag=False,
printd3dag=True,
keepgoing=True,
cluster_config=cluster_config,
configfile=configfile,
targets=targetfiles)
Unfortunately, it complains about missing input files.
It is right about the fact that the files are missing but I had hoped that it would run anyways, especially after setting the keepgoing option to True.
Is there a smart way to export the DAG without the input files?
Thanks,
Jan

--keep-going allows execution of independent jobs if a snakemake job fails. That is, snakemake has to successfully start running jobs. In your case, it never gets to that stage. I would imagine missing input files would not allow creation of DAG.

Related

Accessing the --default-remote-prefix within the Snakefile

When I run snakemake on the google life sciences executor, I run something like:
snakemake --google-lifesciences --default-remote-prefix my_bucket_name --preemption-default 10 --use-conda
Now, my_bucket_name is going to get added to all of the input and output paths.
BUT for reasons I need to recreate the full path within the Snakefile code and therefore I want to be able to access whatever is passed to --default-remote-prefix within the code
Is there a way to do this?
I want to be able to access whatever is passed to --default-remote-prefix within the code
You can use the workflow object like:
print(workflow.default_remote_prefix) # Will print my_bucket_name in your example
rule all:
input: ...
I'm not 100% sure if the workflow object is supposed to be used by the user or if it's private to snakemake and if so it could be changed in the future without warning. But I think it's ok, I use workflow.basedir all the time to get the directory where the Snakefile sits.
Alternatively you could parse the sys.argv list but I think that this is more hacky.
Another option:
bucket_name=foo
snakemake --default-remote-prefix $bucket_name --config bucket_name=$bucket_name ...
then use config["bucket_name"] within the code to get the value foo. But I still prefer the workflow solution.

Snakemake: use checksums instead of timestamps?

My project is likely to have instances where input datasets are overwritten but the contents are not changed. Is there a way in Snakemake to check for build changes using checksums instead of timestamps?
For example, Scons checks for build changes in both code and data using md5 hashes (hashes computed only where timestamps have changed). But I'd so much prefer to use Snakemake because of its other killer features.
The desired behavior is similar to the between workflow caching functionality described in the docs. In the docs it says:
There is no need to use this feature to avoid redundant computations within a workflow. Snakemake does this already out of the box.
But all references to this issue point to Snakemake only using timestamps within a normal workflow.
Using the ancient marker or using touch to adjust timestamps won't work for me as that will require too much manual intervention.
I eventually found an old SO post indicating that I could do this by writing my own script to compare checksums and then feeding that into Snakemake, but I'm not sure if this is still the only option.
I'm not aware of a built-in solution in snakemake. Maybe here's how I would go about it.
Say your input data is data.txt. This is the file(s) that is overwritten possibly without changing. Instead of using this file directly in the snakemake rules, you use a cached copy that is overwritten only if the md5 has changed between original and cache. The checking can be done before the rule all using standard python code.
Here's a pseudo-code example:
input_md5 = get_md5('data.txt')
cache_md5 = get_md5('cache/data.txt')
if input_md5 != cache_md5:
# This will trigger the pipeline because cache/data.txt is newer than output
copy('data.txt', 'cache/data.txt')
rule all:
input:
'stuff.txt'
rule one:
input:
'cache/data.txt',
output:
'stuff.txt',
EDIT: This pseudo-code saves the md5 of the cached input files so they don't need to be recomputed each time. It also saves to file the timestamp of the input data so that the md5 of the input is recomputed only if such timestamp is newer than the cached one:
for each input datafile do:
current_input_timestamp = get_timestamp('data.txt')
cache_input_timestamp = read('data.timestamp.txt')
if current_input_timestamp > cache_input_timestamp:
input_md5 = get_md5('data.txt')
cache_md5 = read('cache/data.md5')
if input_md5 != cache_md5:
copy('data.txt', 'cache/data.txt')
write(input_md5, 'cache/data.md5')
write(current_input_timestamp, 'data.timestamp.txt')
# If any input datafile is newer than the cache, the pipeline will be triggered
However, this adds complexity to the pipeline so I would check whether it is worth it.

How to properly use batch flag in snakemake to subset DAG

I have a workflow written in snakemake that balloons at one point to a large number of output files. This happens because there are many combinations of my wildcards yielding on the order of 60,000 output files.
With this number of input files, DAG generation is very slow for subsequent rules to the point of being unusable. In the past, I've solved this issue by iterating with bash loops across subsetted configuration files where all but one of the wildcards is commented out. For example, in my case I had one wildcard (primer) that had 12 possible values. By running snakemake iteratively for each value of "primer", it divided up the workflow into digestible chunks (~5000 input files). With this strategy DAG generation is quick and the workflow proceeds well.
Now I'm interested in using the new --batch flag available in snakemake 5.7.4 since Johannes suggested to me on twitter that it should basically do the same thing I did with bash loops and subsetted config files.
However, I'm running into an error that's confusing me. I don't know if this is an issue I should be posting on the github issue tracker or I'm just missing something basic.
The rule my workflow is failing on is as follows:
rule get_fastq_for_subset:
input:
fasta="compute-workflow-intermediate/06-subsetted/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.fasta",
fastq="compute-workflow-intermediate/04-sorted/{sample}.{direction}.SSU.{group}.fastq"
output:
fastq=temp("compute-workflow-intermediate/07-subsetted-fastq/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.full.fastq"),
fastq_revcomp="compute-workflow-intermediate/07-subsetted-fastq/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.full.revcomped.fastq"
conda:
"envs/bbmap.yaml"
shell:
"filterbyname.sh names={input.fasta} include=t in={input.fastq} out={output.fastq} ; "
"revcompfastq_according_to_pyNAST.py --inpynast {input.fasta} --infastq {output.fastq} --outfastq {output.fastq_revcomp}"
I set up a target in my all rule to generate the output specified in the rule, then tried running snakemake with the batch flag as follows:
snakemake --configfile config/config.yaml --batch get_fastq_for_subset=1/20 --snakefile Snakefile-compute.smk --cores 40 --use-conda
It then fails with this message:
WorkflowError:
Batching rule get_fastq_for_subset has less input files than batches. Please choose a smaller number of batches.
Now the thing I'm confused about is that my input for this rule should actually be on the order of 60,000 files. Yet it appears snakemake getting less than 20 input files. Perhaps it's only counting 2, as in the number of input files that are specified by the rule? But this doesn't really make sense to me...
The source code for snakemake is here, showing how snakemake counts things but it's a bit beyond me:
https://snakemake.readthedocs.io/en/stable/_modules/snakemake/dag.html
If anyone has any idea what's going on, I'd be very grateful for your help!
Best,
Jesse

How to run rule even when some of its inputs are missing?

In the first step of my process, I am extracting some hourly data from a database. Because of things data is sometimes missing for some hours resulting in files. As long as the amount of missing files is not too large I still want to run some of the rules that depend on that data. When running those rules I will check how much data is missing and then decide if I want to generate an error or not.
An example below. The Snakefile:
rule parse_data:
input:
"data/1.csv", "data/2.csv", "data/3.csv", "data/4.csv"
output:
"result.csv"
shell:
"touch {output}"
rule get_data:
output:
"data/{id}.csv"
shell:
"Rscript get_data.R {output}"
And my get_data.R script:
output <- commandArgs(trailingOnly = TRUE)[1]
if (output == "data/1.csv")
stop("Some error")
writeLines("foo", output)
How do I force running of the rule parse_data even when some of it's inputs are missing? I do not want to force running any other rules when input is missing.
One possible solution would be to generate, for example, an empty file in get_data.R when the query failed. However, in practice I am also using --restart-times 5 when running snakemake as the query can also fail because of database timeouts. When creating an empty file this mechanism of retrying the queries would no longer work.
You need data-dependent conditional execution.
Use a checkpoint on get_data. Then you replace parse_data's input with a function, that aggregates whatever files do exist.
(note that I am a Snakemake newbie and am just learning this myself, I hope this is helpful)

How to overwrite a parameter from the configfile that is not at the first level in a snakemake call?

I can't figure out the syntax. For example:
snakemake --configfile myconfig.yml --config myparam="new value"
This will overwrite the value of config["myparam"] from the yaml file upon workflow execution.
But what if I want to overwrite config["myparam"]["otherparam"]?
Thanks!
This is currently not possible. A general remark: Note that --config should be used as little as possible, because it defeats the goal of reproducibility and data provenance (you would have to remember the command line with which you invoked snakemake).