Snakemake: use checksums instead of timestamps? - snakemake

My project is likely to have instances where input datasets are overwritten but the contents are not changed. Is there a way in Snakemake to check for build changes using checksums instead of timestamps?
For example, Scons checks for build changes in both code and data using md5 hashes (hashes computed only where timestamps have changed). But I'd so much prefer to use Snakemake because of its other killer features.
The desired behavior is similar to the between workflow caching functionality described in the docs. In the docs it says:
There is no need to use this feature to avoid redundant computations within a workflow. Snakemake does this already out of the box.
But all references to this issue point to Snakemake only using timestamps within a normal workflow.
Using the ancient marker or using touch to adjust timestamps won't work for me as that will require too much manual intervention.
I eventually found an old SO post indicating that I could do this by writing my own script to compare checksums and then feeding that into Snakemake, but I'm not sure if this is still the only option.

I'm not aware of a built-in solution in snakemake. Maybe here's how I would go about it.
Say your input data is data.txt. This is the file(s) that is overwritten possibly without changing. Instead of using this file directly in the snakemake rules, you use a cached copy that is overwritten only if the md5 has changed between original and cache. The checking can be done before the rule all using standard python code.
Here's a pseudo-code example:
input_md5 = get_md5('data.txt')
cache_md5 = get_md5('cache/data.txt')
if input_md5 != cache_md5:
# This will trigger the pipeline because cache/data.txt is newer than output
copy('data.txt', 'cache/data.txt')
rule all:
input:
'stuff.txt'
rule one:
input:
'cache/data.txt',
output:
'stuff.txt',
EDIT: This pseudo-code saves the md5 of the cached input files so they don't need to be recomputed each time. It also saves to file the timestamp of the input data so that the md5 of the input is recomputed only if such timestamp is newer than the cached one:
for each input datafile do:
current_input_timestamp = get_timestamp('data.txt')
cache_input_timestamp = read('data.timestamp.txt')
if current_input_timestamp > cache_input_timestamp:
input_md5 = get_md5('data.txt')
cache_md5 = read('cache/data.md5')
if input_md5 != cache_md5:
copy('data.txt', 'cache/data.txt')
write(input_md5, 'cache/data.md5')
write(current_input_timestamp, 'data.timestamp.txt')
# If any input datafile is newer than the cache, the pipeline will be triggered
However, this adds complexity to the pipeline so I would check whether it is worth it.

Related

SCIP - run (nearly) same LP on different instances

I have an LP, formulated in the modelling language Zimpl, that I want to run on many instances, which are in different files.
Additionally, I want to change one parameter in this LP.
For a single call, my file test.zpl looks like this:
param FILE := "file1.dat"
param BOUND := 42
[test_body: Rest of LP]
Now I want to change those two parameters. SCIP has the -c option, to execute some command. But I cannot find by which command to achieve this. All parameter changes I found affect the algorithm, not the data.
The command change to change the problem does not seem to allow new parameters/variables.
In the end, I expect the solution to look something like
scip -c "[set my parameters]; read test_body.zpl; optimize; quit"
How do I set these problem parameters?
I am not aware of any commands that support the modification of model parameters as you wish. However, if you don't hardcode the value of param BOUND in the .zpl file (instead, move the value to the .dat file and use a proper read command in the model), then you could procede as follows:
Make a copy of your data file such that each copy contains a distinct value of param BOUND
Call scip.exe separately with each data file (you could also use a simple batch script)

How to properly use batch flag in snakemake to subset DAG

I have a workflow written in snakemake that balloons at one point to a large number of output files. This happens because there are many combinations of my wildcards yielding on the order of 60,000 output files.
With this number of input files, DAG generation is very slow for subsequent rules to the point of being unusable. In the past, I've solved this issue by iterating with bash loops across subsetted configuration files where all but one of the wildcards is commented out. For example, in my case I had one wildcard (primer) that had 12 possible values. By running snakemake iteratively for each value of "primer", it divided up the workflow into digestible chunks (~5000 input files). With this strategy DAG generation is quick and the workflow proceeds well.
Now I'm interested in using the new --batch flag available in snakemake 5.7.4 since Johannes suggested to me on twitter that it should basically do the same thing I did with bash loops and subsetted config files.
However, I'm running into an error that's confusing me. I don't know if this is an issue I should be posting on the github issue tracker or I'm just missing something basic.
The rule my workflow is failing on is as follows:
rule get_fastq_for_subset:
input:
fasta="compute-workflow-intermediate/06-subsetted/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.fasta",
fastq="compute-workflow-intermediate/04-sorted/{sample}.{direction}.SSU.{group}.fastq"
output:
fastq=temp("compute-workflow-intermediate/07-subsetted-fastq/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.full.fastq"),
fastq_revcomp="compute-workflow-intermediate/07-subsetted-fastq/{sample}.SSU.{direction}.{group}_pyNAST_{primer}.full.revcomped.fastq"
conda:
"envs/bbmap.yaml"
shell:
"filterbyname.sh names={input.fasta} include=t in={input.fastq} out={output.fastq} ; "
"revcompfastq_according_to_pyNAST.py --inpynast {input.fasta} --infastq {output.fastq} --outfastq {output.fastq_revcomp}"
I set up a target in my all rule to generate the output specified in the rule, then tried running snakemake with the batch flag as follows:
snakemake --configfile config/config.yaml --batch get_fastq_for_subset=1/20 --snakefile Snakefile-compute.smk --cores 40 --use-conda
It then fails with this message:
WorkflowError:
Batching rule get_fastq_for_subset has less input files than batches. Please choose a smaller number of batches.
Now the thing I'm confused about is that my input for this rule should actually be on the order of 60,000 files. Yet it appears snakemake getting less than 20 input files. Perhaps it's only counting 2, as in the number of input files that are specified by the rule? But this doesn't really make sense to me...
The source code for snakemake is here, showing how snakemake counts things but it's a bit beyond me:
https://snakemake.readthedocs.io/en/stable/_modules/snakemake/dag.html
If anyone has any idea what's going on, I'd be very grateful for your help!
Best,
Jesse

How to run rule even when some of its inputs are missing?

In the first step of my process, I am extracting some hourly data from a database. Because of things data is sometimes missing for some hours resulting in files. As long as the amount of missing files is not too large I still want to run some of the rules that depend on that data. When running those rules I will check how much data is missing and then decide if I want to generate an error or not.
An example below. The Snakefile:
rule parse_data:
input:
"data/1.csv", "data/2.csv", "data/3.csv", "data/4.csv"
output:
"result.csv"
shell:
"touch {output}"
rule get_data:
output:
"data/{id}.csv"
shell:
"Rscript get_data.R {output}"
And my get_data.R script:
output <- commandArgs(trailingOnly = TRUE)[1]
if (output == "data/1.csv")
stop("Some error")
writeLines("foo", output)
How do I force running of the rule parse_data even when some of it's inputs are missing? I do not want to force running any other rules when input is missing.
One possible solution would be to generate, for example, an empty file in get_data.R when the query failed. However, in practice I am also using --restart-times 5 when running snakemake as the query can also fail because of database timeouts. When creating an empty file this mechanism of retrying the queries would no longer work.
You need data-dependent conditional execution.
Use a checkpoint on get_data. Then you replace parse_data's input with a function, that aggregates whatever files do exist.
(note that I am a Snakemake newbie and am just learning this myself, I hope this is helpful)

Snakemake: exporting d3dag - MissingInputException

I want to export the DAG of my workflow in D3.js compatible JSON format:
snakemake.snakemake(snakefile=smfile,
dryrun=True,
forceall=True,
printdag=False,
printd3dag=True,
keepgoing=True,
cluster_config=cluster_config,
configfile=configfile,
targets=targetfiles)
Unfortunately, it complains about missing input files.
It is right about the fact that the files are missing but I had hoped that it would run anyways, especially after setting the keepgoing option to True.
Is there a smart way to export the DAG without the input files?
Thanks,
Jan
--keep-going allows execution of independent jobs if a snakemake job fails. That is, snakemake has to successfully start running jobs. In your case, it never gets to that stage. I would imagine missing input files would not allow creation of DAG.

Bazel Checkers Support

What options do Bazel provide for creating new or extending existing targets that call C/C++-code checkers such as
qac
cppcheck
iwyu
?
Do I need to use a genrule or is there some other target rule for that?
Is https://bazel.build/versions/master/docs/be/extra-actions.html my only viable choice here?
In security critical software industries, such as aviation and automotive, it's very common to use the results of these calls to collect so called "metric reports".
In these cases, calls to such linters must have outputs that are further processed by the build actions of these metric report collectors. In such cases, I cannot find a useful way of reusing Bazel's "extra-actions". Ideas any one?
I've written something which uses extra actions to generate a compile_commands.json file used by clang-tidy and other tools, and I'd like to do the same kind of thing for iwyu when I get around to it. I haven't used those other tools, but I assume they fit the same pattern too.
The basic idea is to run an extra action which generates some output for each file (aka C/C++ compilation command), and then find all the output files afterwards (outside of Bazel) and aggregate them. A reasonably complete example is here for reference. Basically, the action listener (written in Python) decodes the extra action proto and extracts the source files, compiler options, etc:
action = extra_actions_base_pb2.ExtraActionInfo()
with open(argv[1], 'rb') as f:
action.MergeFromString(f.read())
cpp_compile_info = action.Extensions[extra_actions_base_pb2.CppCompileInfo.cpp_compile_info]
compiler = cpp_compile_info.tool
options = ' '.join(cpp_compile_info.compiler_option)
source = cpp_compile_info.source_file
output = cpp_compile_info.output_file
print('%s %s -c %s -o %s' % (compiler, options, source, output))
If you give the extra action an output template, then it can write that output to a file. If you give the output files distinctive names, you can find them all in the output tree and merge them together however you want.
A more sophisticated option is to use bazel query --output=proto and write code to calculate the extra action output filenames of the targets you're interested in from there. That requires writing more code, but you don't have problems with old output files in the output tree that are accidentally included when aggregating.
FWIW, Aspects are another possibility. However, I think extra actions work acceptably for this.