Can I use snakemake with humans-in-the-loop? - snakemake

I am very curious about snakemake but I'm not sure it fits my use case, because I have humans in the loop.
My process is something like this:
Start with a baseline binary classification model
Generate 100 examples near the margin (predicted probability near 0.5)
Have humans label those 100 examples.
Add the 100 examples to the data set and retrain.
Goto step 1.
Thus, it's a form of active learning with humans-in-the-loop
Is snakemake a good fit for this? Or is the human-in-the-loop confounding the principle of reproducibility? If I should use snakemake, are there any relevant pointers for something similar?

You can achieve this by imagining each loop as a distinct Snakemake output:
rule generate_example:
output: "examples/{iter}.tsv"
input: "model/{iter}.tsv"
wildcard_constraints: iter = "\d+"
rule build_baseline_model:
output: "model/0.tsv"
rule build_subsequent_model:
output: "model/{iter}.tsv"
input: lambda wc: expand("examples-labelled/{iter}.tsv", iter=range(0, wc.iter)
wildcard_constraints: iter = "[1-9]\d*" # not 0
So, yes, I think Snakemake is a good fit for your process because it can represent it with reproducibility at and for each loop.

Related

How to efficiently combine N files, two at a time

Based on the discussion in another question.
Some tools will only accept two input files at a time, but the final, merged output requires merging N output files. Examples include paste and some bed or vcf tools. Assume a list of samples is present and the binary operation is associative, (a+b)+c == a+(b+c). The required, merged output must be generated by repeatedly combining input and intermediate files. How can you efficiently merge the files?
The two solutions I will present are to sequentially combine input files and to recursively build intermediate files as a binary tree. For each, consider pasting together a few hundred samples with the following start of a snakefile:
ids = list('abcdefghijklmnopqrstuvwxyz')
samples = expand('{id1}{id2}', id1=ids, id2=ids) # 676 samples, need not be numbers
# aa, ab, ac, .., zz
rule all:
input: 'merged.txt'
rule generate_data:
output: 'sample_{sample}.txt'
shell:
'echo {wildcards.sample} > {output}'
Sequential Solution
The sequential solution is fairly easy to remember and understand. You combine files 1 and 2 into a temporary file, then combine the temporary file with file 3, ... until file N. You can do this with a run directive and shell commands but I will present this as just a shell directive
rule merge:
input:
first_files=expand('sample_{sample}.txt', sample=samples[:2]),
rest_files=expand('sample_{sample}.txt', sample=samples[2:])
output: 'merged.txt'
shell:
'paste {input.first_files} > {output} \n'
'for file in {input.rest_files} ; do '
'paste {output} $file > {output}_tmp \n'
'mv {output}_tmp {output} \n'
'done '
Recursive Solution
The general idea behind the recursive solution is to combine files 1 and 2, 3 and 4, 5 and 6, ... in the first step them combine those intermediate files until one merged file is left. The difficulty is that snakemake evaluates the dag from the top down and the number of files may not be evenly divisible by 2.
rule merge:
"""Request final output from merged files 0 to N-1."""
input:
f'temp_0_{len(samples)-1}'
output: 'merged.txt'
shell:
'cp {input} {output}'
def merge_intermediate_input(wildcards):
"""From start and end indices, request input files. Raises ValueError when indices are equal."""
start, end = int(wildcards.start), int(wildcards.end)
if start == end: # perform link instead
raise ValueError
if start + 1 == end: # base case
return expand('sample_{sample}.txt',
sample=(samples[start], samples[end]))
# default
return [f'temp_{start}_{(start+end)//2}', f'temp_{(start+end)//2+1}_{end}']
rule merge_intermediate:
"""Solve subproblem, producing start to end."""
input: merge_intermediate_input
output: temp('temp_{start}_{end}')
shell:
'paste {input} > {output}'
def merge_base_input(wildcards):
"""Get input sample from index in list."""
index = int(wildcards.start)
return f'sample_{samples[index]}.txt'
rule merge_base:
"""Create temporary symbolic link for input file with start==end."""
input: merge_base_input
output: temp('temp_{start}_{start}')
shell:
'ln -sr {input} {output}'
merge_intermediate solves the subproblem of producing the merged files from start to end from the two merged files split halfway. When start == end, the merged file is created as a symbolic link. When start + 1 == end, the base case is to merge the input files at those indices. The recursive solution is clearly more code and more complex, but it can be more efficient in long-running or complex merge operations.
Runtime Complexity, Performance
Let each of the N files have k lines and the runtime complexity of the merge operation have O(f(n)). In the sequential solution, the temporary file is created N-1 times and its length increases as 2k, 3k ... for a total of k*N*(N+1)/2 - k ~ O(f(k N^2)).
For the recursive solution, in the first layer, each pair of files is joined. Each operation requires O(f(2k)) and there are N/2 such operations. Next, each pair of the resulting files are merged, at a cost of O(f(4k)) with N/4 operations. Overall, ln(N) layers of merges are required to produce the final output, again with N-1 merge operations. The complexity of the entire operation is O(f(k N ln(n))).
In terms of overhead, the recursive solution launches N-1 snakemake jobs with any associated calls to the scheduler, activating environments, etc. The sequential version launches a single job and runs everything in a single shell process.
The recursive solution can run with more parallelism; each 'level' of the recursive solution is independent, allowing up to N/2 jobs to run at once. The sequential solution requires the results of each previous step. There is an additional challenge with resource estimation for the recursive solution. The first merges have O(2k) while the last has O(k N). The resources could be dynamically estimated or, if the merge step doesn't increase the resulting file size (e.g. intersecting regions), the resources could be similar.
Conclusion
While the recursive solution offers better asymptotic runtime complexity, it introduces more snakemake jobs, temporary files, and complex logic. The sequential solution is straight-forward and contained in a single job, though could be N/ln(N) times slower. Quick merge operations can be successfully performed with the sequential solution and the runtime won't be much worse until N is quite large. However, if merging takes 10s of minutes or longer, depends on the input file sizes, and produces outputs longer than inputs (e.g. cat, paste, and similar), the recursive solution may offer better performance and a significantly shorter wall clock time.

Way to force snakemake to re-evaluate dag before checkpoint with --list-input-changes?

I wonder if anyone might have some ideas for a small problem I am having with checkpoints. I am trying to produce a workflow that is robust to changes in the sample list, so that necessary rules re-run if a sample is removed. However, if I have a successful workflow run, but remove some samples, re-running with -R $(snakemake --list-input-changes) does not manage to detect the input changes in rules completed before the most recent dag re-evaluation after a checkpoint. Does anyone know how to force snakemake to also check for input changes in rules that happen before a checkpoint output is produced? A small example of a use case might be:
# index the reference to get a list of all chromosomes
checkpoint index_ref:
input:
"reference.fa"
output:
"reference.fa.fai"
shell:
"samtools faidx {input}"
# make a function to get a chromosome list item from checkpoint output
def get_contigs(wildcards):
with checkpoints.index_ref.get().output.open() as index:
return pd.read_table(index,header=None,usecols=[0]).squeeze("columns")
# calculate depth per chromosome for a set of samples
rule samtools_depth:
input:
expand({sample}.bam, sample=sample_list)
output:
"dataset.chr{chrom}.depth"
shell:
"samtools depth -r {wildcards.chrom} {input} > {output}"
# combine depths into single output file
rule combine_depths:
input:
expand("dataset.chr{chrom}.depth", chrom=get_contigs())
output:
"dataset.genome.depth"
shell:
"cat {input} > {output}"
rule all:
input:
"dataset.genome.depth"
In this case, if the workflow is successfully run, then a sample removed from sample_list, using snakemake -R $(snakemake --list-input-changes) will suggest that the workflow is already complete. If dataset.genome.depth is removed, the same command will result in suggesting that rules samtools_depth and combine_depths both need to be re-run due to the former's input file changes, which is what I would prefer to happen even if downstream outputs are already created.

Recursive input calling in Snakemake rule

I am writing rule to process some data:
data in directory will be something like:
myfirst.trim_1P, myfirst.trim_2P, mysecond.trim_1P, mysecond.trim_2P,...
rule trim_data:
input:"{dataset}/{sample}.trim_{r}P"
output:"{dataset}/{sample}.{r}.fastq"
params:
length=14
shell:
"""
reformat.sh forcetrimleft="{params.length}" in="{input}" out="{output}"
"""
I have this error:
WorkflowError:
RecursionError: maximum recursion depth exceeded
If building the DAG exceeds the recursion limit
myDir/myfirst.1.trimed.1.trimed.2.trimed.2.trimed.2....
Why it run in recursive way if the output different from the input? and how I can fix it?
This is a wild guess... Maybe the wildcards capture more then they should since they are interpreted as regular expression. If {dataset}, {sample} and {r} take a defined list of values, try constraining their scope with:
wildcard_constraints:
dataset= '|'.join([re.escape(x) for x in DATASET]),
sample= '|'.join([re.escape(x) for x in SAMPLE]),
r= '|'.join([re.escape(x) for x in R]),
Where DATASET, SAMPLE and R are lists of values (e.g. R= ['1', '2'])
I had a similar error (below). This turned out to be caused by the fact the file name for the rule output was identical to a different rule. Changing the name of the output file fixed this error.
Code:
rule merge_sample_chr:
input:
bcftools=config["bcftools"],
chr_list=expand(config["vcf_dir_ase"]+"/Eagle/all_{{sample}}_chr{this_chr}_sorted.vcf.gz",this_chr=CHRS)
params:
chr_list=expand("I="+config["vcf_dir_ase"]+"/Eagle/all_{{sample}}_chr{this_chr}_sorted.vcf.gz",this_chr=CHRS)
output:
vcf=config["vcf_dir_ase"]+"/Eagle/all_{sample}.vcf.gz"
shell:
"""
{input.bcftools} concat {input.chr_list} -Oz -o {output}
"""
Error:
WorkflowError:
RecursionError: maximum recursion depth exceeded in __instancecheck__
If building the DAG exceeds the recursion limit, this is likely due to a cyclic dependency.E.g. you might have a sequence of rules that can generate their own input. Try to make the output files more specific. A common pattern is
Problematic file pattern: /path/samplid_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_.vcf.gz

Machine Learning Algorithm for multiple output features

I am looking for machine learning algorithm where I have multiple variables as output . It is something like like a vector[A,....X] each of which can have 0 or 1 value. I have data to train the model with required input features.
Which algorithm should I use for such case. With my limited knowledge I know that multi label classification can solve the problem where one output variable can take multiple values like color. But this case is multiple output variables taking 0 or 1 . Please let me know.
It is difficult to give an answer on which algorithm is the best without more information.
A perceptron, a neural network with an output layer with multiple binary (threshold function) neurons could be a good candidate.

Using tensorflow for sequence tagging : Synced sequence input and output

I would like to use Tensorflow for sequence tagging namely Part of Speech tagging. I tried to use the same model outlined here: http://tensorflow.org/tutorials/seq2seq/index.md (which outlines a model to translate English to French).
Since in tagging, the input sequence and output sequence have exactly the same length, I configured the buckets so that input and output sequences have same length and tried to learn a POS tagger using this model on ConLL 2000.
However it seems that the decoder sometimes outputs a taggedsequence shorter than the input sequence (it seems to feel that the EOS tag appears prematurely)
For example:
He reckons the current account deficit will narrow to only # 1.8 billion in September .
The above sentence is tokenized to have 18 tokens which gets padded to 20 (due to bucketing).
When asked to decode the above, the decoder spits out the following:
PRP VBD DT JJ JJ NN MD VB TO VB DT NN IN NN . _EOS . _EOS CD CD
So here it ends the sequence (EOS) after 15 tokens not 18.
How can I force the sequence to learn that the decoded sequence should be the same length as the encoded one in my scenario.
If your input and output sequences are the same length you probably want something simpler than a seq2seq model (since handling different sequence lengths is one of it's strengths)
Have you tried just training (word -> tag) ?
note: that for something like pos tagging where there is clear signal from tokens on either side you'll definitely get a benefit from a bidirectional net.
If you want to go all crazy there would be some fun character level variants too where you only emit the tag at the token boundary (the rationale being that pos tagging benefits from character level features; e.g. things like out of vocab names). So many variants to try! :D
There are various ways of specifying an end of sequence parameter. The translate demo uses a flag <EOS> to determine the end of sequence. However, you can also specify end of sequence by counting the number of expected words in the output. In the lines 225-227 of the translate.py:
# If there is an EOS symbol in outputs, cut them at that point.
if data_utils.EOS_ID in outputs:
outputs = outputs[:outputs.index(data_utils.EOS_ID)]
You can see that outputs are being cut off whenever <EOS> is encountered. You can easily tweak it to constrain the number of output words. You might also consider getting rid of <EOS> flag altogether while training, considering your application.
I came to the same problem. At the end I found ptb_word_lm.py example in tensorflow's examples is exactly what we need for tokenization, NER and POS tagging.
If you look into details of the language model example, you can find out that it treats the input character sequence as X and right shift X for 1 space as Y. It is exactly what fixed length sequence labeling needs.