Recursive input calling in Snakemake rule - snakemake

I am writing rule to process some data:
data in directory will be something like:
myfirst.trim_1P, myfirst.trim_2P, mysecond.trim_1P, mysecond.trim_2P,...
rule trim_data:
input:"{dataset}/{sample}.trim_{r}P"
output:"{dataset}/{sample}.{r}.fastq"
params:
length=14
shell:
"""
reformat.sh forcetrimleft="{params.length}" in="{input}" out="{output}"
"""
I have this error:
WorkflowError:
RecursionError: maximum recursion depth exceeded
If building the DAG exceeds the recursion limit
myDir/myfirst.1.trimed.1.trimed.2.trimed.2.trimed.2....
Why it run in recursive way if the output different from the input? and how I can fix it?

This is a wild guess... Maybe the wildcards capture more then they should since they are interpreted as regular expression. If {dataset}, {sample} and {r} take a defined list of values, try constraining their scope with:
wildcard_constraints:
dataset= '|'.join([re.escape(x) for x in DATASET]),
sample= '|'.join([re.escape(x) for x in SAMPLE]),
r= '|'.join([re.escape(x) for x in R]),
Where DATASET, SAMPLE and R are lists of values (e.g. R= ['1', '2'])

I had a similar error (below). This turned out to be caused by the fact the file name for the rule output was identical to a different rule. Changing the name of the output file fixed this error.
Code:
rule merge_sample_chr:
input:
bcftools=config["bcftools"],
chr_list=expand(config["vcf_dir_ase"]+"/Eagle/all_{{sample}}_chr{this_chr}_sorted.vcf.gz",this_chr=CHRS)
params:
chr_list=expand("I="+config["vcf_dir_ase"]+"/Eagle/all_{{sample}}_chr{this_chr}_sorted.vcf.gz",this_chr=CHRS)
output:
vcf=config["vcf_dir_ase"]+"/Eagle/all_{sample}.vcf.gz"
shell:
"""
{input.bcftools} concat {input.chr_list} -Oz -o {output}
"""
Error:
WorkflowError:
RecursionError: maximum recursion depth exceeded in __instancecheck__
If building the DAG exceeds the recursion limit, this is likely due to a cyclic dependency.E.g. you might have a sequence of rules that can generate their own input. Try to make the output files more specific. A common pattern is
Problematic file pattern: /path/samplid_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_chr1_.vcf.gz

Related

Can I use snakemake with humans-in-the-loop?

I am very curious about snakemake but I'm not sure it fits my use case, because I have humans in the loop.
My process is something like this:
Start with a baseline binary classification model
Generate 100 examples near the margin (predicted probability near 0.5)
Have humans label those 100 examples.
Add the 100 examples to the data set and retrain.
Goto step 1.
Thus, it's a form of active learning with humans-in-the-loop
Is snakemake a good fit for this? Or is the human-in-the-loop confounding the principle of reproducibility? If I should use snakemake, are there any relevant pointers for something similar?
You can achieve this by imagining each loop as a distinct Snakemake output:
rule generate_example:
output: "examples/{iter}.tsv"
input: "model/{iter}.tsv"
wildcard_constraints: iter = "\d+"
rule build_baseline_model:
output: "model/0.tsv"
rule build_subsequent_model:
output: "model/{iter}.tsv"
input: lambda wc: expand("examples-labelled/{iter}.tsv", iter=range(0, wc.iter)
wildcard_constraints: iter = "[1-9]\d*" # not 0
So, yes, I think Snakemake is a good fit for your process because it can represent it with reproducibility at and for each loop.

How to efficiently combine N files, two at a time

Based on the discussion in another question.
Some tools will only accept two input files at a time, but the final, merged output requires merging N output files. Examples include paste and some bed or vcf tools. Assume a list of samples is present and the binary operation is associative, (a+b)+c == a+(b+c). The required, merged output must be generated by repeatedly combining input and intermediate files. How can you efficiently merge the files?
The two solutions I will present are to sequentially combine input files and to recursively build intermediate files as a binary tree. For each, consider pasting together a few hundred samples with the following start of a snakefile:
ids = list('abcdefghijklmnopqrstuvwxyz')
samples = expand('{id1}{id2}', id1=ids, id2=ids) # 676 samples, need not be numbers
# aa, ab, ac, .., zz
rule all:
input: 'merged.txt'
rule generate_data:
output: 'sample_{sample}.txt'
shell:
'echo {wildcards.sample} > {output}'
Sequential Solution
The sequential solution is fairly easy to remember and understand. You combine files 1 and 2 into a temporary file, then combine the temporary file with file 3, ... until file N. You can do this with a run directive and shell commands but I will present this as just a shell directive
rule merge:
input:
first_files=expand('sample_{sample}.txt', sample=samples[:2]),
rest_files=expand('sample_{sample}.txt', sample=samples[2:])
output: 'merged.txt'
shell:
'paste {input.first_files} > {output} \n'
'for file in {input.rest_files} ; do '
'paste {output} $file > {output}_tmp \n'
'mv {output}_tmp {output} \n'
'done '
Recursive Solution
The general idea behind the recursive solution is to combine files 1 and 2, 3 and 4, 5 and 6, ... in the first step them combine those intermediate files until one merged file is left. The difficulty is that snakemake evaluates the dag from the top down and the number of files may not be evenly divisible by 2.
rule merge:
"""Request final output from merged files 0 to N-1."""
input:
f'temp_0_{len(samples)-1}'
output: 'merged.txt'
shell:
'cp {input} {output}'
def merge_intermediate_input(wildcards):
"""From start and end indices, request input files. Raises ValueError when indices are equal."""
start, end = int(wildcards.start), int(wildcards.end)
if start == end: # perform link instead
raise ValueError
if start + 1 == end: # base case
return expand('sample_{sample}.txt',
sample=(samples[start], samples[end]))
# default
return [f'temp_{start}_{(start+end)//2}', f'temp_{(start+end)//2+1}_{end}']
rule merge_intermediate:
"""Solve subproblem, producing start to end."""
input: merge_intermediate_input
output: temp('temp_{start}_{end}')
shell:
'paste {input} > {output}'
def merge_base_input(wildcards):
"""Get input sample from index in list."""
index = int(wildcards.start)
return f'sample_{samples[index]}.txt'
rule merge_base:
"""Create temporary symbolic link for input file with start==end."""
input: merge_base_input
output: temp('temp_{start}_{start}')
shell:
'ln -sr {input} {output}'
merge_intermediate solves the subproblem of producing the merged files from start to end from the two merged files split halfway. When start == end, the merged file is created as a symbolic link. When start + 1 == end, the base case is to merge the input files at those indices. The recursive solution is clearly more code and more complex, but it can be more efficient in long-running or complex merge operations.
Runtime Complexity, Performance
Let each of the N files have k lines and the runtime complexity of the merge operation have O(f(n)). In the sequential solution, the temporary file is created N-1 times and its length increases as 2k, 3k ... for a total of k*N*(N+1)/2 - k ~ O(f(k N^2)).
For the recursive solution, in the first layer, each pair of files is joined. Each operation requires O(f(2k)) and there are N/2 such operations. Next, each pair of the resulting files are merged, at a cost of O(f(4k)) with N/4 operations. Overall, ln(N) layers of merges are required to produce the final output, again with N-1 merge operations. The complexity of the entire operation is O(f(k N ln(n))).
In terms of overhead, the recursive solution launches N-1 snakemake jobs with any associated calls to the scheduler, activating environments, etc. The sequential version launches a single job and runs everything in a single shell process.
The recursive solution can run with more parallelism; each 'level' of the recursive solution is independent, allowing up to N/2 jobs to run at once. The sequential solution requires the results of each previous step. There is an additional challenge with resource estimation for the recursive solution. The first merges have O(2k) while the last has O(k N). The resources could be dynamically estimated or, if the merge step doesn't increase the resulting file size (e.g. intersecting regions), the resources could be similar.
Conclusion
While the recursive solution offers better asymptotic runtime complexity, it introduces more snakemake jobs, temporary files, and complex logic. The sequential solution is straight-forward and contained in a single job, though could be N/ln(N) times slower. Quick merge operations can be successfully performed with the sequential solution and the runtime won't be much worse until N is quite large. However, if merging takes 10s of minutes or longer, depends on the input file sizes, and produces outputs longer than inputs (e.g. cat, paste, and similar), the recursive solution may offer better performance and a significantly shorter wall clock time.

Way to force snakemake to re-evaluate dag before checkpoint with --list-input-changes?

I wonder if anyone might have some ideas for a small problem I am having with checkpoints. I am trying to produce a workflow that is robust to changes in the sample list, so that necessary rules re-run if a sample is removed. However, if I have a successful workflow run, but remove some samples, re-running with -R $(snakemake --list-input-changes) does not manage to detect the input changes in rules completed before the most recent dag re-evaluation after a checkpoint. Does anyone know how to force snakemake to also check for input changes in rules that happen before a checkpoint output is produced? A small example of a use case might be:
# index the reference to get a list of all chromosomes
checkpoint index_ref:
input:
"reference.fa"
output:
"reference.fa.fai"
shell:
"samtools faidx {input}"
# make a function to get a chromosome list item from checkpoint output
def get_contigs(wildcards):
with checkpoints.index_ref.get().output.open() as index:
return pd.read_table(index,header=None,usecols=[0]).squeeze("columns")
# calculate depth per chromosome for a set of samples
rule samtools_depth:
input:
expand({sample}.bam, sample=sample_list)
output:
"dataset.chr{chrom}.depth"
shell:
"samtools depth -r {wildcards.chrom} {input} > {output}"
# combine depths into single output file
rule combine_depths:
input:
expand("dataset.chr{chrom}.depth", chrom=get_contigs())
output:
"dataset.genome.depth"
shell:
"cat {input} > {output}"
rule all:
input:
"dataset.genome.depth"
In this case, if the workflow is successfully run, then a sample removed from sample_list, using snakemake -R $(snakemake --list-input-changes) will suggest that the workflow is already complete. If dataset.genome.depth is removed, the same command will result in suggesting that rules samtools_depth and combine_depths both need to be re-run due to the former's input file changes, which is what I would prefer to happen even if downstream outputs are already created.

Tensorflow bert tokenize unknown words

I am currently doing the following tf tutorial : https://www.tensorflow.org/tutorials/text/solve_glue_tasks_using_bert_on_tpu
Testing the outputs of the tokenize function on different sentences, I wonder what happens when tokenizing unknown words.
Loading model:
bert_model_name = 'bert_en_uncased_L-12_H-768_A-12'
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
bert_preprocess = hub.load(tfhub_handle_preprocess)
Tokenizing sentence/word:
tok = bert_preprocess.tokenize(tf.constant(['Tensorsss bla']))
print(tok)
# Output:
<tf.RaggedTensor [[[23435, 4757, 2015], [1038, 2721]]]>
Shouldnt it be so every word is tokenized to a single token ? Those are obviously made up words, but I am wondering what happens when you encode those words to fixed length vectors.
Also, how does the tokenizer transform the made up words in 3 different tokens ? Does it split the unknown words into different known parts ?
The default cache location for the tensorflow/bert_en_uncased_preprocess/3 model is /tmp/tfhub_modules/602d30248ff7929470db09f7385fc895e9ceb4c0 (more on caching). In the assets directory, you'll find vocab.txt, which is the used vocabulary. You can use the file to look up what token the token-id i corresponds to by looking at line i+1 of the file i.e.
sed '23436q;d' /tmp/tfhub_modules/602d30248ff7929470db09f7385fc895e9ceb4c0/assets/vocab.txt
> tensor
Doing that for all token-ids returns
[tensor, ##ss, ##s], [b, ##la]
As you can see, this confirms your theory that words are split into different known parts. More details on the exact algorithm can be found in Subword tokenizers.

Snakemake 'MissingOutputException'

Input: A Snakefile that uses the SSP software to calculate various quality metrics for sequencing data. The input to SSP is a BAM file.
sample1.sorted.bam
Output: Various files, but the only one I care about is a file named {prefix}.stats.txt.
sample1.stats.txt
Snakefile: ($SCIF_DATA = /scif/data)
configfile: "config.yaml"
workdir: "/scif/data"
# define samples
SAMPLES, = glob_wildcards("raw_data/{sample}.fastq.gz")
rule all:
input:
expand("processed_data/qc/{sample}/{sample}.stats.txt", sample=SAMPLES),
rule quality_metrics:
input:
"processed_data/{sample}.sorted.bam"
params:
prefix="{sample}",
gt="raw_data/hg38.chrom.sizes"
output:
"processed_data/qc/{sample}/{sample}.stats.txt"
shell:
"scif run ssp '-i $SCIF_DATA/{input} -o {params.prefix} --gt {params.gt} -p 50 --odir $SCIF_DATA/{params.prefix}'"
When I run ssp '-i sample1.sorted.bam -o sample1 --gt {params.gt} -p 50 --odir sample1 on the terminal, I get the correct output:
{path}/sample1/sample1.stats.txt
However when I run my snakemake workflow, I am getting the following error:
Waiting at most 5 seconds for missing files.
MissingOutputException in line 58 of /scif/data/Snakefile:
Missing files after 5 seconds:
processed_data/qc/THP-1_PU1-cMyc_PU1_sc_S40_R1_001/THP-1_PU1-cMyc_PU1_sc_S40_R1_001.stats.txt
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Will exit after finishing currently running jobs.
Shutting down, this might take some time.
Increasing the latency wait time does not help.
Any ideas?
I think you are missing part of the output path in the prefix
params:
prefix="processed_data/qc/{sample}"