Does Snakemake have states during a workflow - snakemake

Does Snakemake support states in the pipelines. Meaning the current run can be changed according to the last e.g.10 runs?
For example: Data is being processed and if the current value is greater than X and in the last 10 values there were at least 5 others with a value greater than X, then i want the workflow to branch differently otherwise it should continue normally.

You could potentially use some slightly hacky workaround with checkpoints and multiple snakefiles to achieve conditional execution of rules.
For example, a first snakemake file that includes a checkpoint, i.e. a rule that waits for the execution of previous rules and only gets evaluated then. Here you could check your conditions from the current pipeline and previous results. For the example code I'm just using a random number to determine what the checkpoint does.
rule all:
input:
"random_number.txt",
"next_step.txt"
rule random_number:
output: "random_number.txt"
run:
import numpy as np
r = np.random.choice([0, 1])
with open(output[0], 'w') as fh:
fh.write(f"{r}")
checkpoint next_rule:
output: "next_step.txt"
run:
# read random number
with open("random_number.txt", 'r') as rn:
num = int(rn.read())
print(num)
if num == 0:
with open(output[0], 'w') as fh:
fh.write("case_a")
elif num == 1:
with open(output[0],'w') as fh:
fh.write("case_b")
else:
exit(1)
Then you could have a second snakefile with a conditional rule all, i.e. a list of output files that depends on the result of the first pipeline.
with open("next_step.txt", 'r') as fh:
case = fh.read()
outputs = []
if case == "case_a":
outputs = ["output_a_0.txt", "output_a_1.txt"]
if case == "case_b":
outputs = ["output_b_0.txt", "output_b_1.txt"]
rule all:
input:
outputs

Related

Snakemake: a rule with batched inputs and corresponding outputs

I have the following basic structure of the workflow:
files are downloaded from a remote server,
converted locally and then
analyzed.
One of the analyses is time-consuming, but it scales well if run on multiple input files at a time. The output of this rule is independent of what files are analyzed together as a batch as long as they all share the same set of settings. Upstream and downstream rules operate on individual files, so from the perspective of the workflow, this rule is an outlier. What files are to be run together can told in advance, although ideally if some of the inputs failed to be produced along the way, the rule should be run on a reduced of files.
The following example illustrates the problem:
samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
groups = {
'A': samples[0:3],
'B': samples[3:6]
}
rule all:
input:
expand("done/{sample}.txt", sample = samples)
rule create:
output:
"created/{sample}.txt"
shell:
"echo {wildcards.sample} > {output}"
rule analyze:
input:
"created/{sample}.txt"
output:
"analyzed/{sample}.txt"
params:
outdir = "analyzed/"
shell:
"""
sleep 1 # or longer
parallel md5sum {{}} \> {params.outdir}/{{/}} ::: {input}
"""
rule finalize:
input:
"analyzed/{sample}.txt"
output:
"done/{sample}.txt"
shell:
"touch {output}"
The rule analyze is the one to produce multiple output files from multiple inputs according to the assignment in groups. The rules create and finalize operate on individual files upstream and downstream, respectively.
Is there a way to implement such logic? I'd try like to try to avoid splitting the workflow to accommodate this irregularity.
Note: this question is not related to the similar sounding question here.
If I understand correctly. rule analyze takes in input files created/a.txt, created/b.txt, created/c.txt for group A and gives in output
analyzed/a.txt, analyzed/b.txt, analyzed/c.txt. The same for group B so rule analyze runs twice, everything else runs 6 times.
If so, I make rule analyze output a dummy file signaling that files in group A (or B, etc.) has been analyzed. Downstream rules will take in input this dummy file and will find the corresponding analyzed/{sample}.txtavailable.
Here's your example:
samples = [ 'a', 'b', 'c', 'd', 'e', 'f' ]
groups = {
'A': samples[0:3],
'B': samples[3:6]
}
# Map samples to groups by inverting dict groups
inv_groups = {}
for x in samples:
for k in groups:
if x in groups[k]:
inv_groups[x] = k
rule all:
input:
expand("done/{sample}.txt", sample = samples)
rule create:
output:
"created/{sample}.txt"
shell:
"echo {wildcards.sample} > {output}"
rule analyze:
input:
# Collect input for this group (A, B, etc)
grp= lambda wc: ["created/%s.txt" % x for x in groups[wc.group]]
output:
done = touch('created/{group}.done'),
shell:
"""
# Code that actually does the job...
for x in {input.grp}
do
sn=`basename $x .txt`
touch analyzed/$sn.txt
done
"""
rule finalize:
input:
# Get dummy file for this {sample}.
# If the dummy exists also the corresponding analyzed/{sample}.txt exists.
done = lambda wc: 'created/%s.done' % inv_groups[wc.sample],
output:
fout= "done/{sample}.txt"
params:
fin= "analyzed/{sample}.txt",
shell:
"cp {params.fin} {output.fout}"

Spacy - erroneous config.file

While training ner with custom labels I created a .json file the exactly similar way but with my own data as stated in the example.
Then I tried to convert it (both train/dev) to the binary format needed for training using the command:
python -m spacy convert train.json ./ -t spacy
which did result in creating 2 files.
The error I got while launching the training process:
[E923] It looks like there is no proper sample data to initialize the Model of component 'ner'. To check your input data paths and annotation, run: python -m spacy debug data config.cfg
The debug command output is the same.
The problem is that there are overlapping entities. For each word there should be only one tag.
The solution of the problem can be (code from spacy_convert_script):
import srsly
import spacy
for f in ["train.json", "dev.json"]:
nlp = spacy.blank("en")
db = DocBin()
for text, annot in srsly.read_json(f):
doc = nlp.make_doc(text)
ents = []
try:
for start, end, label in annot["entities"]:
span = doc.char_span(start, end, label=label)
if span is None:
msg = f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{doc.text[start:end]}' does not align with token boundaries:\n\n{repr(text)}\n"
warnings.warn(msg)
else:
ents.append(span)
doc.ents = ents
db.add(doc)
except:
print(doc.text, ents) #see which texts cause the problem
continue
db.to_disk(f.split('.')[0]+'.spacy')
That just would result in skipping the texts which cause problems. To choose one of the overlapping entities:
try:
x = 0
for start, end, label in annot["entities"]:
span = doc.char_span(start, end, label=label)
if span is None:
msg = f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{doc.text[start:end]}' does not align with token boundaries:\n\n{repr(text)}\n"
warnings.warn(msg)
else:
if start > x and end > x:
x = end
ents.append(span)

Tensorflow graph execution ignores equality condition in earger execution mode

I stumbled on some weird tensorflow behaviour. After tf.print everywhere, it led me to the cause as shown in the following code but don't know why it happened unless either threading race condition or graph construction omitted the code segment. Don't see either of them should happen.
# Ragged tensor may have empty rows. So, for tensor arithmetic operation,
# we need to create zero-padded tensors to replace them.
# This implementation only keeps the first entry of each row.
# So, the output tensor is a normal tensor.
def pad_empty_ragged_tensor(ragtensor):
tf.print("Ragged tensor padding empty tensor...", output_stream=sys.stdout)
batch_size = ragtensor.shape[0]
n_rows = ragtensor.row_lengths()
tf.print("row_lengths(): ", n_rows, output_stream=sys.stdout)
new_tensor = []
for i in range(batch_size):
tf.print("n_rows[i]: ", n_rows[i], output_stream=sys.stdout)
if tf.equal(n_rows[i], 0): # Tried n_rows[i] == 0 too
tf.print("Create zero padded tensor...", output_stream=sys.stdout)
num_zeros = ragtensor.shape[-1]
tensor = tf.tile([[0]], [1, num_zeros])
tensor = tf.cast(tensor, dtype=ragtensor.dtype)
else:
tf.print("Take first entry from the row", output_stream=sys.stdout)
tensor = ragtensor[i,0:1]
new_tensor.append(tensor)
tensor = tf.stack(new_tensor, axis=0) # [batch, 1, [y, x, h, w]]
tensor.set_shape([batch_size, 1, ragtensor.shape[-1]])
tf.print("The padded tensor shape: ", tensor.shape, output_stream=sys.stdout)
return tensor
Here is a segment of the print trace:
row_lengths(): [1 1 0 ... 1 1 1]
n_rows[i]: 1
Take first entry from the row
n_rows[i]: 1
Take first entry from the row
n_rows[i]: 0
Take first entry from the row
n_rows[i]: 1
Take first entry from the row
As shown, if tf.equal(n_rows[i], 0): # Tried n_rows[i] == 0 too condition block was never called. It falls into 'else' condition every time even if the equality condition was met. Could anyone hint me what went wrong?
BTW, debugging tensorflow runtime is difficult too. Breakpoint in VSCode didn't hit once graph execution runs. tfdbg is not working with eager execution either. A suggestion on this is very beneficial to me too.
My dev env:
OS: Ubuntu18.04
Python: 3.6
Tensorflow-gpu: 1.14
GPU: RTX2070
Cuda: 10.1
cudnn: 7.6
IDE: VS code
Tensorflow mode: Eager execution
Thanks in advance

Python Pandas: Read in External Dataset Into Dataframe Only If Conditions Are Met Using Function Call

Let's say I have an Excel file called "test.xlsx" on my local machine. I can read this data set in using traditional code.
df_test = pd.read_excel('test.xlsx')
However, I want to conditionally read that data set in if a condition is met ... if another condition is met I want to read in a different dataset.
Below is the code I tried using a function:
def conditional_run(x):
if x == 'fleet':
eval('''df_test = pd.read_excel('test.xlsx')''')
elif x != 'fleet':
eval('''df_test2 = pd.read_excel('test_2.xlsx')''')
conditional_run('fleet')
Below is the error I get:
File "<string>", line 1
df_test = pd.read_excel('0Day Work Items Raw Data.xlsx')
^
SyntaxError: invalid syntax
There probably isn't a reason to use eval in this case. It might be sufficient to conditionally read the file based on its name. For example:
def conditional_run(x):
if x == 'fleet':
file = "test.xlsx"
elif x != 'fleet':
file = "test_2.xlsx"
df_test = pd.read_excel(file)
return df_test
conditional_run('fleet')

Python - Change variable value on exit for next session

I want to change the value of a variable on exit so that on the next run, it remains what was last set. This is a short version of my current code:
def example():
x = 1
while True:
x = x + 1
print x
On 'KeyboardInterrupt', I want the last value set in the while loop to be a global variable. On running the code next time, that value should be the 'x' in line 2. Is it possible?
This is a bit hacky, but hopefully it gives you an idea that you can better implement in your current situation (pickle/cPickle is what you should use if you want to persist more robust data structures - this is just a simple case):
import sys
def example():
x = 1
# Wrap in a try/except loop to catch the interrupt
try:
while True:
x = x + 1
print x
except KeyboardInterrupt:
# On interrupt, write to a simple file and exit
with open('myvar', 'w') as f:
f.write(str(x))
sys.exit(0)
# Not sure of your implementation (probably not this :) ), but
# prompt to run the function
resp = raw_input('Run example (y/n)? ')
if resp.lower() == 'y':
example()
else:
# If the function isn't to be run, read the variable
# Note that this will fail if you haven't already written
# it, so you will have to make adjustments if necessary
with open('myvar', 'r') as f:
myvar = f.read()
print int(myvar)
You could save any variables that you want to persist to a text file then read them back in to the script the next time it runs.
Here is a link for reading and writing to text files.
http://docs.python.org/2/tutorial/inputoutput.html#reading-and-writing-files
Hope it helps!