chain/dependency of some rules by wildcards - snakemake

I have a particular use case for which I have not found the solution in the Snakemake documentation.
Let's say in a given pipeline I have a portion with 3 rules a, b and c which will run for N samples.
Those rules handle large amount of data and for reasons of local storage limits I do not want those rules to execute at the same time. For instance rule a produces the large amount of data then rule c compresses and export the results.
So what I am looking for is a way to chain those 3 rules for 1 sample/wildcard, and only then execute those 3 rules for the next sample. All of this to make sure the local space is available.
Thanks

I agree that this is problem that Snakemake still has no solution for. However you may have a workaround.
rule all:
input: expand("a{sample}", sample=[1, 2, 3])
rule a:
input: "b{sample}"
output: "a{sample}"
rule b:
input: "c{sample}"
output: "b{sample}"
rule c:
input:
lambda wildcards: f"a{wildcards.sample-1}"
output: "c{sample}"
That means that the rule c for sample 2 wouldn't start before the output for rule a for sample 1 is ready. You need to add a pseudo output a0 though or make the lambda more complicated.

So building on Dmitry Kuzminov's answer, the following can work (both with numbers as samples and strings).
The execution order will be a3 > b3 > a1 > b1 > a2 > b2.
I used a different sample order to show it can be made different from the sample list.
samples = [1, 2, 3]
sample_order = [3, 1, 2]
def get_previous(wildcards):
if wildcards.sample != sample_order[0]: # if different from a3 in this case
previous_sample = sample_order[sample_order.index(wildcards.sample) - 1]
return f'b_out_{previous_sample}'
else: # if is the first sample in the order i.e. a3
return #here put dummy file always present e.g. the file containing those rules or the Snakemake
rule all:
expand("b_out_{S}", S=sample)
rule a:
input:
"a_in_{sample}",
get_previous
output:
"a_out_{sample}"
rule b:
input:
"a_out_{sample}"
output:
"b_out_{sample}"

Related

Can PICT handle independent parameters

Can PICT (=Pairwise Independent Combinatorial Testing) handle/model independent parameters.
For example in following input a and b are independent, so they should not be combined.
Input in PICT:
a: 1, 2, 3, 4
b: 5, 6, 7, 8
//some line which models the independence: a independent of b
Output, that I would expect:
a b
1 5
2 6
3 7
4 8
This example, with only 2 parameters, of course normally would not make much sense, but it's illustrative.
The same could be applied to 3 parameters (a,b,c), where a is independent of b, but not c.
The main goal of declaring parameters as independent would be the reduce the number of tests.
I read the paper/user guide to PICT, but I didn't found any useful information.
I will answer my question by myself:
The solution is to define submodels and set the default order from 2 (=pairwise) to 1 (= no combination).
For example parameter a = {a_1, a_2, a_3} should be independent of
b = {b_1, b_2, b_3} and
c = {c_1, ..., c_4}.
Therefor I would expect 12 tests ((b x c) + a).
Resulting in the following input file:
a: 1,2,3
b: 1,2,3
c: 1,2,3,4
{b,c}#2
{b,c}#2 defines a submodel, consisting of b and c, which uses pairwise combination.
And running pict with the option: "/o:1".
In PICT you can define conditions and achieve your goal to not combine them with the rest. You can add additional option as N/A and have something like this;
If [A] = '1' Then [B] = 'N/A';
This is one possible option to handle this case.

Snakemake multiple input files with expand but no repetitions

I'm new to snakemake and I don't know how to figure out this problem.
I've got my rule which has two inputs:
rule test
input_file1=f1
input_file2=f2
f1 is in [A{1}$, A{2}£, B{1}€, B{2}¥]
f2 is in [C{1}, C{2}]
The numbers are wildcards that come from an expand call. I need to find a way to pass to the file f1 and f2 a pair of files that match exactly with the number. For example:
f1 = A1
f2 = C1
or
f1 = B1
f2 = C1
I have to avoid combinations such as:
f1 = A1
f2 = C2
I would create a function that makes this kind of matches between the files, but the same should manage the input_file1 and the input_file2 at the same time. I thought to make a function that creates a dictionary with the different allowed combinations but how would I "iterate" over it during the expand?
Thanks
Assuming rule test gives you in output a file named {f1}.{f2}.txt, then you need some mechanism that correctly pairs f1 and f2 and create a list of {f1}.{f2}.txt files.
How you create this list is up to you, expand is just a convenience function for that but maybe in this case you may want to avoid it.
Here's a super simple example:
fin1 = ['A1$', 'A2£', 'B1€', 'B2¥']
fin2 = ['C1', 'C2']
outfiles = []
for x in fin1:
for y in fin2:
## Here you pair f1 and f2. This is a very trivial way of doing it:
if y[1] in x:
outfiles.append('%s.%s.txt' % (x, y))
wildcard_constraints:
f1 = '|'.join([re.escape(x) for x in fin1]),
f2 = '|'.join([re.escape(x) for x in fin2]),
rule all:
input:
outfiles,
rule test:
input:
input_f1 = '{f1}.txt',
input_f2 = '{f2}.txt',
output:
'{f1}.{f2}.txt',
shell:
r"""
cat {input} > {output}
"""
This pipeline will execute the following commands
cat A2£.txt C2.txt > A2£.C2.txt
cat A1$.txt C1.txt > A1$.C1.txt
cat B1€.txt C1.txt > B1€.C1.txt
cat B2¥.txt C2.txt > B2¥.C2.txt
If you touch the starting input files with touch 'A1$.txt' 'A2£.txt' 'B1€.txt' 'B2¥.txt' 'C1.txt' 'C2.txt' you should be able to run this example.

in snakemake, for two inputs, expand pairwise combination of a vector

I am new to Snakemake and have a problem in Snakemake expand function.
First, I need to have a group of combinations and use them as base to expand another vector upon them with pair-wise elements combinations of it.
Lets say the set for the pairwise combination is
setC=["A","B","C","D"]
I get the partial group as follows:
part_group1 = expand("TEMPDIR/{setA}_{setB}_", setA = config["setA"], setB = config["setB"]
Then, (if that is OK), I used this partial group, to expand another set with its pairwise combinations. But I am not sure how to expand pairwise combinations of setC as seen below. It is obviously not correct; just written to clarify the question. Also, how to input the name of the expanded estimator from shell?
rule get_performance:
input:
xdata1 = TEMPDIR + part_group1 +"{setC}.rda"
xdata2 = TEMPDIR + part_group1 +"{setC}.rda"
estimator1= {estimator}
output:
results = TEMPDIR + "result_" + part_group1 +{estimator}_{setC}_{setC}.txt"
params:
Rfile = FunctionDIR + "function.{estimator}.R"
shell:
"Rscript {params.Rfile} {input.xdata1} {input.xdata12} {input.estimator1} "
"{output.results}"
The expand function will return a list of the product of the variables used. For example, if
setA=["A","B"]
setB=["C","D"]
then
expand("TEMPDIR/{setA}_{setB}_", setA = config["setA"], setB = config["setB"]
will give you:
["TEMPDIR/A_C_","TEMPDIR/A_D_","TEMPDIR/B_C_","TEMPDIR/B_D_"]
Your question is not very clear on what you want to achieve but I'll have a guess.
If you want to make pairwise combinations of setC:
import itertools
combiC=list(itertools.combinations(setC, 2))
combiList=list()
for c in combiC:
combiList.append(c[0]+"_"+c[1])
the you (probably) want the files:
rule all:
input: expand(TEMPDIR + "/result_{A}_{B}_estim{estimator}_combi{C}.txt",A=setA, B=setB, estimator=estimators, C=combiList)
I'm putting some words like "estim" and "combi" not to confuse the wildcards here. I do not know what the list or set "estimators" is supposed to be but I suppose you have declared it above.
Then your rule get_performance:
rule get_performance:
input:
xdata1 = TEMPDIR + "/{A}_{B}_{firstC}.rda",
xdata2 = TEMPDIR + "/{A}_{B}_{secondC}.rda"
output:
TEMPDIR + "/result_{A}_{B}_estim{estimator}_combi{firstC}_{secondC}.txt"
params:
Rfile = FunctionDIR + "/function.{estimator}.R"
shell:
"Rscript {params.Rfile} {input.xdata1} {input.xdata2} {input.estimator} {output.results}"
Again, this is a guess since you haven't defined all the necessary items.

Two variables with inconsistent names as input for a Snakemake rule

How can I pair up input data for rules in snakemake if the naming isn't consistent and they are all in the same folder?
For example if I want to use each pair of samples as input for each rule:
PT1 T5
S6 T7
S1 T20
In this example I would want to have 3 pairs PT1 & T5, S6 & T7, S1 & T20 so to start with, I would want to create 3 folders:
PT1vsT5
S6vsT7
S1vsT20
And then perform an analysis with manta and output the results into these 3 folders accordingly.
In the following pipeline I want the GERMLINE sample to be the first element in each line (PT1, S6, S1) and TUMOR the second one (T5, T7, T20).
rule all:
input:
expand("/{samples_g}vs{samples_t}", samples_g = GERMLINE, samples_t = TUMOR),
expand("/{samples_g}vs{samples_t}/runWorkflow.py", samples_g = GERMLINE, samples_t = TUMOR),
# Create folders
rule folders:
output: "./{samples_g}vs{samples_t}"
shell: "mkdir {output}"
# Manta configuration
rule manta_config:
input:
g = BAMPATH + "/{samples_g}.bam",
t = BAMPATH + "/{samples_t}.bam"
output:
wf = "{samples_g}vs{samples_t}/runWorkflow.py"
params:
ref = IND,
out_dir = "{samples_g}vs{samples_t}/runWorkflow.py"
shell:
"python configManta.py --normalBam {input.g} --tumorBam {input.t} --referenceFasta {params.ref} --runDir {params.out_dir} "
Could I do it by using as an input a .txt containing the pairs and then use a loop? If so how should I do it? Otherwise how could it be done?
You can generate the list of input or output files "manually" using any appropriate python code. For instance, you could proceed as follows to generate the first of your input lists:
In [1]: GERMLINE = ("PT1", "S6", "S1")
In [2]: TUMOR = ("T5", "T7", "T20")
In [3]: ["/{}vs{}".format(sample_g, sample_t) for (sample_g, sample_t) in zip(GERMLINE, TUMOR)]
Out[3]: ['/PT1vsT5', '/S6vsT7', '/S1vsT20']
So this would be applied as follows:
rule all:
input:
["/{}vs{}".format(sample_g, sample_t) for (sample_g, sample_t) in zip(GERMLINE, TUMOR)],
["/{}vs{}/runWorkflow.py".format(sample_g, sample_t) for (sample_g, sample_t) in zip(GERMLINE, TUMOR)],
(Note that I put sample_g and sample_t in singular form, as it sounded more logical in this context, where those variable represent individual sample names, and not lists of several samples)

Organizing data (pandas dataframe)

I have a data in the following form:
product/productId B000EVS4TY
1 product/title Arrowhead Mills Cookie Mix, Chocolate Chip, 1...
2 product/price unknown
3 review/userId A2SRVDDDOQ8QJL
4 review/profileName MJ23447
5 review/helpfulness 2/4
6 review/score 4.0
7 review/time 1206576000
8 review/summary Delicious cookie mix
9 review/text I thought it was funny that I bought this pro...
10 product/productId B0000DF3IX
11 product/title Paprika Hungarian Sweet
12 product/price unknown
13 review/userId A244MHL2UN2EYL
14 review/profileName P. J. Whiting "book cook"
15 review/helpfulness 0/0
16 review/score 5.0
17 review/time 1127088000
I want to convert it to a dataframe such that the entries in the 1st column
product/productId
product/title
product/price
review/userId
review/profileName
review/helpfulness
review/score
review/time
review/summary
review/text
are the column headers with the values arranged corresponding to each header in the table.
I still had a tiny doubt about your file, but since both my suggestions are quite similar, I will try to address both the scenarios you might have.
In case your file doesn't actually have the line numbers inside of it, this should do it:
filepath = "./untitled.txt" # you need to change this to your file path
column_separator="\s{3,}" # we'll use a regex, I explain some caveats of this below...
# engine='python' surpresses a warning by pandas
# header=None is that so all lines are considered 'data'
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None)
df = df.set_index(0) # this takes column '0' and uses it as the dataframe index
df = df.T # this makes the data look like you were asking (goes from multiple rows+1column to multiple columns+1 row)
df = df.reset_index(drop=True) # this is just so the first row starts at index '0' instead of '1'
# you could just do the last 3 lines with:
# df = df.set_index(0).T.reset_index(drop=True)
If you do have line numbers, then we just need to do some little adjustments
filepath = "./untitled1.txt"
column_separator="\s{3,}"
df = pd.read_csv(filepath, sep=column_separator, engine="python", header=None, index_col=0)
df.set_index(1).T.reset_index(drop=True) #I did all the 3 steps in 1 line, for brevity
In this last case, I would advise you change it in order to have line numbers in all of them (in the example you provided, the numbering starts at the second line, this might be an option about how you handle headers when exporting the data in whatever tool you might be using
Regarding the regex, the caveat is that "\s{3,}" looks for any block of 3 consecutive whitespaces or more to determine the column separator. The problem here is that we'll depend a bit on the data to find the columns. For instance, if in any of the values just so happens to appear 3 consecutive spaces, pandas will raise an exception, since the line will have one more column than the others. One solution to this could be increasing it to any other 'appropriate' number, but then we still depend on the data (for instance, with more than 3, in your example, "review/text" would have enough spaces for the two columns to be identified)
edit after realising what you meant by "stacked"
Whatever "line-number scenario" you have, you'll need to make sure you always have the same number of columns for all registers and reshape the continuous dataframe with something similar to this:
number_of_columns = 10 # you'll need to make sure all "registers" do have the same number of columns otherwise this will break
new_shape = (-1,number_of_columns) # this tuple will mean "whatever number of lines", by 10 columns
final_df = pd.DataFrame(data = df.values.reshape(new_shape)
,columns=df.columns.tolist()[:-10])
Again, take notice of making sure that all lines have the same number of columns (for instance, a file with just the data you provided, assuming 10 columns, wouldn't work). Also, this solution assumes all columns will have the same name.