Snakemake: How do I use a function that takes in a wildcard and returns a value? - snakemake

I have cram(bam) files that I want to split by read group. This requires reading the header and extracting the read group ids.
I have this function which does that in my Snakemake file:
def identify_read_groups(cram_file):
import subprocess
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.split('\n')[:-1]
return(read_groups)
I have this rule all:
rule all:
input:
expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))
And this rule to actually do the split:
rule split_cram_by_rg:
input:
cram_file='cram/{sample}.bam.cram',
read_groups=identify_read_groups('cram/{sample}.bam.cram')
output:
'cram/RG_bams/{sample}.RG{read_groups}.bam'
run:
import subprocess
read_groups = open(input.readGroupIDs).readlines()
read_groups = [str(rg.replace('\n','')) for rg in read_groups]
for rg in read_groups:
command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
subprocess.check_output(command, shell=True)
I get this error when doing a dry run:
[E::hts_open_format] fail to open file 'cram/{sample}.bam.cram'
samtools view: failed to open "cram/{sample}.bam.cram" for reading: No such file or directory
TypeError in line 19 of /gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile:
a bytes-like object is required, not 'str'
File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 37, in <module>
File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 19, in identify_read_groups
{sample} isn't being passed to the function.
How do I solve this problem? I'm open to other approaches if I'm not doing this in a 'snakemake-ic' way.
==============
EDIT 1
Ok, the first set of examples I gave had many many issues.
Here's a better (?) set of code, which I hope demonstrates my issue.
import sys
from os.path import join
shell.prefix("set -eo pipefail; ")
def identify_read_groups(wildcards):
import subprocess
cram_file = 'cram/' + wildcards + '.bam.cram'
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.decode().split('\n')[:-1]
return(read_groups)
SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for i in SAMPLES:
RG_dict[i] = identify_read_groups(i)
rule all:
input:
expand('{sample}.boo.txt', sample=list(RG_dict.keys()))
rule split_cram_by_rg:
input:
file='cram/{sample}.bam.cram',
RG = lambda wildcards: RG_dict[wildcards.sample]
output:
expand('cram/RG_bams/{{sample}}.RG{input_RG}.bam') # I have a problem HERE. How can I get my read groups values applied here? I need to go from one cram to multiple bam files split by RG (see -r in samtools view below). It can't pull the RG from the input.
shell:
'samtools view -b -r {input.RG} {input.file} > {output}'
rule merge_RG_bams_into_one_bam:
input:
rules.split_cram_by_rg.output
output:
'{sample}.boo.txt'
message:
'echo {input}'
shell:
'samtools merge {input} > {output}' #not working
"""
==============
EDIT 2
Getting MUCH closer, but currently struggling with expand properly building the lane bam files and keeping the wildcards
I'm using this loop to create the intermediate file names:
for sample in SAMPLES:
for rg_id in list(return_ID(sample)):
out_rg_bam.append("temp/lane_bam/{}.ID{}.bam".format(sample, rg_id))
return_ID is a function which takes the sample wildcard and returns a list of the read groups the sample contains
If I use out_rg_bam as an input for a merge rule, then ALL of the files get combined into a merged bam, instead of being split by sample.
If I use expand('temp/realigned/{{sample}}.ID{rg_id}.realigned.bam', sample=SAMPLES, rg_id = return_ID(sample)) then rg_id gets applied to each sample. So if I have two samples (a,b) , with read groups (0,1) and (0,1,2), I end up with a0, a1, a0, a1, a2 and b0, b1, b0, b1, b2.

I'm going to give a more general answer to help others that might find this thread. Snakemake only applies wildcards to strings in the 'input' and 'output' sections when the strings are directly listed, e.g.:
input:
'{sample}.bam'
If you are trying to use functions like you were here:
input:
read_groups=identify_read_groups('cram/{sample}.bam.cram')
The wildcard replacement will not be done. You can use a lambda function and do the replacement yourself:
input:
read_groups=lambda wildcards: identify_read_groups('cram/{sample}.bam.cram'.format(sample=wildcards.sample))

try this:
I use id = 0, 1, 2, 3 to name the output bam file depending on how many readgroup for a bam file.
## this is a regular function which takes the cram file, and get the read-group to
## construct your rule all
## you actually just need the number of #RG, below can be simplified
def get_read_groups(sample):
import subprocess
cram_file = 'cram/' + sample + '.bam.cram'
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.decode().split('\n')[:-1]
return(read_groups)
SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for sample in SAMPLES:
RG_dict[sample] = get_read_groups(sample)
outbam = []
for sample in SAMPLES:
read_groups = RG_dict[sample]
for i in range(len(read_groups)):
outbam.append("{}.RG{}.bam".format(sample, id))
rule all:
input:
outbam
## this is the input function, only uses wildcards as argument
def identify_read_groups(wildcards):
import subprocess
cram_file = 'cram/' + wildcards.sample + '.bam.cram'
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.decode().split('\n')[:-1]
return(read_groups[wildcards.id])
rule split_cram_by_rg:
input:
cram_file='cram/{sample}.bam.cram',
read_groups=identify_read_groups
output:
'cram/RG_bams/{sample}.RG{id}.bam'
run:
import subprocess
read_groups = input.read_groups
for rg in read_groups:
command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
subprocess.check_output(command, shell=True)
when use snakemake, think the way bottom up. First define what you want to generate in the rule all, and then construct the rule to create your final all.

Your all rule cannot have wildcards. It's a no wildcard-zone.
EDIT 1
I typed this pseudo-code in Notepad++, its not meant to compile, just trying to provide a framework. I think this is more what you are after.
Use a function inside of an expand to generate a list of file names which will then be used to driver the Snakemake pipeline's all rule. The baseSuffix and basePrefix variables are just to give you an idea as to String passing, arguments are permitted here. When passing back the list of strings, you will have to unpack them to ensure Snakemake reads the result properly.
def getSampleFileList(String basePrefix, String baseSuffix){
myFileList = []
ListOfSamples = *The wildcard glob call*
for sample in ListOfSamples:
command = "samtools -h " + sample + "SAME CALL USED TO GENERATE LIST OF HEADERS"
for rg in command:
myFileList.append(basePrefix + sample + ".RG" + rg + baseSuffix)
}
basePreix = "cram/RG_bams/"
baseSuffix = ".bam"
rule all:
input:
unpack(expand("{fileName}", fileName=getSampleFileList(basePrefix, baseSuffix)))
rule processing_rg_files:
input:
'cram/RG_bams/{sample}.RG{read_groups}.bam'
output:
'cram/RG_TXTs/{sample}.RG{read_groups}.txt'
run:
"Let's pretend this is useful code"
END OF EDIT
If it wasn't in the all rule, you'd use inline functions
So I'm not sure what you're trying to accomplish. As per my guesses, read below for some notes about your code.
rule all:
input:
expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))
The dry run is failing when it calls the function "identify_read_groups" inside the rule all call. It's being passed into your function call as a string, not a wildcard.
Technically, if the samtools call wasn't failing, and the function call "identify_read_groups(cram_file)" returned a list of 5 strings, it would expand to something like this:
rule all:
input:
'cram/RG_bams/{sample}.RG<output1FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output2FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output3FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output4FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output5FromFunctionCall>.bam'
But the term "{sample}", at this stage in Snakemake's pre-processing, is considered a string. As you needed to denote wildcards in an expand function with {{}}.
See how I address every Snakemake variable I declare for my rule all input call and don't use wildcards:
expand("{outputDIR}/{pathGVCFT}/tables/{samples}.{vcfProgram}.{form[1][varType]}{form[1][annotated]}.txt", outputDIR=config["outputDIR"], pathGVCFT=config["vcfGenUtil_varScanDIR"], samples=config["sample"], vcfProgram=config["vcfProgram"], form=read_table(StringIO(config["sampleFORM"]), " ").iterrows())
In this case read_table returns 2-dimensional array to form. Snakemake is well supported by python. I needed this for pairing of different annotations to different variant types.
Your rule all needs to be a string, or list of strings, as input. You cannot have wildcards in your "all" rule. These rule all input strings are what Snakemake uses to generate matches for OTHER wildcards. Build the entire filename in the function call and return it if you need to.
I think you should just turn it into something like this:
rule all:
input:
expand("{fileName}", fileName=myFunctionCall(BecauseINeededToPass, ACoupleArgs))
Also consider updating this to be more generic.:
rule split_cram_by_rg:
input:
cram_file='cram/{sample}.bam.cram',
read_groups=identify_read_groups('cram/{sample}.bam.cram')
It can have two or more wildcards (why we love Snakemake). You can access the wildcards later in the python "run" directive via the wildcards object, since it looks like you'll want to in your for each loop.
I think input and output wildcards have to match, so maybe do try it this way as well.
rule split_cram_by_rg:
input:
'cram/{sample}.bam.cram'
output:
expand('cram/RG_bams/{{sample}}.RG{read_groups}.bam', read_groups=myFunctionCall(BecauseINeededToPass, ACoupleArgs))
...
params:
rg=myFunctionCall(BecauseINeededToPass, ACoupleArgs)
run:
command = 'Just an example ' + + str(params.rg)
Again, not super sure what you're trying to do, I'm not sure I like the idea of the function call twice, but hey, it would run ;P Also notice the use of a wildcard "sample" in the input directive within a string {} and in the output directive within an expand {{}}.
An example of accessing wildcards in your run directive
Example of function calls in places you wouldn't think. I grabbed VCF fields but it could have been anything. I use an external configfile here.

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

How do Snakemake checkpoints work when i do not wanna make a folder?

I have a snakemake file where one rule produces a file from witch i would like to extract the header and use as wildcards in my rule all.
The Snakemake guide provides an example where it creates new folders named like the wildcards, but if I can avoid that it would be nice since in some cases it would need to create 100-200 folders then. Any suggestions on how to make it work?
link to snakemake guide:
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html
import pandas as pd
rule all:
input:
final_report = expand('report_{fruit}.txt', fruit= ???)
rule create_file:
input:
output:
fruit = 'fruit_file.csv'
run:
....
rule next:
input:
fruit = 'fruit_file.csv'
output:
report = 'report_{phenotype}.txt'
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = 'touch report_' + i + '.txt'
shell(cmd)
This is a simplified workflow since i am actually using some long script to both produce the pheno_file.csv and the report files.
The pheno_file.csv is tab-seperated and could look like this:
FID IID Apple Banana Plum
Mouse Mickey 0 0 1
Mouse Minnie 1 0 1
Duck Donnald 0 1 0
I think you are misreading the snakemake checkpoint example. You only need to create one folder in your case. They have a wildcard (sample) in the folder name, but that part of the output name is known ahead of time.
checkpoint fruit_reports:
input:
fruit = 'fruit_file.csv'
output:
report_dir = directory('reports')
run:
fruit_file = pd.read_csv({input.fruit}, header = 0, sep = '\t')
fruits= fruit_file.columns.tolist()[2:]
for i in fruits:
cmd = f'touch {output}/report_{i}.txt'
shell(cmd)
Since you do not know all names (fruits) ahead of time, you cannot include them in the all rule. You need to reference an intermediate rule to bring everything together. Maybe use a final report file:
rule all:
input: 'report.txt'
Then after the checkpoint:
def aggregate_fruit(wildcards):
checkpoint_output = checkpoints.fruit_reports.get(**wildcards).output[0]
return expand("reports/report_{i}.txt",
i=glob_wildcards(os.path.join(checkpoint_output, "report_{i}.txt")).i)
rule report:
input:
aggregate_input
output:
"report.txt"
shell:
"ls 1 {input} > {output}"

Snakemake: how to specify absolute paths to shell commands

I am writing a snakemake rule that uses multiple commands as shown below:
rule RULE1:
input: 'path/to/input.file'
output: 'path/to/output.file'
shell: 'path/to/command1 {input} | /path/to/command2 | /path/to/command3 {output}'
If the /path/to/command1 is really long the rule becomes a bit unwieldy. Is there a way to specify it somewhere else as cmd1='/path/to/command1' and use {cmd1} within the rule? I know, I can use something like params: cmd1='/path/to/command1' and use it as follows:
rule RULE1:
input: 'path/to/input.file'
output: 'path/to/output.file'
params:
cmd1='/path/to/command1',
cmd2='/path/to/command2',
cmd3='/path/to/command3'
shell: '{cmd1} {input} | {cmd2}| {cmd3} {output}'
But that workaround requires me to specify it for every rule separately and cannot use relative paths.
What is the standard way to do such a thing?
The shell directive takes a string as argument which you can construct however you prefer e.g.
cmd1= 'foo'
cmd2= 'bar'
rule one:
...
shell:
cmd1 + ' {input}' + ' | ' + cmd2 + ' > {output}'
To show some power of the snake, you could do something like
path2 = "/the/long/and/winding/path/"
rule RULE1:
input: path2 + 'input.file'
output: path2 + 'output.file'
shell: f'{path2}command1 {{input}} | {path2}command2 l | {path2}command3 {{output}}'
A couples of notes:
Double curlybraces since both snakemake and python (f') will want to parse them
Variables as path2 above are often stored in a config-file accessed by the snakemake directive configfile:
If all your files are on the same path, you might be able to use workdir: "/the/long/and/winding/path/" - or set the path from the command-line (better as you snakefile will be less prone to errors if you change directories)
Can obviously be combined with dariober's (better) answer, creating cmd1 = path2 +'command1' avoiding to repeat the long path in all commands ...

Snakemake : "wildcards in input files cannot be determined from output files"

I use Snakemake to execute some rules, and I've a problem with one:
rule filt_SJ_out:
input:
"pass1/{sample}SJ.out.tab"
output:
"pass1/SJ.db"
shell:'''
gawk '$6==1 || ($6==0 && $7>2)' {input} >> {output};
'''
Here, I just want to merge some files into a general file, but by searching on google I've see that wildcards use in inputs must be also use in output.
But I can't find a solution to work around this problem ..
Thank's by advance
If you know the values of sample prior to running the script, you could do the following:
SAMPLES = [... define the possible values of `sample` ...]
rule filt_SJ_out:
input:
expand("pass1/{sample}SJ.out.tab", sample=SAMPLES)
output:
"pass1/SJ.db"
shell:
"""
gawk '$6==1 || ($6==0 && $7>2)' {input} >> {output};
"""
In the input step, this will generate a list of files, each of the form pass1/<XYZ>SJ.out.tab.

Parallel execution of a snakemake rule with same input and a range of values for a single parameter

I am transitioning a bash script to snakemake and I would like to parallelize a step I was previously handling with a for loop. The issue I am running into is that instead of running parallel processes, snakemake ends up trying to run one process with all parameters and fails.
My original bash script runs a program multiple times for a range of values of the parameter K.
for num in {1..3}
do
structure.py -K $num --input=fileprefix --output=fileprefix
done
There are multiple input files that start with fileprefix. And there are two main outputs per run, e.g. for K=1 they are fileprefix.1.meanP, fileprefix.1.meanQ. My config and snakemake files are as follows.
Config:
cat config.yaml
infile: fileprefix
K:
- 1
- 2
- 3
Snakemake:
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['infile'],
K = config['K']
threads: 3
shell:
"""
structure.py -K {params.K} \
--input=output/{params.prefix} \
--output=output/{params.prefix}
"""
This was executed with snakemake --cores 3. The problem persists when I only use one thread.
I expected the outputs described above for each value of K, but the run fails with this error:
RuleException:
CalledProcessError in line 84 of Snakefile:
Command ' set -euo pipefail; structure.py -K 1 2 3 --input=output/fileprefix \
--output=output/fileprefix ' returned non-zero exit status 2.
File "Snakefile", line 84, in __rule_Structure
File "snake/lib/python3.6/concurrent/futures/thread.py", line 56, in run
When I set K to a single value such as K = ['1'], everything works. So the problem seems to be that {params.K} is being expanded to all values of K when the shell command is executed. I started teaching myself snakemake today, and it works really well, but I'm hitting a brick wall with this.
You need to retrieve the argument for -K from the wildcards, not from the config file. The config file will simply return your list of possible values, it is a plain python dictionary.
configfile: 'config.yaml'
rule all:
input:
expand("output/{sample}.{K}.{ext}",
sample = config['infile'],
K = config['K'],
ext = ['meanQ', 'meanP'])
rule structure:
output:
"output/{sample}.{K}.meanQ",
"output/{sample}.{K}.meanP"
params:
prefix = config['invcf'],
K = config['K']
threads: 3
shell:
"structure.py -K {wildcards.K} "
"--input=output/{params.prefix} "
"--output=output/{params.prefix}"
Note that there are more things to improve here. For example, the rule structure does not define any input file, although it uses one.
There is an option now for parameter space exploration
https://snakemake.readthedocs.io/en/stable/snakefiles/rules.html#parameter-space-exploration