I am writing a snakemake rule that uses multiple commands as shown below:
rule RULE1:
input: 'path/to/input.file'
output: 'path/to/output.file'
shell: 'path/to/command1 {input} | /path/to/command2 | /path/to/command3 {output}'
If the /path/to/command1 is really long the rule becomes a bit unwieldy. Is there a way to specify it somewhere else as cmd1='/path/to/command1' and use {cmd1} within the rule? I know, I can use something like params: cmd1='/path/to/command1' and use it as follows:
rule RULE1:
input: 'path/to/input.file'
output: 'path/to/output.file'
params:
cmd1='/path/to/command1',
cmd2='/path/to/command2',
cmd3='/path/to/command3'
shell: '{cmd1} {input} | {cmd2}| {cmd3} {output}'
But that workaround requires me to specify it for every rule separately and cannot use relative paths.
What is the standard way to do such a thing?
The shell directive takes a string as argument which you can construct however you prefer e.g.
cmd1= 'foo'
cmd2= 'bar'
rule one:
...
shell:
cmd1 + ' {input}' + ' | ' + cmd2 + ' > {output}'
To show some power of the snake, you could do something like
path2 = "/the/long/and/winding/path/"
rule RULE1:
input: path2 + 'input.file'
output: path2 + 'output.file'
shell: f'{path2}command1 {{input}} | {path2}command2 l | {path2}command3 {{output}}'
A couples of notes:
Double curlybraces since both snakemake and python (f') will want to parse them
Variables as path2 above are often stored in a config-file accessed by the snakemake directive configfile:
If all your files are on the same path, you might be able to use workdir: "/the/long/and/winding/path/" - or set the path from the command-line (better as you snakefile will be less prone to errors if you change directories)
Can obviously be combined with dariober's (better) answer, creating cmd1 = path2 +'command1' avoiding to repeat the long path in all commands ...
Related
I want to rename and move my fastq.gz files from these:
NAME-BOB_S1_L001_R1_001.fastq.gz
NAME-BOB_S1_L001_R2_001.fastq.gz
NAME-JOHN_S2_L001_R1_001.fastq.gz
NAME-JOHN_S2_L001_R2_001.fastq.gz
to these:
NAME_BOB/reads/NAME_BOB.R1.fastq.gz
NAME_BOB/reads/NAME_BOB.R2.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R1.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R2.fastq.gz
This is my code. The problem I have is the second variable S which I do not know how to specify in the code as I do not need it in my output filename.
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
S=["S1","S2"]
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz", dir=DIR),
expand("NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz", dir=DIR)
rule rename:
input:
fastq1=("fastq/NAME-{dir}_{s}_L001_R1_001.fastq.gz", zip, dir=DIR, s=S),
fastq2=("fastq/NAME-{dir}_{s}_L001_R2_001.fastq.gz", zip, dir=DIR, s=S)
output:
fastq1="NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz",
fastq2="NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz"
shell:
"""
mv {input.fastq1} {output.fastq1}
mv {input.fastq2} {output.fastq2}
"""
There are several problems in your code. First of all, the {dir} in your output and {dir} in your input are two different variables. Actually the {dir} in the output is a wildcard, while the {dir} in the input is a parameter for the expand function (moreover, you even forgot to call this function, and that is the second problem).
The third problem is that the shell section shall contain only a single command. You may try mv {input.fastq1} {output.fastq1}; mv {input.fastq2} {output.fastq2}, but this is not an idiomatic solution. Much better would be to create a rule that produces a single file, letting Snakemake to do the rest of the work.
Finally the S value fully depend on the DIR value, so it becomes a function of {dir}, and that can be solved with a lambda in input:
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
dir2s = {"BOB": "S1", "JOHN": "S2"}
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz", dir=DIR, r=["R1", "R2"])
rule rename:
input:
lambda wildcards:
"fastq/NAME-{{dir}}_{s}_L001_{{r}}_001.fastq.gz".format(s=dir2s[wildcards.dir])
output:
"NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz",
shell:
"""
mv {input} {output}
"""
I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}
First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"
So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.
I have some ONT sequencing runs that have been basecalled on the MINIT. As such, when I demultiplex with guppy_barcoder, I get a directory of fastq files for each barcode. I want to use snakemake as a workflow manager to take these fastq files through our analyses, but this involves swapping the {barcode} for {sample} at some point.
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
rule all:
input:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE)), #guppy_barcoder
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq")
output:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE))
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
rules.demultiplex.output
output:
"Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat Sequencing_reads/demultiplexed/{wildcards.barcode}/*fastq > {output.fastq} "
This does give me an error:
RuleException in line 32 of /home/eriny/sandbox/ONT_unicycler_pipeline/ONT_pipeline.smk:
'Wildcards' object has no attribute 'barcode'
But I actually think I'm missing something conceptually. I would like rule gather to be something like:
cat Sequencing_reads/demultiplexed/barcode01/*fastq > Sequencing_reads/gathered/sample01_ONT.fastq
I have tried setting up some dictionaries so that sample and barcode are given the same key, but my syntax must be broken.
I'm hoping to find a 1:1 way to map one variable name onto another.
I'm hoping to find a 1:1 way to map one variable name onto another.
I think the sample to dictionary is a possibility combined with a lambda as input function to get the barcode assign to a sample. For example:
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
sam2bar= dict(zip(SAMPLE, BARCODE))
rule all:
input:
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq"),
output:
done= touch('demux.done'), # This signals that guppy has completed
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
done= 'demux.done',
fastq= lambda wc: glob.glob("Sequencing_reads/demultiplexed/%s/*fastq" % sam2bar[wc.sample])
output:
fastq= "Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat {input.fastq} > {output.fastq} "
How can I make sure in rule all that the output folder was well created?
Should I add each expected result file?
somehow relates to snakemake define folder as output but in my case the specified 'output' is a combination of a path to a dir and a prefix for all results files (they wil be multiple)
the following command creates a folder path Analysis/MosDepth and adds to that path the files:
gt0.mosdepth.global.dist.txt
gt0.mosdepth.region.dist.txt
gt0.per-base.bed.gz
gt0.per-base.bed.gz.csi
gt0.regions.bed.gz
gt0.regions.bed.gz.csi
rule MosDepth:
input:
bam = "Analysis/Minimap2/"+UnpackedRawFastq+".bam",
bed = "ReferenceData/"+UnpackedGenomeGFF+"_exons.bed"
output:
pfx = "Analysis/MosDepth/gt0"
threads: config["threads"]
shell:
"mosdepth -t {threads} -b {input.bed} {output.pfx} {input.bam}"
I currently have only one of the files in rule all:, is this enough or is there a better way to ensure that the mosdepth has run well and not redo it in a later re-run?
rule all:
input:
"Analysis/MosDepth/gt0.regions.bed.gz"
I would recommend sth like this:
mos_out = ['gt0.mosdepth.global.dist.txt', 'gt0.mosdepth.region.dist.txt', 'gt0.per-base.bed.gz', 'gt0.per-base.bed.gz.csi', 'gt0.regions.bed.gz', 'gt0.regions.bed.gz.csi']
rule MosDepth:
input:
bam = "Analysis/Minimap2/"+UnpackedRawFastq+".bam",
bed = "ReferenceData/"+UnpackedGenomeGFF+"_exons.bed"
output:
expand("Analysis/MosDepth/{mos_out}", mos_out=mos_out)
params:
pfx = "Analysis/MosDepth/gt0"
threads: config["threads"]
shell:
"mosdepth -t {threads} -b {input.bed} {params.pfx} {input.bam}"
If one of the output files is not created by the rule, snakemake will remove all the output files for you, and throw an error.
I have cram(bam) files that I want to split by read group. This requires reading the header and extracting the read group ids.
I have this function which does that in my Snakemake file:
def identify_read_groups(cram_file):
import subprocess
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.split('\n')[:-1]
return(read_groups)
I have this rule all:
rule all:
input:
expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))
And this rule to actually do the split:
rule split_cram_by_rg:
input:
cram_file='cram/{sample}.bam.cram',
read_groups=identify_read_groups('cram/{sample}.bam.cram')
output:
'cram/RG_bams/{sample}.RG{read_groups}.bam'
run:
import subprocess
read_groups = open(input.readGroupIDs).readlines()
read_groups = [str(rg.replace('\n','')) for rg in read_groups]
for rg in read_groups:
command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
subprocess.check_output(command, shell=True)
I get this error when doing a dry run:
[E::hts_open_format] fail to open file 'cram/{sample}.bam.cram'
samtools view: failed to open "cram/{sample}.bam.cram" for reading: No such file or directory
TypeError in line 19 of /gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile:
a bytes-like object is required, not 'str'
File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 37, in <module>
File "/gpfs/gsfs5/users/mcgaugheyd/projects/nei/mcgaughey/EGA_EGAD00001002656/Snakefile", line 19, in identify_read_groups
{sample} isn't being passed to the function.
How do I solve this problem? I'm open to other approaches if I'm not doing this in a 'snakemake-ic' way.
==============
EDIT 1
Ok, the first set of examples I gave had many many issues.
Here's a better (?) set of code, which I hope demonstrates my issue.
import sys
from os.path import join
shell.prefix("set -eo pipefail; ")
def identify_read_groups(wildcards):
import subprocess
cram_file = 'cram/' + wildcards + '.bam.cram'
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.decode().split('\n')[:-1]
return(read_groups)
SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for i in SAMPLES:
RG_dict[i] = identify_read_groups(i)
rule all:
input:
expand('{sample}.boo.txt', sample=list(RG_dict.keys()))
rule split_cram_by_rg:
input:
file='cram/{sample}.bam.cram',
RG = lambda wildcards: RG_dict[wildcards.sample]
output:
expand('cram/RG_bams/{{sample}}.RG{input_RG}.bam') # I have a problem HERE. How can I get my read groups values applied here? I need to go from one cram to multiple bam files split by RG (see -r in samtools view below). It can't pull the RG from the input.
shell:
'samtools view -b -r {input.RG} {input.file} > {output}'
rule merge_RG_bams_into_one_bam:
input:
rules.split_cram_by_rg.output
output:
'{sample}.boo.txt'
message:
'echo {input}'
shell:
'samtools merge {input} > {output}' #not working
"""
==============
EDIT 2
Getting MUCH closer, but currently struggling with expand properly building the lane bam files and keeping the wildcards
I'm using this loop to create the intermediate file names:
for sample in SAMPLES:
for rg_id in list(return_ID(sample)):
out_rg_bam.append("temp/lane_bam/{}.ID{}.bam".format(sample, rg_id))
return_ID is a function which takes the sample wildcard and returns a list of the read groups the sample contains
If I use out_rg_bam as an input for a merge rule, then ALL of the files get combined into a merged bam, instead of being split by sample.
If I use expand('temp/realigned/{{sample}}.ID{rg_id}.realigned.bam', sample=SAMPLES, rg_id = return_ID(sample)) then rg_id gets applied to each sample. So if I have two samples (a,b) , with read groups (0,1) and (0,1,2), I end up with a0, a1, a0, a1, a2 and b0, b1, b0, b1, b2.
I'm going to give a more general answer to help others that might find this thread. Snakemake only applies wildcards to strings in the 'input' and 'output' sections when the strings are directly listed, e.g.:
input:
'{sample}.bam'
If you are trying to use functions like you were here:
input:
read_groups=identify_read_groups('cram/{sample}.bam.cram')
The wildcard replacement will not be done. You can use a lambda function and do the replacement yourself:
input:
read_groups=lambda wildcards: identify_read_groups('cram/{sample}.bam.cram'.format(sample=wildcards.sample))
try this:
I use id = 0, 1, 2, 3 to name the output bam file depending on how many readgroup for a bam file.
## this is a regular function which takes the cram file, and get the read-group to
## construct your rule all
## you actually just need the number of #RG, below can be simplified
def get_read_groups(sample):
import subprocess
cram_file = 'cram/' + sample + '.bam.cram'
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.decode().split('\n')[:-1]
return(read_groups)
SAMPLES, = glob_wildcards(join('cram/', '{sample}.bam.cram'))
RG_dict = {}
for sample in SAMPLES:
RG_dict[sample] = get_read_groups(sample)
outbam = []
for sample in SAMPLES:
read_groups = RG_dict[sample]
for i in range(len(read_groups)):
outbam.append("{}.RG{}.bam".format(sample, id))
rule all:
input:
outbam
## this is the input function, only uses wildcards as argument
def identify_read_groups(wildcards):
import subprocess
cram_file = 'cram/' + wildcards.sample + '.bam.cram'
command = 'samtools view -H ' + cram_file + ' | grep ^#RG | cut -f2 | cut -f2 -d":" '
read_groups = subprocess.check_output(command, shell=True)
read_groups = read_groups.decode().split('\n')[:-1]
return(read_groups[wildcards.id])
rule split_cram_by_rg:
input:
cram_file='cram/{sample}.bam.cram',
read_groups=identify_read_groups
output:
'cram/RG_bams/{sample}.RG{id}.bam'
run:
import subprocess
read_groups = input.read_groups
for rg in read_groups:
command = 'samtools view -b -r ' + str(rg) + ' ' + str(input.cram_file) + ' > ' + str(output)
subprocess.check_output(command, shell=True)
when use snakemake, think the way bottom up. First define what you want to generate in the rule all, and then construct the rule to create your final all.
Your all rule cannot have wildcards. It's a no wildcard-zone.
EDIT 1
I typed this pseudo-code in Notepad++, its not meant to compile, just trying to provide a framework. I think this is more what you are after.
Use a function inside of an expand to generate a list of file names which will then be used to driver the Snakemake pipeline's all rule. The baseSuffix and basePrefix variables are just to give you an idea as to String passing, arguments are permitted here. When passing back the list of strings, you will have to unpack them to ensure Snakemake reads the result properly.
def getSampleFileList(String basePrefix, String baseSuffix){
myFileList = []
ListOfSamples = *The wildcard glob call*
for sample in ListOfSamples:
command = "samtools -h " + sample + "SAME CALL USED TO GENERATE LIST OF HEADERS"
for rg in command:
myFileList.append(basePrefix + sample + ".RG" + rg + baseSuffix)
}
basePreix = "cram/RG_bams/"
baseSuffix = ".bam"
rule all:
input:
unpack(expand("{fileName}", fileName=getSampleFileList(basePrefix, baseSuffix)))
rule processing_rg_files:
input:
'cram/RG_bams/{sample}.RG{read_groups}.bam'
output:
'cram/RG_TXTs/{sample}.RG{read_groups}.txt'
run:
"Let's pretend this is useful code"
END OF EDIT
If it wasn't in the all rule, you'd use inline functions
So I'm not sure what you're trying to accomplish. As per my guesses, read below for some notes about your code.
rule all:
input:
expand('cram/RG_bams/{sample}.RG{read_groups}.bam', read_groups=identify_read_groups('cram/{sample}.bam.cram'))
The dry run is failing when it calls the function "identify_read_groups" inside the rule all call. It's being passed into your function call as a string, not a wildcard.
Technically, if the samtools call wasn't failing, and the function call "identify_read_groups(cram_file)" returned a list of 5 strings, it would expand to something like this:
rule all:
input:
'cram/RG_bams/{sample}.RG<output1FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output2FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output3FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output4FromFunctionCall>.bam',
'cram/RG_bams/{sample}.RG<output5FromFunctionCall>.bam'
But the term "{sample}", at this stage in Snakemake's pre-processing, is considered a string. As you needed to denote wildcards in an expand function with {{}}.
See how I address every Snakemake variable I declare for my rule all input call and don't use wildcards:
expand("{outputDIR}/{pathGVCFT}/tables/{samples}.{vcfProgram}.{form[1][varType]}{form[1][annotated]}.txt", outputDIR=config["outputDIR"], pathGVCFT=config["vcfGenUtil_varScanDIR"], samples=config["sample"], vcfProgram=config["vcfProgram"], form=read_table(StringIO(config["sampleFORM"]), " ").iterrows())
In this case read_table returns 2-dimensional array to form. Snakemake is well supported by python. I needed this for pairing of different annotations to different variant types.
Your rule all needs to be a string, or list of strings, as input. You cannot have wildcards in your "all" rule. These rule all input strings are what Snakemake uses to generate matches for OTHER wildcards. Build the entire filename in the function call and return it if you need to.
I think you should just turn it into something like this:
rule all:
input:
expand("{fileName}", fileName=myFunctionCall(BecauseINeededToPass, ACoupleArgs))
Also consider updating this to be more generic.:
rule split_cram_by_rg:
input:
cram_file='cram/{sample}.bam.cram',
read_groups=identify_read_groups('cram/{sample}.bam.cram')
It can have two or more wildcards (why we love Snakemake). You can access the wildcards later in the python "run" directive via the wildcards object, since it looks like you'll want to in your for each loop.
I think input and output wildcards have to match, so maybe do try it this way as well.
rule split_cram_by_rg:
input:
'cram/{sample}.bam.cram'
output:
expand('cram/RG_bams/{{sample}}.RG{read_groups}.bam', read_groups=myFunctionCall(BecauseINeededToPass, ACoupleArgs))
...
params:
rg=myFunctionCall(BecauseINeededToPass, ACoupleArgs)
run:
command = 'Just an example ' + + str(params.rg)
Again, not super sure what you're trying to do, I'm not sure I like the idea of the function call twice, but hey, it would run ;P Also notice the use of a wildcard "sample" in the input directive within a string {} and in the output directive within an expand {{}}.
An example of accessing wildcards in your run directive
Example of function calls in places you wouldn't think. I grabbed VCF fields but it could have been anything. I use an external configfile here.