I would like to run multiple rules one after another using snakemake. However, when I run this script, the bam_list rule appears before samtools_markdup rule, and gives me an error that it cannot find input files, which are obviously have not been generated yet.
How to solve this problem?
rule all:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES)
"dup/bam_list"
rule samtools_markdup:
input:
sortbam ="rg/{sample}.rg.bam"
output:
dupbam = "dup/{sample}.dup.bam"
threads: 5
shell:
"""
samtools markdup -# {threads} {input.sortbam} {output.dupbam}
"""
rule bam_list:
output:
outlist = "dup/bam_list"
shell:
"""
ls dup/*.bam > {output.outlist}
"""
Snakemake is following directions, you want dup/bam_list and it can be produced without any inputs. I think what you mean to have is:
rule all:
input:
"dup/bam_list"
rule samtools_markdup:
input:
sortbam ="rg/{sample}.rg.bam"
output:
dupbam = "dup/{sample}.dup.bam"
threads: 5
shell:
"""
samtools markdup -# {threads} {input.sortbam} {output.dupbam}
"""
rule bam_list:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES)
output:
outlist = "dup/bam_list"
shell:
"""
ls dup/*.bam > {output.outlist}
"""
Now bam_list will wait until all the samtools_markdup jobs are completed. As an aside, I expect the contents of dup_list to be identical to expand("dup/{sample}.dup.bam", sample=SAMPLES), so if you use the file later in the workflow you can probably just use the expand output.
Related
I currently have a snakemake workflow that requires the use of lambda wildcards, set up as follows:
Snakefile:
configfile: "config.yaml"
workdir: config["work"]
rule all:
input:
expand("logs/bwa/{ref}.log", ref=config["refs"])
rule bwa_index:
input:
lambda wildcards: 'data/'+config["refs"][wildcards.ref]+".fna.gz"
output:
"logs/bwa/{ref}.log"
log:
"logs/bwa/{ref}.log"
shell:
"bwa index {input} 2&>1 {log}"
Config file:
work: /datasets/work/AF_CROWN_RUST_WORK/2020-02-28_GWAS
refs:
12NC29: GCA_002873275.1_ASM287327v1_genomic
12SD80: GCA_002873125.1_ASM287312v1_genomic
This works, but I've had to use a hack to get the output of bwa_index to play with the input of all. My hack is to generate a log file as part of bwa_index, set the log to the output of bwa_index, and then set the input of all to these log files. As I said, it works, but I don't like it.
The problem is that the true outputs of bwa_index are of the format, for example, GCA_002873275.1_ASM287327v1_genomic.fna.sa. So, to specify these output files, I would need to use a lambda function for the output, something like:
rule bwa_index:
input:
lambda wildcards: 'data/'+config["refs"][wildcards.ref]+".fna.gz"
output:
lambda wildcards: 'data/'+config["refs"][wildcards.ref]+".fna.sa"
log:
"logs/bwa/{ref}.log"
shell:
"bwa index {input} 2&>1 {log}"
and then use a lambda function with expand for the input of rule all. However, snakemake will not accept functions as output, so I'm at a complete loss how to do this (other than my hack). Does anyone have suggestions of a sensible solution? TIA!
You can use a simple python function in the inputs (as the lambda function) so I suggest you use it for the rule all.
configfile: "config.yaml"
workdir: config["work"]
def getTargetFiles():
targets = list()
for r in config["refs"]:
targets.append("data/"+config["refs"][r]+".fna.sa")
return targets
rule all:
input:
getTargetFiles()
rule bwa_index:
input:
"data/{ref}.fna.gz"
output:
"data/{ref}.fna.sa"
log:
"logs/bwa/{ref}.log"
shell:
"bwa index {input} 2&>1 {log}"
Careful here the wildcard {ref} is the value and not the key of your dictionnary so your log files will finally be named "logs/bwa/GCA_002873275.1_ASM287327v1_genomic.log", etc...
Generating output based on changed input files in Snakemake is easy:
rule all:
input: [f'out_{i}.txt' for i in range(10)]
rule make_input:
output: 'in_{i}.txt'
shell: 'touch {output}'
rule make_output_parallel:
input: 'in_{i}.txt'
output: 'out_{i}.txt'
shell: 'touch {output}'
In this case, make_output will only run for instances where in_{i}.txt have changed.
But suppose the 'out_{i}.txt' cannot be generated in parallel and I want to generate them in a single step, like,
rule make_output_one_step:
input: [f'in_{i}.txt' for i in range(10)]
output: [f'out_{i}.txt' for i in range(10)]
shell: 'touch {output}'
If only one of the in_{i}.txt files have changed, I don't need to regenerate all 10 of them.
How can I adjust make_output_one_step.output to generate only the needed files?
If you want some parts of the pipeline to not work in parallel for whatever reason (RAM, internet usage, IO, API limit, etc....) you can make use of resources.
rule all:
input: [f'out_{i}.txt' for i in range(10)]
rule make_input:
output: 'in_{i}.txt'
shell: 'touch {output}'
rule make_output:
input: 'in_{i}.txt'
output: 'out_{i}.txt'
resources: max_parallel=1
shell: 'touch {output}'
And then you can call your pipeline like snakemake --resources max_parallel=1 --cores 10. In this case all the jobs of rule make_input will run in parallel, but only one instance of make_output will run in parallel.
Following my previous question:Snakemake: HISAT2 alignment of many RNAseq reads against many genomes UPDATED.
I wanted to run the hisat2 alignment using touch in snakemake.
I have several genome files with suffix .1.ht2l to .8.ht2l
bob.1.ht2l
...
bob.8.ht2l
steve.1.ht2l
...
steve.8.ht2l
and sereval RNAseq samples
flower_kevin_1.fastq.gz
flower_kevin_2.fastq.gz
flower_daniel_1.fastq.gz
flower_daniel_2.fastq.gz
I need to align all rnaseq reads against each genome.
workdir: "/path/to/dir/"
(HISAT2_INDEX_PREFIX,)=glob_wildcards('/path/to/dir/{prefix}.fasta')
(SAMPLES,)=glob_wildcards("/path/to/dir/{sample}_1.fastq.gz")
rule all:
input:
expand("{prefix}.{sample}.bam", zip, prefix=HISAT2_INDEX_PREFIX, sample=SAMPLES)
rule hisat2_build:
input:
database="/path/to/dir/{prefix}.fasta"
output:
done = touch("{prefix}")
threads: 2
shell:
("/Tools/hisat2-2.1.0/hisat2-build -p {threads} {input.database} {wildcards.prefix}")
rule hisat2:
input:
hisat2_prefix_done = "{prefix}",
fastq1="/path/to/dir/{sample}_1.fastq.gz",
fastq2="/path/to/dir/{sample}_2.fastq.gz"
output:
bam = "{prefix}.{sample}.bam",
txt = "{prefix}.{sample}.txt",
log: "{prefix}.{sample}.snakemake_log.txt"
threads: 50
shell:
"/Tools/hisat2-2.1.0/hisat2 -p {threads} -x {wildcards.prefix}"
" -1 {input.fastq1} -2 {input.fastq2} --summary-file {output.txt} |"
"/Tools/samtools-1.9/samtools sort -# {threads} -o {output.bam}"
The output gives me bob and steve aligned ONLY against ONE rna-seq sample (i.e. flower_kevin). I don't know how to solve. Any suggestions would be helpful.
I solved the problem by removing zip from rule all. Critics about the syntax of code is still welcome.
I am trying to use file that will be written during the run as an input to another rule, but it always give me error FileNotFoundError: [Errno 2] No such file or directory:
Is there a way to fix it or other implementation to have the same logic.
def vc_list(wildcards):
my_list = []
with open(wildcards.mydir+"/file_B.txt", 'r') as data_in:
for line in data_in:
my_list.append(line.strip())
return(my_list)
# rule A will process file_A.txt and give me file_B.txt
rule A:
input: "{mydir}/file_A.txt"
output: "{mydir}/file_B.txt"
shell: "seq 1 5 > {output}" # assume that `seq 1 5` is the output from proicessing the file
rule B:
input: "{vlaue}"
output: "{vlaue}.vc"
shell: "pythoncode.py {input} {output}"
# rule C will process file_B.txt to give me list of values that will be used to expanded the input, then will use rile B to produce it
rule C:
input:
processed_file = rules.A.output, #"{mydir}/file_B.txt",
my_list = lambda wildcards: expand("{mydir}/{value}.vc", mydir=wildcards.mydir, value=vc_list(wildcards))
output: "{mydir}/done.txt"
shell: "touch {output}"
#I always have the error that "{mydir}/file_B.txt" does not exist
The error now:
test_loop.snakefile:
FileNotFoundError: [Errno 2] No such file or directory: 'read_file/file_B.txt'
Wildcards:
mydir=read_file
Thanks,
The answer to my question is to use checkpoint as dynamic will be deprecated.
Here is how the logic should be changed:
rule:
input: 'done.txt'
checkpoint A:
output: 'B.txt'
shell: 'seq 1 2 > {output}'
rule N:
input: "genome.fa"
output: '{num}.bam'
shell: "touch {output}"
rule B:
input: '{num}.bam'
output: '{num}.vc'
shell: "touch {output}"
def aggregate_input(wildcards):
with open(checkpoints.A.get(**wildcards).output[0], 'r') as f:
return [num.rstrip() + '.vc' for num in f]
rule C:
input: aggregate_input
output: touch('done.txt')
Credit goes to Eric Lim
Your script fails even before the workflow starts, on the phase of the pipeline construction.
So, there is nothing surprising regarding the rules A and B: Snakemake reads their input and output sections and finds no problem with them. Then it starts reading the rule C where the input section calls the vc_list() function which in turn tries to read the file 'read_file/file_B.txt' even before the workflow has started! For sure it doesn't find the file and produces the error.
As for what to do, you need to clarify the task first. Most probable you are trying to use dynamic information in the input rule. In this case you need to use dynamic files or checkpoints.
So far I used snakemake to generate individual plots with snakemake. This has worked great! Now though, I want to create a rule that creates a combined plot across the topics, without explicitly putting the name in the plot. See the combined_plot rule below.
topics=["soccer", "football"]
params=[1, 2, 3, 4]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"plot.py --input={input} --output={output}"
rule combined_plot:
input:
# all data_p={param}_{topic}.csv files
output:
"combined_p={param}_plot.png"
shell:
"plot2.py " + # one "--input=" and one "--output" for each csv file
Is there a simple way to do this with snakemake?
If I understand correctly, the code below should be more straightforward as it replaces the lambda and the glob with the expand function. It will execute the two commands:
plot2.py --input=data_p=1_soccer.csv --input=data_p=1_football.csv --output combined_p=1_plot.png
plot2.py --input=data_p=2_soccer.csv --input=data_p=2_football.csv --output combined_p=2_plot.png
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("combined_p={param}_plot.png", param=params),
rule combined_plot:
input:
csv= expand("data_p={{param}}_{topic}.csv", topic= topics)
output:
"combined_p={param}_plot.png",
run:
inputs= ['--input=' + x for x in input.csv]
shell("plot2.py {inputs} --output {output}")
I got a working version, by using a function called 'wcs' as input (see here) and I used run instead of shell. In the run section I could first define a variable before executing the result with shell(...).
Instead of referring to the files with glob I could also have directly used the topics in the lambda function.
If anyone with more experience sees this, please tell me if this is the "right" way to do it.
from glob import glob
topics=["soccer", "football"]
params=[1, 2]
rule all:
input:
expand("plot_p={param}_{topic}.png", topic=topics, param=params),
expand("combined_p={param}_plot.png", param=params),
rule plot:
input:
"data_p={param}_{topic}.csv"
output:
"plot_p={param}_{topic}.png"
shell:
"echo plot.py {input} {output}"
rule combined_plot:
input:
lambda wcs: glob("data_p={param}_*.csv".format(**wcs))
output:
"combined_p={param}_plot.png"
run:
inputs=" ".join(["--input " + inp for inp in input])
shell("echo plot2.py {inputs}")