I'm trying to build a Snakemake pipeline, but I'm confused why filename wildcards work for input and output, but not for shell. For example, the following works fine:
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}.out", sample=samplelist)
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {output}"
But let's say that the command I use for shell actually derives from a string I give it, so I can't name the output file directly in the shell command. Then how can I use my filename wildcard (here {sample}) in the shell command?
For example, the following doesn't work:
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}.out", sample=samplelist)
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {sample}.out"
It gives me the following error:
RuleException in line 6 of Snakefile:
NameError: The name 'sample' is unknown in this context. Please make sure
that you defined that variable. Also note that braces not used for variable
access have to be escaped by repeating them, i.e. {{print $1}}
How can I work around this?
(Or if you really want to see some real life code, here is what I'm working with):
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}_aligned.sam", sample=samplelist)
rule align:
input: "{sample}_R1_001.trimmed.fastq.gz",
"{sample}_R2_001.trimmed.fastq.gz"
output: "{sample}_aligned.sam"
threads: 4
shell: "STAR --outFileNamePrefix {sample}_aligned --readFilesIn {input[0]} {input[1]} --readFilesCommand zcat --runMode alignReads --runThreadN {threads} --genomeDir /path/to/StarIndex"
But the error message is basically the same. For shell, I can use {input}, {output}, and {threads}, but not {sample}.
I did look at Snakemake: How do I use a function that takes in a wildcard and returns a value?, but that seems to be focused on generating input file names. My issue deals with interpolation of the filename wildcard into the shell command.
Wildcards are available via {wildcards.XXXX}. source
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {wildcards.sample}.out"
Related
I want to rename and move my fastq.gz files from these:
NAME-BOB_S1_L001_R1_001.fastq.gz
NAME-BOB_S1_L001_R2_001.fastq.gz
NAME-JOHN_S2_L001_R1_001.fastq.gz
NAME-JOHN_S2_L001_R2_001.fastq.gz
to these:
NAME_BOB/reads/NAME_BOB.R1.fastq.gz
NAME_BOB/reads/NAME_BOB.R2.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R1.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R2.fastq.gz
This is my code. The problem I have is the second variable S which I do not know how to specify in the code as I do not need it in my output filename.
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
S=["S1","S2"]
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz", dir=DIR),
expand("NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz", dir=DIR)
rule rename:
input:
fastq1=("fastq/NAME-{dir}_{s}_L001_R1_001.fastq.gz", zip, dir=DIR, s=S),
fastq2=("fastq/NAME-{dir}_{s}_L001_R2_001.fastq.gz", zip, dir=DIR, s=S)
output:
fastq1="NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz",
fastq2="NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz"
shell:
"""
mv {input.fastq1} {output.fastq1}
mv {input.fastq2} {output.fastq2}
"""
There are several problems in your code. First of all, the {dir} in your output and {dir} in your input are two different variables. Actually the {dir} in the output is a wildcard, while the {dir} in the input is a parameter for the expand function (moreover, you even forgot to call this function, and that is the second problem).
The third problem is that the shell section shall contain only a single command. You may try mv {input.fastq1} {output.fastq1}; mv {input.fastq2} {output.fastq2}, but this is not an idiomatic solution. Much better would be to create a rule that produces a single file, letting Snakemake to do the rest of the work.
Finally the S value fully depend on the DIR value, so it becomes a function of {dir}, and that can be solved with a lambda in input:
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
dir2s = {"BOB": "S1", "JOHN": "S2"}
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz", dir=DIR, r=["R1", "R2"])
rule rename:
input:
lambda wildcards:
"fastq/NAME-{{dir}}_{s}_L001_{{r}}_001.fastq.gz".format(s=dir2s[wildcards.dir])
output:
"NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz",
shell:
"""
mv {input} {output}
"""
I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}
First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"
So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.
I have some ONT sequencing runs that have been basecalled on the MINIT. As such, when I demultiplex with guppy_barcoder, I get a directory of fastq files for each barcode. I want to use snakemake as a workflow manager to take these fastq files through our analyses, but this involves swapping the {barcode} for {sample} at some point.
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
rule all:
input:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE)), #guppy_barcoder
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq")
output:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE))
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
rules.demultiplex.output
output:
"Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat Sequencing_reads/demultiplexed/{wildcards.barcode}/*fastq > {output.fastq} "
This does give me an error:
RuleException in line 32 of /home/eriny/sandbox/ONT_unicycler_pipeline/ONT_pipeline.smk:
'Wildcards' object has no attribute 'barcode'
But I actually think I'm missing something conceptually. I would like rule gather to be something like:
cat Sequencing_reads/demultiplexed/barcode01/*fastq > Sequencing_reads/gathered/sample01_ONT.fastq
I have tried setting up some dictionaries so that sample and barcode are given the same key, but my syntax must be broken.
I'm hoping to find a 1:1 way to map one variable name onto another.
I'm hoping to find a 1:1 way to map one variable name onto another.
I think the sample to dictionary is a possibility combined with a lambda as input function to get the barcode assign to a sample. For example:
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
sam2bar= dict(zip(SAMPLE, BARCODE))
rule all:
input:
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq"),
output:
done= touch('demux.done'), # This signals that guppy has completed
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
done= 'demux.done',
fastq= lambda wc: glob.glob("Sequencing_reads/demultiplexed/%s/*fastq" % sam2bar[wc.sample])
output:
fastq= "Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat {input.fastq} > {output.fastq} "
I want to download the fastq files from SRA database using SRR ID using Snakemake. I read a file to get SRR ID using python code.
I want to parse the Variable one by one as input. My code is below.
I want to run command
fastq-dump SRR390728
#SAMPLES = ['SRR390728','SRR400816']
SAMPLES = [line.strip() for line in open("./srrList", 'r')]
rule all:
input:
expand("fastq/{sample}.fastq.log",sample=SAMPLES)
rule download_fastq:
input:
"{sample}"
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {input} > {output}"
Skip input and just call the wildcard in shell command. input needs to be a filepath that needs to already exist or be created as part of the pipeline - neither are true in your case.
rule download_fastq:
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {wildcards.sample} > {output}"
I'm trying to build a Snakemake pipeline, but I'm confused why filename wildcards work for input and output, but not for shell. For example, the following works fine:
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}.out", sample=samplelist)
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {output}"
But let's say that the command I use for shell actually derives from a string I give it, so I can't name the output file directly in the shell command. Then how can I use my filename wildcard (here {sample}) in the shell command?
For example, the following doesn't work:
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}.out", sample=samplelist)
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {sample}.out"
It gives me the following error:
RuleException in line 6 of Snakefile:
NameError: The name 'sample' is unknown in this context. Please make sure
that you defined that variable. Also note that braces not used for variable
access have to be escaped by repeating them, i.e. {{print $1}}
How can I work around this?
(Or if you really want to see some real life code, here is what I'm working with):
samplelist=[ "aa_S1", "bb_S2"]
rule all:
input: expand("{sample}_aligned.sam", sample=samplelist)
rule align:
input: "{sample}_R1_001.trimmed.fastq.gz",
"{sample}_R2_001.trimmed.fastq.gz"
output: "{sample}_aligned.sam"
threads: 4
shell: "STAR --outFileNamePrefix {sample}_aligned --readFilesIn {input[0]} {input[1]} --readFilesCommand zcat --runMode alignReads --runThreadN {threads} --genomeDir /path/to/StarIndex"
But the error message is basically the same. For shell, I can use {input}, {output}, and {threads}, but not {sample}.
I did look at Snakemake: How do I use a function that takes in a wildcard and returns a value?, but that seems to be focused on generating input file names. My issue deals with interpolation of the filename wildcard into the shell command.
Wildcards are available via {wildcards.XXXX}. source
rule align:
input: "{sample}.txt"
output: "{sample}.out"
shell: "touch {wildcards.sample}.out"