Is it possible to access a snakemake wildcards outside of a rule ?
The wildcard is a variable that has a meaning inside of a rule only. This allows the rule to be generic, and the actual value of the wildcard is being assigned when the rule is executed.
In your code you are trying to use wildcard outside of any rule. You don't need any wildcard there:
if samples_dict.loc[wildcards.sample]['Sequencing_type'] == "SE":
rule test_SE :
input : '{sample}.txt'
output : '{sample}_done.txt'
conda :
'test.yaml'
shell :
"somecommand -i {input} -o {output} ;"
elif samples_dict.loc[wildcards.sample]['Sequencing_type'] == "PE":
rule test_PE :
input :
input1="input1",
input2="input2"
output :
output1="output1",
output2="output2"
conda :
'test.yaml'
shell :
"somecommand -i1 {input.input1} -i2 {input.input2} -o1 {output.output1} -o2 {output.output2} ;"
else:
rule test_fail :
output :
'{sample}.txt'
conda :
'test.yaml'
shell :
'echo "Use PE or SE in Sequencing_type column" > {output};'
This however opens another question: why do you even need to have these conditional rules? These rules produce different outputs, and should have no conflicts. Try to remove all if/elif/else, as this is not an idiomatic way to use Snakemake.
Related
I want to rename and move my fastq.gz files from these:
NAME-BOB_S1_L001_R1_001.fastq.gz
NAME-BOB_S1_L001_R2_001.fastq.gz
NAME-JOHN_S2_L001_R1_001.fastq.gz
NAME-JOHN_S2_L001_R2_001.fastq.gz
to these:
NAME_BOB/reads/NAME_BOB.R1.fastq.gz
NAME_BOB/reads/NAME_BOB.R2.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R1.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R2.fastq.gz
This is my code. The problem I have is the second variable S which I do not know how to specify in the code as I do not need it in my output filename.
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
S=["S1","S2"]
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz", dir=DIR),
expand("NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz", dir=DIR)
rule rename:
input:
fastq1=("fastq/NAME-{dir}_{s}_L001_R1_001.fastq.gz", zip, dir=DIR, s=S),
fastq2=("fastq/NAME-{dir}_{s}_L001_R2_001.fastq.gz", zip, dir=DIR, s=S)
output:
fastq1="NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz",
fastq2="NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz"
shell:
"""
mv {input.fastq1} {output.fastq1}
mv {input.fastq2} {output.fastq2}
"""
There are several problems in your code. First of all, the {dir} in your output and {dir} in your input are two different variables. Actually the {dir} in the output is a wildcard, while the {dir} in the input is a parameter for the expand function (moreover, you even forgot to call this function, and that is the second problem).
The third problem is that the shell section shall contain only a single command. You may try mv {input.fastq1} {output.fastq1}; mv {input.fastq2} {output.fastq2}, but this is not an idiomatic solution. Much better would be to create a rule that produces a single file, letting Snakemake to do the rest of the work.
Finally the S value fully depend on the DIR value, so it becomes a function of {dir}, and that can be solved with a lambda in input:
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
dir2s = {"BOB": "S1", "JOHN": "S2"}
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz", dir=DIR, r=["R1", "R2"])
rule rename:
input:
lambda wildcards:
"fastq/NAME-{{dir}}_{s}_L001_{{r}}_001.fastq.gz".format(s=dir2s[wildcards.dir])
output:
"NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz",
shell:
"""
mv {input} {output}
"""
I'm kinda new at snakemake and I'm trying to understand how it works.
I tried to pull a simple snakefile
from snakemake.utils import min_version
min_version("5.3.0")
max_reads: 250000
sra_id: ["SRR1187735"]
rule all:
input:
"DATA/{sra_id}.fastq.gz"
rule prefetch:
output:
"DATA/{sra_id}.fastq.gz"
params:
max_reads = "max_reads"
version: "1.0"
shell:
"conda activate sra-tools-2.10.1 "
"&& "
"fastq-dump {wildcards.sra_id} -X {params.max_reads} --readids \
--dumpbase --skip-technical --gzip -Z > {output} "
"&& "
"conda deactivate "
But I'm getting this error :
WildcardError in line 5 of /save_home/skastalli/test_rule/Snakefile:
Wildcards in input files cannot be determined from output files:
'sra_id'
Can someone help me please ?
Most of your rules should be generalizable and include wildcards, so prefetch is correct. At some point you have to ask for a specific file for snakemake to try and generate. This can occur in several spots of the workflow, but at least the rule all needs to ask for specific files. Based on your preamble, I think the input of your rule all should be:
rule all:
input:
expand("DATA/{sra_id}.fastq.gz", sra_id=sra_id)
Also, because snakemake runs in debug mode for shell, you don't need the && in the shell directive and can just use newlines instead. Personal preference though.
I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}
First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"
So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.
Adding .g.vcf instead of .vcf after the variable in expand rule is somehow adding the .g to a wildcard in another module
I have tried the following in the all rule :
{stuff}.g.vcf
{stuff}"+"g.vcf"
{stuff}_var"+".g.vcf"
{stuff}.t.vcf
all fail but {stuff}.gvcf or {stuff}.vcf work
Error:
InputFunctionException in line 21 of snake_modules/mark_duplicates.snakefile:
KeyError: 'Mother.g'
Wildcards:
lane=Mother.g
Code:
LANES = config["list2"].split()
rule all:
input:
expand(projectDir+"results/alignments/variants/{stuff}.g.vcf", stuff=LANES)
rule mark_duplicates:
""" this will mark duplicates for bam files from the same sample and library """
input:
get_lanes
output:
projectDir+"results/alignments/markdups/{lane}.markdup.bam"
log:
projectDir+"logs/"+stamp+"_{lane}_markdup.log"
shell:
" input=$(echo '{input}' |sed -e s'/ / I=/g') && java -jar /home/apps/pipelines/picard-tools/CURRENT MarkDuplicates I=$input O={projectDir}results/alignments/markdups/{wildcards.lane}.markdup.bam M={projectDir}results/alignments/markdups/{wildcards.lane}.markdup_metrics.txt &> {log}"
I want my final output to have the {stuff}.g.vcf notation. Please note this output is created in another snake module but the error appears in the mark duplicates which is before the other module.
I have tried multiple changes but it is the .g.vcf in the all rule that causes the issue.
My guess is that {lane} is interpreted as a regular expression and it's capturing more than it should. Try adding before rule all:
wildcard_constraints:
stuff= '|'.join([re.escape(x) for x in LANES]),
lane= '|'.join([re.escape(x) for x in LANES])
(See also this thread https://groups.google.com/forum/#!topic/snakemake/wVlJW9X-9EU)
I use Snakemake to execute some rules, and I've a problem with one:
rule filt_SJ_out:
input:
"pass1/{sample}SJ.out.tab"
output:
"pass1/SJ.db"
shell:'''
gawk '$6==1 || ($6==0 && $7>2)' {input} >> {output};
'''
Here, I just want to merge some files into a general file, but by searching on google I've see that wildcards use in inputs must be also use in output.
But I can't find a solution to work around this problem ..
Thank's by advance
If you know the values of sample prior to running the script, you could do the following:
SAMPLES = [... define the possible values of `sample` ...]
rule filt_SJ_out:
input:
expand("pass1/{sample}SJ.out.tab", sample=SAMPLES)
output:
"pass1/SJ.db"
shell:
"""
gawk '$6==1 || ($6==0 && $7>2)' {input} >> {output};
"""
In the input step, this will generate a list of files, each of the form pass1/<XYZ>SJ.out.tab.