Snakemake wildscard unexpected changed - snakemake

I'm a novice of Snakemake and I come across a bug that struggled me a lot.
I have a wildcards like this:
rank = ['Kingdom', 'Phylum', 'Class', 'Order', 'Family', 'Genus', 'Species']
ordi = ['DCA', 'CCA', 'RDA', 'NMDS', 'MDS', 'NMDS', 'PCoA']
The previous version didn't have the wildcards problem and run successfully
the previous version rule all like this:
rule all:
input:
expand('common_taxonomic/abundance_table_{Rank}.biom', Rank = rank),
directory('Gene/gene_Venn'),
directory('Gene/gene_samples_heatmap'),
directory('taxa_ternaryplot'),
directory(expand('beta/PCA/{Rank}', Rank = rank))
benchmark:
"Check_utility.tsv"
But when I exchange the wildcards position like
directory(expand('beta/{Rank}/PCA/', Rank = rank)),
directory(expand('beta/{Rank}/{Ordi}', Rank = rank, Ordi = ordi))
I got this error
Building DAG of jobs...
MissingInputException in line 59 of /sysdata/Meta/pipeline/Snakefile:
Missing input files for rule biom_convert:
common_taxonomic/Table_taxa_NR_Kingdom/CCA.txt
As you can see, the rank wildcard is elongated with /PCA, or /{ordi}. I am quite confused about this, am I writing a wrong code?
my biom_convert rule is:
rule biom_convert:
input: 'common_taxonomic/Table_taxa_NR_{rank}.txt'
output:'common_taxonomic/abundance_table_{rank}.biom'
shell:'biom convert -i {input} -o {output} --table-type="OTU table" --to-json'

I am not sure if we got the complete information required to solve it, but we can still try :).
The names of wildcards are completely arbitrary, and you can name them however you like. If you name your wildcard rank it is in no way related to the wildcards of other rules. In fact, what can be the value of wildcard rank in one rule, can be the wildcard value of ordi in another!
So what we will have to do is make sure that the output rule correctly distinguishes between rank and ordi:
rule biom_convert:
input:
'common_taxonomic/Table_taxa_NR_{rank}.txt'
output:
'common_taxonomic/abundance_table_{rank}/{ordi}.biom'
shell:
'biom convert -i {input} -o {output} --table-type="OTU table" --to-json'
If you want you can force the wildcards of rank and ordi through global wildcard constraints:
wildcard_constraints:
rank='|'.join(rank)
ordi='|'.join(ordi)
Now I am not sure if this will solve the complete problem you have, but it should definitely push you in the right direction.

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

How to stop snakemake from adding non file endings to wildcards when using expand function? (.g.vcf fails, .vcf works)

Adding .g.vcf instead of .vcf after the variable in expand rule is somehow adding the .g to a wildcard in another module
I have tried the following in the all rule :
{stuff}.g.vcf
{stuff}"+"g.vcf"
{stuff}_var"+".g.vcf"
{stuff}.t.vcf
all fail but {stuff}.gvcf or {stuff}.vcf work
Error:
InputFunctionException in line 21 of snake_modules/mark_duplicates.snakefile:
KeyError: 'Mother.g'
Wildcards:
lane=Mother.g
Code:
LANES = config["list2"].split()
rule all:
input:
expand(projectDir+"results/alignments/variants/{stuff}.g.vcf", stuff=LANES)
rule mark_duplicates:
""" this will mark duplicates for bam files from the same sample and library """
input:
get_lanes
output:
projectDir+"results/alignments/markdups/{lane}.markdup.bam"
log:
projectDir+"logs/"+stamp+"_{lane}_markdup.log"
shell:
" input=$(echo '{input}' |sed -e s'/ / I=/g') && java -jar /home/apps/pipelines/picard-tools/CURRENT MarkDuplicates I=$input O={projectDir}results/alignments/markdups/{wildcards.lane}.markdup.bam M={projectDir}results/alignments/markdups/{wildcards.lane}.markdup_metrics.txt &> {log}"
I want my final output to have the {stuff}.g.vcf notation. Please note this output is created in another snake module but the error appears in the mark duplicates which is before the other module.
I have tried multiple changes but it is the .g.vcf in the all rule that causes the issue.
My guess is that {lane} is interpreted as a regular expression and it's capturing more than it should. Try adding before rule all:
wildcard_constraints:
stuff= '|'.join([re.escape(x) for x in LANES]),
lane= '|'.join([re.escape(x) for x in LANES])
(See also this thread https://groups.google.com/forum/#!topic/snakemake/wVlJW9X-9EU)

Snakemake: Target rules may not contain wildcards

I am trying to supply a bunch of files as input to snakemake and wildcards are not working for some reason:
rule cluster:
input:
script = '/Users/nikitavlasenko/python_scripts/python/dbscan.py',
path = '/Users/nikitavlasenko/python_scripts/data_files/umap/{sample}.csv'
output:
path = '/Users/nikitavlasenko/python_scripts/output/{sample}'
shell:
"python {input.script} -data {input.path} -eps '0.3' -min_samples '10' -path {output.path}"
I want snakemake to read files in from the umap directory, get their names, and then use them to pass to the python script, so that each result would get a unique name. How this task can be achieved without such an error that I am getting right now:
Building DAG of jobs...
WorkflowError:
Target rules may not contain wildcards. Please specify concrete files or
a rule without wildcards.
Update
I found that most probably the rule all is required at the top:
https://bioinformatics.stackexchange.com/questions/2761/how-to-resolve-in-snakemake-error-target-rules-may-not-contain-wildcards
So I added it like that:
samples='SCID_WT_CCA'
rule all:
input:
expand('/Users/nikitavlasenko/python_scripts/data_files/umap/
{sample}_umap.csv', sample=samples.split(' '))
However, I am getting the following weird message:
Building DAG of jobs...
Nothing to be done.
So, it is not running.
Update
I thought that it could be related to the fact that I had just one sample name at the top, so I changed it to:
samples='SCID_WT_CCA WT SCID plus_1 minus_1'
And added the respective files, of course, but it did not fix this error.
Actually if I run snakemake cluster I get the same error as at the very top, but if I just run snakemake, then there is nothing to be done error. I tried to substitute absolute paths for the relative ones, but it did not help:
samples='SCID_WT_CCA WT SCID plus_1 minus_1'
rule all:
input:
expand('data_files/umap/{sample}_umap.csv', sample=samples.split(' '))
rule cluster:
input:
script = 'python/dbscan.py',
path = 'data_files/umap/{sample}_umap.csv'
output:
path = 'output/{sample}'
shell:
"python {input.script} -data {input.path} -eps '0.3' -min_samples '10' -path {output.path}"
The "all" rule should have as input the list of files you want the other rule(s) to generate as output. Here, you seem to be using the list of your starting files instead.
Try the following:
samples = 'SCID_WT_CCA WT SCID plus_1 minus_1'
rule all:
input:
expand('output/{sample}', sample=samples.split(' '))
rule cluster:
input:
script = 'python/dbscan.py',
path = 'data_files/umap/{sample}_umap.csv'
output:
path = 'output/{sample}'
shell:
"python {input.script} -data {input.path} -eps '0.3' -min_samples '10' -path {output.path}"
Following bli's answer, I was able to solve the issue. However, one additional modification was needed. I passed output/{sample} to the python script and it generated two files from this path. Seems like that should not be done because I got another error when snakemake wrote that it could not see output/file_name. Obviously it will be able to see them only if I set all the paths manually right away without python modifying it on the fly, so I did that and here is the final Snakefile that worked well:
samples='SCID_WT_CCA WT SCID plus_1 minus_1'
rule all:
input:
expand('output/{sample}_umap.png', sample=samples.split(' ')),
expand('output/{sample}_clusters.csv', sample=samples.split(' '))
rule cluster:
input:
script = 'python/dbscan.py',
path = 'data_files/umap/{sample}_umap.csv'
output:
path_to_umap = 'output/{sample}_umap.png',
path_to_clusters = 'output/{sample}_clusters.csv'
shell:
"python {input.script} -data {input.path} -eps '0.3' -min_samples '10' -path_to_umap {output.path_to_umap} -path_to_clusters {output.path_to_clusters}"

Snakemake suggestions on the workflow design to overcome the ambiguous rules

I have two rules that are capable of producing the same output depending on the wildcard value and this causes the ambiguous rule exception.
I read the documentation on http://snakemake.readthedocs.io/en/latest/snakefiles/rules.html?highlight=ruleorder#handling-ambiguous-rules about handling the ambiguous rule exceptions. It seems that the use of rule order could be the solution. However, the input of my rule preprocess_zheng17 is depending on the output of the simulate_data rule. Therefore, if I use the ruleorder: simulate_data > preprocess_zheng17 then, the preprocess_zheng17 rule is never run.
What I would like to do is to first run the simulate_data and then to run the preprocess_zheng17 rule for each wildcard pairs. I am wondering what could be a good workflow design practice to cope with this problem. The rules are provided below.
rule preprocess_zheng17:
input:
loom_file = SIMULATED_DATA_OUTPUT+'/{sample}_sim_loc{loc}.loom'
params:
transpose = False
output:
SIMULATED_DATA_OUTPUT+'/{sample}_sim_loc{loc}_zheng17.loom'
script:
"scripts/preprocess_zheng17.py"
rule simulate_data:
input:
sample_loom = HDF5_OUTPUT+'/{sample}.loom'
params:
group_prob = config['splat_simulate']['group_prob'],
dropout_present = config['splat_simulate']['dropout_present']
output:
SIMULATED_DATA_OUTPUT+'/{sample}_sim_loc{loc}.loom'
script:
"scripts/data_simulation.R"
Thank you in advance.
Your problem does not come from the design but from the fact that the outputs of your two rules and the wildcards used cannot be distinguished.
Both
SIMULATED_DATA_OUTPUT+'/{sample}_sim_loc{loc}_zheng17.loom' and
SIMULATED_DATA_OUTPUT+'/{sample}_sim_loc{loc}.loom'
begin and end with the same pattern and snakemake cannot determine if _zheng17 is part of the wildcard {loc} or not.
You can either use what bli described in his comment or change a little bit the output of the either rule. For exemple:
rule preprocess_zheng17:
input:
loom_file = SIMULATED_DATA_OUTPUT+'/{sample}_sim_loc{loc}.loom'
params:
transpose = False
output:
SIMULATED_DATA_OUTPUT+'/{sample}_sim_zheng17_loc{loc}.loom'
script:
"scripts/preprocess_zheng17.py"
rule simulate_data:
input:
sample_loom = HDF5_OUTPUT+'/{sample}.loom'
params:
group_prob = config['splat_simulate']['group_prob'],
dropout_present = config['splat_simulate']['dropout_present']
output:
SIMULATED_DATA_OUTPUT+'/{sample}_sim_loc{loc}.loom'
script:
"scripts/data_simulation.R"

Snakemake: rule generate strange results

I create this rule:
rule picard_addRG2:
input:
"mapped_reads/merged_samples/{sample}.dedup.bam"
output:
"mapped_reads/merged_samples/{sample}_rg.dedup.bam"
params:
sample_idi = config['samples'],
library = "library00"
shell:
"""picard AddOrReplaceReadGroups I={input} O={output} RGID={params.sample_id} RGLB={params.library} RGPL=illumina RGPU=unit1 RGSM=20 RGPU=MP"""
I add o Snakemake file this rule:
expand("mapped_reads/merged_samples/{sample}_rg.dedup.bam",sample=config['samples'])
I found this strange result on another rule:
snakemake --configfile exome.yaml -np
InputFunctionException in line 17 of /illumina/runs/FASTQ/test_play/rules/samfiles.rules:
KeyError: '445_rg'
Wildcards:
sample=445_rg
What I did wrong?
If I change the rule in this way works perfectly:
rule picard_addRG2:
input:
"mapped_reads/merged_samples/{sample}.dedup.bam"
output:
"mapped_reads/merged_samples/{sample}.dedup_rg.bam"
params:
sample_id = config['samples'],
library = "library00"
shell:
"""picard AddOrReplaceReadGroups I={input} O={output} RGID={params.sample_id} RGLB={params.library} RGPL=illumina RGPU=unit1 RGSM=20 RGPU=MP"""
Since it works perfectly with the second way to write the output, I would suggest to use this one. What's happening is the following:
since in your rule picard the input is:
"mapped_reads/merged_samples/{sample}.dedup.bam"
you must have a rule that creates this file as output.
and in your rule picard the output is:
"mapped_reads/merged_samples/{sample}_rg.dedup.bam"
So when you ask in your expand:
"mapped_reads/merged_samples/{sample}_rg.dedup.bam"
snakemake does not know if it has to use your rule picard with sample as the wildcard or your other rule with sample_rg as the wildcard since they both end and begin with the same pattern.
To resume: try not to use two outputs with a wildcard that can be expanded. Here both you outputs:
"mapped_reads/merged_samples/{sample}.dedup.bam"
"mapped_reads/merged_samples/{sample}_rg.dedup.bam"
begin and end with exactly the same pattern.
When you use:
"mapped_reads/merged_samples/{sample}.dedup_rg.bam"
as output, the wildcard cannot be expanded!