Error caused by missing output files while running Nextflow - nextflow

I have an error when i run nextflow consist of the following sentence
Error executing process > 'BWA_INDEX (Homo_sapiens_assembly38_chr1.fasta)'
Caused by:
Missing output file(s) FASTA.* expected by process 'BWA_INDEX(Homo_sapiens_assembly38_chr1.fasta)'
I use the following script.
#!/usr/bin/env nextflow
params.PublishDir = "/home/nextflow_test/genesFilter"
params.pathFasta = "/home/nf-core/references/Homo_sapiens/GATK/GRCh38/Sequence/WholeGenomeFasta/Homo_sapiens_assembly38_chr1.fasta"
InputFasta = file(params.pathFasta)
process BWA_INDEX {
tag {InputFasta.name}
publishDir (
path: "${params.PublishDir}",
mode: 'copy',
overwrite: 'true',
saveAs: "${params.PublishDir}/${it}"
)
input:
path InputFasta
output:
file("FASTA.*") into bwa_indexes
script:
"""
bwa-mem2 index "${InputFasta}"
"""
}
ch_bwa = bwa_indexes
Nevertheless into the work directory (specified after the error sentence) the process does work correctly and the output files are generated but not on my desire output directory. I tried to replace the "file" by the "path" on the script in the line:
output:
file("FASTA.*")
As well as replace "FASTA.* " for "${params.PublishDir}/FASTA.*"
but the error still appears. I don't know exactly why it happens. ¿Maybe could be due to the use of params to specify the inputs and outputs?
Thanks in advance!

Missing output file(s) FASTA.* expected by process 'BWA_INDEX(Homo_sapiens_assembly38_chr1.fasta)'
Nextflow is expecting files matching the glob pattern FASTA.* in the working directory, but they could not be found when the process exited (successfully). You just need to tell Nextflow what files to expect in your output declaration. The files that bwa-mem2 index Homo_sapiens_assembly38_chr1.fasta should have created might look like:
Homo_sapiens_assembly38_chr1.fasta.0123
Homo_sapiens_assembly38_chr1.fasta.amb
Homo_sapiens_assembly38_chr1.fasta.ann
Homo_sapiens_assembly38_chr1.fasta.bwt.2bit.64
Homo_sapiens_assembly38_chr1.fasta.bwt.8bit.32
Homo_sapiens_assembly38_chr1.fasta.pac
The following output declaration should be sufficient to find these files:
output:
path("${InputFasta}.*") into bwa_indexes
Note that only files that are declared in your output block are published to the publishDir. Also, the 'saveAs' publishDir parameter must be a closure for it to work correctly. You will need to fix this (or just remove the line entirely) to make your example work.

Related

Snakemake: Error when trying to run a rule for multiple directories and files

I create a dictionary in python and save the path to the directories (that I want the software to run on) as the keys and the corresponding values are a list of the expected output for each directory. Right now I have a structure like this:
sampleDict = {'/path_to_directory1': ["sample1","sample2","sample3"],
'/path_to_directory2': ["sample1","sample2"],
'/path_to_directory3': ["sample1","sample2","sample3"]}
# sampleDict looks pretty much like this
# key is a path to the directory that I want the rule to be executed on and the corresponding value sampleDict[key] is an array e.g. ["a","b","c"]
def input():
input=[]
for key in dirSampleDict:
input.extend(expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key]))
return input
rule all:
input:
input()
# example should run some software on different directories for each set of directories and their expected output samples
rule example:
input:
# the path to each set of samples should be the wildcard
dir = lambda wildcards: expand("{dir}", dir=dirSampleDict.keys())
params:
# some params
output:
expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key])
log:
log = '{dir}/{sample}.log'
run:
cmd = "software {dir}"
shell(cmd)
Doing this I receive the following error:
No values given for wildcard 'dir
Edit: Maybe it was not so clear what I actually want to do so I filled in some data.
I also tried using the wildcards I set up in rule all as follows:
sampleDict = {'/path_to_directory1': ["sample1","sample2","sample3"],
'/path_to_directory2': ["sample1","sample2"],
'/path_to_directory3': ["sample1","sample2","sample3"]}
# sampleDict looks pretty much like this
# key is a path to the directory that I want the rule to be executed on and the corresponding value sampleDict[key] is an array e.g. ["a","b","c"]
def input():
input=[]
for key in dirSampleDict:
input.extend(expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key]))
return input
rule all:
input:
input()
# example should run some software on different directories for each set of directories and their expected output samples
rule example:
input:
# the path to each set of samples should be the wildcard
dir = "{{dir}}"
params:
# some params
output:
'{dir}/{sample}*.foo'
log:
log = '{dir}/{sample}.log'
run:
cmd = "software {dir}"
shell(cmd)
Doing this I receive the following error:
Not all output, log and benchmark files of rule example contain the
same wildcards. This is crucial though, in order to avoid that two or
more jobs write to the same file.
I'm pretty sure the second part is more likely what I actually want to do, since expand() as output would only run the rule once but I need to run it for every key value pair in the dictonary.
First of all, what do you expect from the asterisk in the output?
output:
'{dir}/{sample}*.foo'
The output has to be a list of valid filenames that can be formed with substitution of each wildcard with some string.
Next problem is that you are using the "{dir}" in the run: section. There is no variable dir defined in the script used for run. If you want to use the wildcard, you need to address it using wildcards.dir. However the run: can be substituted with a shell: section:
shell:
"software {wildcards.dir}"
Regarding your first script: there is no dir wildcard defined (actually there are no wildcards at all):
output:
expand('{dir}/{sample}*.foo', dir = key, sample=dirSampleDict[key])
Both {dir} and {sample} are the variables in the context of expand function, and they are fully substituted with the named parameters.
Now the second script. What did you mean by this input?
input:
dir = "{{dir}}"
Here the "{{dir}}" is not a wildcard, but a reference to a global variable (you haven't provided the rest of your script, so I cannot judge whether it is defined or not). Moreover, what's the need in the input? You never use the {input} variable at all, and there is no dependencies that are needed to connect the rule example with any other rule to produce the input for rule example.

Accessing file path from a config.yaml in Snakemake

I'm working with Snakemake for NGS analysis. I have a list of input files, stored in a YAML file as follows:
DATASETS:
sample1: /path/to/input/bam
.
.
A very simplified skeleton of my Snakemake file, as described earlier in Snakemake: How to use config file efficiently and https://www.biostars.org/p/406452/, is as follows:
rule all:
input:
expand("report/{sample}.xlsx", sample = config["DATASETS"])
rule call:
input:
lambda wildcards: config["DATASETS"][wildcards.sample]
output:
"tmp/{sample}.vcf"
shell:
"some mutect2 script"
rule summarize:
input:
"tmp/{sample}.vcf"
output:
"report/{sample}.xlsx"
shell:
"processVCF.py"
This complains about missing input files for rule all. I'm really not too sure what I am missing out here: Could someone perhaps point out where I can start looking to try to solve my problem?
This problem persists even when I execute snakemake -n tmp/sample1.vcf, so it seems the problem is related to the inability to pass the input file to the rule call. I have a nagging feeling that I'm really missing something trivial here.

Snakemake: 'Missing input files' due to wrong wildcard expansion

I am new to Snakemake and I want to write a very simple Snakefile with a rule that processes each input file separately to an output file, but somehow my wildcards aren't interpreted correctly.
I have set up a minimal, reproducible example environment in Ubuntu 18.04 with the input files "test/test1.txt", "test/test2.txt", and a Snakefile. (snakemake version 5.5.4)
Snakefile:
ins = glob_wildcards("test/{f}.txt")
rule all:
input: expand("out/{f}.txt", f=ins)
rule test:
input: "test/{f}.txt"
output: "out/{f}.txt"
shell: "touch {output}"
This Snakefile throws the following error while building the DAG of jobs:
Missing input files for rule test:
test/['test1', 'test2'].txt
Any ideas how to fix this error?
I think you need to use ins.f or something similar:
expand("out/{f}.txt", f= ins.f)
The reason is explained in the FAQ
[glob_wildcards returns] a named tuple that contains a list of values
for each wildcard.

how can I pass a string in config file into the output section?

New to snakemake and I've been trying to transform my shell script based pipeline into snakemake based today and run into a lot of syntax issues.. I think most of the trouble I have is around getting all the files in a particular directories and infer output names from input names since that's how I use shell script (for loop), in particular, I tried to use expand function in the output section and it always gave me an error.
After checking some example Snakefile, I realized people never use expand in the output section. So my first question is: is output the only section where expand can't be used and if so, why? What if I want to pass a prefix defined in config.yaml file as part of the output file and that prefix can not be inferred from input file names, how can I achieve that, just like what I did below for the log section where {runid} is my prefix?
Second question about syntax: I tried to pass a user defined id in the configuration file (config.yaml) into the log section and it seems to me that here I have to use expand in the following form, is there a better way of passing strings defined in config.yaml file?
log:
expand("fastq/fastqc/{runid}_fastqc_log.txt",runid=config["run"])
where in the config.yaml
run:
"run123"
Third question: I initially tried the following 2 methods but they gave me errors so does it mean that inside log (probably input and output) section, Python syntax is not followed?
log:
"fastq/fastqc/"+config["run"]+"_fastqc_log.txt"
log:
"fastq/fastqc/{config["run"]}_fastqc_log.txt"
Here is an example of small workflow:
# Sample IDs
SAMPLES = ["sample1", "sample2"]
CONTROL = ["sample1"]
TREATMENT = ["sample2"]
rule all:
input: expand("{treatment}_vs_{control}.bed", treatment=TREATMENT, control=CONTROL)
rule peak_calling:
input: control="{control}.sam", treatment="{treatment}.sam"
output: "{treatment}_vs_{control}.bed"
shell: "touch {output}"
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
shell: "cp {input} {output}"
I used the expand function only in my final target. From there, snakemake can deduce the different values of the wildcards used in the rules "mapping" and "peak_calling".
As for the last part, the right way to put it would be the first one:
log:
"fastq/fastqc/" + config["run"] + "_fastqc_log.txt"
But again, snakemake can deduce it from your target (the rule all, in my example).
rule mapping:
input: "{samples}.fastq"
output: "{samples}.sam"
log: "{samples}.log"
shell: "cp {input} {output}"
Hope this helps!
You can use f-strings:
If this is you folder_with_configs/some_config.yaml:
var: value
Then simply
configfile:
"folder_with_configs/some_config.yaml"
rule one_to_rule_all:
output:
f"results/{config['var']}.file"
shell:
"touch {output[0]}"
Do remember about python rules related to nesting different types of apostrophes.
config in the smake rule is a simple python dictionary.
If you need to use additional variables in a path, e.g. some_param, use more curly brackets.
rule one_to_rule_all:
output:
f"results/{config['var']}.{{some_param}}"
shell:
"touch {output[0]}"
enjoy

Rebol Is it possible to get the name of the script currently executing?

I'm executing multiple libraries from user.r.
I can get the path of the script from system/script/path but I can't see how I can get the name of the script. So am I obliged to hardcode the file name in header property like below (File):
REBOL [
Title: "Lib1"
File: "lib1.r"
]
script-path: ""
]
system/script/header/script-path: rejoin [system/script/path system/script/header/file]
probe system/script/header/script-path
input
system/options/script does only give the full script name and path of the first script passed by dos command line (not if it is executed in console) and not the path of subsequent scripts called by the very first one.
What I want is the full path of the subsequents scripts.
So it seems there's no solution!
Try help system/options and you will find the information you are lookimg for.