Using checkpoints with snakemake gives each instance of a rule all input files - snakemake

I've recently come across checkpoints in snakemake and realized they will work perfectly with what I am trying to do. I've been able to implement the workflow listed here. I also found this stackoverflow question, but can't quite make sense of it or how I might make it work for what I am doing
The rules I am working with are as follows:
def ReturnBarcodeFolderNames():
path = config['results_folder'] + "Barcode/"
return_direc = []
for root, directory, files in os.walk(path):
for direc in directory:
return_direc.append(direc)
return return_direc
rule all:
input:
expand(config['results_folder'] + "Barcode/{folder}.merged.fastq", folder=ReturnBarcodeFolderNames())
checkpoint barcode:
input:
expand(config['results_folder'] + "Basecall/{fast5_files}", fast5_files=FAST5_FILES)
output:
temp(directory(config['results_folder'] + "Barcode/.tempOutput/"))
shell:
"guppy_barcoder "
"--input_path {input} "
"--save_path {output} "
"--barcode_kits EXP-PBC096 "
"--recursive"
def aggregate_barcode_folders(wildcards):
checkpoint_output = checkpoints.barcode.get(**wildcards).output[0]
folder_names = []
for root, directories, files in os.walk(checkpoint_output):
for direc in directories:
folder_names.append(direc)
return expand(config['results_folder'] + "Barcode/.tempOutput/{folder}", folder=folder_names)
rule merge:
input:
aggregate_barcode_folders
output:
config['results_folder'] + "Barcode/{folder}.merged.fastq"
shell:
"echo {input}"
The rule barcode and def aggregate_barcode_folders work as expected, but when rule merge is reached, every input folder is being passed to each instance of the rule. This results in something like the following:
rule merge:
input: /Results/Barcode/.tempOutput/barcode81,
/Results/Barcode/.tempOutput/barcode28,
/Results/Barcode/.tempOutput/barcode17,
/Results/Barcode/.tempOutput/barcode10,
/Results/Barcode/.tempOutput/barcode26,
/Results/Barcode/.tempOutput/barcode21,
/Results/Barcode/.tempOutput/barcode42,
/Results/Barcode/.tempOutput/barcode89,
/Results/Barcode/.tempOutput/barcode45,
/Results/Barcode/.tempOutput/barcode20,
/Results/Barcode/.tempOutput/barcode18,
/Results/Barcode/.tempOutput/barcode27,
/Results/Barcode/.tempOutput/barcode11,
.
.
.
.
.
output: /Results/Barcode/barcode75.merged.fastq
jobid: 82
wildcards: folder=barcode75
The same exact input is needed for each job of rule merge, which amounts to about 80 instances. But, the wildcards portion in each job is different for each folder. How can I use this as input for each instance of my rule merge, instead of passing the entire list received from def aggregate_barcode_folders?
I feel there may be something amiss with the input from rule all, but I'm not 100% sure what the problem may be.
As a note, I know snakemake will throw an error stating that it is waiting for output files from rule merge, as I am not doing anything with the output other than printing it to the screen.
EDIT
I've decided to go against checkpoints for now, and instead opt for the following. To make things more clear, the goal for this pipeline is as follows: I am attempting to merge fastq files from an output folder into one file, with the input files having a variable number of files (1 to about 3 per folder, but I won't know how many). The structure of the input is as follows
INPUT
|-- Results
|-- FolderA
|-- barcode01
|-- file1.fastq
|-- barcode02
|-- file1.fastq
|-- file2.fastq
|-- barcode03
|-- file1.fastq
|-- FolderB
|-- barcode01
|-- file1.fastq
|-- barcode02
|-- file1.fastq
|-- file2.fastq
|-- barcode03
|-- file1.fastq
|-- FolderC
|-- barcode01
|-- file1.fastq
|-- file2.fastq
|-- barcode02
|-- file1.fastq
|-- barcode03
|-- file1.fastq
|-- file2.fastq
OUTPUT
I would like to turn that output resembling something such as:
|-- Results
|-- barcode01.merged.fastq
|-- barcode02.merged.fastq
|-- barcode03.merged.fastq
The output files would contain data from all file#.fastq from its respective barcode folder, from folder A, B, and C.
I've been able to get (I think) further than I was before, but snakemake is throwing an error that says Missing input files for rule basecall: /Users/joshl/PycharmProjects/ARS/Results/DataFiles/fast5/FAL03879_67a0761e_1055/ barcode72.fast5. My code relevant code is here:
CODE
configfile: "config.yaml"
FAST5_FILES = glob_wildcards(config['results_folder'] + "DataFiles/fast5/{fast5_files}.fast5").fast5_files
def return_fast5_folder_names():
path = config['results_folder'] + "Basecall/"
fast5_folder_names = []
for item in os.scandir(path):
if Path(item).is_dir():
fast5_folder_names.append(item.name)
return fast5_folder_names
def return_barcode_folder_names():
path = config['results_folder'] + ".barcodeTempOutput"
fast5_folder_names = []
collated_barcode_folder_names = []
for item in os.scandir(path):
if Path(item).is_dir():
full_item_path = os.path.join(path, item.name)
fast5_folder_names.append(full_item_path)
index = 0
for item in fast5_folder_names:
collated_barcode_folder_names.append([])
for folder in os.scandir(item):
if Path(folder).is_dir():
collated_barcode_folder_names[index].append(folder.name)
index += 1
return collated_barcode_folder_names
rule all:
input:
# basecall
expand(config['results_folder'] + "Basecall/{fast5_file}", fast5_file=FAST5_FILES),
# barcode
expand(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}", fast5_folders=return_fast5_folder_names()),
# merge files
expand(config['results_folder'] + "Barcode/{barcode_numbers}.merged.fastq", barcode_numbers=return_barcode_folder_names())
rule basecall:
input:
config['results_folder'] + "DataFiles/fast5/{fast5_file}.fast5"
output:
directory(config['results_folder'] + "Basecall/{fast5_file}")
shell:
r"""
guppy_basecaller \
--input_path {input} \
--save_path {output} \
--quiet \
--config dna_r9.4.1_450bps_fast.cfg \
--num_callers 2 \
--cpu_threads_per_caller 6
"""
rule barcode:
input:
config['results_folder'] + "Basecall/{fast5_folders}"
output:
directory(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}")
threads: 12
shell:
r"""
for item in {input}; do
guppy_barcoder \
--input_path $item \
--save_path {output} \
--barcode_kits EXP-PBC096 \
--recursive
done
"""
rule merge_files:
input:
expand(config['results_folder'] + ".barcodeTempOutput/" + "{fast5_folder}/{barcode_numbers}",
fast5_folder=glob_wildcards(config['results_folder'] + ".barcodeTempOutput/{fast5_folders}/{barcode_numbers}/{fastq_files}.fastq").fast5_folders,
barcode_numbers=glob_wildcards(config['results_folder'] +".barcodeTempOutput/{fast5_folders}/{barcode_numbers}/{fastq_files}.fastq").barcode_numbers)
output:
config['results_folder'] + "Barcode/{barcode_numbers}.merged.fastq"
shell:
r"""
echo "Hello world"
echo {input}
"""
Under rule all, if I comment out the line that corresponds to merge files, there is no error

I am not fully understanding what you mean, but I think the problem lies indeed in the input for rule all. I currently also do not have access to a computer (I'm on my phone right now), so I can not make a real example.. Probably what you want to do is change ReturnBarcodeFolderNames to use a checkpoint. I guess only after rule barcode you actually know what you want as final output.
def ReturnBarcodeFolderNames(wildcards):
# the wildcard here makes sure that barcode is executed first
checkpoint_output = checkpoints.barcode.get().output[0]
folder_names = []
for root, directories, files in os.walk(checkpoint_output):
for direc in directories:
folder_names.append(direc)
return expand(config['results_folder'] + "Barcode/{folder}.merged.fastq", folder=folder_names)
rule all:
input:
ReturnBarcodeFolderNames
rule merge:
input:
config['results_folder'] + "Barcode/.tempOutput/{folder}"
output:
config['results_folder'] + "Barcode/{folder}.merged.fastq"
shell:
"echo {input}"
Obviously ReturnBarcodeFolderNames does not work in its current form. However, the idea is that you check what you want as final output in rule all after rule barcode has been executed. Rule merge then does not have to use the checkpoint, as its input and output can be clearly defined.
I hope this helps :), but maybe I have been addressing something else than your problem. It wasn't completely clear to me from the question unfortunately.
edit
Here is a stripped down version of the code, but it should be easy to implement the last parts now. It works for the folder structure you gave in the example:
import os
import glob
def get_merged_barcodes(wildcards):
tmpdir = checkpoints.barcode.get(**wildcards).output[0] # this forces the checkpoint to be executed before we continue
barcodes = set() # a set is like a list, but only stores unique values
for folder in os.listdir(tmpdir):
for barcode in os.listdir(tmpdir + "/" + folder):
barcodes.add(barcode)
mergedfiles = ["results/" + barcode + ".merged.fastq" for barcode in barcodes]
return mergedfiles
rule all:
input:
get_merged_barcodes
checkpoint barcode:
input:
rules.basecall.output
output:
directory("results")
shell:
"""
stuff
"""
def get_merged_input(wildcards):
return glob.glob(f"results/**/{wildcards.barcode}/*.fastq")
rule merge_files:
input:
get_merged_input
output:
"results/{barcode}.merged.fastq"
shell:
"""
echo {input}
"""
Basically what you did in the original question was almost working!

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

How to save the multiple output of single process in publishDir in Nextflow

I have the process create_parallel_params whose output is parallel_params folder containing json files.
#!/usr/bin/env nextflow
nextflow.enable.dsl = 2
params.spectra = "$baseDir/data/spectra/"
params.library = "$baseDir/data/library/"
params.workflow_parameter="$baseDir/data/workflowParameters.xml"
TOOL_FOLDERS="$baseDir/bin"
process create_parallel_params{
publishDir "$baseDir/nf_output", mode: 'copy'
output:
path "parallel_params/*.json"
script:
"""
mkdir parallel_params | python $TOOL_FOLDERS/parallel_paramgen.py \
parallel_params \
10
"""
}
The output of the above process passed into process searchlibrarysearch_molecularv2_parallelstep1 which process each json file.
process searchlibrarysearch_molecularv2_parallelstep1{
publishDir "$baseDir/nf_output", mode: 'copy'
input:
path json_file
//path params.spectra
//path params.library
output:
path "result_folder" emit:"result_folder/*.tsv"
script:
"""
mkdir result_folder convert_binary librarysearch_binary | \
python $TOOL_FOLDERS/searchlibrarysearch_molecularv2_parallelstep1.py \
$params.spectra \
$json_file \
$params.workflow_parameter \
$params.library \
result_folder \
convert_binary \
librarysearch_binary \
"""
}
workflow{
ch_parallel_params=create_parallel_params()
ch_searchlibrarysearch=searchlibrarysearch_molecularv2_parallelstep1(create_parallel_params.out.flatten())
ch_searchlibrarysearch.view()
}
I want the output of these file in publishDir (nf_output) in a single folder. So How can i do that. Provide some example.
The emit option can be used to assign a name identifier to an output channel. This is helpful if your output declaration defines more than one output channels, but isn't usually necessary if you make only a single declaration. Providing a glob pattern as an identifier doesn't make much sense: if you need only the output TSV files (and not the whole folder), you can just use the following and the output TSV files will be published to the publishDir:
output:
path "result_folder/*.tsv"
If you want to declare the folder itself, usually you can just update your publishDir to include a subdirectory with a unique name. You could use something like:
publishDir "$baseDir/nf_output/${json_file.baseName}", mode: 'copy'
But this will give you a 'result_folder' in every subdirectory. If that's not desirable, it might be preferable to change your output declaration to:
output:
path "result_folder/*"

Snakemake: how to specify absolute paths to shell commands

I am writing a snakemake rule that uses multiple commands as shown below:
rule RULE1:
input: 'path/to/input.file'
output: 'path/to/output.file'
shell: 'path/to/command1 {input} | /path/to/command2 | /path/to/command3 {output}'
If the /path/to/command1 is really long the rule becomes a bit unwieldy. Is there a way to specify it somewhere else as cmd1='/path/to/command1' and use {cmd1} within the rule? I know, I can use something like params: cmd1='/path/to/command1' and use it as follows:
rule RULE1:
input: 'path/to/input.file'
output: 'path/to/output.file'
params:
cmd1='/path/to/command1',
cmd2='/path/to/command2',
cmd3='/path/to/command3'
shell: '{cmd1} {input} | {cmd2}| {cmd3} {output}'
But that workaround requires me to specify it for every rule separately and cannot use relative paths.
What is the standard way to do such a thing?
The shell directive takes a string as argument which you can construct however you prefer e.g.
cmd1= 'foo'
cmd2= 'bar'
rule one:
...
shell:
cmd1 + ' {input}' + ' | ' + cmd2 + ' > {output}'
To show some power of the snake, you could do something like
path2 = "/the/long/and/winding/path/"
rule RULE1:
input: path2 + 'input.file'
output: path2 + 'output.file'
shell: f'{path2}command1 {{input}} | {path2}command2 l | {path2}command3 {{output}}'
A couples of notes:
Double curlybraces since both snakemake and python (f') will want to parse them
Variables as path2 above are often stored in a config-file accessed by the snakemake directive configfile:
If all your files are on the same path, you might be able to use workdir: "/the/long/and/winding/path/" - or set the path from the command-line (better as you snakefile will be less prone to errors if you change directories)
Can obviously be combined with dariober's (better) answer, creating cmd1 = path2 +'command1' avoiding to repeat the long path in all commands ...

Define input files from csv

I would like to define input file names from different varialbles extracted from a csv. I have built the following simplified example:
I have a file test.csv:
data/samples/A.fastq
data/samples/B.fastq
I give the path to test.csv in a json config file:
{
"samples": {
"summaryFile": "somepath/test.csv"
}
}
Now I want to run bwa on each file within a rule. My feeling is that I have to use lambda wildcards but I am not sure. My Snakefile looks like this:
#only for bcf_tools
import pandas
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table)
def returnSamples(table):
# Have tried different things here but nothing worked
return table
rule all:
input:
expand("mapped_reads/{sample}.bam", sample= samplesData)
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: returnSamples(wildcards.sample)
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
I have tried a million things including using expand (which is working but the rule is not called on each file).
Any help will be tremendously appreciated.
Snakemake works by defining which output you want (like you do in rule all). You are very close to a working solution, however there were some small things that went wrong:
Reading the pandas dataframe does not do what you expect (try printing the samplesData to see what it did/does). Therefore the expand in rule all does not work properly.
You do not need to use lambdas for the input, you can reuse the wildcard.
This should work for your example:
import pandas
import re
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table, header=None).loc[:, 0].tolist()
samples = [re.findall("[^/]+\.", sample)[0][:-1] for sample in samplesData] # overly complicated regex
rule all:
input:
expand("mapped_reads/{sample}.bam", sample=samples)
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
However I think it would be easiest to change the description in test.csv. Now we have to do some weird magic to get the sample name from the file, it would probably be best to just store the sample names there.

snakemake: write files from an array

I have an array xx = [1,2,3] and I want to use Snakemake to create a list of (empty) files 1.txt, 2.txt, 3.txt.
This is the Snakefile I use:
xx = [1,2,3]
rule makefiles:
output: expand("{f}.txt", f=xx)
run:
with open(output, 'w') as file:
file.write('blank')
However instead of having three new shiny text files in my folder I see an error message:
expected str, bytes or os.PathLike object, not OutputFiles
Not sure what I am doing wrong.
Iterate output to get filenames and then write to them. See relevant documentation here.
rule makefiles:
output: expand("{f}.txt", f=xx)
run:
for f in output:
with open(f, 'w') as file:
file.write('blank')
Rewriting above rule, to parallelize, by defining target files in rule all:
rule all:
expand("{f}.txt", f=xx)
rule makefiles:
output:
"{f}.txt"
run:
with open(output[0], 'w') as file:
file.write('blank')