snakemake wildcards or expand command - snakemake

I want a rule to perform realignment between normal and tumor. The main problem is I don't know how to manage that problem. Is it the wildcard or the expand the answer to my problem?
This is my list of samples:
conditions:
pair1:
tumor: "432"
normal: "433"
So the rule need to be something like this
rule gatk_RealignerTargetCreator:
input:
expand("mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['tumor']),
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",sample=config['conditions']['pair1']['normal']),
output:
"mapped_reads/merged_samples/{pair1}.realign.intervals"
How can I do this operation for all keys on conditions? (I suppose to have more that one pair)
I have tried this code:
input:
lambda wildcards: config["conditions"][wildcards.condition],
tumor= expand("mapped_reads/merged_samples/{tumor}.sorted.dup.reca.bam",tumor=config['conditions'][wildcards.condition]['tumor']),
normal = expand("mapped_reads/merged_samples/{normal}.sorted.dup.reca.bam",normal=config['conditions'][wildcards.condition]['normal']),
output:
"mapped_reads/merged_samples/{tumor}/{tumor}_{normal}.realign.intervals"
name 'wildcards' is not defined
??

wildcards is not "directly" defined in the input of a rule. You need to use a function of wildcards instead. I'm not sure I understand exactly what you want to do, but you may try something like that.
def condition2tumorsamples(wildcards):
return expand(
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
sample=config['conditions'][wildcards.condition]['tumor'])
def condition2normalsamples(wildcards):
return expand(
"mapped_reads/merged_samples/{sample}.sorted.dup.reca.bam",
sample=config['conditions'][wildcards.condition]['normal'])
rule gatk_RealignerTargetCreator:
input:
tumor = condition2tumorsamples,
normal = condition2normalsamples,
output:
"mapped_reads/merged_samples/{condition}.realign.intervals"
# remainder of the rule here...

DISCLAIMER: You want to read your pairings from a YAML file, however,
I advise against this. I couldn't figure out how to do it elegantly using YAML formatting. I have an ad-hoc way of doing it to pair my SNP and INDEL annotations, however, there is a lot of boiler plate code JUST so it can write it from the YAML. This was okay because the YAML variable is likely never edited, so maintenance in a pedantically formatted string is no longer important in this case.
I think the code you tried is just about right.
What I think is missing is the ability to "request" the correct pairings in your "rule all" input. I personally prefer to do this using Pandas. It is listed on the homepage of the Python Software Foundation, so it's a robust choice.
The pandas setup is very easy to maintain, it's a single file tab or space separated. Easier for the end user than formatting nest YAML files (What I think would be required if setup via YAML format). This is how I do it in my system. It scales indefinitely. I'll admit accessing the pandas object is a bit tricky, but I've provided the code for you. Just know that first layer of objects (The [#] in the 'sample[1][tumor]' call), the [0] I think is just meta data on the file being read. I have yet to find a use for it and otherwise just ignore it.
tree structure of workspace
(CentOS5-Compatible) [tboyarski#login3 Test]$ tree
.
|-- [tboyarsk 620 Aug 4 10:57] Snakefile
|-- [tboyarsk 47 Aug 4 10:52] config.yaml
|-- [tboyarsk 512 Aug 4 10:57] output
| |-- [tboyarsk 0 Aug 4 10:54] ABC.bam
| |-- [tboyarsk 0 Aug 4 10:53] TimNorm.bam
| |-- [tboyarsk 0 Aug 4 10:53] TimTum.bam
| `-- [tboyarsk 0 Aug 4 10:57] XYZ.bam
`-- [tboyarsk 36 Aug 4 10:49] sampleFILEpair.txt
sampleFILEpair.txt (Proof the sample names can be unrelated)
tumor normal
TimTum TimNorm
XYZ ABC
config.yaml
pathDIR: output
sampleFILE: sampleFILEpair.txt
Snakefile
from pandas import read_table
configfile: "config.yaml"
rule all:
input:
expand("{pathDIR}/{sample[1][tumor]}_{sample[1][normal]}.bam", pathDIR=config["pathDIR"], sample=read_table(config["sampleFILE"], " ").iterrows())
rule gatk_RealignerTargetCreator:
input:
"{pathGRTC}/{normal}.bam",
"{pathGRTC}/{tumor}.bam",
output:
"{pathGRTC}/{tumor}_{normal}.bam"
# wildcard_constraints:
# tumor = '[^_|-|\/][0-9a-zA-Z]*',
# normal = '[^_|-|\/][0-9a-zA-Z]*'
run:
call('touch ' + str(wildcard.tumor) + '_' + str(wildcard.normal) + '.bam', shell=True)
With the merging of wildcards, in the past, I have found it to be a source of cyclical dependencies, so I also always include wildcard_constraints when merging (essentially that's what we're doing). They aren't actually necessary here. The "rule all" contains no wildcards, and it is calling "gatk", so in this exact example where is no room for ambiguity, but if this rule connects with other rules utilizing wildcards, usually it can generate some funky DAG's.

Related

Nextflow input how to declare tuple in tuple

I am working with a nextflow workflow that, at a certain stage, groups a series of files by their sample id using groupTuple(), and resulting in a channel that looks like this:
[sample_id, [file_A, file_B, ... , file_N]]
[sample_id, [file_A, file_B, ... , file_N]]
...
[sample_id, [file_A, file_B, ... , file_N]]
Note that this is the same channel structure that you get from .fromFilePairs().
I want to use these channel items in a process in such a way that, for each item, the process reads the sample_id from the first field and all the files from the inner tuple at once.
The nextflow documentation is somewhat cryptic about this, and it is hard to find how to declare this type of input in a channel, so I thought I'd create a question on stack overflow and then answer it myself for anyone who will ever be looking for this answer.
How does one declare the inner tuple in the input section of a nextflow process?
In the example given above, my inner tuple contains items of only one type (files). I can therefore pass the whole second term of the tuple (i.e. the inner tuple) as a single input item under the file() qualifier. Like this:
input:
tuple \
val(sample_id), \
file(inner_tuple) \
from Input_channel
This will ensure that the tuple content is read as file (one by one), the same way as performing .collect() on a channel of files, in the sense that all files will then be available in the nextflow temp directory where the process is executed.
The question is how you come up with sample_id, but in case they just have different file extensions you might use something like this:
all_files = Channel.fromPath("/path/to/your/files/*")
all_files.map { it -> [it.simpleName, it] }
.groupTuple()
.set { grouped_files }
The path qualifier (previously the file qualifier) can be used to stage a single (file) value or a collection of (file) values into the process execution directory. The note at the bottom of the multiple input files section in the docs also mentions:
The normal file input constructs introduced in the input of files
section are valid for collections of multiple files as well.
This means, you can use a script variable, e.g.:
input:
tuple val(sample_id), path(my_files)
In which case, the variable will hold the list of files (preserving the original filenames). You could use it directly to refer to all of the files in the list, or, you could access specific (file) elements (if you need them) using square bracket (slice) notation.
This is the syntax you will want most of the time. However, if you need predicable filenames or if you need to deal with files with the identical filenames, you may need a different approach:
Alternatively, you could specify a target filename, e.g.:
input:
tuple val(sample_id), path('my_file')
In the case where a single file is received by the process, the file would be staged with the target filename. However, when a collection of files is received by the process, the filename will be appended with a numerical suffix representing its ordinal position in the list. For example:
process test {
tag { sample_id }
debug true
stageInMode 'rellink'
input:
tuple val(sample_id), path('fastq')
"""
echo "${sample_id}:"
ls -g --time-style=+"" fastq*
"""
}
workflow {
readgroups = Channel.fromFilePairs( '*_{1,2}.fastq' )
test( readgroups )
}
Results:
$ touch {foo,bar,baz}_{1,2}.fastq
$ nextflow run .
N E X T F L O W ~ version 22.04.4
Launching `./main.nf` [scruffy_caravaggio] DSL2 - revision: 87a80d6d50
executor > local (3)
[65/66f860] process > test (bar) [100%] 3 of 3 ✔
baz:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../baz_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../baz_2.fastq
foo:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../foo_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../foo_2.fastq
bar:
lrwxrwxrwx 1 users 20 fastq1 -> ../../../bar_1.fastq
lrwxrwxrwx 1 users 20 fastq2 -> ../../../bar_2.fastq
Note that the names of staged files can be controlled using the * and ? wildcards. See the links above for a table that shows how the wildcards are replaced depending on the cardinality of the input collection.

Snakemake is unable to match wildcard although it's defined and even suggested

I am still very confused about the wildcards concept despite reading the full docs and a few examples, so maybe someone can shed light on this weird behaviour. It might be a bug but it's such a basic example that I am pretty sure I am doing or understanding something wrong.
Here is my Snakefile which should generate a bunch of files defined in a dictionary where the location of the files is stored (those can be served by all kinds of data providers like iRODS, XRootD etc., but it's not important now).
import os
some_files = {
"foo": "some_location/foo",
"bar": "another_location/bar",
"baz": "yet_another_loc/baz"
}
rule all:
input: ["raw/" + os.path.basename(f) for f in some_files.keys()]
rule generate_files:
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {some_files[wildcards.fname]} > {output}"
As you can see, I need to use a similar "trick" which was proposed in my previous question (Array of values as input in Snakemake workflows) to force the recognition of the files by adding a rule and listing those (in rule all), which works nicely.
The rule generate_files should then generate (retrieve) those by using the corresponding URL and protocol defined in some_files. For the sake of simplicity, it's now just echoing the origin into the output file.
To achieve this, I thought I can simply use the wildcards.fname in the shell section but I when I run the workflow, I get:
░ tamasgal#silentbox-(2):PhD/snakemake  master ●●● snakemake took 16s
░ 08:47:35 > snakemake -c1
Building DAG of jobs...
Using shell: /bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
-------------- ------- ------------- -------------
all 1 1 1
generate_files 3 1 1
total 4 1 1
Select jobs to execute...
[Fri Feb 18 08:47:38 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'wildcards.fname' is unknown in this context. Please make sure that you defined that variable. Also note that braces not used for variable access have to be escaped by repeating them, i.e. {{print $1}}
If I use fname (and not wildcards.fname), Snakemake proposes to use wildcards.fname, which again, does not work. Here is the output when running with fname in output:
[Fri Feb 18 08:47:48 2022]
rule generate_files:
output: raw/bar
jobid: 2
wildcards: fname=bar
resources: tmpdir=/var/folders/84/mcvklq757tq1nfrkbxvvbq8m0000gn/T
RuleException in line 12 of /Users/tamasgal/Dev/PhD/snakemake/Snakefile:
NameError: The name 'fname' is unknown in this context. Did you mean 'wildcards.fname'?
Why is this happening? The output of the workflow clearly shows that wildcards: fname=bar, so it exists and is defined. Is this a bug?
Hm, you may have to try and get at some_files[wildcards.fname] outside of the shell part? It looks to me like it can tell what the wildcard is supposed to be for the output to be raw/bar, but it can't handle using it to access the dict in the shell part. It seems like this could be handled with an input function to me.
Off the top of my head:
rule generate_files:
input:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {input.some_file} > {output}"
EDIT: if it fails because the file isn't local so Snakemake can't find it, you may supply the path to it as a parameter instead:
rule generate_files:
params:
some_file = lambda wildcards: some_files[wildcards.fname]
output:
temp("raw/{fname}")
shell:
"echo grabbed file from {params.some_file} > {output}"

Snakemake variable number of files

I'm in a situation, where I would like to scatter my workflow into a variable number of chunks, which I don't know beforehand. Maybe it is easiest to explain the problem by being concrete:
Someone has handed me FASTQ files demultiplexed using bcl2fastq with the no-lane-splitting option. I would like to split these files according to lane, map each lane individually, and then finally gather everything again. However, I don't know the number of lanes beforehand.
Ideally, I would like a solution like this,
rule split_fastq_file: (...) # results in N FASTQ files
rule map_fastq_file: (...) # do this N times
rule merge_bam_files: (...) # merge the N BAM files
but I am not sure this is possbile. The expand function requires me to know the number of lanes, and can't see how it would be possible to use wildcards for this, either.
I should say that I am rather new to Snakemake, and that I may have complete misunderstood how Snakemake works. It has taken me some time to get used to think about things "upside-down" by focusing on output files and then working backwards.
One option is to use checkpoint when splitting the fastqs, so that you can dynamically re-evaluate the DAG at a later point to get the resulting lanes.
Here's an MWE step by step:
Setup and make an example fastq file.
# Requires Python 3.6+ for f-strings, Snakemake 5.4+ for checkpoints
import pathlib
import random
random.seed(1)
rule make_fastq:
output:
fastq = touch("input/{sample}.fastq")
Create a random number of lanes between 1 and 9 each with random identifier from 1 to 9. Note that we declare this as a checkpoint, rather than a rule, so that we can later access the result. Also, we declare the output here as a directory specific to the sample, so that we can later glob in it to get the lanes that were created.
checkpoint split_fastq:
input:
fastq = rules.make_fastq.output.fastq
output:
lane_dir = directory("temp/split_fastq/{sample}")
run:
pathlib.Path(output.lane_dir).mkdir(exist_ok=True)
n_lanes = random.randrange(1, 10)-
lane_numbers = random.sample(range(1, 10), k = n_lanes)
for lane_number in lane_numbers:
path = pathlib.Path(output.lane_dir) / f"L00{lane_number}.fastq"
path.touch()
Do some intermediate processing.
rule map_fastq:
input:
fastq = "temp/split_fastq/{sample}/L00{lane_number}.fastq"
output:
bam = "temp/map_fastq/{sample}/L00{lane_number}.bam"
run:
bam = pathlib.Path(output.bam)
bam.parent.mkdir(exist_ok=True)
bam.touch()
To merge all the processed files, we use an input function to access the lanes that were created in split_fastq, so that we can do a dynamic expand on these. We do the expand on the last rule in the chain of intermediate processing steps, in this case map_fastq, so that we ask for the correct inputs.
def get_bams(wildcards):
lane_dir = checkpoints.split_fastq.get(**wildcards).output[0]
lane_numbers = glob_wildcards(f"{lane_dir}/L00{{lane_number}}.fastq").lane_number
bams = expand(rules.map_fastq.output.bam, **wildcards, lane_number=lane_numbers)
return bams
This input function now gives us easy access to the bam files we wish to merge, however many there are, and whatever they may be called.
rule merge_bam:
input:
get_bams
output:
bam = "temp/merge_bam/{sample}.bam"
shell:
"cat {input} > {output.bam}"
This example runs, and with random.seed(1) happens to create three lanes (l001, l002, and l005).
If you don't want to use checkpoint, I think you could achieve something similar by creating an input function for merge_bam that opens up the original input fastq, scans the read names for lane info, and predicts what the input files ought to be. This seems less robust, however.

Define input files from csv

I would like to define input file names from different varialbles extracted from a csv. I have built the following simplified example:
I have a file test.csv:
data/samples/A.fastq
data/samples/B.fastq
I give the path to test.csv in a json config file:
{
"samples": {
"summaryFile": "somepath/test.csv"
}
}
Now I want to run bwa on each file within a rule. My feeling is that I have to use lambda wildcards but I am not sure. My Snakefile looks like this:
#only for bcf_tools
import pandas
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table)
def returnSamples(table):
# Have tried different things here but nothing worked
return table
rule all:
input:
expand("mapped_reads/{sample}.bam", sample= samplesData)
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: returnSamples(wildcards.sample)
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
I have tried a million things including using expand (which is working but the rule is not called on each file).
Any help will be tremendously appreciated.
Snakemake works by defining which output you want (like you do in rule all). You are very close to a working solution, however there were some small things that went wrong:
Reading the pandas dataframe does not do what you expect (try printing the samplesData to see what it did/does). Therefore the expand in rule all does not work properly.
You do not need to use lambdas for the input, you can reuse the wildcard.
This should work for your example:
import pandas
import re
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table, header=None).loc[:, 0].tolist()
samples = [re.findall("[^/]+\.", sample)[0][:-1] for sample in samplesData] # overly complicated regex
rule all:
input:
expand("mapped_reads/{sample}.bam", sample=samples)
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
However I think it would be easiest to change the description in test.csv. Now we have to do some weird magic to get the sample name from the file, it would probably be best to just store the sample names there.

Missing wildcards in S4 snakemake Object in R

I'm running a workflow with a main Snakefile including rules from the rules folder and calling rscripts from those included rules.
Here are a few lines and their specific files:
Snakefile:
samples = pd.read_table("samples.csv", header=0, sep=',', index_col=0)
rule extract:
input:
'summary/umi_expression_matrix.tsv'
include: "rules/extract_expression_single.smk"
rules/extract_expression_single.smk:
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
scripts/merge_counts_single.R:
samples = read.csv('samples.csv', header=TRUE, stringsAsFactors=FALSE)$samples
read_list = c()
for (i in 1:length(samples)){
temp_matrix = read.table(snakemake#input[[i]][1], header=T, stringsAsFactors = F)
cell_barcodes = colnames(temp_matrix)[-1]
colnames(temp_matrix) = c("GENE",paste(samples[i], cell_barcodes, sep = "_"))
read_list=c(read_list, list(temp_matrix))
}
# Little function that allows to merge unequal matrices
merge.all <- function(x, y) {
merge(x, y, all=TRUE, by="GENE")
}
read_counts <- Reduce(merge.all, read_list)
read_counts[is.na(read_counts)] = 0
rownames(read_counts) = read_counts[,1]
read_counts = read_counts[,-1]
write.table(read_counts, file=snakemake#output[[1]], sep='\t')
The "clean" way to do it would be to call snakemake#wildcard.sample to attribute sample names to the script. But for some reason snakemake#wildcards is an empty vector.
In python:
print(type(snakemake.wildcards))
print(snakemake.wildcards)
print('done')
gives:
<class 'snakemake.io.Wildcards'>
done
which means it's also empty.
So right now I have to rely on getting back to the samples.csv file and getting the sample names there. I will also have to double check matching indexes maybe using greps, don't want the samples and the files to get mixed up.
Any idea why this is happening?
Update:
I've tried adding the sample_name as params to see if this would work and it actually does.
rule merge_umi:
input:
expand('summary/{sample}_umi_expression_matrix.tsv', sample=samples.index)
params:
sample_name = lambda wildcards: samples.index
output:
'summary/umi_expression_matrix.tsv'
script:
"../scripts/merge_counts_single.R"
I'm gonna use this for now, but my guess is there is still an issue with the scope of wildcards in included rules. Or maybe I'm doing it wrong.
The idea of using wildcards is to call a rule for each value in the wildcards. If you use the expand function in the input of a rule, then your rule will take all of the wildcard values and create a list of strings. Which means, your rule will be invoked just for once (not for each wildcard value). Per default, expand uses the python itertools function product that yields all combinations of the provided wildcard values.
By doing so, you cannot use that wildcard inside your rule any longer. Because when that rule is invoked, it gets all of the wildcard values and convert them into a list that will be given to your R script just for once (not for each wildcard value).
In your case, using wildcards is not suitable, since your merge_count rule will be run only for once (not for each wildcard value).