Define input files from csv - snakemake

I would like to define input file names from different varialbles extracted from a csv. I have built the following simplified example:
I have a file test.csv:
data/samples/A.fastq
data/samples/B.fastq
I give the path to test.csv in a json config file:
{
"samples": {
"summaryFile": "somepath/test.csv"
}
}
Now I want to run bwa on each file within a rule. My feeling is that I have to use lambda wildcards but I am not sure. My Snakefile looks like this:
#only for bcf_tools
import pandas
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table)
def returnSamples(table):
# Have tried different things here but nothing worked
return table
rule all:
input:
expand("mapped_reads/{sample}.bam", sample= samplesData)
rule bwa_map:
input:
"data/genome.fa",
lambda wildcards: returnSamples(wildcards.sample)
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
I have tried a million things including using expand (which is working but the rule is not called on each file).
Any help will be tremendously appreciated.

Snakemake works by defining which output you want (like you do in rule all). You are very close to a working solution, however there were some small things that went wrong:
Reading the pandas dataframe does not do what you expect (try printing the samplesData to see what it did/does). Therefore the expand in rule all does not work properly.
You do not need to use lambdas for the input, you can reuse the wildcard.
This should work for your example:
import pandas
import re
input_table = config["samples"]["summaryFile"]
samplesData = pandas.read_csv(input_table, header=None).loc[:, 0].tolist()
samples = [re.findall("[^/]+\.", sample)[0][:-1] for sample in samplesData] # overly complicated regex
rule all:
input:
expand("mapped_reads/{sample}.bam", sample=samples)
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
However I think it would be easiest to change the description in test.csv. Now we have to do some weird magic to get the sample name from the file, it would probably be best to just store the sample names there.

Related

Snakemake pipeline not attempting to produce output?

I have a relatively simple snakemake pipeline but when run I get all missing files for rule all:
refseq = 'refseq.fasta'
reads = ['_R1_001', '_R2_001']
def getsamples():
import glob
test = (glob.glob("*.fastq"))
print(test)
samples = []
for i in test:
samples.append(i.rsplit('_', 2)[0])
return(samples)
def getbarcodes():
with open('unique.barcodes.txt') as file:
lines = [line.rstrip() for line in file]
return(lines)
rule all:
input:
expand("grepped/{barcodes}{sample}_R1_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples()),
expand("grepped/{barcodes}{sample}_R2_001.plate.fastq", barcodes=getbarcodes(), sample=getsamples())
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
rule fastq_grep:
input:
R1 = "{sample}_R1_001.fastq",
R2 = "{sample}_R2_001.fastq"
output:
out1 = "grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2 = "grepped/{barcodes}{sample}_R2_001.plate.fastq"
wildcard_constraints:
barcodes="[a-z-A-Z]+$"
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
The output files that are listed by the terminal seem correct, so it seems it is seeing what I want to produce but the shell is not making anything at all.
I want to produce a list of files that have grepped the list of barcodes I have in a file. But I get "Missing input files for rule all:"
There are two issues:
You have an impossible wildcard_constraints defined for {barcode}
Your two wildcards {barcode} and {sample} are competing with each other.
Remove the wildcard_constraints from your two rules and add the following lines to the top of your Snakefile:
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
The constraint for {barcodes} now only matches capital letters. Before it also included end-of-line matching (trailing $) which was impossible to match for this wildcard as you had additional text in the filepath following.
The constraint for {sample} ensures that the path of the filename starting with "Well..." is interpreted as the start of the {sample} wildcard. Else you'd get something unwanted like barcode=ACGGTW instead of barcode=ACGGT.
A note of advice:
I usually find it easier to seperate wildcards into directory structures rather than having multiple wildcards in the same filename. In you case that would mean having a structure like
grepped/{barcode}/{sample}_R1_001.plate.fastq.
Full suggested Snakefile (formatted using snakefmt)
wildcard_constraints:
barcodes="[A-Z]+",
sample="Well.*",
refseq = "refseq.fasta"
reads = ["_R1_001", "_R2_001"]
def getsamples():
import glob
test = glob.glob("*.fastq")
print(test)
samples = []
for i in test:
samples.append(i.rsplit("_", 2)[0])
return samples
def getbarcodes():
with open("unique.barcodes.txt") as file:
lines = [line.rstrip() for line in file]
return lines
rule all:
input:
expand(
"grepped/{barcodes}{sample}_R1_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
expand(
"grepped/{barcodes}{sample}_R2_001.plate.fastq",
barcodes=getbarcodes(),
sample=getsamples(),
),
rule fastq_grep:
input:
R1="{sample}_R1_001.fastq",
R2="{sample}_R2_001.fastq",
output:
out1="grepped/{barcodes}{sample}_R1_001.plate.fastq",
out2="grepped/{barcodes}{sample}_R2_001.plate.fastq",
shell:
"fastq-grep -i '{wildcards.barcodes}' {input.R1} > {output.out1} && fastq-grep -i '{wildcards.barcodes}' {input.R2} > {output.out2}"
In addition to #euronion's answer (+1), I prefer to constrain wildcards to match only and exactly the list of values you expect. This means disabling the regex matching altogether. In your case, I would do something like:
wildcard_constraints:
barcodes='|'.join([re.escape(x) for x in getbarcodes()]),
sample='|'.join([re.escape(x) for x in getsamples()]),
now {barcodes} is allowed to match only the values in getbarcodes(), whatever they are, and the same for {sample}. In my opinion this is better than anticipating what combination of regex a wildcard can take.

Snakemake input two variables and output one variable

I want to rename and move my fastq.gz files from these:
NAME-BOB_S1_L001_R1_001.fastq.gz
NAME-BOB_S1_L001_R2_001.fastq.gz
NAME-JOHN_S2_L001_R1_001.fastq.gz
NAME-JOHN_S2_L001_R2_001.fastq.gz
to these:
NAME_BOB/reads/NAME_BOB.R1.fastq.gz
NAME_BOB/reads/NAME_BOB.R2.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R1.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R2.fastq.gz
This is my code. The problem I have is the second variable S which I do not know how to specify in the code as I do not need it in my output filename.
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
S=["S1","S2"]
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz", dir=DIR),
expand("NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz", dir=DIR)
rule rename:
input:
fastq1=("fastq/NAME-{dir}_{s}_L001_R1_001.fastq.gz", zip, dir=DIR, s=S),
fastq2=("fastq/NAME-{dir}_{s}_L001_R2_001.fastq.gz", zip, dir=DIR, s=S)
output:
fastq1="NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz",
fastq2="NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz"
shell:
"""
mv {input.fastq1} {output.fastq1}
mv {input.fastq2} {output.fastq2}
"""
There are several problems in your code. First of all, the {dir} in your output and {dir} in your input are two different variables. Actually the {dir} in the output is a wildcard, while the {dir} in the input is a parameter for the expand function (moreover, you even forgot to call this function, and that is the second problem).
The third problem is that the shell section shall contain only a single command. You may try mv {input.fastq1} {output.fastq1}; mv {input.fastq2} {output.fastq2}, but this is not an idiomatic solution. Much better would be to create a rule that produces a single file, letting Snakemake to do the rest of the work.
Finally the S value fully depend on the DIR value, so it becomes a function of {dir}, and that can be solved with a lambda in input:
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
dir2s = {"BOB": "S1", "JOHN": "S2"}
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz", dir=DIR, r=["R1", "R2"])
rule rename:
input:
lambda wildcards:
"fastq/NAME-{{dir}}_{s}_L001_{{r}}_001.fastq.gz".format(s=dir2s[wildcards.dir])
output:
"NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz",
shell:
"""
mv {input} {output}
"""

conditional execution of snakemake rules based on column in metatable

I'm trying to use a column in a text file to conditionally execute rules in a snakemake workflow.
The text file is as follows:
id end sample_name fq1 fq2
a paired test_paired resources/SRR1945436_1.fastq.gz resources/SRR1945436_2.fastq.gz
b single test_single resources/SRR1945436.fastq.gz NA
For each sample in the text file, if value in end column is paired I would like to use rule cp_fastq_pe and if end is single then I would like to use rule cp_fastq_pe to process the fq1 & fq2 or just fq1 files, respectively.
relevant part of Snakefile is as follows:
import pandas as pd
samples = pd.read_table("config/samples.tsv").set_index("id", drop=False)
all_ids=list(samples["id"])
rule cp_fastq_pe:
"""
copy file to resources
"""
input:
fq1=lambda wildcards: samples.loc[wildcards.id, "fq1"],
fq2=lambda wildcards: samples.loc[wildcards.id, "fq2"]
output:
"resources/fq/{id}_1.fq.gz",
"resources/fq/{id}_2.fq.gz"
shell:
"""
cp {input.fq1} {output[0]}
cp {input.fq2} {output[1]}
"""
rule cp_fastq_se:
"""
copy file to resources
"""
input:
fq1=lambda wildcards: samples.loc[wildcards.id, "fq1"]
output:
"resources/fq/{id}.fq.gz",
shell:
"""
cp {input.fq1} {output}
"""
Is it possible to do this?
I had a similar problem, which I solved here: How to make Snakemake input optional but not empty?
Here is the idea adjusted to your problem. First, you need to specify the ruleorder to resolve the ambiguity (otherwise the single could always be applied whenever the paired is possible):
ruleorder: cp_fastq_pe > cp_fastq_se
Next, in your cp_fastq_pe rule you need to define a function that either returns a valid file (for the paired case) or returns a placeholder for non-existing file:
rule cp_fastq_pe:
input:
fq1=lambda wildcards: samples.loc[wildcards.id, "fq1"],
fq2=lambda wildcards: samples.loc[wildcards.id, "fq2"] if "fq2" in samples else "non-existing-filename"
This rule would be applied to all samples wherever "fq2" field exists and represents a valid file. The other rule would be selected to the rest of the samples.

Snakemake: Exchanging variables

I have some ONT sequencing runs that have been basecalled on the MINIT. As such, when I demultiplex with guppy_barcoder, I get a directory of fastq files for each barcode. I want to use snakemake as a workflow manager to take these fastq files through our analyses, but this involves swapping the {barcode} for {sample} at some point.
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
rule all:
input:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE)), #guppy_barcoder
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq")
output:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE))
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
rules.demultiplex.output
output:
"Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat Sequencing_reads/demultiplexed/{wildcards.barcode}/*fastq > {output.fastq} "
This does give me an error:
RuleException in line 32 of /home/eriny/sandbox/ONT_unicycler_pipeline/ONT_pipeline.smk:
'Wildcards' object has no attribute 'barcode'
But I actually think I'm missing something conceptually. I would like rule gather to be something like:
cat Sequencing_reads/demultiplexed/barcode01/*fastq > Sequencing_reads/gathered/sample01_ONT.fastq
I have tried setting up some dictionaries so that sample and barcode are given the same key, but my syntax must be broken.
I'm hoping to find a 1:1 way to map one variable name onto another.
I'm hoping to find a 1:1 way to map one variable name onto another.
I think the sample to dictionary is a possibility combined with a lambda as input function to get the barcode assign to a sample. For example:
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
sam2bar= dict(zip(SAMPLE, BARCODE))
rule all:
input:
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq"),
output:
done= touch('demux.done'), # This signals that guppy has completed
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
done= 'demux.done',
fastq= lambda wc: glob.glob("Sequencing_reads/demultiplexed/%s/*fastq" % sam2bar[wc.sample])
output:
fastq= "Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat {input.fastq} > {output.fastq} "

Snakemake shell command should only take one file at a time, but it's trying to do multiple files at once

First off, I'm sorry if I'm not explaining my problem clearly, English is not my native language.
I'm trying to make a snakemake rule that takes a fastq file and filters it with a program called Filtlong. I have multiple fastq files on which I want to run this rule and it should output a filtered file per fastq file but apparently it takes all of the fastq files as input for a single Filtlong command.
The fastq files are in separate directories and the snakemake rule should write the filtered files to separate directories aswell.
This is how my code looks right now:
from os import listdir
configfile: "config.yaml"
DATA = config["DATA"]
SAMPLES = listdir(config["RAW_DATA"])
RAW_DATA = config["RAW_DATA"]
FILT_DIR = config["FILTERED_DIR"]
rule all:
input:
expand("{FILT_DIR}/{sample}/{sample}_filtered.fastq.gz", FILT_DIR=FILT_DIR, sample=SAMPLES)
rule filter_reads:
input:
expand("{RAW_DATA}/{sample}/{sample}.fastq", sample=SAMPLES, RAW_DATA=RAW_DATA)
output:
"{FILT_DIR}/{sample}/{sample}_filtered.fastq.gz"
shell:
"filtlong --keep_percent 90 --target_bases 300000000 {input} | gzip > {output}"
And this is the config file:
DATA:
all_samples
RAW_DATA:
all_samples/raw_samples
FILTERED_DIR:
all_samples/filtered_samples
The separate directories with the fastq files are in RAW_DATA and the directories with the filtered files should be in FILTERED_DIR,
When I try to run this, I get an error that looks something like this:
Error in rule filter_reads:
jobid: 30
output: all_samples/filtered_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10_filtered.fastq.gz
shell:
filtlong --keep_percent 90 --target_bases 300000000 all_samples/raw_samples/cell3_barcode11/cell3_barcode11.fastq all_samples/raw_samples/barcode01/barcode01.fastq all_samples/raw_samples/barcode03/barcode03.fastq all_samples/raw_samples/barcode04/barcode04.fastq all_samples/raw_samples/barcode05/barcode05.fastq all_samples/raw_samples/barcode06/barcode06.fastq all_samples/raw_samples/barcode07/barcode07.fastq all_samples/raw_samples/barcode08/barcode08.fastq all_samples/raw_samples/barcode09/barcode09.fastq all_samples/raw_samples/cell3_barcode01/cell3_barcode01.fastq all_samples/raw_samples/cell3_barcode02/cell3_barcode02.fastq all_samples/raw_samples/cell3_barcode03/cell3_barcode03.fastq all_samples/raw_samples/cell3_barcode04/cell3_barcode04.fastq all_samples/raw_samples/cell3_barcode05/cell3_barcode05.fastq all_samples/raw_samples/cell3_barcode06/cell3_barcode06.fastq all_samples/raw_samples/cell3_barcode07/cell3_barcode07.fastq all_samples/raw_samples/cell3_barcode08/cell3_barcode08.fastq all_samples/raw_samples/cell3_barcode09/cell3_barcode09.fastq all_samples/raw_samples/cell3_barcode10/cell3_barcode10.fastq all_samples/raw_samples/cell3_barcode12/cell3_barcode12.fastq all_samples/raw_samples/cell_18-07-19_barcode01/cell_18-07-19_barcode01.fastq all_samples/raw_samples/cell_18-07-19_barcode02/cell_18-07-19_barcode02.fastq all_samples/raw_samples/cell_18-07-19_barcode03/cell_18-07-19_barcode03.fastq all_samples/raw_samples/cell_18-07-19_barcode04/cell_18-07-19_barcode04.fastq all_samples/raw_samples/cell_18-07-19_barcode05/cell_18-07-19_barcode05.fastq all_samples/raw_samples/cell_18-07-19_barcode06/cell_18-07-19_barcode06.fastq all_samples/raw_samples/cell_18-07-19_barcode07/cell_18-07-19_barcode07.fastq all_samples/raw_samples/cell_18-07-19_barcode08/cell_18-07-19_barcode08.fastq all_samples/raw_samples/cell_18-07-19_barcode09/cell_18-07-19_barcode09.fastq all_samples/raw_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10.fastq all_samples/raw_samples/cell_18-07-19_barcode11/cell_18-07-19_barcode11.fastq all_samples/raw_samples/cell_18-07-19_barcode12/cell_18-07-19_barcode12.fastq all_samples/raw_samples/cell_18-07-19_barcode13/cell_18-07-19_barcode13.fastq all_samples/raw_samples/cell_18-07-19_barcode14/cell_18-07-19_barcode14.fastq all_samples/raw_samples/cell_18-07-19_barcode15/cell_18-07-19_barcode15.fastq all_samples/raw_samples/cell_18-07-19_barcode16/cell_18-07-19_barcode16.fastq all_samples/raw_samples/cell_18-07-19_barcode17/cell_18-07-19_barcode17.fastq all_samples/raw_samples/cell_18-07-19_barcode18/cell_18-07-19_barcode18.fastq all_samples/raw_samples/cell_18-07-19_barcode19/cell_18-07-19_barcode19.fastq | gzip > all_samples/filtered_samples/cell_18-07-19_barcode10/cell_18-07-19_barcode10_filtered.fastq.gz
(exited with non-zero exit code)
As far as I can tell, the rule takes all of the fastq files as input for a single Filtlong command, but I don't quite understand why
You shouldn't use the expand function in your input section of the filter_reads rule. What you are doing now is requiring all your samples to be the input of each filtered file: that is what you can observe in your error message.
There is another complication that you introduce out of nothing: you mix both wildcards and variables. In your example the {FILT_DIR} is just a predefined value while the {sample} is a wildcard that Snakemake uses to match the rules. Try the following (pay special attention on single/double brackets and on the formatted string (the one that has the form f"")):
rule filter_reads:
input:
f"{RAW_DATA}/{{sample}}/{{sample}}.fastq"
output:
f"{FILT_DIR}/{{sample}}/{{sample}}_filtered.fastq.gz"
shell:
"filtlong --keep_percent 90 --target_bases 300000000 {input} | gzip > {output}"