Sample Input from file - snakemake

I am trying to create the input for rules from a sample file. The sample file contains a Column SampleID which should be used as sample wildcard. I want to extract the paths of normal and tumor bams from the columns Path_Normal and Path_Tumor per SampleID from the data frame.
For this I tried like this:
import pandas as pd
input_table = "sampletable.tsv"
samples = pd.read_table(input_table).set_index("SampleID", drop=False)
rule all:
input:
expand("/directory/sm_mutect2_paired/vcf/{sample}.mt2.vcf", sample=samples.index)
rule Mutect2:
input:
tumor = samples[samples['SampleID']=="{sample}"]['Path_Tumor'],
normal = samples[samples['SampleID']=="{sample}"]['Path_Normal']
output:
"/directory/sm_mutect2_paired/vcf/{sample}.mt2.vcf"
conda:
"envs/gatk_mutect2_paired.yaml"
shell:
"gatk --java-options '-Xmx16G -XX:+UseParallelGC -XX:ParallelGCThreads=16' Mutect2 \
-R /directory/ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta \
{input.tumor} \
{input.normal} \
-L /directory/GATK_interval_files_Agilent/S07604514_hs_hg38/S07604514_Covered.bed \
-O {output} \
--af-of-alleles-not-in-resource 2.5e-06 \
--germline-resource /directory/GATK_gnomad/af-only-gnomad.hg38.vcf.gz \
-pon /home/zyto/unger/GATK_PoN/1000g_pon.hg38.vcf.gz"
...
When doing a dry run I do not get an error message but the execution fails because the input is empty which becomes looking at the log:
atk --java-options '-Xmx16G -XX:+UseParallelGC -XX:ParallelGCThreads=16' Mutect2 -R /directory/GATK_ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta -L /directory/GATK_interval_files_Agilent/S07604514_hs_hg38/S07604514_Covered.bed -O /directory/WES_Rezidiv_HNSCC_Clonality/sm_mutect2_paired/vcf/HL05_Rez_HL05_NG.mt2.vcf --af-of-alleles-not-in-resource 2.5e-06 --germline-resource /directory/GATK_gnomad/af-only-gnomad.hg38.vcf.gz -pon /directory/GATK_PoN/1000g_pon.hg38.vcf.gz
The two input files should appear between "Mutect2" and "-R".
So it looks I am doing something wrong defining the inputs...

You need to defer the determination of the input files of that rule to the so-called DAG phase, when jobs and wildcard values are known. This works via input functions. I would strongly recommend to do the official Snakemake tutorial, which covers this topic in depth.

Related

How to reference input in params section of snakemake rule?

I need to process my input file values, turning them into a comma-separated string (instead of white space) in order to pass them to a CLI program. To do this, I want to run the input files through a Python function. How can I reference the input files of a rule in the params section of the same rule?
This is what I've tried, but it doesn't work:
rule a:
input:
foo="a.txt"
bar=expand({build}.txt,build=config["build"])
output:
baz=result.txt
params:
joined_bar=lambda w: ",".join(input.bar) # this doesn't work
shell:
"""
qux --comma-separated-files {params.joined_bar} \
--foo {input.foo} \
>{output.baz}
"""
It fails with:
InputFunctionException:
AttributeError: 'builtin_function_or_method' object has no attribute 'bar'
Potentially related but (over-)complicated questions:
How to define parameters for a snakemake rule with expand input
Is Snakemake params function evaluated before input file existence?
Turns out I need to explicitly add input to the lambda w: part:
rule a:
input:
foo="a.txt"
bar=expand({build}.txt,build=config["build"])
output:
baz=result.txt
params:
joined_bar=lambda w, input: ",".join(input.bar) # ', input' was added
shell:
"""
qux --comma-separated-files {params.joined_bar} \
--foo {input.foo} \
>{output.baz}
"""
Interestingly, I found that one needs to use input in the lambda w, input. In my testing, lambda w, i did not work.
And alternative is to refer to the rule input in the standard way: rules.a.input.bar:
rule a:
input:
foo="a.txt"
bar=expand({build}.txt,build=config["build"])
output:
baz=result.txt
params:
joined_bar=lambda w: ",".join(rules.a.input.bar) # 'rules.a.' was added
shell:
"""
qux --comma-separated-files {params.joined_bar} \
--foo {input.foo} \
>{output.baz}
"""
Also see http://biolearnr.blogspot.com/2017/11/snakemake-using-inputoutput-values-in.html for a discussion.

Have snakemake skip rules based on input selection

I have a massive Snakefile. The bits that are likely important are these below. I want to make this a bit more flexible with input files.
in the runini file; if lanenumlanelaeve = 1, I want snakemake to start on rule cutadapt (as samples would have been merged already) with corresponding input files, if not follow what normal flow of rules with those corresponding input files. I know an if else needs to placed. But, I am not seeing how to add this/where. Maybe add something in the configfile?
# config file
configfile:'rna.config.yaml
# check run.ini file for various things
runini = configparser.ConfigParser()
runini.read('../Run/ini')
ss = runini['File']['SS']
rule all:
input: complete
def fastq(wildcards):
names = glob.glob(config['fq_glob'] %wildcards.sampleID)
return sorted(names)
rule merge:
input: fastq
output:
'merged_{sampleID}_merged_R1.fastq.gz'
'merged_{sampleID}_merged_R2.fastq.gz'
threads: 8
params:
r1 = config['pari_id'][0],
r2 = config['pari_id'][1]
run:
r1 = [x for x in input if params.r1 in x]
r2 = [x for x in input if params.r2 in x]
shell('cat %s > {output[0]}' %' '.join(r1))
shell('cat %s > {output[1]}' %' '.join(r2))
rule cutadapt:
input: rules.merge_fastqs.output
output:
r1 = 'trimmed/{sampleID}_trimmed_R1.fastq.gz',
r2 = 'trimmed/{sampleID}_trimmed_R2.fastq.gz'
log: 'multiqc/cutadapt/{sampleID}.cutadapt.log'
threads: 16
params: adapter = config['adapter_fa']
run:
shell('cutadapt -b {params.adapter} -B {params.adapter} \
--cores={threads} \
--minimum-length=20 \
-q 20 \
-o {output.r1} \
-p {output.r2} \
{input} > {log}')
It's not clear from the snippet you posted if the following would work, since a lot of values and relations have to be guessed. One possibility is to add an explicit python conditional statement along these lines:
if myvar==1:
rule x:
input: some_files,
output: processed_files,
else:
rule y:
input: other_files,
output: processed_files,
This type of conditional rule definition can be avoided by having a more wholesome rule definitions, but that would require knowing the full workflow.

Snakemake: HISAT2index buind and alignment using touch

Following my previous question:Snakemake: HISAT2 alignment of many RNAseq reads against many genomes UPDATED.
I wanted to run the hisat2 alignment using touch in snakemake.
I have several genome files with suffix .1.ht2l to .8.ht2l
bob.1.ht2l
...
bob.8.ht2l
steve.1.ht2l
...
steve.8.ht2l
and sereval RNAseq samples
flower_kevin_1.fastq.gz
flower_kevin_2.fastq.gz
flower_daniel_1.fastq.gz
flower_daniel_2.fastq.gz
I need to align all rnaseq reads against each genome.
workdir: "/path/to/dir/"
(HISAT2_INDEX_PREFIX,)=glob_wildcards('/path/to/dir/{prefix}.fasta')
(SAMPLES,)=glob_wildcards("/path/to/dir/{sample}_1.fastq.gz")
rule all:
input:
expand("{prefix}.{sample}.bam", zip, prefix=HISAT2_INDEX_PREFIX, sample=SAMPLES)
rule hisat2_build:
input:
database="/path/to/dir/{prefix}.fasta"
output:
done = touch("{prefix}")
threads: 2
shell:
("/Tools/hisat2-2.1.0/hisat2-build -p {threads} {input.database} {wildcards.prefix}")
rule hisat2:
input:
hisat2_prefix_done = "{prefix}",
fastq1="/path/to/dir/{sample}_1.fastq.gz",
fastq2="/path/to/dir/{sample}_2.fastq.gz"
output:
bam = "{prefix}.{sample}.bam",
txt = "{prefix}.{sample}.txt",
log: "{prefix}.{sample}.snakemake_log.txt"
threads: 50
shell:
"/Tools/hisat2-2.1.0/hisat2 -p {threads} -x {wildcards.prefix}"
" -1 {input.fastq1} -2 {input.fastq2} --summary-file {output.txt} |"
"/Tools/samtools-1.9/samtools sort -# {threads} -o {output.bam}"
The output gives me bob and steve aligned ONLY against ONE rna-seq sample (i.e. flower_kevin). I don't know how to solve. Any suggestions would be helpful.
I solved the problem by removing zip from rule all. Critics about the syntax of code is still welcome.

output of one rule as input of another

I am new to snakemake and I'm trying to write a complex pipeline with many steps and branching points. One of the earlier steps is a STAR alignment.
Here I want to use the genome alignment for some stuff and the transcriptome aligment for others. I'm outputting two output files and I want to use each of these as inputs for other rules in snakemake.
If possible I want to avoid using actual filenames and I want snakemake to deal with it for me.
rule star:
input:
reads=samples.iloc[0,1].split(",")
output:
tx_align=temp("/".join([output_directory, "star", samplename+"Aligned.toTranscriptome.out.bam"])),
genome_align="/".join([output_directory, "star", samplename+"Aligned.sortedByCoord.out.bam"])
params:
index=config["resources"]["star_index"],
gtf=config["resources"]["gtf"],
prefix="/".join([output_directory, "star", samplename])
log: config["root_dir"]+"/"+str{samples["samplename"]}+"star.log"
threads: 10
shell:
"""
STAR --runMode alignReads \
--runThreadN {threads} \
--readFilesCommand zcat \
--readFilesIn {reads} \
--genomeDir {params.index} \
--outFileNamePrefix {params.prefix} \
--twopassMode Basic \
--sjdbGTFfile {params.gtf} \
--outFilterType BySJout \
--limitSjdbInsertNsj 1200000 \
--outSAMstrandField intronMotif \
--outFilterIntronMotifs None \
--alignSoftClipAtReferenceEnds Yes \
--quantMode TranscriptomeSAM GeneCounts \
--outSAMtype BAM SortedByCoordinate \
--outSAMattrRGline ID:{samplename}, SM:sm1 \
--outSAMattributes All \
--outSAMunmapped Within \
--outSAMprimaryFlag AllBestScore \
--chimSegmentMin 15 \
--chimJunctionOverhangMin 15 \
--chimOutType Junctions \
--chimMainSegmentMultNmax 1 \
--genomeLoad NoSharedMemory
"""
So then I wan to use something like
rule rsem:
input:
rules.star.ouput[0]
output:
somefile
run:
etc
I'm not even sure if this is possible.
EDIT:
nwm here is the solution
rule1
input: some_input
output:
out1=output1,
out2=output2
shell:
"command {input} {out1} {out2}"
rule2
input:rules.rule1.output.out1

Conditional execution of multiplexed analysis with snakemake

I've some troubles with Snakemake, up to now I didn’t found pertinent informations
in the documentation (or somewhere else).
In fact, I've a big file with different samples (multiplexed analyses) and I would like to stop the execution of the pipeline for some sample according to result found after rules.
I've already tried to change this value out of a rule definition (using a checkpoint or a def), to make conditional input for folowing rules and to considere wildcards as a simple list to delete one item.
Below is an example of what I want to do (the conditional if is only indicative here) :
# Import the config file(s)
configfile: "../PATH/configfile.yaml"
# Wildcards
sample = config["SAMPLE"]
lauch = config["LAUCH"]
# Rules
rule all:
input:
expand("PATH_TO_OUTPUT/{lauch}.{sample}.output", lauch=lauch, sample=sample)
rule one:
input:
"PATH_TO_INPUT/{lauch}.{sample}.input"
output:
temp("PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp")
shell:
"""
somescript.sh {input} {output}
"""
rule two:
input:
"PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp"
output:
"PATH_TO_OUTPUT/{lauch}.{sample}.output"
shell:
"""
somecheckpoint.sh {input} # Print a message and write in the log file for now
if [ file_dont_pass_checkpoint ]; then
# Delete the correspondant sample to the wildcard {sample}
# to continu the analysis only with samples who are pass the validation
fi
somescript2.sh {input} {output}
"""
If someone has an idea I'm interested.
Thank you in advance for your answers.
I think this is an interesting situation if I understand it correctly. If a sample passes some checks, then keep analysing it. Otherwise, stop early.
At the end of the pipeline, every sample must have a PATH_TO_OUTPUT/{lauch}.{sample}.output since this what the rule all asks for regardless of the check results.
You could have the rule(s) performing the checks writing a file containing a flag indicating whether for that sample the checks passed or not (say flag PASS or FAIL). Then according to that flag, the rule(s) doing the analysis either go for the full analysis (if PASS) or write an empty file (or whathever) if the flag is FAIL. Here's the gist:
rule all:
input:
expand('{sample}.output', sample= samples),
rule checker:
input:
'{sample}.input',
output:
'{sample}.check',
shell:
r"""
if [ some_check_is_ok ]
then
echo "PASS" > {output}
else
echo "FAIL" > {output}
fi
"""
rule do_analysis:
input:
chk= '{sample}.check',
smp= '{sample}.input',
output:
'{sample}.output',
shell:
r"""
if [ {input.chk} contains "PASS"]:
do_long_analysis.sh {input.smp} > {output}
else:
> {output} # Do nothing: empty file
"""
If you don't want to see the failed, empty output files at all, you could use the onsuccess directive to get rid of them at the end of the pipeline:
onsuccess:
for x in expand('{sample}.output', sample= samples):
if os.path.getsize(x) == 0:
print('Removing failed sample %s' % x)
os.remove(x)
The canonical solution to problems like this is to use checkpoints. Consider the following example:
import pandas as pd
def get_results(wildcards):
qc = pd.read_csv(checkpoints.qc.get().output[0].open(), sep="\t")
return expand(
"results/processed/{sample}.txt",
sample=qc[qc["some-qc-criterion"] > config["qc-threshold"]]["sample"]
)
rule all:
input:
get_results
checkpoint qc:
input:
expand("results/preprocessed/{sample}.txt", sample=config["samples"])
output:
"results/qc.tsv"
shell:
"perfom-qc {input} > {output}"
rule process:
input:
"results/preprocessed/{sample}.txt"
output:
"results/processed/{sample.txt}"
shell:
"process {input} > {output}"
The idea is the following: at some point in your pipeline, after some (let's say) preprocessing, you add a checkpoint rule, which aggregates over all samples and generates some kind of QC table. Then, downstream of that, there is a rule that aggregates over samples (e.g. the rule all, or some other aggregation inside of the workflow). Let's say in that aggregation you only want to consider samples that pass the QC. For that, you let the required files ("results/processed/{sample}.txt") be determined via an input function, which reads the QC table generated by the checkpoint rule. Snakemake's checkpoint mechanism ensures that this input function is evaluated after the checkpoint has been executed, so that you can actually read the table results and base your decision about the samples on the qc criteria contained in that table. Any intermediate rules (like here the process rule) will then be automatically applied by Snakemake when re-evaluating the DAG.