Snakemake: HISAT2index buind and alignment using touch - snakemake

Following my previous question:Snakemake: HISAT2 alignment of many RNAseq reads against many genomes UPDATED.
I wanted to run the hisat2 alignment using touch in snakemake.
I have several genome files with suffix .1.ht2l to .8.ht2l
bob.1.ht2l
...
bob.8.ht2l
steve.1.ht2l
...
steve.8.ht2l
and sereval RNAseq samples
flower_kevin_1.fastq.gz
flower_kevin_2.fastq.gz
flower_daniel_1.fastq.gz
flower_daniel_2.fastq.gz
I need to align all rnaseq reads against each genome.
workdir: "/path/to/dir/"
(HISAT2_INDEX_PREFIX,)=glob_wildcards('/path/to/dir/{prefix}.fasta')
(SAMPLES,)=glob_wildcards("/path/to/dir/{sample}_1.fastq.gz")
rule all:
input:
expand("{prefix}.{sample}.bam", zip, prefix=HISAT2_INDEX_PREFIX, sample=SAMPLES)
rule hisat2_build:
input:
database="/path/to/dir/{prefix}.fasta"
output:
done = touch("{prefix}")
threads: 2
shell:
("/Tools/hisat2-2.1.0/hisat2-build -p {threads} {input.database} {wildcards.prefix}")
rule hisat2:
input:
hisat2_prefix_done = "{prefix}",
fastq1="/path/to/dir/{sample}_1.fastq.gz",
fastq2="/path/to/dir/{sample}_2.fastq.gz"
output:
bam = "{prefix}.{sample}.bam",
txt = "{prefix}.{sample}.txt",
log: "{prefix}.{sample}.snakemake_log.txt"
threads: 50
shell:
"/Tools/hisat2-2.1.0/hisat2 -p {threads} -x {wildcards.prefix}"
" -1 {input.fastq1} -2 {input.fastq2} --summary-file {output.txt} |"
"/Tools/samtools-1.9/samtools sort -# {threads} -o {output.bam}"
The output gives me bob and steve aligned ONLY against ONE rna-seq sample (i.e. flower_kevin). I don't know how to solve. Any suggestions would be helpful.

I solved the problem by removing zip from rule all. Critics about the syntax of code is still welcome.

Related

Sample Input from file

I am trying to create the input for rules from a sample file. The sample file contains a Column SampleID which should be used as sample wildcard. I want to extract the paths of normal and tumor bams from the columns Path_Normal and Path_Tumor per SampleID from the data frame.
For this I tried like this:
import pandas as pd
input_table = "sampletable.tsv"
samples = pd.read_table(input_table).set_index("SampleID", drop=False)
rule all:
input:
expand("/directory/sm_mutect2_paired/vcf/{sample}.mt2.vcf", sample=samples.index)
rule Mutect2:
input:
tumor = samples[samples['SampleID']=="{sample}"]['Path_Tumor'],
normal = samples[samples['SampleID']=="{sample}"]['Path_Normal']
output:
"/directory/sm_mutect2_paired/vcf/{sample}.mt2.vcf"
conda:
"envs/gatk_mutect2_paired.yaml"
shell:
"gatk --java-options '-Xmx16G -XX:+UseParallelGC -XX:ParallelGCThreads=16' Mutect2 \
-R /directory/ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta \
{input.tumor} \
{input.normal} \
-L /directory/GATK_interval_files_Agilent/S07604514_hs_hg38/S07604514_Covered.bed \
-O {output} \
--af-of-alleles-not-in-resource 2.5e-06 \
--germline-resource /directory/GATK_gnomad/af-only-gnomad.hg38.vcf.gz \
-pon /home/zyto/unger/GATK_PoN/1000g_pon.hg38.vcf.gz"
...
When doing a dry run I do not get an error message but the execution fails because the input is empty which becomes looking at the log:
atk --java-options '-Xmx16G -XX:+UseParallelGC -XX:ParallelGCThreads=16' Mutect2 -R /directory/GATK_ref/genomics-public-data/references/hg38/v0/Homo_sapiens_assembly38.fasta -L /directory/GATK_interval_files_Agilent/S07604514_hs_hg38/S07604514_Covered.bed -O /directory/WES_Rezidiv_HNSCC_Clonality/sm_mutect2_paired/vcf/HL05_Rez_HL05_NG.mt2.vcf --af-of-alleles-not-in-resource 2.5e-06 --germline-resource /directory/GATK_gnomad/af-only-gnomad.hg38.vcf.gz -pon /directory/GATK_PoN/1000g_pon.hg38.vcf.gz
The two input files should appear between "Mutect2" and "-R".
So it looks I am doing something wrong defining the inputs...
You need to defer the determination of the input files of that rule to the so-called DAG phase, when jobs and wildcard values are known. This works via input functions. I would strongly recommend to do the official Snakemake tutorial, which covers this topic in depth.

Running multiple snakemake rules

I would like to run multiple rules one after another using snakemake. However, when I run this script, the bam_list rule appears before samtools_markdup rule, and gives me an error that it cannot find input files, which are obviously have not been generated yet.
How to solve this problem?
rule all:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES)
"dup/bam_list"
rule samtools_markdup:
input:
sortbam ="rg/{sample}.rg.bam"
output:
dupbam = "dup/{sample}.dup.bam"
threads: 5
shell:
"""
samtools markdup -# {threads} {input.sortbam} {output.dupbam}
"""
rule bam_list:
output:
outlist = "dup/bam_list"
shell:
"""
ls dup/*.bam > {output.outlist}
"""
Snakemake is following directions, you want dup/bam_list and it can be produced without any inputs. I think what you mean to have is:
rule all:
input:
"dup/bam_list"
rule samtools_markdup:
input:
sortbam ="rg/{sample}.rg.bam"
output:
dupbam = "dup/{sample}.dup.bam"
threads: 5
shell:
"""
samtools markdup -# {threads} {input.sortbam} {output.dupbam}
"""
rule bam_list:
input:
expand("dup/{sample}.dup.bam", sample=SAMPLES)
output:
outlist = "dup/bam_list"
shell:
"""
ls dup/*.bam > {output.outlist}
"""
Now bam_list will wait until all the samtools_markdup jobs are completed. As an aside, I expect the contents of dup_list to be identical to expand("dup/{sample}.dup.bam", sample=SAMPLES), so if you use the file later in the workflow you can probably just use the expand output.

snakemake batch creation of output

Generating output based on changed input files in Snakemake is easy:
rule all:
input: [f'out_{i}.txt' for i in range(10)]
rule make_input:
output: 'in_{i}.txt'
shell: 'touch {output}'
rule make_output_parallel:
input: 'in_{i}.txt'
output: 'out_{i}.txt'
shell: 'touch {output}'
In this case, make_output will only run for instances where in_{i}.txt have changed.
But suppose the 'out_{i}.txt' cannot be generated in parallel and I want to generate them in a single step, like,
rule make_output_one_step:
input: [f'in_{i}.txt' for i in range(10)]
output: [f'out_{i}.txt' for i in range(10)]
shell: 'touch {output}'
If only one of the in_{i}.txt files have changed, I don't need to regenerate all 10 of them.
How can I adjust make_output_one_step.output to generate only the needed files?
If you want some parts of the pipeline to not work in parallel for whatever reason (RAM, internet usage, IO, API limit, etc....) you can make use of resources.
rule all:
input: [f'out_{i}.txt' for i in range(10)]
rule make_input:
output: 'in_{i}.txt'
shell: 'touch {output}'
rule make_output:
input: 'in_{i}.txt'
output: 'out_{i}.txt'
resources: max_parallel=1
shell: 'touch {output}'
And then you can call your pipeline like snakemake --resources max_parallel=1 --cores 10. In this case all the jobs of rule make_input will run in parallel, but only one instance of make_output will run in parallel.

Snakemake read input from file

I am trying to use file that will be written during the run as an input to another rule, but it always give me error FileNotFoundError: [Errno 2] No such file or directory:
Is there a way to fix it or other implementation to have the same logic.
def vc_list(wildcards):
my_list = []
with open(wildcards.mydir+"/file_B.txt", 'r') as data_in:
for line in data_in:
my_list.append(line.strip())
return(my_list)
# rule A will process file_A.txt and give me file_B.txt
rule A:
input: "{mydir}/file_A.txt"
output: "{mydir}/file_B.txt"
shell: "seq 1 5 > {output}" # assume that `seq 1 5` is the output from proicessing the file
rule B:
input: "{vlaue}"
output: "{vlaue}.vc"
shell: "pythoncode.py {input} {output}"
# rule C will process file_B.txt to give me list of values that will be used to expanded the input, then will use rile B to produce it
rule C:
input:
processed_file = rules.A.output, #"{mydir}/file_B.txt",
my_list = lambda wildcards: expand("{mydir}/{value}.vc", mydir=wildcards.mydir, value=vc_list(wildcards))
output: "{mydir}/done.txt"
shell: "touch {output}"
#I always have the error that "{mydir}/file_B.txt" does not exist
The error now:
test_loop.snakefile:
FileNotFoundError: [Errno 2] No such file or directory: 'read_file/file_B.txt'
Wildcards:
mydir=read_file
Thanks,
The answer to my question is to use checkpoint as dynamic will be deprecated.
Here is how the logic should be changed:
rule:
input: 'done.txt'
checkpoint A:
output: 'B.txt'
shell: 'seq 1 2 > {output}'
rule N:
input: "genome.fa"
output: '{num}.bam'
shell: "touch {output}"
rule B:
input: '{num}.bam'
output: '{num}.vc'
shell: "touch {output}"
def aggregate_input(wildcards):
with open(checkpoints.A.get(**wildcards).output[0], 'r') as f:
return [num.rstrip() + '.vc' for num in f]
rule C:
input: aggregate_input
output: touch('done.txt')
Credit goes to Eric Lim
Your script fails even before the workflow starts, on the phase of the pipeline construction.
So, there is nothing surprising regarding the rules A and B: Snakemake reads their input and output sections and finds no problem with them. Then it starts reading the rule C where the input section calls the vc_list() function which in turn tries to read the file 'read_file/file_B.txt' even before the workflow has started! For sure it doesn't find the file and produces the error.
As for what to do, you need to clarify the task first. Most probable you are trying to use dynamic information in the input rule. In this case you need to use dynamic files or checkpoints.

Conditional execution of multiplexed analysis with snakemake

I've some troubles with Snakemake, up to now I didn’t found pertinent informations
in the documentation (or somewhere else).
In fact, I've a big file with different samples (multiplexed analyses) and I would like to stop the execution of the pipeline for some sample according to result found after rules.
I've already tried to change this value out of a rule definition (using a checkpoint or a def), to make conditional input for folowing rules and to considere wildcards as a simple list to delete one item.
Below is an example of what I want to do (the conditional if is only indicative here) :
# Import the config file(s)
configfile: "../PATH/configfile.yaml"
# Wildcards
sample = config["SAMPLE"]
lauch = config["LAUCH"]
# Rules
rule all:
input:
expand("PATH_TO_OUTPUT/{lauch}.{sample}.output", lauch=lauch, sample=sample)
rule one:
input:
"PATH_TO_INPUT/{lauch}.{sample}.input"
output:
temp("PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp")
shell:
"""
somescript.sh {input} {output}
"""
rule two:
input:
"PATH_TO_OUTPUT/{lauch}.{sample}.output.tmp"
output:
"PATH_TO_OUTPUT/{lauch}.{sample}.output"
shell:
"""
somecheckpoint.sh {input} # Print a message and write in the log file for now
if [ file_dont_pass_checkpoint ]; then
# Delete the correspondant sample to the wildcard {sample}
# to continu the analysis only with samples who are pass the validation
fi
somescript2.sh {input} {output}
"""
If someone has an idea I'm interested.
Thank you in advance for your answers.
I think this is an interesting situation if I understand it correctly. If a sample passes some checks, then keep analysing it. Otherwise, stop early.
At the end of the pipeline, every sample must have a PATH_TO_OUTPUT/{lauch}.{sample}.output since this what the rule all asks for regardless of the check results.
You could have the rule(s) performing the checks writing a file containing a flag indicating whether for that sample the checks passed or not (say flag PASS or FAIL). Then according to that flag, the rule(s) doing the analysis either go for the full analysis (if PASS) or write an empty file (or whathever) if the flag is FAIL. Here's the gist:
rule all:
input:
expand('{sample}.output', sample= samples),
rule checker:
input:
'{sample}.input',
output:
'{sample}.check',
shell:
r"""
if [ some_check_is_ok ]
then
echo "PASS" > {output}
else
echo "FAIL" > {output}
fi
"""
rule do_analysis:
input:
chk= '{sample}.check',
smp= '{sample}.input',
output:
'{sample}.output',
shell:
r"""
if [ {input.chk} contains "PASS"]:
do_long_analysis.sh {input.smp} > {output}
else:
> {output} # Do nothing: empty file
"""
If you don't want to see the failed, empty output files at all, you could use the onsuccess directive to get rid of them at the end of the pipeline:
onsuccess:
for x in expand('{sample}.output', sample= samples):
if os.path.getsize(x) == 0:
print('Removing failed sample %s' % x)
os.remove(x)
The canonical solution to problems like this is to use checkpoints. Consider the following example:
import pandas as pd
def get_results(wildcards):
qc = pd.read_csv(checkpoints.qc.get().output[0].open(), sep="\t")
return expand(
"results/processed/{sample}.txt",
sample=qc[qc["some-qc-criterion"] > config["qc-threshold"]]["sample"]
)
rule all:
input:
get_results
checkpoint qc:
input:
expand("results/preprocessed/{sample}.txt", sample=config["samples"])
output:
"results/qc.tsv"
shell:
"perfom-qc {input} > {output}"
rule process:
input:
"results/preprocessed/{sample}.txt"
output:
"results/processed/{sample.txt}"
shell:
"process {input} > {output}"
The idea is the following: at some point in your pipeline, after some (let's say) preprocessing, you add a checkpoint rule, which aggregates over all samples and generates some kind of QC table. Then, downstream of that, there is a rule that aggregates over samples (e.g. the rule all, or some other aggregation inside of the workflow). Let's say in that aggregation you only want to consider samples that pass the QC. For that, you let the required files ("results/processed/{sample}.txt") be determined via an input function, which reads the QC table generated by the checkpoint rule. Snakemake's checkpoint mechanism ensures that this input function is evaluated after the checkpoint has been executed, so that you can actually read the table results and base your decision about the samples on the qc criteria contained in that table. Any intermediate rules (like here the process rule) will then be automatically applied by Snakemake when re-evaluating the DAG.