Snakemake: HISAT2 alignment of many RNAseq reads against many genomes UPDATED - snakemake

I have several genome files with suffix .1.ht2l to .8.ht2l
bob.1.ht2l
...
bob.8.ht2l
steve.1.ht2l
...
steve.8.ht2l
and sereval RNAseq samples
flower_kevin_1.fastq.gz
flower_kevin_2.fastq.gz
flower_daniel_1.fastq.gz
flower_daniel_2.fastq.gz
I need to align all rnaseq reads against each genome.
UPDATED as dariober suggested:
workdir: "/path/to/aligned"
(HISAT2_INDEX_PREFIX,)=glob_wildcards("/path/to/index/{prefix}.1.ht2l")
(SAMPLES,)=glob_wildcards("/path/to/{sample}_1.fastq.gz")
print(HISAT2_INDEX_PREFIX)
print (SAMPLES)
rule all:
input:
expand("{prefix}.{sample}.bam", zip, prefix=HISAT2_INDEX_PREFIX, sample=SAMPLES)
rule hisat2:
input:
hisat2_index=expand("%s.{ix}.ht2l" % "/path/to/index/{prefix}", ix=range(1, 9), prefix = HISAT2_INDEX_PREFIX),
fastq1="/path/to/{sample}_1.fastq.gz",
fastq2="/path/to/{sample}_2.fastq.gz"
output:
bam = "{prefix}.{sample}.bam",
txt = "{prefix}.{sample}.txt",
log: "{prefix}.{sample}.snakemake_log.txt"
threads: 5
shell:
"/Tools/hisat2-2.1.0/hisat2 -p {threads} -x {/path/to/index/{wildcards.prefix}"
" -1 {input.fastq1} -2 {input.fastq2} --summary-file {output.txt} |"
"/Tools/samtools-1.9/samtools sort -# {threads} -o {output.bam}"
The problem I get is when running HISAT2 is taking as -x input all bob.1.ht2l:bob.8.ht2l and steve.1.ht2l:steve.8.ht2l at once. While rna-seq should be mapped at each genome separately. Where is the error?
NB: my previous question: Snakemake: HISAT2 alignment of many RNAseq reads against many genomes

I think your confusion comes from the fact that hisat wants a prefix to the index files, not all the list of index files. So instead of -x {input.hisat2_index} (i.e. the list of index files) use something like -x /path/to/{wildcards.prefix}.
In other words, the input hisat2_index=expand(...) should be there only to tell snakemake to start this rule only after these files are ready but you don't use them directly (well, hisat does use them of course but you don't pass them on the command line).

Related

How does snakemake handle possible corruptions due to a rule run in parallel simultaneously appending to a single file?

I would like to learn how snakemake handles following situations, and what is the best practice avoid collisions/corruptions.
rule something:
input:
expand("/path/to/out-{asd}.txt", asd=LIST)
output:
"/path/to/merged.txt"
shell:
"cat {input} >> {output}"
With snakemake -j10 the command will try to append to the same file simultaneously, and I could not figure out if this could lead to possible corruptions or if this is already handled.
Also, how are more complicated cases handled e.g. where it is not only cat but a return value of another process based on input value being appended to the same file? Is the best practice first writing them to individual files then catting them together?
rule get_merged_total_distinct:
input:
expand("{dataset_id}/merge_libraries/{tomerge}_merged_rmd.bam",dataset_id=config["dataset_id"],tomerge=list(TOMERGE.keys())),
output:
"{dataset_id}/merge_libraries/merged_total_distinct.csv"
params:
with_dups="{dataset_id}/merge_libraries/{tomerge}_merged.bam"
shell:
"""
RCT=$(samtools view -#4 -c -F1 -F4 -q 30 {params.with_dups})
RCD=$(samtools view -#4 -c -F1 -F4 -q 30 {input})
printf "{wildcards.tomerge},${{RCT}},${{RCD}}\n" >> {output}
"""
or cases where an external script is being called to print the result to a single output file?
input:
expand("infile/{x}",...) # expanded as above
output:
"results/all.txt"
shell:
"""
bash script.sh {params.x} {input} {params.y} >> {output}
"""
With your example, the shell directive will expand to
cat /path/to/out-SAMPLE1.txt /path/to/out-SAMPLE2.txt [...] >> /path/to/merged.txt
where SAMPLE1, etc, comes from the LIST. In this case, there is no collision, corruption, or race conditions. One thread will run that command as if you typed it on your shell and all inputs will get cated to the output. Since snakemake is pull based, once the output exists that rule will only run again if the inputs change at which points the new inputs will be added to the old due to using >>. As such, I would recommend using > so the old contents are removed; rules should be deterministic where possible.
Now, if you had done something like
rule something:
input:
"/path/to/out-{asd}.txt"
output:
touch("/path/to/merged-{asd}.txt")
params:
output="/path/to/merged.txt"
shell:
"cat {input} >> {params.output}"
# then invoke
snakemake -j10 /path/to/merged-{a..z}.txt
Things are more messy. Snakemake will launch all 10 jobs and output to the single merged.txt. Note that file is now a parameter and we are targeting some dummy files. This will behave as if you had 10 different shells and executed the commands
cat /path/to/out-a.txt >> /path/to/merged.txt
# ...
cat /path/to/out-z.txt >> /path/to/merged.txt
all at once. The output will have a random order and lines may be interleaved or interrupted.
As some guidance
Try to make outputs deterministic. Given the same inputs you should always produce the same outputs. If possible, set random seeds and enforce input ordering. In the second example, you have no idea what the output will be.
Don't use the append operator. This follows from the first point. If the output already exists and needs to be updated, start from scratch.
If you need to append a bunch of outputs, say log files or to create a summary, do so in a separate rule. This again follows from the first point, but it's the only reason I can think of to use append.
Hope that helps. Otherwise you can comment or edit with a more realistic example of what you are worried about.

use directories or all files in directories as input in snakemake

I am new to snakemake. I want to use directories or all files in directories as input in snakemake. For example, two directories with different no. of bam files,
--M1
M1-1.bam
M1-2.bam
--M2
M2-3.bam
M2-5.bam
I just want to merge M1-1.bam, M1-2.bam to M1.bam; M2-3.bam, M2-5.bam to M2.bam; I tried to use wildcards and expand followd by this and this, and the code is as follows,
config.yaml
SAMPLES:
M1:
- 1
- 2
M2:
- 3
- 5
rawdata: path/to/rawdata
outpath: path/to/output
reference: path/to/reference
snakemake file
configfile:"config.yaml"
SAMPLES=config["SAMPLES"]
REFERENCE=config["reference"]
RAWDATA=config["rawdata"]
OUTPATH=config["outpath"]
ALL_INPUT = []
for key, values in SAMPLES.items():
ALL_INPUT.append(f"Map/bwa/merge/{key}.bam")
ALL_INPUT.append(f"Map/bwa/sort/{key}.sort.bam")
ALL_INPUT.append(f"Map/bwa/dup/{key}.sort.rmdup.bam")
ALL_INPUT.append(f"Map/bwa/dup/{key}.sort.rmdup.matrix")
ALL_INPUT.append(f"SNV/Mutect2/result/{key}.vcf.gz")
ALL_INPUT.append(f"Map/bwa/result/{key}")
for value in values:
ALL_INPUT.append(f"Map/bwa/result/{key}/{key}-{value}.bam")
for num in {1,2}:
ALL_INPUT.append(f"QC/fastp/{key}/{key}-{value}.R{num}.fastq.gz")
rule all:
input:
expand("{outpath}/{all_input}",all_input=ALL_INPUT,outpath=OUTPATH)
rule fastp:
input:
r1= RAWDATA + "/{key}-{value}.R1.fastq.gz",
r2= RAWDATA + "/{key}-{value}.R2.fastq.gz"
output:
a1="{outpath}/QC/fastp/{key}/{key}-{value}.R1.fastq.gz",
a2="{outpath}/QC/fastp/{key}/{key}-{value}.R2.fastq.gz"
params:
prefix="{outpath}/QC/fastp/{key}/{key}-{value}"
shell:
"""
fastp -i {input.r1} -I {input.r2} -o {output.a1} -O {output.a2} -j {params.prefix}.json -h {params.prefix}.html
"""
rule bwa:
input:
a1="{outpath}/QC/fastp/{key}/{key}-{value}.R1.fastq.gz",
a2="{outpath}/QC/fastp/{key}/{key}-{value}.R2.fastq.gz"
output:
o1="{outpath}/Map/bwa/result/{key}/{key}-{value}.bam"
params:
mem="4000",
rg="#RG\\tID:{key}\\tPL:ILLUMINA\\tSM:{key}"
shell:
"""
bwa mem -t {threads} -M -R '{params.rg}' {REFERENCE} {input.a1} {input.a2} | samtools view -b -o {output.o1}
"""
## get sample index from raw fastq
key_ids,value_ids = glob_wildcards(RAWDATA + "/{key}-{value}.R1.fastq.gz")
# remove duplicate sample name, and this is useful when there is only one sample input
key_ids = list(set(key_ids))
rule merge:
input:
expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam",outpath=OUTPATH, key=key_ids, value=value_ids)
output:
"{outpath}/Map/bwa/merge/{key}.bam"
shell:
"""
samtools merge {output} {input}
"""
The {input} in merge command will be
M1-1.bam M1-2.bam M1-3.bam M1-5.bam M2-1.bam M2-2.bam M2-3.bam M2-5.bam
Actually, for M1 sample, the {input} should be M1-1.bam M1-2.bam; for M2, M2-3.bam M2-5.bam. I also read this, but I have no idea if there are lots of directories with different files each.
Then I tried to use directories as input, for merge rule
rule mergebam:
input:
"{outpath}/Map/bwa/result/{key}"
output:
"{outpath}/Map/bwa/merge/{key}.bam"
log:
"{outpath}/Map/bwa/log/{key}.merge.bam.log"
shell:
"""
samtools merge {output} `ls {input}/*bam` > {log} 2>&1
"""
But this give me MissingInputException error
Missing input files for rule merge:
/{outpath}/Map/bwa/result/M1
Any idea will be appreciated.
I haven't fully parsed your question but I'll give it a shot anyway... In rule merge you have:
expand("{outpath}/Map/bwa/result/{key}/{key}-{value}.bam",outpath=OUTPATH, key=key_ids, value=value_ids)
This means that you collect all combinations of outpath, key and value.
Presumably you want all combinations of value within each outpath and key. So use:
expand("{{outpath}}/Map/bwa/result/{{key}}/{{key}}-{value}.bam", value=value_ids)
if you change your config.yaml to the following, can you make the implementation easier by using expand?
SAMPLES:
M1:
- M1-1
- M2-2
M2:
- M2-3
- M2-5

Snakemake: MissingInputException with inconsistent naming scheme

I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}
First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"
So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.

Snakemake: HISAT2 alignment of many RNAseq reads against many genomes

I have several genome files, all with suffix .1.ht2l to .8.ht2l
bob.1.ht2l
bob.2.ht2l
bob.3.ht2l
bob.4.ht2l
bob.5.ht2l
bob.6.ht2l
bob.7.ht2l
bob.8.ht2l
steve.1.ht2l ....steve.8.ht2l and so on
and sereval RNAseq samples, like
flower_kevin_1.fastq.gz
flower_kevin_2.fastq.gz
flower_daniel_1.fastq.gz
flower_daniel_2.fastq.gz and so on also with different tissues.
I would like to align all rnaseq reds against the genomes.
UPDATED:
workdir: "/path/to/aligned"
(HISAT2_INDEX_PREFIX,)=glob_wildcards("/path/to/index/{prefix}.1.ht2l")
(SAMPLES,)=glob_wildcards("/path/to/{sample}_1.fastq.gz")
print(HISAT2_INDEX_PREFIX)
print (SAMPLES)
rule all:
input:
expand("{prefix}.{sample}.bam", zip, prefix=HISAT2_INDEX_PREFIX, sample=SAMPLES)
rule hisat2:
input:
hisat2_index=expand("%s.{ix}.ht2l" % "/path/to/index/{prefix}", ix=range(1, 9)),
fastq1="/path/to/{sample}_1.fastq.gz",
fastq2="/path/to/{sample}_2.fastq.gz"
output:
bam = "{prefix}.{sample}.bam",
txt = "{prefix}.{sample}.txt",
log: "{prefix}.{sample}.snakemake_log.txt"
threads: 5
shell:
"/Tools/hisat2-2.1.0/hisat2 -p {threads} -x {HISAT2_INDEX_PREFIX}"
" -1 {input.fastq1} -2 {input.fastq2} --summary-file {output.txt} |"
"/Tools/samtools-1.9/samtools sort -# {threads} -o {output.bam}"
I get missing input files error, I am sure the error is somewhere with suffixes, however,I do not know how to solve and any suggestion is much appreciated.
UPDATE: ALL genome files bob.1.ht2l..bob.8.ht2l, steve.1.ht2l..steve.8.ht2l get invoked all at once, why is that?
Let's compare these 2 lines:
(HISAT2_INDEX_PREFIX,)=glob_wildcards("/path/to/index/{prefix}.1.ht2l")
hisat2_index=expand("%s.{ix}.ht2l" % HISAT2_INDEX_PREFIX, ix=range(1, 9)),
The first line explicitly specifies the /path/to/index/, the second misses this path. Be consistent, that should fix your problem.

Proxy file on snakemake code

I want to do alignment using star and I use proxy file for star the alignment.
Without a proxy file star-align run also without reference. So if I gave as input constrain of the alignment process the presence of database.done the alignment process can start.
How can manage this situation?
rule star_index:
input:
config['references']['transcriptome_fasta']
output:
genome=config['references']['starindex_dir'],
tp=touch("database.done")
shell:
'STAR --limitGenomeGenerateRAM 54760833024 --runMode genomeGenerate --genomeDir {output.genome} --genomeFastaFiles {input}'
rule star_map:
input:
dt="trim/{sample}/",
forward_paired="trim/{sample}/{sample}_forward_paired.fq.gz",
reverse_paired="trim/{sample}/{sample}_reverse_paired.fq.gz",
forward_unpaired="trim/{sample}/{sample}_forward_unpaired.fq.gz",
reverse_unpaired="trim/{sample}/{sample}_reverse_unpaired.fq.gz",
t1p="database.done",
output:
out1="ALIGN/{sample}/Aligned.sortedByCoord.out.bam",
out2="ALIGN/{sample}/",
# out2=touch("Star.align.done")
params:
genomedir = config['references']['basepath'],
sample="mitico",
platform_unit=config['platform'],
cente=config['center']
threads: 12
log: "ALIGN/log/{params.sample}_star.log"
shell:
'mkdir -p ALIGN/;STAR --runMode alignReads --genomeDir {params.genomedir} '
r' --outSAMattrRGline ID:{params.sample} SM:{params.sample} PL:{config[platform]} PU:{params.platform_unit} CN:{params.cente} '
'--readFilesIn {input.forward_paired} {input.reverse_paired} \
--readFilesCommand zcat
--outWigType wiggle \
--outWigStrand Stranded --runThreadN {threads} --outFileNamePrefix {output.out2} 2> {log} '
How can start a module only after all the previous function have finished.
I mean.Here i create the index then I trim ll my data and then I staart the alignment. I want after finishis all this sstep for all the sample start a new function like run fastqc. How can decode this in snakemake?
thanks so much for patience help
Without any mention of the genome as a required input for "star_map", I believe the rule is starting too early.
Try moving the genome reference from being a "Parameter" to being an "Input" requirement for star_map. Snakemake doesn't wait for parameters, only inputs. All reference genomes should be listed as inputs. In fact, all required files should be listed as input requirements. Param's are just for mostly convenience; ad-hoc strings and things on the fly.
I'm not entirely sure as to the connectivity across your files, some of these references are to a YAML file you have not provided, so I cannot guarantee the code will work.
rule star_map:
input:
dt="trim/{sample}/",
forward_paired="trim/{sample}/{sample}_forward_paired.fq.gz",
reverse_paired="trim/{sample}/{sample}_reverse_paired.fq.gz",
forward_unpaired="trim/{sample}/{sample}_forward_unpaired.fq.gz",
reverse_unpaired="trim/{sample}/{sample}_reverse_unpaired.fq.gz",
# Including the gnome as a required input, so Snakemake knows to wait for it too.
genomedir = config['references']['basepath'],
output:
out1="ALIGN/{sample}/Aligned.sortedByCoord.out.bam",
out2="ALIGN/{sample}/",
Snakemake doesn't check what files your shell commands are touching and modifying. Snakemake only knows to coordinate the files described in the "input" and "output" directives.