Snakemake: MissingInputException with inconsistent naming scheme - snakemake

I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}

First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"

So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.

Related

How does snakemake handle possible corruptions due to a rule run in parallel simultaneously appending to a single file?

I would like to learn how snakemake handles following situations, and what is the best practice avoid collisions/corruptions.
rule something:
input:
expand("/path/to/out-{asd}.txt", asd=LIST)
output:
"/path/to/merged.txt"
shell:
"cat {input} >> {output}"
With snakemake -j10 the command will try to append to the same file simultaneously, and I could not figure out if this could lead to possible corruptions or if this is already handled.
Also, how are more complicated cases handled e.g. where it is not only cat but a return value of another process based on input value being appended to the same file? Is the best practice first writing them to individual files then catting them together?
rule get_merged_total_distinct:
input:
expand("{dataset_id}/merge_libraries/{tomerge}_merged_rmd.bam",dataset_id=config["dataset_id"],tomerge=list(TOMERGE.keys())),
output:
"{dataset_id}/merge_libraries/merged_total_distinct.csv"
params:
with_dups="{dataset_id}/merge_libraries/{tomerge}_merged.bam"
shell:
"""
RCT=$(samtools view -#4 -c -F1 -F4 -q 30 {params.with_dups})
RCD=$(samtools view -#4 -c -F1 -F4 -q 30 {input})
printf "{wildcards.tomerge},${{RCT}},${{RCD}}\n" >> {output}
"""
or cases where an external script is being called to print the result to a single output file?
input:
expand("infile/{x}",...) # expanded as above
output:
"results/all.txt"
shell:
"""
bash script.sh {params.x} {input} {params.y} >> {output}
"""
With your example, the shell directive will expand to
cat /path/to/out-SAMPLE1.txt /path/to/out-SAMPLE2.txt [...] >> /path/to/merged.txt
where SAMPLE1, etc, comes from the LIST. In this case, there is no collision, corruption, or race conditions. One thread will run that command as if you typed it on your shell and all inputs will get cated to the output. Since snakemake is pull based, once the output exists that rule will only run again if the inputs change at which points the new inputs will be added to the old due to using >>. As such, I would recommend using > so the old contents are removed; rules should be deterministic where possible.
Now, if you had done something like
rule something:
input:
"/path/to/out-{asd}.txt"
output:
touch("/path/to/merged-{asd}.txt")
params:
output="/path/to/merged.txt"
shell:
"cat {input} >> {params.output}"
# then invoke
snakemake -j10 /path/to/merged-{a..z}.txt
Things are more messy. Snakemake will launch all 10 jobs and output to the single merged.txt. Note that file is now a parameter and we are targeting some dummy files. This will behave as if you had 10 different shells and executed the commands
cat /path/to/out-a.txt >> /path/to/merged.txt
# ...
cat /path/to/out-z.txt >> /path/to/merged.txt
all at once. The output will have a random order and lines may be interleaved or interrupted.
As some guidance
Try to make outputs deterministic. Given the same inputs you should always produce the same outputs. If possible, set random seeds and enforce input ordering. In the second example, you have no idea what the output will be.
Don't use the append operator. This follows from the first point. If the output already exists and needs to be updated, start from scratch.
If you need to append a bunch of outputs, say log files or to create a summary, do so in a separate rule. This again follows from the first point, but it's the only reason I can think of to use append.
Hope that helps. Otherwise you can comment or edit with a more realistic example of what you are worried about.

Snakemake input two variables and output one variable

I want to rename and move my fastq.gz files from these:
NAME-BOB_S1_L001_R1_001.fastq.gz
NAME-BOB_S1_L001_R2_001.fastq.gz
NAME-JOHN_S2_L001_R1_001.fastq.gz
NAME-JOHN_S2_L001_R2_001.fastq.gz
to these:
NAME_BOB/reads/NAME_BOB.R1.fastq.gz
NAME_BOB/reads/NAME_BOB.R2.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R1.fastq.gz
NAME_JOHN/reads/NAME_JOHN.R2.fastq.gz
This is my code. The problem I have is the second variable S which I do not know how to specify in the code as I do not need it in my output filename.
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
S=["S1","S2"]
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz", dir=DIR),
expand("NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz", dir=DIR)
rule rename:
input:
fastq1=("fastq/NAME-{dir}_{s}_L001_R1_001.fastq.gz", zip, dir=DIR, s=S),
fastq2=("fastq/NAME-{dir}_{s}_L001_R2_001.fastq.gz", zip, dir=DIR, s=S)
output:
fastq1="NAME_{dir}/reads/NAME_{dir}.R1.fastq.gz",
fastq2="NAME_{dir}/reads/NAME_{dir}.R2.fastq.gz"
shell:
"""
mv {input.fastq1} {output.fastq1}
mv {input.fastq2} {output.fastq2}
"""
There are several problems in your code. First of all, the {dir} in your output and {dir} in your input are two different variables. Actually the {dir} in the output is a wildcard, while the {dir} in the input is a parameter for the expand function (moreover, you even forgot to call this function, and that is the second problem).
The third problem is that the shell section shall contain only a single command. You may try mv {input.fastq1} {output.fastq1}; mv {input.fastq2} {output.fastq2}, but this is not an idiomatic solution. Much better would be to create a rule that produces a single file, letting Snakemake to do the rest of the work.
Finally the S value fully depend on the DIR value, so it becomes a function of {dir}, and that can be solved with a lambda in input:
workdir: "/path/to/workdir/"
DIR=["BOB","JOHN"]
dir2s = {"BOB": "S1", "JOHN": "S2"}
rule all:
input:
expand("NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz", dir=DIR, r=["R1", "R2"])
rule rename:
input:
lambda wildcards:
"fastq/NAME-{{dir}}_{s}_L001_{{r}}_001.fastq.gz".format(s=dir2s[wildcards.dir])
output:
"NAME_{dir}/reads/NAME_{dir}.{r}.fastq.gz",
shell:
"""
mv {input} {output}
"""

Input vs Output: Job completed successfully, but some output files are missing

My pipeline is failing I believe due to a conflict between the expected out put of rule all vs the actual final output. I believe snakemake is waiting for the file kma/{sample} without an extension to appear instead it is getting a directory that has multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz") and I am having trouble getting them to play well.
configfile: "config.yaml"
rule all:
input:
expand("kma/{sample}", sample = config["samples"])
#multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz", sample = config["samples"])
rule seqtk_qualtiy_filter:
input:
lambda wildcards: "S5_Raw/" + config["samples"][wildcards.sample]
output:
temp("qtrim/{sample}.qtrim.fq")
shell:
"seqtk trimfq -b 0.01 {input} > {output}"
rule seqtk_clip:
input:
"qtrim/{sample}.qtrim.fq"
output:
temp("clip/{sample}.clip.fq")
shell:
"seqtk trimfq -b20 -L 350 {input} > {output}"
rule bbnorm:
input:
"clip/{sample}.clip.fq"
output:
"S5_processed/{sample}.norm.fq"
shell:
"bbnorm.sh in={input} out={output} target=100"
rule kma_map:
input:
"S5_processed/{sample}.norm.fq"
params:
ref = "ref/consensus.fasta"
output:
directory("kma/{sample}")
#multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz")
shell:
"kma -i {input} -t_db {params.ref} -o {output}"
The error if you run it the way that kma would like to see the handling done
Waiting at most 5 seconds for missing files.
MissingOutputException in line 33 of /home/sean/Desktop/reo/antisera project/ReovirusS1AmpliconS5.smk:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
kma/BA8359-19
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
File "/home/sean/.local/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
File "/home/sean/.local/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 231, in handle_job_success
Shutting down, this might take some time.
I have tried increasing the latency time as well, however since the expected file is never actually created it doesn't matter how long you wait.
the error you receive if you using the multiext function
Error in rule kma_map:
jobid: 11
output: kma/BA8359-19.res, kma/BA8359-19.aln, kma/BA8359-19.fsa, kma/BA8359-19.frag.gz
shell:
kma -i S5_processed/BA8359-19.norm.fq -t_db ref/consensus.fasta -o kma/BA8359-19.res kma/BA8359-19.aln kma/BA8359-19.fsa kma/BA8359-19.frag.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
This error happens, I believe, because snakemake lists all the file types in the -o flag and kma says no.
There may be an inconsistency between the input files in rule all and the output files in rule kma_map. Showing the error you get would help.
Anyway, the command multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz", sample = config["samples"]) that you seem you have tried should not compile since multiext does not accept named arguments. Maybe what you want is this:
rule all:
input:
expand("kma/{sample}{ext}", sample= config['samples'], ext= [".res", ".aln", ".fsa", ".gz"]),
rule kma_map:
...
output:
multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz")
shell:
"kmap ... {input} {output}"
EDIT 21SEP2020
I have figured out the correct way to run this application.
first I need the rule all like this
KMA_OUTPUTS = [".res",".aln",".fsa",".frag.gz"]
rule all:
input:
expand("kma/{sample}{ext}", sample = config["samples"], ext = KMA_OUTPUTS)
And then the rule for kma like this
rule kma_map:
input:
"S5_processed/{sample}.norm.fq"
params:
ref = "ref/consensus.fasta",
prefix = "kma/{sample}"
output:
multiext("kma/{sample}", ".res", ".aln", ".fsa", ".frag.gz")
shell:
"kma -i {input} -t_db {params.ref} -o {params.prefix}"
by using the parama.prefix as the output file input I can get the desired output.

Snakemake: Exchanging variables

I have some ONT sequencing runs that have been basecalled on the MINIT. As such, when I demultiplex with guppy_barcoder, I get a directory of fastq files for each barcode. I want to use snakemake as a workflow manager to take these fastq files through our analyses, but this involves swapping the {barcode} for {sample} at some point.
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
rule all:
input:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE)), #guppy_barcoder
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq")
output:
directory(expand("Sequencing_reads/demultiplexed/{barcode}", barcode=BARCODE))
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
rules.demultiplex.output
output:
"Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat Sequencing_reads/demultiplexed/{wildcards.barcode}/*fastq > {output.fastq} "
This does give me an error:
RuleException in line 32 of /home/eriny/sandbox/ONT_unicycler_pipeline/ONT_pipeline.smk:
'Wildcards' object has no attribute 'barcode'
But I actually think I'm missing something conceptually. I would like rule gather to be something like:
cat Sequencing_reads/demultiplexed/barcode01/*fastq > Sequencing_reads/gathered/sample01_ONT.fastq
I have tried setting up some dictionaries so that sample and barcode are given the same key, but my syntax must be broken.
I'm hoping to find a 1:1 way to map one variable name onto another.
I'm hoping to find a 1:1 way to map one variable name onto another.
I think the sample to dictionary is a possibility combined with a lambda as input function to get the barcode assign to a sample. For example:
BARCODE=['barcode01', 'barcode02', 'barcode03', 'barcode04']
SAMPLE=['sample01', 'sample02', 'sample03', 'sample04']
sam2bar= dict(zip(SAMPLE, BARCODE))
rule all:
input:
expand("Sequencing_reads/gathered/{sample}_ONT.fastq", sample=SAMPLE), #getting all of the fastq files with the same barcode assigned to the correct sample
rule demultiplex:
input:
glob.glob("Sequencing_reads/fastq_pass/*fastq"),
output:
done= touch('demux.done'), # This signals that guppy has completed
shell:
"guppy_barcoder --input_path Sequencing_reads/fastq_pass --save_path Sequencing_reads/demultiplexed -r "
rule gather:
input:
done= 'demux.done',
fastq= lambda wc: glob.glob("Sequencing_reads/demultiplexed/%s/*fastq" % sam2bar[wc.sample])
output:
fastq= "Sequencing_reads/gathered/{sample}_ONT.fastq"
shell:
"cat {input.fastq} > {output.fastq} "

How to pass variable value as input in snakemake?

I want to download the fastq files from SRA database using SRR ID using Snakemake. I read a file to get SRR ID using python code.
I want to parse the Variable one by one as input. My code is below.
I want to run command
fastq-dump SRR390728
#SAMPLES = ['SRR390728','SRR400816']
SAMPLES = [line.strip() for line in open("./srrList", 'r')]
rule all:
input:
expand("fastq/{sample}.fastq.log",sample=SAMPLES)
rule download_fastq:
input:
"{sample}"
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {input} > {output}"
Skip input and just call the wildcard in shell command. input needs to be a filepath that needs to already exist or be created as part of the pipeline - neither are true in your case.
rule download_fastq:
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {wildcards.sample} > {output}"