Snakemake combine ambiguous rules - snakemake

I am trying to combine some rules. The rule1 creates in automatics {sample}_unmapped.bam file taking information from library_params.txt file, which I cannot specify as output as program outputs it itself, but which I need to use in the rule2. Is there a way for program to attend the rule1 to finish and then run rule2 using the output from rule1? Because the error it is giving me now is: {sample}_unmapped.bam file is missing.
rule rule1:
input:
basecalls_dir="/RUN1/Data/Intensities/BaseCalls/",
barcodes_dir=directory("barcodes"),
library_params="library_params.txt",
metrics_file="metrics_output.txt"
output:
log="barcodes.log"
shell:
"""
java -Djava.io.tmpdir=/path/to/tmp -Xmx2g -jar picard.jar IlluminaBasecallsToSam BASECALLS_DIR={input.basecalls_dir} BARCODES_DIR={input.barcodes_dir} LANE=1 READ_STRUCTURE=151T8B9M8B151T RUN_BARCODE=run1 LIBRARY_PARAMS={input.library_params} MOLECULAR_INDEX_TAG=RX ADAPTERs_TO_CHECK=INDEXED READ_GROUP_ID=BO NUM_PROCESSORS=2 IGNORE_UNEXPECTED_BARCODES=true > {output.log}
"""
rule rule2:
input:
log="barcodes.log",
infile="{sample}_unmapped.bam"
params:
ref="ref.fasta"
output:
outfile="{sample}.mapped.bam"
shell:
"""
java -Djava.io.tmpdir=/path/to/tmp -Xmx2g -jar picard.jar SamToFastq I={input.infile} F=/dev/stdout INTERLEAVE=true | bwa mem -p -t 7 {params.ref} /dev/stdin | java -Djava.io.tmpdir=/path/to/tmp -Xmx4g -jar picard.jar MergeBamAlignment UNMAPPED={input.infile} ALIGNED=/dev/stdin O={output.outfile} R={params.ref} SORT_ORDER=coordinate MAX_GAPS=-1 ORIENTATIONS=FR

In rule2 I would move infile="{sample}_unmapped.bam" from the input directive to the params directive. And of course you would change the shell script from I={input.infile} to I={params.infile}.
rule2 will still wait for rule1 to complete because you give barcodes.log as input to rule2.

Related

Snakemake using the first argument of a list as a wildcard

I am trying to run the analysis in snakemake where as a proband I take always the first bam file present in the list, i.e NUM_194 and NUM_123. Is there a way to use as a wildcards the first IDENTIFIER of the bam file of the list(d) in the proband line?
d = {"FAM_194": ["path/to/NUM_194/NUM_194.bam", "path/to/NUM_195/NUM_195.bam", "path/to/NUM_196/NUM_196.bam"],
"FAM_123": ["path/to/NUM_123/NUM_123.bam", "path/to/NUM_126/NUM_126.bam", "path/to/NUM_127/NUM_127.bam"]}
FAMILIES = list(d)
rule all:
input:
...
wildcard_constraints:
family = "|".join(FAMILIES)
.....some other rules
rule SelectVariants:
input:
invcf="{fam}/{fam}.vcf"
params:
ref="myref.fasta"
output:
out="{fam}/{fam}.proband.vcf",
out2="{fam}/{fam}.p.avinput"
shell:
"""
proband=NUM_194 <--- the first sample of the list(d), for example NUM_194
gatk --java-options "-Xms2G -Xmx2g -XX:ParallelGCThreads=2" SelectVariants -R {params.ref} -V {input.invcf} -sn "$proband" -O {output.out}
convert2annovar -format vcf4 --includeinfo {output.out} > {output.out2}
"""
Maybe using a function as input (lambda function here) like this?
rule SelectVariants:
input:
invcf="{fam}/{fam}.vcf",
proband= lambda wc: d[wc.fam][0],
...
shell:
"""
gatk ... -sn {input.proband} ...
"""

Input vs Output: Job completed successfully, but some output files are missing

My pipeline is failing I believe due to a conflict between the expected out put of rule all vs the actual final output. I believe snakemake is waiting for the file kma/{sample} without an extension to appear instead it is getting a directory that has multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz") and I am having trouble getting them to play well.
configfile: "config.yaml"
rule all:
input:
expand("kma/{sample}", sample = config["samples"])
#multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz", sample = config["samples"])
rule seqtk_qualtiy_filter:
input:
lambda wildcards: "S5_Raw/" + config["samples"][wildcards.sample]
output:
temp("qtrim/{sample}.qtrim.fq")
shell:
"seqtk trimfq -b 0.01 {input} > {output}"
rule seqtk_clip:
input:
"qtrim/{sample}.qtrim.fq"
output:
temp("clip/{sample}.clip.fq")
shell:
"seqtk trimfq -b20 -L 350 {input} > {output}"
rule bbnorm:
input:
"clip/{sample}.clip.fq"
output:
"S5_processed/{sample}.norm.fq"
shell:
"bbnorm.sh in={input} out={output} target=100"
rule kma_map:
input:
"S5_processed/{sample}.norm.fq"
params:
ref = "ref/consensus.fasta"
output:
directory("kma/{sample}")
#multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz")
shell:
"kma -i {input} -t_db {params.ref} -o {output}"
The error if you run it the way that kma would like to see the handling done
Waiting at most 5 seconds for missing files.
MissingOutputException in line 33 of /home/sean/Desktop/reo/antisera project/ReovirusS1AmpliconS5.smk:
Job completed successfully, but some output files are missing. Missing files after 5 seconds:
kma/BA8359-19
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
File "/home/sean/.local/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 544, in handle_job_success
File "/home/sean/.local/lib/python3.6/site-packages/snakemake/executors/__init__.py", line 231, in handle_job_success
Shutting down, this might take some time.
I have tried increasing the latency time as well, however since the expected file is never actually created it doesn't matter how long you wait.
the error you receive if you using the multiext function
Error in rule kma_map:
jobid: 11
output: kma/BA8359-19.res, kma/BA8359-19.aln, kma/BA8359-19.fsa, kma/BA8359-19.frag.gz
shell:
kma -i S5_processed/BA8359-19.norm.fq -t_db ref/consensus.fasta -o kma/BA8359-19.res kma/BA8359-19.aln kma/BA8359-19.fsa kma/BA8359-19.frag.gz
(one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
This error happens, I believe, because snakemake lists all the file types in the -o flag and kma says no.
There may be an inconsistency between the input files in rule all and the output files in rule kma_map. Showing the error you get would help.
Anyway, the command multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz", sample = config["samples"]) that you seem you have tried should not compile since multiext does not accept named arguments. Maybe what you want is this:
rule all:
input:
expand("kma/{sample}{ext}", sample= config['samples'], ext= [".res", ".aln", ".fsa", ".gz"]),
rule kma_map:
...
output:
multiext("kma/{sample}", ".res", ".aln", ".fsa", ".gz")
shell:
"kmap ... {input} {output}"
EDIT 21SEP2020
I have figured out the correct way to run this application.
first I need the rule all like this
KMA_OUTPUTS = [".res",".aln",".fsa",".frag.gz"]
rule all:
input:
expand("kma/{sample}{ext}", sample = config["samples"], ext = KMA_OUTPUTS)
And then the rule for kma like this
rule kma_map:
input:
"S5_processed/{sample}.norm.fq"
params:
ref = "ref/consensus.fasta",
prefix = "kma/{sample}"
output:
multiext("kma/{sample}", ".res", ".aln", ".fsa", ".frag.gz")
shell:
"kma -i {input} -t_db {params.ref} -o {params.prefix}"
by using the parama.prefix as the output file input I can get the desired output.

Snakemake: MissingInputException with inconsistent naming scheme

I am trying to process MinION cDNA amplicons using Porechop with Minimap2 and I am getting this error.
MissingInputException in line 16 of /home/sean/Desktop/reo/antisera project/20200813/MinIONAmplicon.smk:
Missing input files for rule minimap2:
8413_19_strict/BC01.fastq.g
I understand what the error telling me, I just understand why its being its not trying to make the rule before it. Porechop is being used to check for all the possible barcodes and will output more than one fastq file if it finds more than barcode in the directory. However since I know what barcode I am looking for I made a barcodes section in the config.yaml file so I can map them together.
I think the error is happening because my target output for Porechop doesn't match the input for minimap2 but I do not know how to correct this problem as there can be multiple outputs from porechop.
I thought I was building a path for the input file for the minimap2 rule and when snakemake discovered that the porechop output was not there it would make it, but that is not what is happening.
Here is my pipeline so far,
configfile: "config.yaml"
rule all:
input:
expand("{sample}.bam", sample = config["samples"])
rule porechop_strict:
input:
lambda wildcards: config["samples"][wildcards.sample]
output:
directory("{sample}_strict/")
shell:
"porechop -i {input} -b {output} --barcode_threshold 85 --threads 8 --require_two_barcodes"
rule minimap2:
input:
lambda wildcards: "{sample}_strict/" + config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output}"
and the yaml file
samples: {
'8413_19': relabeled_reads/8413_19.raw.fastq.gz,
'8417_19': relabeled_reads/8417_19.raw.fastq.gz,
'8445_19': relabeled_reads/8445_19.raw.fastq.gz,
'8466_19_104': relabeled_reads/8466_19_104.raw.fastq.gz,
'8466_19_105': relabeled_reads/8466_19_105.raw.fastq.gz,
'8467_20': relabeled_reads/8467_20.raw.fastq.gz,
}
barcodes: {
'8413_19': BC01.fastq.gz,
'8417_19': BC02.fastq.gz,
'8445_19': BC03.fastq.gz,
'8466_19_104': BC04.fastq.gz,
'8466_19_105': BC05.fastq.gz,
'8467_20': BC06.fastq.gz,
}
First of all, you can always debug the problems like that specifying the flag --printshellcmds. That would print all shell commands that Snakemake runs under the hood; you may try to run them manually and locate the problem.
As for why your rule doesn't produce any output, my guess is that samtools requires explicit filenames or - to use stdin:
Samtools is designed to work on a stream. It regards an input file '-'
as the standard input (stdin) and an output file '-' as the standard
output (stdout). Several commands can thus be combined with Unix
pipes. Samtools always output warning and error messages to the
standard error output (stderr).
So try that:
shell:
"minimap2 -ax map-ont -t8 ../concensus.fasta {input} | samtools sort -o {output} -"
So I am not 100% sure why this way works, I imagine it has to do with the way snakemake looks at the targets however here is the solution I found for it.
rule minimap2:
input:
"{sample}_strict"
params:
suffix=lambda wildcards: config["barcodes"][wildcards.sample]
output:
"{sample}.bam"
shell:
"minimap2 -ax map-ont -t8 ../consensus.fasta\
{input}/{params.suffix} | samtools sort -o {output}"
by using the params feature in snakemake I was able to match up the correct barcode to the sample name. I am not sure why I could just do that as the input itself, but when I returned the input to the match the output of the previous rule it works.

Snakemake “Missing files after X seconds” error

I am getting the following error every time I try to run my snakemake script:
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cluster nodes: 99
Job counts:
count jobs
1 all
1 antiSMASH
1 pear
1 prodigal
4
[Wed Dec 11 14:59:43 2019]
rule pear:
input: Unmap_41_1.fastq, Unmap_41_2.fastq
output: merged_reads/Unmap_41.fastq
jobid: 3
wildcards: sample=Unmap_41, extension=fastq
Submitted job 3 with external jobid 'Submitted batch job 4572437'.
Waiting at most 120 seconds for missing files.
MissingOutputException in line 14 of /faststorage/project/ABR/scripts/antismash.smk:
Missing files after 120 seconds:
merged_reads/Unmap_41.fastq
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Job failed, going on with independent jobs.
Exiting because a job execution failed. Look above for error message
It would seem that the first rule is not executing, but I am unsure as to why as from what I can see all the syntax is correct. Anyone have some advice?
The snakefile is the following:
#!/miniconda/bin/python
workdir: config["path_to_files"]
wildcard_constraints:
separator = config["separator"],
extension = config["file_extension"],
sample = '|' .join(config["samples"])
rule all:
input:
expand("antismash-output/{sample}/{sample}.txt", sample = config["samples"])
# merging the paired end reads (either fasta or fastq) as prodigal only takes single end reads
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {output} -t 21")
# If single end then move them to merged_reads directory
rule move:
input:
"{sample}.{extension}"
output:
"merged_reads/{sample}.{extension}"
shell:
"cp {path}/{sample}.{extension} {path}/merged_reads/"
# Setting the rule order on the 3 above rules which should be treated equally and only one run.
ruleorder: pear > move
# annotating the metagenome with prodigal#. Can be done inside antiSMASH but prefer to do it out
rule prodigal:
input:
f"merged_reads/{{sample}}.{config['file_extension']}"
output:
gbk_files = "annotated_reads/{sample}.gbk",
protein_files = "protein_reads/{sample}.faa"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("prodigal -i {input} -o {output.gbk_files} -a {output.protein_files} -p meta")
# running antiSMASH on the annotated metagenome
rule antiSMASH:
input:
"annotated_reads/{sample}.gbk"
output:
touch("antismash-output/{sample}/{sample}.txt")
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("antismash --knownclusterblast --subclusterblast --full-hmmer --smcog --outputfolder antismash-output/{wildcards.sample}/ {input}")
I am running the pipeline on only one file at the moment but the yaml file looks like this if it is of intest:
file_extension: fastq
path_to_files: /home/lamma/ABR/Each_reads
samples:
- Unmap_41
separator: _
I know the error can occure when you use certain flags in snakemake but I dont believe I am using those flags. The command being submited to run the snakefile is:
snakemake --latency-wait 120 --rerun-incomplete --keep-going --jobs 99 --cluster-status 'python /home/lamma/ABR/scripts/slurm-status.py' --cluster 'sbatch -t {cluster.time} --mem={cluster.mem} --cpus-per-task={cluster.c} --error={cluster.error} --job-name={cluster.name} --output={cluster.output}' --cluster-config antismash-config.json --configfile yaml-config-files/antismash-on-rawMetagenome.yaml --snakefile antismash.smk
I have tried to -F flag to force a rerun but this seems to do nothing, as does increasing the --latency-wait number. Any help would be appriciated :)
I think it is likely to be something involving the way I am calling the conda environment in the run commands but using the conda: option with a yaml files returns version not found style errors.
As of what I read from pear documentation:
-o Specify the name to be used as base for the output files. PEAR outputs four files. A file containing the assembled reads with a
assembled.fastq extension, two files containing the forward, resp.
reverse, unassembled reads with extensions unassembled.forward.fastq,
resp. unassembled.reverse.fastq, and a file containing the discarded
reads with a discarded.fastq extension
So if the output defined in your rule is just a base, I suggest you put this as a param and the real names of the output as output:
rule pear:
input:
forward = f"{{sample}}{config['separator']}1.{{extension}}",
reverse = f"{{sample}}{config['separator']}2.{{extension}}"
output:
"merged_reads/{sample}.{extension}".assembled.fastq,
"merged_reads/{sample}.{extension}".unassembled.forward.fastq,
"merged_reads/{sample}.{extension}".unassembled.reverse.fastq,
"merged_reads/{sample}.{extension}".discarded.fastq
params:
base="merged_reads/{sample}.{extension}"
#conda:
#"/home/lamma/env-export/antismash.yaml"
run:
shell("set +u")
shell("source ~/miniconda3/etc/profile.d/conda.sh")
shell("conda activate antismash")
shell("pear -f {input.forward} -r {input.reverse} -o {params.base} -t 21")
I haven't tested pear so not sure what the output file names are exactly.

How to pass variable value as input in snakemake?

I want to download the fastq files from SRA database using SRR ID using Snakemake. I read a file to get SRR ID using python code.
I want to parse the Variable one by one as input. My code is below.
I want to run command
fastq-dump SRR390728
#SAMPLES = ['SRR390728','SRR400816']
SAMPLES = [line.strip() for line in open("./srrList", 'r')]
rule all:
input:
expand("fastq/{sample}.fastq.log",sample=SAMPLES)
rule download_fastq:
input:
"{sample}"
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {input} > {output}"
Skip input and just call the wildcard in shell command. input needs to be a filepath that needs to already exist or be created as part of the pipeline - neither are true in your case.
rule download_fastq:
output:
"fastq/{sample}.fastq.log"
shell:
"fastq-dump {wildcards.sample} > {output}"