How to avoid Snakemake rule from using incomplete output file from other rules - snakemake

rule rule1:
output: tsv = "..."
input: faa = "..."
shell:
"""
awk ... > {output.tsv}
some commands {input.faa} | awk ... >> {output.tsv}
"""
rule rule2:
output:
tsv = "..."
input:
tsv = rules.rule1.output.tsv,
shell:
"""
awk ... {input.tsv} > {output.tsv}
"""
As it illustrated above, rule2 takes input file from rule1.
According to the official docs, since the output file in rule1 is created successfully by awk, Snakemake assumes everything worked fine, even if my output file is incomplete, because awk is going to append to that file. Snakemake just ran rule2 and took the incomplete file from rule1. Actually, the second awk command in rule1 have not being executed, leaving the output file incomplete.

As far as I can tell, Snakemake detecting the presence of output.csv and assuming the rule completed successfully (since your awks didn't error) is working as intended.
It's not very easy for me to suggest specific edits since the commands are not complete, but how about creating intermediate files in your rule for the two awk commands, then combining them, so that if one or the other don't run the rule fails. Something like:
rule rule1:
output: tsv = "...", int1 = temp(".../{sample}_i1.tsv"), int2 = temp(".../{sample}_i2.tsv")
input: faa = "..."
shell:
"""
awk ... > {output.int1}
awk ... {input.faa} >> {output.int2}
[some logic to make sure the processing is complete]
cp {output.int2} {output.tsv}
"""
I wrapped both intermediates in temp() so that snakemake cleans them up after the rule ends.

Related

Snakemake, producing list of files that are created within the pipeline

This is my first snakemake workflow, so it might be that I'm overcomplicating things.
My workflow takes as input the 'database query' for downloading some files, which is specified in my 'config.yaml'. It means that I do not know the names of the files that will be downloaded before running the pipeline.
# configfile: "config.yaml"
# DATABASE = config["database"]
# database: '("Apis"[Organism] OR Apis[All Fields]) AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])'
DATABASE = '("Apis"[Organism] OR Apis[All Fields]) AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])'
What I want to do is to:
Create a genome list: call a database with my query and extract the links to the files (create_genome_list). (Here, I use entrez)
Next, I want to download the files using the collected links (download_genome)
Files are zipped, so I want to unzip them (unzip_genome)
Finally, I would like to create a list of all downloaded and unzipped files... and here I struggle. (make_summary_table)
I can run my snakemake on steps 1-3 when I call one of the expected output files with the following:
snakemake -p database/GCA_000184785.2_Aflo_1.1_genomic/GCA_000184785.2_Aflo_1.1_genomic.fna --use-conda
It gives me links to all expected files (5) in folder /temp,
and 1 downloaded and unzipped file: /database/GCA_000184785.2_Aflo_1.1_genomic/GCA_000184785.2_Aflo_1.1_genomic.fna
My snakemake for steps 1-3 looks like this:
rule create_genome_list:
output: touch("temp/{genome}")
conda: "entrez_env.yaml"
message: "Creating the genomes list..."
shell:
r"""
esearch -db assembly -query '{DATABASE}' \
| esummary \
| xtract -pattern DocumentSummary -element FtpPath_GenBank \
| while read -r line ;
do
fname=$(echo $line | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/');
wildcard=$(echo $fname | sed -e 's!.fna.gz!!');
echo "$line/$fname" > temp/$wildcard;
#echo $wildcard >> list_of_genomes.txt
done
"""
rule download_genome:
output: touch("database/{genome}/{genome}.fna.gz")
input: "temp/{genome}"
shell:
r"""
GENOME_LINK=$(cat {input})
GENOME="${{GENOME_LINK##*/}}"
wget -P ./database/{wildcards.genome}/ $GENOME_LINK
"""
rule unzip_genome:
output: touch("database/{genome}/{genome}.fna")
input: "database/{genome}/{genome}.fna.gz"
shell: "gunzip {input}"
My problem starts when I want to create the final rule, which will wrap up the results of my pipeline. In my real pipeline, I do some additional analyses with downloaded genomes, and at the end, I want to join all partial results obtained per single genome into one table. Here I post a toy example, which I believe reflects my problem the best.
I guess there is some way to extract the genomes' names so I could call them in the final summarising rule's input.
I approached it in an ugly way by listing files in temp/ and using them in expand() like follow:
GENOMES = os.listdir("temp/")
rule make_summary_table:
output: "summary_table.txt"
input: expand("database/{genome}/{genome}.fna", genome = GENOMES)
shell:
"""
echo {input} >> {output}
echo " " >> {output}
"""
But it works only when /temp exists before running the pipeline. And it produces the summary_table.txt with 5 positions only when I run steps 1-3 before (otherwise, it produces an empty file).
I am also afraid that in my real pipeline, it might happen that not all genomes will produce partial results on time the last summarising rule will be called. But maybe Snakemake handles it somehow (by waiting?) once all the inputs are specified.
-----------------------------EDIT-----------------------------------------
I have tried to implement checkpoint as a possible solution as follow:
DATABASE = '("Apis"[Organism] OR Apis[All Fields]) AND (latest[filter] AND "representative genome"[filter] AND all[filter] NOT anomalous[filter])'
rule all:
input: "summary_table.txt"
checkpoint create_genome_list:
output: directory("temp/")
conda: "entrez_env.yaml"
shell:
r"""
esearch -db assembly -query '{DATABASE}' \
| esummary \
| xtract -pattern DocumentSummary -element FtpPath_GenBank \
| while read -r line ;
do
fname=$(echo $line | grep -o 'GCA_.*' | sed 's/$/_genomic.fna.gz/');
wildcard=$(echo $fname | sed -e 's!.fna.gz!!');
echo "$line/$fname" > temp/$wildcard;
#echo $wildcard >> list_of_genomes.txt
done
"""
rule download_genome:
output: touch("database/{genome}/{genome}.fna.gz")
input: "temp/{genome}"
shell:
r"""
GENOME_LINK=$(cat {input})
GENOME="${{GENOME_LINK##*/}}"
wget -P ./database/{wildcards.genome}/ $GENOME_LINK
"""
rule unzip_genome:
output: "database/{genome}/{genome}.fna"
input: "database/{genome}/{genome}.fna.gz"
shell:
r"""
gunzip {input}
"""
def aggregate_input(wildcards):
checkpoint_output = checkpoints.create_genome_list.get(**wildcards).output[0]
return expand("database/{genome}/{genome}.fna",
i=glob_wildcards(os.path.join(checkpoint_output, "{genome}.fna")).genome)
rule make_summary_table:
output: "summary_table.txt"
input: aggregate_input
shell:
"""
echo {input} >> {output}
echo " " >> {output}
"""
But cannot overcome the error: InputFunctionException in line 73 (rule make_summary_table) of ~/snakemake_test/Snakefile: WildcardError: No values given for wildcard 'genome'. Wildcards:
For your updated code to work you need to apply atleast the following two fixes:
Redefine the rule as checkpoint (not necessary, see edit note below)
Expand all wildcards in the checkpoint-related function (your expand leaves {genome} un-expanded as it expands i which is not defined as a wildcard and thus does nothing.
The relevant code lines:
def aggregate_input(wildcards):
checkpoint_output = checkpoints.create_genome_list.get(**wildcards).output[0]
return expand(
"database/{genome}/{genome}.fna",
genome=glob_wildcards(os.path.join(checkpoint_output, "{genome}.fna")).genome,
)
rule make_summary_table:
output:
"summary_table.txt",
input:
aggregate_input,
shell:
"""
echo {input} >> {output}
echo " " >> {output}
"""
Give it a try and let us know if it works!
edit: Sorry, I realised that the correct rule was already converted to a checkpoint and my 1. point is invalid. I've updated the answer above.

Split files in Snakemake

I have a simple question, but I just cannot figure it out myself.
I have a list of inputs (a,b,c). For each input, I need to extract some data (1 to 23):
bcftools view -H a.vcf.gz -r 1 > a_chr1.txt
...
bcftools view -H a.vcf.gz -r 23 > a_chr23.txt
I can do it with FOR loop in the Snakemake rule:
IDS=['a','b','c']
chrs=range(1,23)
rule:
input:
expand("{id}.vcf.gz", id=IDS)
output:
expand("{id}_{chr}.txt", chr=chrs, id=IDS)
run:
for i in IDS:
for c in chrs:
shell("bcftools view -H {i}.vcf.gz -r {c} > {i}_chr{c}.txt")
, but FOR loop does not parallelize it. I need a proper Snakemake-way, smth like below, but it does not work.
IDS=['a','b','c']
chrs=range(1,23)
rule:
input:
expand("{id}.vcf.gz", id=IDS)
output:
expand("{id}_{chr}.txt", chr=chrs, id=IDS)
params:
c=expand("{chr}", chr=chrs)
shell:
"bcftools view -H {input} -r {params.c} > {output}"
Could you please help?
You are not taking advantage of the snakemake wildcards here. If you specify an expand in your inputs and outputs then snakemake will run the rule only once. It tells snakemake that all vcf files are needed to run the rule and that this rule will produce all splitted files. What you need is a rule that can be applied to any vcf file and will produce only one splitted (by chr) output.
IDS=['a','b','c']
chrs=range(1,23)
rule all:
input: expand("{id}_{chr}.txt", chr=chrs, id=IDS)
rule splitByChr:
input:
"{id}.vcf.gz"
output:
"{id}_{chr}.txt"
shell:
"bcftools view -H {input} -r {wildcards.chr} > {output}"
The rule all here will trigger the rule splitByChr as many times as necessary.
Also note that {id} and {chr} in the expand function are not wildcards. They are placeholders for the expand arguments defined.

Can I stop a rule in snakefile being parallel executed

I tried to concatenate files created via snakemake workflow as the last rule. To separate and identify the contents of each file, I echo each file name first in the shell as a separation tag (see the code below)
rule cat:
input:
expand('Analysis/typing/{sample}_type.txt', sample=samples)
output:
'Analysis/typing/Sum_type.txt'
shell:
'echo {input} >> {output} && cat {input} >> {output}'
I was looking for the result as this format:
file name of sample 1 content of sample 1 file name of sample
2 content of sample 2
instead I got this format:
file name of sample 1 file name of sample 2 ... content of sample 1
content of sample 2 ...
It seems snakemake execute echo command in parallel first then execute the cat command. What can I do the get the format I wanted?
Thanks
This looks more like a shell issue than a Snakemake issue.
If you want the file names and contents alternate, you can use a loop on the input files, as follows:
# Just an example:
samples = ["A", "B", "C"]
rule all:
input:
'Analysis/typing/Sum_type.txt'
rule cat:
input:
expand('Analysis/typing/{sample}_type.txt', sample=samples)
output:
'Analysis/typing/Sum_type.txt'
shell:
"""
for file in {input}
do
echo ${{file}} >> {output}
cat ${{file}} >> {output}
done
"""
(Double curly braces avoid the interpretation of the intended shell variable file as a thing that Snakemake/Python should "interpolate" when computing the string it passes to the shell.)
The output you get is consistent with the way bash works rather than with snakemake. Anyway, I think the snakemake way of doing it would be a rule to add the filename to the file content of each file and a rule to concatenate the output. E.g. (not checked for errors):
rule cat:
input:
'Analysis/typing/{sample}_type.txt',
output:
temp('Analysis/typing/{sample}_type.txt.out'),
shell:
r"""
echo {input} > {output}
cat {input} >> {output}
"""
rule cat_all:
input:
expand('Analysis/typing/{sample}_type.txt.out', sample=samples)
output:
'Analysis/typing/Sum_type.txt'
shell:
r"""
cat {input} > {output}
"""

Looking for a good output format to use a value extracted from a file in new script/process in Nextflow

Subject: Looking for a good output format to use a value extracted from a file in new script/process in Nextflow
I can't seem to figure this one out:
I am writing some processes in Nextflow in which I'm extracting a value from a txt.file (PROCESS1) and I want to use it in a second process (PROCESS2). The extraction of the value is no problem but finding the suitable output format is. The problem is that when I save the stdout (OPTION1) to a channel there seems to be some kind of "/n" attached which gives problems in my second script.
Alternatively because this was not working I wanted to save the output of PROCESS1 as a file (OPTION2). Also this is no problem but I can't find the correct way to read the content of the file in PROCESS2. I suspect it has something to do with "getText()" but I tried several things and they all failed.
Finally I wanted to try to save the output as a variable (OPTION3) but I don't know how to do this.
PROCESS1
process txid {
publishDir "$wanteddir", mode:'copy', overwrite: true
input:
file(report) from report4txid
output:
stdout into txid4assembly //OPTION 1
file(txid.txt) into txid4assembly //OPTION 2
val(txid) into txid4assembly //OPTION 3: doesn't work
shell:
'''
column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 //OPTION1
column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 > txid.txt //OPTION2
column -s, -t < !{report}| awk '$4 == "S"'| head -n 1 | cut -f5 > txid //OPTION3
'''
}
PROCESS2
process accessions {
publishDir "$wanteddir", mode:'copy', overwrite: true
input:
val(txid) from txid4assembly //OPTION1 & OPTION3
file(txid) from txid4assembly //OPTION2
output:
file("${txid}accessions.txt") into accessionlist
script:
"""
esearch -db assembly -query '${txid}[txid] AND "complete genome"[filter] AND "latest refseq"[filter]' \
| esummary | xtract -pattern DocumentSummary -element AssemblyAccession > ${txid}accessions.txt
"""
}
RESULTING SCRIPT OF PROCESS2 AFTER OPTION 1 (remark: output = 573, lay-out unchanged)
esearch -db assembly -query '573
[txid] AND "complete genome"[filter] AND "latest refseq"[filter]' | esummary | xtract -pattern DocumentSummary -element AssemblyAccession > 573
accessions.txt
Thank you for your help!
As you've discovered, your command-line writes a trailing newline character. You could try removing it somehow, perhaps by piping to another command, or (better) by refactoring to properly parse your report files. Below is an example using awk to print the fifth column without a trailing newline character. This might work fine for a simple CSV report file, but the CSV parsing capabilities of AWK are limited. So if your reports could contain quoted fields etc, consider using a language that offers CSV parsing in it's standard library (e.g. Python and the csv libary, or Perl and the Text::CSV module). Nextflow makes it easy to use your favourite scripting language.
process txid {
publishDir "$wanteddir", mode:'copy', overwrite: true
input:
file(report) from report4txid
output:
stdout into txid4assembly
shell:
'''
awk -F, '$4 == "S" { printf("%s", $5); exit }' "!{report}"
'''
In the case where your file contains an "S" in the forth column and the fifth column has some value with string length >= 1, this will give you a value that you can use in your 'accessions' process. But please be aware that this won't handle the case where the forth column in your file is never equal to "S". Nor will it handle the case where your fifth column could be an empty value (string length == 0). In these cases 'stdout' will be empty, so you'll get an empty value in your output channel. You may want to add some code to make sure that these edge cases are handled somehow.
I eventually fixed it by adding the following code, which only gets the numbers from my output
... | tr -dc '0-9'

Need help in executing the SQL via shell script and use the result set

I currently have a request to build a shell script to get some data from the table using SQL (Oracle). The query which I'm running return a number of rows. Is there a way to use something like result set?
Currently, I'm re-directing it to a file, but I'm not able to reuse the data again for the further processing.
Edit: Thanks for the reply Gene. The result file looks like:
UNIX_PID 37165
----------
PARTNER_ID prad
--------------------------------------------------------------------------------
XML_FILE
--------------------------------------------------------------------------------
/mnt/publish/gbl/backup/pradeep1/27241-20090722/kumarelec2.xml
pradeep1
/mnt/soar_publish/gbl/backup/pradeep1/11089-20090723/dataonly.xml
UNIX_PID 27654
----------
PARTNER_ID swam
--------------------------------------------------------------------------------
XML_FILE
--------------------------------------------------------------------------------
smariswam2
/mnt/publish/gbl/backup/smariswam2/10235-20090929/swam2.xml
There are multiple rows like this. My requirement is only to use shell script and write this program.
I need to take each of the pid and check if the process is running, which I can take care of.
My question is how do I check for each PID so I can loop and get corresponding partner_id and the xml_file name? Since it is a file, how can I get the exact corresponding values?
Your question is pretty short on specifics (a sample of the file to which you've redirected your query output would be helpful, as well as some idea of what you actually want to do with the data), but as a general approach, once you have your query results in a file, why not use the power of your scripting language of choice (ruby and perl are both good choices) to parse the file and act on each row?
Here is one suggested approach. It wasn't clear from the sample you posted, so I am assuming that this is actually what your sample file looks like:
UNIX_PID 37165 PARTNER_ID prad XML_FILE /mnt/publish/gbl/backup/pradeep1/27241-20090722/kumarelec2.xml pradeep1 /mnt/soar_publish/gbl/backup/pradeep1/11089-20090723/dataonly.xml
UNIX_PID 27654 PARTNER_ID swam XML_FILE smariswam2 /mnt/publish/gbl/backup/smariswam2/10235-20090929/swam2.xml
I am also assuming that:
There is a line-feed at the end of
the last line of your file.
The columns are separated by a single
space.
Here is a suggested bash script (not optimal, I'm sure, but functional):
#! /bin/bash
cat myOutputData.txt |
while read line;
do
myPID=`echo $line | awk '{print $2}'`
isRunning=`ps -p $myPID | grep $myPID`
if [ -n "$isRunning" ]
then
echo "PARTNER_ID `echo $line | awk '{print $4}'`"
echo "XML_FILE `echo $line | awk '{print $6}'`"
fi
done
The script iterates through every line (row) of the input file. It uses awk to extract column 2 (the PID), and then does a check (using ps -p) to see if the process is running. If it is, it uses awk again to pull out and echo two fields from the file (PARTNER ID and XML FILE). You should be able to adapt the script further to suit your needs. Read up on awk if you want to use different column delimiters or do additional text processing.
Things get a little more tricky if the output file contains one row for each data element (as you indicated). A good approach here is to use a simple state mechanism within the script and "remember" whether or not the most recently seen PID is running. If it is, then any data elements that appear before the next PID should be printed out. Here is a commented script to do just that with a file of the format you provided. Note that you must have a line-feed at the end of the last line of input data or the last line will be dropped.
#! /bin/bash
cat myOutputData.txt |
while read line;
do
# Extract the first (myKey) and second (myValue) words from the input line
myKey=`echo $line | awk '{print $1}'`
myValue=`echo $line | awk '{print $2}'`
# Take action based on the type of line this is
case "$myKey" in
"UNIX_PID")
# Determine whether the specified PID is running
isRunning=`ps -p $myValue | grep $myValue`
;;
"PARTNER_ID")
# Print the specified partner ID if the PID is running
if [ -n "$isRunning" ]
then
echo "PARTNER_ID $myValue"
fi
;;
*)
# Check to see if this line represents a file name, and print it
# if the PID is running
inputLineLength=${#line}
if (( $inputLineLength > 0 )) && [ "$line" != "XML_FILE" ] && [ -n "$isRunning" ]
then
isHyphens=`expr "$line" : -`
if [ "$isHyphens" -ne "1" ]
then
echo "XML_FILE $line"
fi
fi
;;
esac
done
I think that we are well into custom software development territory now so I will leave it at that. You should have enough here to customize the script to your liking. Good luck!