Is it possible to have optionally empty Snakemake wildcards? - snakemake

Is it possible to have optionally empty wildcards? It seems like it was possible a few years ago (https://groups.google.com/g/snakemake/c/S7fTL4jAYIM), but the described method didn't work for a user last year and now is not working for me.
My Snakefile looks something like this (abbreviated for clarity):
wildcard_constraints:
udn_id="ID.+",
compound="(no_)*compound(_genome|_exome)*"
rule all:
input: expand("file/path/{id}/{compound}{.*}.html",
id=[config["id"]], compound=compound_list, freq=freq_list)
rule create_html:
output: "file/path/{id}/{compound}{freq,.*}.html"
input: "/oak/stanford/groups/euan/UDN/output/AnnotSV/AnnotSV_3.0.5/{udn_id}/WGS_blood_"+hg+"/gateway_hpo/{udn_id}.{comp_het}{cohort_freq,.*}.annotated.tsv"
shell: #Run shell commands
rule append_freq:
output: "file/path/{id}/{compound}.ha_freq.tsv"
input: "file/path/{id}/{compound}.tsv"
script: "file/path/get_ha_freq.py"
I get the error
No values given for wildcard ''.
File file/path, line 6 in <module>
when I run this.
I also tried implementing a wildcard constraint like this:
wildcard_constraints:
udn_id="ID.+",
compound="(no_)*compound(_genome|_exome)*"
freq=".*"
rule all:
input: expand("file/path/{id}/{compound}{freq}.html",
id=[config["id"]], compound=compound_list, freq=freq_list)
rule create_html:
output: "file/path/{id}/{compound}{freq}.html"
input: "/oak/stanford/groups/euan/UDN/output/AnnotSV/AnnotSV_3.0.5/{udn_id}/WGS_blood_"+hg+"/gateway_hpo/{udn_id}.{comp_het}{cohort_freq}.annotated.tsv"
shell: #Run shell commands
rule append_freq:
output: "file/path/{id}/{compound}.ha_freq.tsv"
input: "file/path/{id}/{compound}.tsv"
script: "file/path/get_ha_freq.py"
but I received the error,
No values given for wildcard 'freq'.
File file/path, line 7 in <module>
when I did this.
What error am I making?

It is still possible to have empty wildcards if you update the wildcard_constraints as described in your link. Here's a short example:
#!/usr/bin/env snakemake
wildcard_constraints:
sample=".*"
samples = ["a", "b", "", "d"]
rule all:
input:
"collected/all_samples.txt"
rule process_samples:
output:
"processed/{sample}.txt"
shell:
"""
echo '{wildcards.sample}' > {output}
"""
rule collect_samples:
input:
processed = ["processed/{}.txt".format(sample) for sample in samples]
output:
"collected/all_samples.txt"
shell:
"""
cat {input.processed} > {output}
"""
After saving this file as example.smk I can run the pipeline:
$ snakemake --cores 1 -s example.smk
Building DAG of jobs...
Using shell: /usr/bin/bash
Provided cores: 1 (use --cores to define parallelism)
Rules claiming more threads will be scaled down.
Job stats:
job count min threads max threads
--------------- ------- ------------- -------------
all 1 1 1
collect_samples 1 1 1
process_samples 4 1 1
total 6 1 1
[Wed Aug 25 20:43:24 2021]
rule process_samples:
output: processed/a.txt
jobid: 2
wildcards: sample=a
resources: tmpdir=/tmp
[Wed Aug 25 20:43:24 2021]
Finished job 2.
1 of 6 steps (17%) done
[Wed Aug 25 20:43:24 2021]
rule process_samples:
output: processed/.txt
jobid: 4
wildcards: sample=
resources: tmpdir=/tmp
... (skipping some output) ...
[Wed Aug 25 20:43:25 2021]
localrule all:
input: collected/all_samples.txt
jobid: 0
resources: tmpdir=/tmp
[Wed Aug 25 20:43:25 2021]
Finished job 0.
6 of 6 steps (100%) done
The empty wildcard output file processed.txt is created at the empty wildcard (line) is present in the collected/all_samples.txt file.
$ ls processed/.txt
processed/.txt
$ cat collected/all_samples.txt
a
b
d

Related

How to print matching line, 3 lines after, and matching URL

I try to extract information from SMTP mails in text, i.e:
the date (ex: Wed, 9 Oct 2019 01:55:58 -0700 (PDT)
the sender (ex: from xxx.yyy.com (zzz:com. [111.222.333.444])
URLs present in the mail (ex: http://some.thing)
Here's an example of an input:
Delivered-To: SOME#ADDRESS.COM
Received: by X.X.X.X with SMTP id SOMEID;
Wed, 9 Oct 2019 01:55:58 -0700 (PDT)
X-Received: by X.X.X.X with SMTP id SOMEID;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
Return-Path: <SOME#ADDRESS.COM>
Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X])
by SOME.THIRD.URL.COM with ESMTP id SOMEID
for <SOME#ADDRESS.COM>;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
SOME_HTML
SOME_HTML
href="http://URL1"><img
SOME_HTML
src="http://URL2"
SOME_HTML
The example is deliberately truncated because the header is longer, but this is for the example
I've tried sed and awk and I managed to do some thing but not as I want.
SED:
sed -e 's/http/\nhttp/g' -n -e '/Received: from/{h;n;n;n;H;x;s/\n \+/;/;p}' a.txt
The first one is to have the URL on one lien but I didn't manage to use it after.
And anyway, it's not in order.
AWK:
BEGIN{
RS = "\n";
FS = "";
}
/Received: from/{
from = $0;
getline;
getline;
getline;
date = $0
}
/"\"https?://[^\"]+"/
{
FS="\"";
print $0;
}
END{
print date";"from;
};
This one works except for the URL. The rexgexp doesn't works while in a oneline yes.
I also tried to find a more elegant way for the date by using the value of NR+3, but it didn't work.
And display this in csv format:
date;sender;URL1;URL2;...
I would prefer pure sed or pure awk, because I think I can do it with grep, tail, sed and awk but as I want to learn, I prefer one or both of them :)
Well, the following longish sed script with comments inside:
sed -nE '
/Received: from /{
# hold mu line!
h
# ach, spagetti code, here we go again
: notdate
${
s/.*/ERROR: INVALID INPUT: DATE NOT FOUND/
p
q1
}
# the next line after the line ending with ; should be the date
/;$/!{
# so search for a line ending with ;
n
b notdate
}
# the next line is the date
n
# remove leading spaces
s/^[[:space:]]*//
# grab the Received: from line
G
# and save it for later
h
}
# headers end with an empty line
/^$/{
# loop over lines
: read_next_line
n
# flag with \x01<URL>\x02 all occurences of URLs
s/"(http[^"]+)"/\x01\1\x02/g
# we found at least one URL if there is \x01 in the pattern space
/\x01/{
# extract each occurence to the end of pattern space with a newline
: again
s/^([^\x01]*)\x01([^\x02]*)\x02(.*)$/\1\3\n\2/
t again
# remove everything in front of separator - the unparsed part of line
s/^[^\n]*\n//
# add URLs to hold space
H
}
# if this is the last line, we should finally print something!, and, exit
${
# grab the hold space
x
# replace the separator for a ;
s/\n/;/g
# print and exit successfully
p
q 0
}
# here we go again!
b read_next_line
}
'
for the following input:
Delivered-To: SOME#ADDRESS.COM
Received: by X.X.X.X with SMTP id SOMEID;
Wed, 9 Oct 2019 01:55:58 -0700 (PDT)
X-Received: by X.X.X.X with SMTP id SOMEID;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
Return-Path: <SOME#ADDRESS.COM>
Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X])
by SOME.THIRD.URL.COM with ESMTP id SOMEID
for <SOME#ADDRESS.COM>;
Wed, 09 Oct 2019 01:55:58 -0700 (PDT)
SOME_HTML
SOME_HTML
href="http://URL1"><img
SOME_HTML
src="http://URL2"
SOME_HTML
SOMEHTML src="http://URL3" SOMEHTML src="http://URL4"
outputs:
Wed, 09 Oct 2019 01:55:58 -0700 (PDT);Received: from SOME.URL.COM (SOME.OTHER.URL.COM. [X.X.X.X]);http://URL1;http://URL2;http://URL3;http://URL4

Snakemake: how to change literal tab

I have a rule like :
rule merge_fastq_by_lane:
input:
r1 = get_fastq_r1,
r2 = get_fastq_r2
output:
r1_o = "{sample}/fastq/lanes/{sample}_{unit}_R1.fastq",
r2_o = "{sample}/fastq/lanes/{sample}_{unit}_R2.fastq",
bam = "{sample}/bam/lanes/{sample}_{unit}.bam"
threads:
1
message:
"Merge fastq from the same sample and lane and align using bwa"
shell:
"""
cat {input.r1} > {output.r1_o}
cat {input.r2} > {output.r2_o}
{bwa} mem -M -t {threads} -R "#RG\tID:{wildcards.sample}_{wildcards.unit}\tSM:{wildcards.sample}" {bwa_index} {output.r1_o} {output.r2_o} | {samtools} view -bS - | {samtools} sort - > {output.bam}
"""
And I have this error message due to tab issues in the -R parameter from bwa
bwa mem -M -t 1 -R "#RG ID:P1_L001 SM:P1" Homo_sapiens.GRCh37.dna.primary_assembly P1/fastq/lanes/P1_L001_R1.fastq P1/fastq/lanes/P1_L001_R2.fastq | samtools view -bS - | samtools sort - > P1/bam/lanes/P1_L001.bam
[E::bwa_set_rg] the read group line contained literal <tab> characters -- replace with escaped tabs: \t
You just have to espace the tab character so that snakemake does not interpret it:
{bwa} mem -M -t {threads} -R "#RG\\tID:{wildcards.sample}_{wildcards.unit}\\tSM:{wildcards.sample}" {bwa_index} {output.r1_o} {output.r2_o} | {samtools} view -bS - | {samtools} sort - > {output.bam}

Different result when count line number of a file, using wc -l and cat -n

I heard that wc -l could count the number of lines in a file. However, when I use it to count lines of a file that was generated by Python, it gives a different result, miscounting one line.
Here is the MWE.
#!/usr/bin/env python
import random
def getRandomLines(in_str, num):
res = list()
lstr = len(in_str)
for i in range(num):
res.append(''.join(random.sample(in_str, lstr)))
return res
def writeRandomLines(rd_lines, fname):
lines = '\n'.join(rd_liens)
with open(fname, 'w') as fout:
fout.write(lines)
if __name__ == '__main__':
writeRandomLines(getRandomLines("foobarbazqux", 20), "example.txt")
This gives a file, example.txt, that contains 20 lines of random strings. And thus, the expection of the number of lines in example.txt is 20. However, when one applies wc -l to it, it gives 19 as the result.
$ wc -l example.txt
19 example.txt
When one uses cat -n to show the content of the file, with line number, one can see
$ cat -n example.txt
1 oaxruzaqobfb
2 ozbarboaufqx
3 fbzarbuoxoaq
4 obqfarbozaxu
5 xoqbrauboazf
6 ufqooxrababz
7 rqoxafuzboab
8 bfuaqoxaorbz
9 baxroazfouqb
10 rqzafoobxaub
11 xqaoabbufzor
12 aobxbaoruzfq
13 buozaqbrafxo
14 aobzoubfarxq
15 aquofrboazbx
16 uaoqrfobbaxz
17 bxqubarfoazo
18 aaxruzofbboq
19 xuaoarzoqfbb
20 bqouzxraobfa
Why wc -l miscount one line, and what could I do to fix this problem?
Any clues or hints will be appreciated.
In your python code, you have:
lines = '\n'.join(rd_liens)
So what you are really writing is :
word1\nword2\n...wordX-1\nwordX
Unfortunately, in man wc:
-l, --lines
print the newline counts
hence your difference.
Apparently wc -l needs to see a \n at the end of the line to count it as one. Your current format has the last line without a trailing \n, therefore not counted by wc -l. Add the newline and it should be fixed.
wc -l only counts number of new line characters.
Since you are appending lines with a '\n' characters, to join 20 lines only 19 '\n' characters were used. Hence result as 19.
If you need correct count, terminate each line with '\n'

Split a field and then remove duplicates

Sample file:
# cat test1
-rw-r--r-- 1 root root 19460 Feb 10 03:56 catalina.2015-02-10.log
-rw-r--r-- 1 root root 206868 May 4 15:05 catalina.2015-05-04.log
-rw-r--r-- 1 root root 922121 Jun 24 09:26 catalina.out
-rw-r--r-- 1 root root 0 Feb 10 02:27 host-manager.2015-02-10.log
-rw-r--r-- 1 root root 0 May 4 04:17 host-manager.2015-05-04.log
-rw-r--r-- 1 root root 2025 Feb 10 03:56 localhost.2015-02-10.log
-rw-r--r-- 1 root root 8323 May 4 15:05 localhost.2015-05-04.log
-rw-r--r-- 1 root root 873 Feb 10 03:56 localhost_access_log.2015-02-10.txt
-rw-r--r-- 1 root root 458600 May 4 23:59 localhost_access_log.2015-05-04.txt
-rw-r--r-- 1 root root 0 Feb 10 02:27 manager.2015-02-10.log
-rw-r--r-- 1 root root 0 May 4 04:17 manager.2015-05-04.log
Expected Output:
catalina
host-manager
localhost
localhost_access_log
manager
Attempt 1 (works):
# awk '{split($9,a,"."); print a[1]}' test1 | awk '!z[$i]++'
catalina
host-manager
localhost
localhost_access_log
manager
Attempt 2 (works):
# awk '{split($9,a,"."); print a[1]}' test1 | uniq
catalina
host-manager
localhost
localhost_access_log
manager
Attempt 3 (Fails):
# awk '{split($9,a,"."); a[1]++} {for (i in a){print a[i]}}' test1
1
2015-02-10
log
1
2015-05-04
log
1
out
.
.
.
Question:
I wanted to split the 9th field and then display only the uniq entries. However, I wanted to do this in a single awk one-liner. Seeking help on my 3rd attempt.
Another, more idiomatic awk one-liner:
awk '!a[ $0 = substr($NF,1,index($NF,".")-1) ]++' file
or, expressed more explicitly:
awk '{$0=substr($NF,1,index($NF,".")-1)} !a[$0]++' file
We use the well-known !a[$0]++ line de-duplication trick.
but first we change $0 to : substr($NF,1,index($NF,".")-1)
the whole line becomes the substring of the last field $NF up to the the first dot (.) – with substr() and some help from index()
A benefit of this solution is that you don't need to wait until the whole file has been parsed. The split fields are de-duplicated and printed on-the-fly.
You have to use the END block to print the results:
awk '{split($NF,a,"."); b[a[1]]} END{for (i in b){print i}}' file
Notes:
I am using $NF to catch the last field. This way, if you happen to have more or less fields than 9, it will also work (as long as there are no filenames with spaces, because parsing ls is evil).
We cannot loop directly through the a[] array, because it is the one containing the splitted data. For this we need to create another array, for example b[]. That's why we say b[a[1]]. Alone, there is no need to b[a[1]]++ unless you want to keep track of how many times any item appears.
END block is executed after processing the whole file. Otherwise you were going through the results once per record (that is, once per line) and subsequently duplicates were appearing.

Remove commas and format dates

I have a large file with entries such as:
<VAL>17,451.26</VAL>
<VAL>353.93</VAL>
<VAL>395.00</VAL>
<VAL>2,405.00</VAL>
<DATE>31 Jul 2013</DATE>
<DATE>31 Jul 2013</DATE>
<DATE>31 Dec 2014</DATE>
<DATE>21 Jun 2002</DATE>
<DATE>10 Jul 2002</DATE>
<MOD>PL</MOD>
<BATCH>13382</BATCH>
<TYPE>Invoice</TYPE>
<REF1>13541/13382</REF1>
<REF2>671042638320</REF2>
<NOTES>a-07 final elec</NOTES>
<SNAME>EDF ENERGY ( Electricity )</SNAME>
<VAL>55.22</VAL>
</CLT>
<CLT>
<CHD>MAT-01</CHD>
<OPN>U5U1</OPN>
<PERIOD>07 2013</PERIOD>
<DATE>13 Jun 2013</DATE>
<DATE>10 Jul 2002</DATE>
<DATE>10 Jul 2002</DATE>
<DATE>21 Aug 2007</DATE>
<DATE>10 Jul 2002</DATE>
<VAL>-4,122,322.03</VAL>
I need to remove the commas in the VAL fields and change the dates to YYYY-MM-DD (e.g. 2013-07-31) in the DATE fields.
Looking for a quick (efficient) way of doing this.
Thanks
This should get you started:
awk -F"[<>]" 'BEGIN {split("Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec",month," ");for (i=1;i<=12;i++) mdigit[month[i]]=i} /<VAL>/ {gsub(/\,/,"")} /<DATE>/ {split($3,a," ");$0=sprintf("<DATE>%s-%02d-%02d</DATE>",a[3],mdigit[a[2]],a[1])}1' file
<VAL>17451.26</VAL>
<VAL>353.93</VAL>
<VAL>395.00</VAL>
<VAL>2405.00</VAL>
<DATE>2013-07-31</DATE>
<DATE>2013-07-31</DATE>
<DATE>2014-12-31</DATE>
<DATE>2002-06-21</DATE>
<DATE>2002-07-10</DATE>
<MOD>PL</MOD>
<BATCH>13382</BATCH>
<TYPE>Invoice</TYPE>
<REF1>13541/13382</REF1>
<REF2>671042638320</REF2>
<NOTES>a-07 final elec</NOTES>
<SNAME>EDF ENERGY ( Electricity )</SNAME>
<VAL>55.22</VAL>
</CLT>
<CLT>
<CHD>MAT-01</CHD>
<OPN>U5U1</OPN>
<PERIOD>07 2013</PERIOD>
<DATE>2013-06-13</DATE>
<DATE>2002-07-10</DATE>
<DATE>2002-07-10</DATE>
<DATE>2007-08-21</DATE>
<DATE>2002-07-10</DATE>
<VAL>-4122322.03</VAL>
sed '# init month convertor in holding buffer
1{h;s/.*/Jan01Fev02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/;x;}
# change Val
/^<VAL>/ s/,//g
# Change Date
/^<DATE>/ {
# change month
G
s/[[:space:]]\{1,\}\([A-Z][a-z][a-z]\)[[:space:]]\{1,\}\(.*\)\n.*\1\([0-9][0-9]\).*/-\3-\2/
# reformat order
s/>\(.*\)-\(.*\)-\(.*\)</>\3-\2-\1</
}' YourFile
posix sed with not extra sub shell for dae conversion
reformat date take 2 s///here but could be merged in 1 s/// a bit more unreadeable (already very attractive regex like this)
could easily add some security feature about source date like bad date format
Your input seems like XML. I'd use a proper XML handling tool, e.g. XML::XSH2, a wrapper around Perl's XML::LibXML:
open file.xml ;
for //VAL set . xsh:subst(., ',', '','g') ;
perl { use Time::Piece } ;
for my $d in //DATE {
$t = $d/text() ;
set $d/text() { Time::Piece->strptime($t, '%d %b %Y')->ymd } ;
}
save :b ;
This might work for you (GNU sed & bash):
sed -r '/^<VAL>/s/,//g;/^(<DATE>)(.*)(<\/DATE>)$/s//echo "\1"$(date -d "\2" +%F)"\3"/e' file
This removes all commas on a line starting <VAL> and for those lines that contain date tags, uses the date utility and the evaluate flag in the substitution command to rearrange the date to YYYY-MM-DD.
An alternative solution, using only seds commands:
sed -r '/^<VAL>/s/,//g;/^<DATE>/!b;s/$/\nJan01Feb02Mar03Apr04May05Jun06Jul07Aug08Sep09Oct10Nov11Dec12/;s/^(<DATE>)(..) (...) (....)(<\/DATE>\n).*\3(..)/\1\4-\6-\2\5/;P;d' file
Appends a lookup to the end of the date line and uses regexp to rearrange the output.