Snakemake: how to change literal tab - snakemake

I have a rule like :
rule merge_fastq_by_lane:
input:
r1 = get_fastq_r1,
r2 = get_fastq_r2
output:
r1_o = "{sample}/fastq/lanes/{sample}_{unit}_R1.fastq",
r2_o = "{sample}/fastq/lanes/{sample}_{unit}_R2.fastq",
bam = "{sample}/bam/lanes/{sample}_{unit}.bam"
threads:
1
message:
"Merge fastq from the same sample and lane and align using bwa"
shell:
"""
cat {input.r1} > {output.r1_o}
cat {input.r2} > {output.r2_o}
{bwa} mem -M -t {threads} -R "#RG\tID:{wildcards.sample}_{wildcards.unit}\tSM:{wildcards.sample}" {bwa_index} {output.r1_o} {output.r2_o} | {samtools} view -bS - | {samtools} sort - > {output.bam}
"""
And I have this error message due to tab issues in the -R parameter from bwa
bwa mem -M -t 1 -R "#RG ID:P1_L001 SM:P1" Homo_sapiens.GRCh37.dna.primary_assembly P1/fastq/lanes/P1_L001_R1.fastq P1/fastq/lanes/P1_L001_R2.fastq | samtools view -bS - | samtools sort - > P1/bam/lanes/P1_L001.bam
[E::bwa_set_rg] the read group line contained literal <tab> characters -- replace with escaped tabs: \t

You just have to espace the tab character so that snakemake does not interpret it:
{bwa} mem -M -t {threads} -R "#RG\\tID:{wildcards.sample}_{wildcards.unit}\\tSM:{wildcards.sample}" {bwa_index} {output.r1_o} {output.r2_o} | {samtools} view -bS - | {samtools} sort - > {output.bam}

Related

Better way to process dmidecode data using awk/sed

I run
dmidecode -t 4 | awk '/Signature/,/L1 Cache Handle/' |
grep -e 'Signature' -e 'L1 Cache Handle' |
awk -v Model="$4" '{
if ($4 == "Model")
print $5 " " $7;
else if ($1 == "L1")
print " " $4}' >> data
The contents of 'data' on my system is :
49, 0
0x002E
Essentially, 'data' corresponds to :
Signature: Family 23, Model 49, Stepping 0
L1 Cache Handle: 0x002E
(Model # and L1 cache handle)
Looking for a better/efficient way to do the above operation. Thanks.
Would you please try the following:
dmidecode -t 4 | sed -nE '/Signature|L1 Cache Handle/{s/.*Model ([0-9]+), Stepping ([0-9]+).*/\1 \2/p;s/.*L1 Cache Handle: ([0-9A-Za-z])/\1/p}' | xargs
The sed command extracts the values for Model, Stepping and
L1 Cache Handle:.
The final xargs merges two lines into one.

How do I keep the first n lines of a file/command, but grep the rest?

Easiest to give an example.
bash-$ psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;
relname | reltype
------------------------+---------
bme_reltag_02 | 0
bme_reltag_type1_type2 | 0
bme_reltag_10 | 0
bme_reltag_11 | 0
bme_reltag_cvalue3 | 0 👈 what I care about
But what I am really interested in is anything with cvalue in it. Rather than modifying each query by hand (yes, I know I could do it), I can egrep what I care about.
psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;' | egrep 'cvalue'
but that strips out the first two lines with the column headers.
bme_reltag_cvalue3 | 0
I know I can also do this:
psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;' | head -2 && psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;' | egrep 'cvalue'
relname | reltype
------------------------+---------
bme_reltag_cvalue3 | 0
but what I really want to do is to keep the head (or tail) of some lines one way and then process the rest another.
My particular use case here is grepping the contents of arbitrary psql selects, but I'm curious as to what bash capabilities are in this domain.
I've done this before by writing to a temp file and then processing the temp file in multiple steps, but that's not what I am looking for.
A while read loop and grep, if that is acceptable.
#!/usr/bin/env bash
while IFS= read -r lines; do
[[ $lines == [12]* ]] && echo "${lines#*:}"
[[ $lines == *cvalue[0-9]* ]] && echo "${lines#*:}"
done < <(psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;' | grep -n .)
Without the grep an alternative is a counter to know the line number, which will be a pure bash solution.
#!/usr/bin/env bash
counter=1
while IFS= read -r lines; do
[[ $counter == [12] ]] && echo "$lines"
[[ $lines == *cvalue[0-9]* ]] && echo "$lines"
((counter++))
done < <(psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;')
If bash4+ is available.
#!/usr/bin/env bash
mapfile -t files < <(psql -c 'select relname, reltype from pg_catalog.pg_class limit 5;')
printf '%s\n' "${files[0]}" "${files[1]}"
unset 'files[0]' 'files[1]'
for file in "${files[#]}"; do
[[ $file == *cvalue[0-9]* ]] && echo "$file"
done
By default the builtin read strips the leading and trailing white spaces, so in this case we don't want that, so we use IFS=
grep -n . adds the line number with a :
[12] is a glob not regex which means either 1 or 2 and the glob * will match if it is the first character of the line.
*cvalue[0-9]* will match cvalue and any amount of int/digit next to it.
"${lines#*:}" is a parameter expansion that strips the leading :
<( ) is called process substitution.
$ psql -c ... | awk 'NR<3 || /cvalue/' file
This can be done with sed using its range feature to only operate on lines 3 and beyond
sed '3,${/cvalue/!{d;};}'
Proof of Concept
$ cat ./psql
relname | reltype
------------------------+---------
bme_reltag_02 | 0
bme_reltag_type1_type2 | 0
bme_reltag_10 | 0
bme_reltag_11 | 0
bme_reltag_cvalue3 | 0
$ sed '3,${/cvalue/!{d;};}' ./psql
relname | reltype
------------------------+---------
bme_reltag_cvalue3 | 0
Explanation
3,${...;}: Start processing from line 3 until the end of file $
/cvalue/!{d;}: Delete d any line that does not match (!) the regex /cvalue/
You can use bash.. tail.and head commands
cat file.sql | head -n 15 > head.sql
Replace the 15 with the number of lines
Or replace head with tail... for the bottom of the file

Is there a simple way to save variables from sed/awk? [duplicate]

Is there a way to tell sed to output only captured groups?
For example, given the input:
This is a sample 123 text and some 987 numbers
And pattern:
/([\d]+)/
Could I get only 123 and 987 output in the way formatted by back references?
The key to getting this to work is to tell sed to exclude what you don't want to be output as well as specifying what you do want. This technique depends on knowing how many matches you're looking for. The grep command below works for an unspecified number of matches.
string='This is a sample 123 text and some 987 numbers'
echo "$string" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
This says:
don't default to printing each line (-n)
exclude zero or more non-digits
include one or more digits
exclude one or more non-digits
include one or more digits
exclude zero or more non-digits
print the substitution (p) (on one line)
In general, in sed you capture groups using parentheses and output what you capture using a back reference:
echo "foobarbaz" | sed 's/^foo\(.*\)baz$/\1/'
will output "bar". If you use -r (-E for OS X) for extended regex, you don't need to escape the parentheses:
echo "foobarbaz" | sed -r 's/^foo(.*)baz$/\1/'
There can be up to 9 capture groups and their back references. The back references are numbered in the order the groups appear, but they can be used in any order and can be repeated:
echo "foobarbaz" | sed -r 's/^foo(.*)b(.)z$/\2 \1 \2/'
outputs "a bar a".
If you have GNU grep:
echo "$string" | grep -Po '\d+'
It may also work in BSD, including OS X:
echo "$string" | grep -Eo '\d+'
These commands will match any number of digit sequences. The output will be on multiple lines.
or variations such as:
echo "$string" | grep -Po '(?<=\D )(\d+)'
The -P option enables Perl Compatible Regular Expressions. See man 3 pcrepattern or man 3 pcresyntax.
Sed has up to nine remembered patterns but you need to use escaped parentheses to remember portions of the regular expression.
See here for examples and more detail
you can use grep
grep -Eow "[0-9]+" file
run(s) of digits
This answer works with any count of digit groups. Example:
$ echo 'Num123that456are7899900contained0018166intext' \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Expanded answer.
Is there any way to tell sed to output only captured groups?
Yes. replace all text by the capture group:
$ echo 'Number 123 inside text' \
| sed 's/[^0-9]*\([0-9]\{1,\}\)[^0-9]*/\1/'
123
s/[^0-9]* # several non-digits
\([0-9]\{1,\}\) # followed by one or more digits
[^0-9]* # and followed by more non-digits.
/\1/ # gets replaced only by the digits.
Or with extended syntax (less backquotes and allow the use of +):
$ echo 'Number 123 in text' \
| sed -E 's/[^0-9]*([0-9]+)[^0-9]*/\1/'
123
To avoid printing the original text when there is no number, use:
$ echo 'Number xxx in text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1/p'
(-n) Do not print the input by default.
(/p) print only if a replacement was done.
And to match several numbers (and also print them):
$ echo 'N 123 in 456 text' \
| sed -En 's/[^0-9]*([0-9]+)[^0-9]*/\1 /gp'
123 456
That works for any count of digit runs:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" \
| sed -En 's/[^0-9]*([0-9]{1,})[^0-9]*/\1 /gp'
123 456 7899900 0018166
Which is very similar to the grep command:
$ str='Test Num(s) 123 456 7899900 contained as0018166df in text'
$ echo "$str" | grep -Po '\d+'
123
456
7899900
0018166
About \d
and pattern: /([\d]+)/
Sed does not recognize the '\d' (shortcut) syntax. The ascii equivalent used above [0-9] is not exactly equivalent. The only alternative solution is to use a character class: '[[:digit:]]`.
The selected answer use such "character classes" to build a solution:
$ str='This is a sample 123 text and some 987 numbers'
$ echo "$str" | sed -rn 's/[^[:digit:]]*([[:digit:]]+)[^[:digit:]]+([[:digit:]]+)[^[:digit:]]*/\1 \2/p'
That solution only works for (exactly) two runs of digits.
Of course, as the answer is being executed inside the shell, we can define a couple of variables to make such answer shorter:
$ str='This is a sample 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D+($d+)$D*/\1 \2/p"
But, as has been already explained, using a s/…/…/gp command is better:
$ str='This is 75577 a sam33ple 123 text and some 987 numbers'
$ d=[[:digit:]] D=[^[:digit:]]
$ echo "$str" | sed -rn "s/$D*($d+)$D*/\1 /gp"
75577 33 123 987
That will cover both repeated runs of digits and writing a short(er) command.
Give up and use Perl
Since sed does not cut it, let's just throw the towel and use Perl, at least it is LSB while grep GNU extensions are not :-)
Print the entire matching part, no matching groups or lookbehind needed:
cat <<EOS | perl -lane 'print m/\d+/g'
a1 b2
a34 b56
EOS
Output:
12
3456
Single match per line, often structured data fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*/$1/g'
a1 b2
a34 b56
EOS
Output:
1
34
With lookbehind:
cat <<EOS | perl -lane 'print m/(?<=a)(\d+)/'
a1 b2
a34 b56
EOS
Multiple fields:
cat <<EOS | perl -lape 's/.*?a(\d+).*?b(\d+).*/$1 $2/g'
a1 c0 b2 c0
a34 c0 b56 c0
EOS
Output:
1 2
34 56
Multiple matches per line, often unstructured data:
cat <<EOS | perl -lape 's/.*?a(\d+)|.*/$1 /g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
34 78
With lookbehind:
cat EOS<< | perl -lane 'print m/(?<=a)(\d+)/g'
a1 b2
a34 b56 a78 b90
EOS
Output:
1
3478
I believe the pattern given in the question was by way of example only, and the goal was to match any pattern.
If you have a sed with the GNU extension allowing insertion of a newline in the pattern space, one suggestion is:
> set string = "This is a sample 123 text and some 987 numbers"
>
> set pattern = "[0-9][0-9]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
123
987
> set pattern = "[a-z][a-z]*"
> echo $string | sed "s/$pattern/\n&\n/g" | sed -n "/$pattern/p"
his
is
a
sample
text
and
some
numbers
These examples are with tcsh (yes, I know its the wrong shell) with CYGWIN. (Edit: For bash, remove set, and the spaces around =.)
Try
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
I got this under cygwin:
$ (echo "asdf"; \
echo "1234"; \
echo "asdf1234adsf1234asdf"; \
echo "1m2m3m4m5m6m7m8m9m0m1m2m3m4m5m6m7m8m9") | \
sed -n -e "/[0-9]/s/^[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\)[^0-9]*\([0-9]*\).*$/\1 \2 \3 \4 \5 \6 \7 \8 \9/p"
1234
1234 1234
1 2 3 4 5 6 7 8 9
$
You need include whole line to print group, which you're doing at the second command but you don't need to group the first wildcard. This will work as well:
echo "/home/me/myfile-99" | sed -r 's/.*myfile-(.*)$/\1/'
It's not what the OP asked for (capturing groups) but you can extract the numbers using:
S='This is a sample 123 text and some 987 numbers'
echo "$S" | sed 's/ /\n/g' | sed -r '/([0-9]+)/ !d'
Gives the following:
123
987
I want to give a simpler example on "output only captured groups with sed"
I have /home/me/myfile-99 and wish to output the serial number of the file: 99
My first try, which didn't work was:
echo "/home/me/myfile-99" | sed -r 's/myfile-(.*)$/\1/'
# output: /home/me/99
To make this work, we need to capture the unwanted portion in capture group as well:
echo "/home/me/myfile-99" | sed -r 's/^(.*)myfile-(.*)$/\2/'
# output: 99
*) Note that sed doesn't have \d
You can use ripgrep, which also seems to be a sed replacement for simple substitutions, like this
rg '(\d+)' -or '$1'
where ripgrep uses -o or --only matching and -r or --replace to output only the first capture group with $1 (quoted to be avoid intepretation as a variable by the shell) two times due to two matches.

HDFS calculate size of subfolders

Please advice on how can I calculate the size of subfolders in HDFS and sort them by size?
hdfs dfs -ls -h /mds/snapshots/user/data | du -sh * | sort -rh | head -10
Seems it should work - but as I understand hdfs doesn't work with additional commands after |
You can use:
hdfs dfs -du -s /path/* | sort -r -k 1 -g | awk '{ suffix="KMGT"; for(i=0;
$1>1024 && i < length(suffix); i++) $1/=1024; print int($1) substr(suffix, i,
1), $3; }'

from string to integer (scripts)

I have this snippet of the code:
set calls = `cut -d" " -f2 ${2} | grep -c "$numbers"`
set messages = `cut -d" " -f2 ${3} | grep -c "$numbers"`
# popularity = (calls * 3) + messages
and error
# expression syntax
what does it mean? grep -c returns number, am I wrong, thanks in advance
in $numbers I have list of numbers, 2 and 3 parameters also contain numbers
Try
# popularity = ($calls * 3) + $messages
The $ symbols are still needed to indicate variables.
See C-shell Cookbook