Extract helix residues from DSSP with awk - awk

I would like to extract helix(H) residues from DSSP files .
1CRN.dssp
31 37 A K H < S+
32 38 A V H < S+
33 39 A F H >< S-
34 40 A G G >< S+
35 41 A K G > S+
1GB5.dssp
113 242 B G H 3>>S+
114 243 B I H <45S+
115 244 B L H X45S+
116 245 B S H 3<5S+
117 246 B K T >X5S+
I want to save the output in the following format.
>1CRN
KVF
>1GB5
GILS
How can I do this with awk? Your suggestions would be appreciated!

It's the 'H' in the 5 th column that indicates "helix(H) residues"?
awk '{
if (FNR == 1 ) print ">" FILENAME
if ($5 == "H") {
printf $4
}
}
END { printf "\n"}' file
output
>tstDat.txt
KVF
IHTH

Related

Find and replace and move a line that contains a specific string

Assuming I have the following text file:
a b c d 1 2 3
e f g h 1 2 3
i j k l 1 2 3
m n o p 1 2 3
How do I replace '1 2 3' with '4 5 6' in the line that contains the letter (e) and move it after the line that contains the letter (k)?
N.B. the line that contains the letter (k) may come in any location in the file, the lines are not assumed to be in any order
My approach is
Remove the line I want to replace
Find the lines before the line I want to move it after
Find the lines after the line I want to move it after
append the output to a file
grep -v 'e' $original > $file
grep -B999 'k' $file > $output
grep 'e' $original | sed 's/1 2 3/4 5 6/' >> $output
grep -A999 'k' $file | tail -n+2 >> $output
rm $file
mv $output $original
but there is a lot of issues in this solution:
a lot of grep commands that seems unnecessary
the argument -A999 and -B999 are assuming the file would not contain lines more than 999, it would be better to have another way to get lines before and after the matched line
I am looking for a more efficient way to achieve that
Using sed
$ sed '/e/{s/1 2 3/4 5 6/;h;d};/k/{G}' input_file
a b c d 1 2 3
i j k l 1 2 3
e f g h 4 5 6
m n o p 1 2 3
Here is a GNU awk solution:
awk '
/\<e\>/{
s=$0
sub("1 2 3", "4 5 6", s)
next
}
/\<k\>/ && s {
printf("%s\n%s\n",$0,s)
next
} 1
' file
Or POSIX awk:
awk '
function has(x) {
for(i=1; i<=NF; i++) if ($i==x) return 1
return 0
}
has("e") {
s=$0
sub("1 2 3", "4 5 6", s)
next
}
has("k") && s {
printf("%s\n%s\n",$0,s)
next
} 1
' file
Either prints:
a b c d 1 2 3
i j k l 1 2 3
e f g h 4 5 6
m n o p 1 2 3
This works regardless of the order of e and k in the file:
awk '
function has(x) {
for(i=1; i<=NF; i++) if ($i==x) return 1
return 0
}
has("e") {
s=$0
sub("1 2 3", "4 5 6", s)
next
}
FNR<NR && has("k") && s {
printf("%s\n%s\n",$0,s)
s=""
next
}
FNR<NR
' file file
This awk should work for you:
awk '
/(^| )e( |$)/ {
sub(/1 2 3/, "4 5 6")
p = $0
next
}
1
/(^| )k( |$)/ {
print p
p = ""
}' file
a b c d 1 2 3
i j k l 1 2 3
e f g h 4 5 6
m n o p 1 2 3
This might work for you (GNU sed):
sed -n '/e/{s/1 2 3/4 5 6/;s#.*#/e/d;/k/s/.*/\&\\n&/#p};' file | sed -f - file
Design a sed script by passing the file twice and applying the sed instructions from the first pass to the second.
Another solution is to use ed:
cat <<\! | ed file
/e/s/1 2 3/4 5 6/
/e/m/k/
wq
!
Or if you prefer:
<<<$'/e/s/1 2 3/4 5 6/\n.m/k/\nwq' ed -s file

Sed command issue

I have this file inside a mariaDB that looks like this
name callerid secret context type host
1000 Omar Al-Ani <1000> op1000DIR MANAGEMENT friend dynamic
1001 Ammar Zigderly <1001> 1001 MANAGEMENT peer dynamic
1002 Lubna COO Office <1002> 1002 ELdefault peer dynamic
i want to convert it using sed and awk to look like this format
[1000]
callerid=Omar Al-Ani <1000>
secret=op1000DIR
context=MANAGEMENT
type=friend
host=dynamic
[1001]
callerid=Ammar Zigderly <1001>
secret=1001
context=MANAGEMENT
type=peer
host=dynamic
[1002]
callerid=Lubna COO Office <1002>
secret=1002
context=ELdefault
type=peer
host=dynamic
This is the output of this command head -3 filename | od -c on the input file
0000000 n a m e \t c a l l e r i d \t s e
0000020 c r e t \t c o n t e x t \t t y p
0000040 e \t h o s t \n 1 0 0 0 \t O m a
0000060 r A l - A n i < 1 0 0 0 >
0000100 \t o p 1 0 0 0 D I R \t M A N A
0000120 G E M E N T \t f r i e n d \t d y
0000140 n a m i c \n 1 0 0 1 \t A m m
0000160 a r Z i g d e r l y < 1 0 0
0000200 1 > \t 1 0 0 1 \t M A N A G E
0000220 M E N T \t p e e r \t d y n a m i
0000240 c \n
0000243
Any idea would be helpfull !
I think awk is going to be a bit simpler and easier (?) to modify if requirements change:
awk -F'\t' '
BEGIN { labels[2]="callerid"
labels[3]="secret"
labels[4]="context"
labels[5]="type"
labels[6]="host"
}
FNR>1 { gsub(/ /,"",$1) # remove spaces from 1st column
printf "[%s]\n",$1
for (i=2;i<=6;i++)
printf "\t%s=%s\n", labels[i],$i
print ""
}
' names.dat
This generates:
[1000]
callerid=Omar Al-Ani <1000>
secret=op1000DIR
context=MANAGEMENT
type=friend
host=dynamic
[1001]
callerid=Ammar Zigderly <1001>
secret=1001
context=MANAGEMENT
type=peer
host=dynamic
[1002]
callerid=Lubna COO Office <1002>
secret=1002
context=ELdefault
type=peer
host=dynamic
assuming tab separated fields
$ awk -F'\t' 'NR==1 {split($0,h); next}
{print "[" $1 "]";
for(i=2;i<=NF;i++) print "\t" h[i] ":" $i}' file.tcv
[1000]
callerid:Omar Al-Ani <1000>
secret:op1000DIR
context:MANAGEMENT
type:friend
host:dynamic
[1001]
callerid:Ammar Zigderly <1001>
secret:1001
context:MANAGEMENT
type:peer
host:dynamic
[1002]
callerid:Lubna COO Office <1002>
secret:1002
context:ELdefault
type:peer
host:dynamic

Using awk to count number of row range

I have a data set: (file.txt)
X Y
1 a
2 b
3 c
10 d
11 e
12 f
15 g
20 h
25 i
30 j
35 k
40 l
41 m
42 n
43 o
46 p
I have two Up10 and Down10 columns,
Up10: From (X) to (X-10) count of row.
Down10 : From (X) to (X+10)
count of row
For example:
X Y Up10 Down10
35 k 3 5
For Up10; 35-10 X=35 X=30 X=25 Total = 3 row
For Down10; 35+10 X=35 X=40 X=41 X=42 X=42 Total = 5 row
I have tried, but i cant show 3rd and 4rth column:
awk 'BEGIN{ FS=OFS="\t" }
NR==FNR{
a[$1]+=$3
next
}
{ $(NF+10)=a[$3] }
{ $(NF-10)=a[$4] }
1
' file.txt file.txt > file-2.txt
Desired Output:
X Y Up10 Down10
1 a 1 5
2 b 2 5
3 c 3 4
10 d 4 5
11 e 5 4
12 f 5 3
15 g 4 3
20 h 5 3
25 i 3 3
30 j 3 3
35 k 3 5
40 l 3 5
41 m 3 4
42 n 4 3
43 o 5 2
46 p 5 1
This is the Pierre François' solution: Thanks again #Pierre François
awk '
BEGIN{OFS="\t"; print "X\tY\tUp10\tDown10"}
(NR == FNR) && (FNR > 1){a[$1] = $1 + 0}
(NR > FNR) && (FNR > 1){
up = 0; upl = $1 - 10
down = 0; downl = $1 + 10
for (i in a) { i += 0 # tricky: convert i to integer
if ((i >= upl) && (i <= $1)) {up++}
if ((i >= $1) && (i <= downl)) {down++}
}
print $1, $2, up, down;
}
' file.txt file.txt > file-2.txt
This is the Pierre François' solution: Thanks again #Pierre François
awk '
BEGIN{OFS="\t"; print "X\tY\tUp10\tDown10"}
(NR == FNR) && (FNR > 1){a[$1] = $1 + 0}
(NR > FNR) && (FNR > 1){
up = 0; upl = $1 - 10
down = 0; downl = $1 + 10
for (i in a) { i += 0 # tricky: convert i to integer
if ((i >= upl) && (i <= $1)) {up++}
if ((i >= $1) && (i <= downl)) {down++}
}
print $1, $2, up, down;
}
' file.txt file.txt > file-2.txt

extracting lines with pivot column

Infile,
S 235 1365 * 0 * * * 15 1 c81 592
H 235 296 99.7 + 0 0 3I296M1066I 14 1 s15018 1
H 235 719 95.4 + 0 0 174D545M820I 15 1 c2664 10
H 235 764 99.1 + 0 0 55I764M546I 15 1 c6519 4
H 235 792 100 + 0 0 180I792M393I 14 1 c407 107
S 236 1365 * 0 * * * 15 1 c474 152
H 236 279 95 + 0 0 765I279M321I 10-1 1 s7689 1
H 236 301 99.7 - 0 0 908I301M156I 15 1 s8443 1
H 236 563 95.2 - 0 0 728I563M74I 17 1 c1725 12
H 236 97 97.9 - 0 0 732I97M536I 17 1 s11472 1
S 237 1365 * 0 * * * 15 1 c474 152
H 237 279 95 + 0 0 765I279M321I 15 1 s7689 1
S 238 1365 * 0 * * * 12 1 c474 152
H 238 279 95 + 0 0 765I279M321I 10-1 1 s7689 1
H 238 301 99.7 - 0 0 908I301M156I 15 1 s8443 1
H 238 563 95.2 - 0 0 728I563M74I 17 1 c1725 12
H 238 97 97.9 - 0 0 732I97M536I 17 1 s11472 1
Outfile what I want is below,
Example 1 by specifying ninth column "10-1", "15", and "17".
S 236 1365 * 0 * * * 15 1 c474 152
H 236 279 95 + 0 0 765I279M321I 10-1 1 s7689 1
H 236 301 99.7 - 0 0 908I301M156I 15 1 s8443 1
H 236 563 95.2 - 0 0 728I563M74I 17 1 c1725 12
H 236 97 97.9 - 0 0 732I97M536I 17 1 s11472 1
Example 2 by specifying ninth column "14" and "15".
S 235 1365 * 0 * * * 15 1 c81 592
H 235 296 99.7 + 0 0 3I296M1066I 14 1 s15018 1
H 235 719 95.4 + 0 0 174D545M820I 15 1 c2664 10
H 235 764 99.1 + 0 0 55I764M546I 15 1 c6519 4
H 235 792 100 + 0 0 180I792M393I 14 1 c407 107
Example 3 by specifying ninth column "15".
S 237 1365 * 0 * * * 15 1 c474 152
H 237 279 95 + 0 0 765I279M321I 15 1 s7689 1
So I would like to extract set of lines those have same value in the second column. At this time, I need to extract only set of lines which have specific values in 9th column. In that case, the set of lines need to have "all of the specified values".
The set 238 has "12" in the ninth column, which is not specified. So I do not want them to be extracted.
This question is very similar to this question.
Extracting lines using two criteria
There's many possible approaches but IMHO the most robust and easiest to expand upon later is to create a hash table of the desired values (goodVals[] below) and then just test if the current $9 is a value that's not in that table:
BEGIN { split("10-1 15 17",tmp); for (i in tmp) goodVals[tmp[i]] }
$2 != prevPivot { prtCurrSet() }
!($9 in goodVals) { isBadSet=1 }
{ currSet = currSet $0 ORS; prevPivot = $2 }
END { prtCurrSet() }
function prtCurrSet() {
if ( !isBadSet ) {
printf "%s", currSet
}
currSet = ""
isBadSet = 0
}
Given the new requirement from your comment, here's a solution for one possible interpretation of that requirement:
$ cat tst.awk
BEGIN { split("10-1 15 17",tmp); for (i in tmp) goodVals[tmp[i]] }
$2 != prevPivot { prtCurrSet() }
{ seen[$9]; currSet = currSet $0 ORS; prevPivot = $2 }
END { prtCurrSet() }
function prtCurrSet( val,allGoodPresent) {
allGoodPresent = 1
for (val in goodVals) {
if ( !(val in seen) ) {
allGoodPresent = 0
}
}
if ( allGoodPresent ) {
printf "%s", currSet
}
currSet = ""
delete seen
}
$ awk -f tst.awk file
S 236 1365 * 0 * * * 15 1 c474 152
H 236 279 95 + 0 0 765I279M321I 10-1 1 s7689 1
H 236 301 99.7 - 0 0 908I301M156I 15 1 s8443 1
H 236 563 95.2 - 0 0 728I563M74I 17 1 c1725 12
H 236 97 97.9 - 0 0 732I97M536I 17 1 s11472 1
and here's another:
$ cat tst.awk
BEGIN { split("10-1 15 17",tmp); for (i in tmp) goodVals[tmp[i]] }
$2 != prevPivot { prtCurrSet() }
{ seen[$9]; currSet = currSet $0 ORS; prevPivot = $2 }
END { prtCurrSet() }
function prtCurrSet( val,allGoodPresent,someBadPresent) {
allGoodPresent = 1
for (val in goodVals) {
if ( !(val in seen) ) {
allGoodPresent = 0
}
delete seen[val]
}
someBadPresent = length(seen)
if ( allGoodPresent && !someBadPresent ) {
printf "%s", currSet
}
currSet = ""
delete seen
}
$ awk -f tst.awk file
S 236 1365 * 0 * * * 15 1 c474 152
H 236 279 95 + 0 0 765I279M321I 10-1 1 s7689 1
H 236 301 99.7 - 0 0 908I301M156I 15 1 s8443 1
H 236 563 95.2 - 0 0 728I563M74I 17 1 c1725 12
H 236 97 97.9 - 0 0 732I97M536I 17 1 s11472 1
Unfortunately your posted sample input/output isn't adequate to test the differences.

extract columns with awk

I have some text files as follows
293 800 J A 0 0 162
294 801 J R - 0 0 67
295 802 J P - 0 0 56
298 805 J G S S- 0 0 22
313 820 J R T 4 S- 0 0 152
I would like to print column4 if column5 is empty.
desired output
>filename
ARP
I used the following code. But this code prints only the filenames.
awk '{
if (FNR == 1 ) print ">" FILENAME
if ($5 == "") {
printf $4
}
}
END { printf "\n"}' *.txt
Here's one way using GNU awk:
awk 'BEGIN { FIELDWIDTHS="5 4 2 3 3 2 7 4 3" } FNR==1 { print ">" FILENAME } $5 == " " { sub(/ $/, "", $4); printf $4 } END { printf "\n" }' file.txt
Result:
>file.txt
ARP
This is not an elegant solution by any means and it is specific to this file.
You can do something like this
cut -c1-15 yourtext | awk '$5 {print $4}'
where 15 is the number of characters including column 5.
I do strongly agree with steve's suggestion to use an better alternative for your files. Or at least put a dummy/error value instead of leaving columns blank.
awk '{if(substr($0,15,1)~/ /)printf("%s",$4);}' your_file
tested below:
> cat temp
293 800 J A 0 0 162
294 801 J R - 0 0 67
295 802 J P - 0 0 56
298 805 J G S S- 0 0 22
313 820 J R T 4 S- 0 0 152
> awk '{if(substr($0,15,1)~/ /)printf("%s",$4);}' temp
ARP>
This is a starting point assuming the variations in column numbers stay the same.
awk '$5 !="" && NF<=8 {printf $4}END{print "\n"}' data.txt
yields
ARP
you can graft on the parts to display the filename.