calculate the number of atoms in the PDB file - awk

I would like to calculate the number of atoms for each residue in the pdb files. A PDB file looks as follows.The third column denotes the atoms and the fourth column denotes the residues.
ATOM 1 N ASN A 380 -0.011 22.902 -13.714 1.00 65.81 N
ATOM 2 CA ASN A 380 0.401 23.938 -12.714 1.00 65.53 C
ATOM 3 C ASN A 380 1.926 24.019 -12.595 1.00 64.48 C
ATOM 9 N THR A 381 2.553 24.693 -13.562 1.00 61.65 N
ATOM 10 CA THR A 381 4.006 24.848 -13.609 1.00 58.60 C
ATOM 16 N ILE A 382 5.156 22.716 -13.481 1.00 53.48 N
ATOM 17 CA ILE A 382 5.808 21.571 -12.830 1.00 49.47 C
ATOM 18 C ILE A 382 6.645 21.933 -11.584 1.00 45.24 C
ATOM 28 CB GLN A 383 8.735 24.763 -10.759 1.00 30.19 C
ATOM 29 CG GLN A 383 10.140 24.257 -11.037 1.00 29.17 C
ATOM 30 CD ASN A 384 10.397 23.975 -12.514 1.00 29.51 C
ATOM 31 OE1 ASN A 384 10.892 24.838 -13.237 1.00 30.67 O
I would like to get the output as follows
Total no:of ASN atoms - 5
Total no:of THR atoms - 2
Total no:of ILE atoms - 3
Total no:of GLN atoms - 2

This should do the job:
awk '{print $4}' <file> | sort | uniq -c | \
awk '{print "Total no:of", $2, "atoms -", $1}'
Or pure awk:
awk '{atom[$4]++;}
END{for (i in atom) {print "Total no:of", i, "atoms -", atom[i]} }' <file>
Output for both methods:
Total no:of GLN atoms - 2
Total no:of THR atoms - 2
Total no:of ASN atoms - 5
Total no:of ILE atoms - 3

Related

handling text file in awk and making a new file

I have a text file like this small example:
chr10:102721669-102724893 3217 3218 5
chr10:102721669-102724893 3218 3219 1
chr10:102721669-102724893 3219 3220 5
chr10:102721669-102724893 421 422 1
chr10:102721669-102724893 858 859 2
chr10:102539319-102568941 13921 13922 1
chr10:102587299-102589074 1560 1561 1
chr10:102587299-102589074 1565 1566 1
chr10:102587299-102589074 1595 1596 1
chr10:102587299-102589074 944 945 1
the expected output would look like this:
chr10:102721669-102724893 3217 3218 5 CA
chr10:102721669-102724893 3218 3219 1 CA
chr10:102721669-102724893 3219 3220 5 CA
chr10:102721669-102724893 421 422 1 BA
chr10:102721669-102724893 858 859 2 BA
chr10:102539319-102568941 13921 13922 1 NON
chr10:102587299-102589074 1560 1561 1 CA
chr10:102587299-102589074 1565 1566 1 CA
chr10:102587299-102589074 1595 1596 1 CA
chr10:102587299-102589074 944 945 1 BA
the input has 4 tab separated columns and in the output, I have one more column with 3 different class (CA, NON or BA).
1- if the 1st column in the input is not repeated, in the 5th column of output it will be classified as NON
2- if (the number just after ":" (in the 1st column) + the 2nd column) - the number just after "-" (in the 1st column) is smaller than -30 (meaning -31 or smaller), that line will be classified as BA. for example in the last line:
(102587299 + 944) - 102589074 = -831 , so this line is classified as BA.
3- if (the number just after ":" (in the 1st column) + the 2nd column) - the number just after "-" (in the 1st column) is equal or bigger than -30 (meaning -30 or -29), that line will be classified as CA. for example the 1st line:
(102721669 + 3217) - 102724893 = -7
I am trying to do that in awk.
awk -F "\t"":""-" '{if($2+$4-$3 < -30) ; print $7 = BA, if($2+$4-$3 >= -30) ; print $7 = CA}' file.txt > out.txt
but it does not returns what I expect. do you know how to fix it?
Try
$ awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1]++; next}
{ split($1, b, /[\t:-]/);
$5 = a[$1]==1 ? "NON" : (b[2]+$2-b[3]) < -30 ? "BA" : "CA" }
1' file.txt file.txt
chr10:102721669-102724893 3217 3218 5 CA
chr10:102721669-102724893 3218 3219 1 CA
chr10:102721669-102724893 3219 3220 5 CA
chr10:102721669-102724893 421 422 1 BA
chr10:102721669-102724893 858 859 2 BA
chr10:102539319-102568941 13921 13922 1 NON
chr10:102587299-102589074 1560 1561 1 BA
chr10:102587299-102589074 1565 1566 1 BA
chr10:102587299-102589074 1595 1596 1 BA
chr10:102587299-102589074 944 945 1 BA
BEGIN{FS=OFS="\t"} set both input/output field separator as tab
NR==FNR{a[$1]++; next} to count how many times first field is present in the file. Input file is passed twice, so that on second pass we can make decision based on count
split($1, b, /[\t:-]/) split the first column further, results saved in b array
rest of the code is assigning 5th field depending on given conditions and printing the modified line
Further reading
Idiomatic awk
split function

Finding the mean values of a field using awk?

This is what I am trying to do:
find the mean values for x,y,z for the HETATM records , the x value are the 7 field, the y values are the 8 field, and z values are the 9 field.
I am trying to do this using this file http://pastebin.com/EqA2SUMy
Here is the sample
HETATM 1756 O HOH A 501 -0.923 10.560 127.393 1.00 16.58 O
HETATM 1757 O HOH A 502 9.272 22.148 134.167 1.00 15.08 O
HETATM 1758 O HOH A 503 0.109 20.243 112.094 1.00 20.74 O
HETATM 1759 O HOH A 504 -3.930 10.522 125.779 1.00 20.79 O
HETATM 1760 O HOH A 505 -0.759 36.323 88.018 1.00 17.42 O
HETATM 1761 O HOH A 506 -4.645 51.936 81.852 1.00 21.43 O
HETATM 1762 O HOH A 507 -3.900 17.103 128.596 1.00 14.08 O
HETATM 1763 O HOH A 508 6.834 21.053 135.062 1.00 16.98 O
Can anyone show me how to do a script for this.
(this part is related to a comment viewers can ignore)
ATOM 214 OE2 GLU A 460 -2.959 24.000 103.360 1.00 32.19 O
ATOM 215 N ARG A 461 -5.878 28.748 106.473 1.00 22.68 N
ATOM 216 CA ARG A 461 -6.553 30.043 106.524 1.00 24.34 C
ATOM 217 C ARG A 461 -5.583 31.176 106.219 1.00 22.42 C
ATOM 218 O ARG A 461 -5.918 32.121 105.497 1.00 25.07 O
ATOM 219 CB ARG A 461 -7.222 30.272 107.887 1.00 24.53 C
ATOM 220 CG ARG A 461 -8.425 29.394 108.150 1.00 26.38
$ awk '{for (i=1;i<=3;i++) sum[i]+=$(i+6)}
END{if (NR) for (i=1;i in sum;i++) print sum[i]/NR}' file
0.25725
23.736
116.62
The if (NR) is necessary to avoid a divide by zero error on an empty file.
If #jaypal is correct and you need to select just the input lines containing HETATM then change it to:
awk '/HETATM/{++nr; for (i=1;i<=3;i++) sum[i]+=$(i+6)}
END{if (nr) for (i=1;i in sum;i++) print sum[i]/nr}' file
It's not rocket science. (Updated to catch only HETATM records — a trivial change; you can use more exacting regexes if you need to. However, it is also necessary to count which records match and use that count, not NR, since you're ignoring many records, in general.)
awk '/HETATM/ { sum7 += $7; sum8 += $8; sum9 += $9; count++ }
END { if (count > 0)
printf("avg(x) = %f, avg(y) = %f, avg(z) = %f\n",
sum7/count, sum8/count, sum9/count)
}'
And yes, you could put it all on one line but it wouldn't be as readable.
I can't answer for why it produced zeros for you; when run on the data from the question, wobbly line starts and all, it produced the output:
avg(x) = 0.257250, avg(y) = 23.736000, avg(z) = 116.620125
If you think there is a possibility of empty input (or, at least, no HETATM records in the input) and an error message is not acceptable, then you can protect the printing action with if (count > 0) or equivalent (added to script). You can generate your own preferred output if ``count` is zero.

Transpose a column to a line after detecting a pattern

I have this text file format:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592 #line 1
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 #line 2
37.31 624 #line 3
260 1 #line 4
321 624 #line 5
532 23 #line 6
12 644 #line 7
270 0.0 #line 8
3e-37 1046 #line 9
154 #line 10
I have to detect a line containing 8 columns (line 2), and transpose the second column of the followning seven lines (lines 3 - 9) to the end of the 8-column line. And finally, exclude line 10. This pattern repeats along a large text file, but it is not frequent (30 time, in a file of 2000 lines). Is it possible doing it using awk?
The edited text file must look like the following text:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592 #line 1
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046 #line 2
Thank you very much in advance.
awk 'NF == 12 { t = $0; for (i = 1; i <= 7; ++i) { r = getline; if (r < 1) break; t = t "\t" $2; } print t; next; } NF > 12' temp.txt
Output:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046
It would automatically print lines having more than 12 fields.
If it detects lines having 12 fields, concatenate second column of other 7 lines with it and print.
Ignore any other line.
edited to only add the second column of the lines with two columns.
I think this does what you want:
awk 'NF >= 8 { a[++i] = $0 } NF == 2 { a[i] = a[i] " " $2 } END { for (j = 1; j <= i; ++j) print a[j] }' file
For lines with more than 8 columns, add a new element to the array a. If the line has 2 columns, append the contents to the current array element. Once the whole file has been processed, go through the array and print all of the lines.
Output:
01 contig00041 1 878 + YP_003990830.1 metalloendopeptidase, glycoprotease family Geobacillus sp. Y4.1MC1 100.00 291 1 291 47 337 0.0 592
01 contig00041 1241 3117 - YP_002948419.1 ABC transporter Geobacillus sp. WCH70 84.94 624 1 624 23 644 0.0 1046

Can't cut column in Linux

I have a file like this
ATOM 3197 HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM 3198 C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM 3199 OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM 3200 O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
I want to cut the second column, however when I use
cut -f1,3,4,5,6,7,8,9,10 filename
it doesn't work. Am I do something wrong?
This is because there are multiple spaces and cut can just handle them one by one.
You can start from the 5th position:
$ cut -d' ' -f 1,5- file
ATOM HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
Or squeeze spaces with tr -s like below (multiple spaces will be lost, though):
$ tr -s ' ' < file | cut -d' ' -f1,3,4,5,6,7,8,9,10
ATOM HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
Note you can indicate from 3 to the end with 3-:
tr -s ' ' < file | cut -d' ' -f1,3-
In fact I would use awk for this:
awk '{$2=""; print}' file
or just
awk '{$2=""} 1' file
There are many spaces in your file. So, you've to start with number of spaces.
The new.txt contains
ATOM 3197 HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM 3198 C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM 3199 OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM 3200 O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
and this is the command to print second column
root#52:/home/ubuntu# cut -d' ' -f4 new.txt
3197
3198
3199
3200
where -d stands for delimiter i.e 'space' in this case denoted by ' '
However, awk comes pretty handy in such cases
**# awk '{print $2}' new.txt**
You can select the position of the content in the first row at that column (3197) and then select the string at the same position in all rows with awk:
cat filename | awk -v field="3197" 'NR==1 {c = index($0,field)} {print substr($0,c,length(field))}'
souce: https://unix.stackexchange.com/a/491770/20661

using grep in a pdb file

I have a PDB file, in short it look a bit like this
ATOM 1189 CA ILE A 172 4.067 0.764 -48.818 1.00 19.53 C
ATOM 1197 CA ATHR A 173 7.121 3.051 -48.711 0.50 17.77 C
ATOM 1198 CA BTHR A 173 7.198 2.978 -48.704 0.50 16.94 C
ATOM 1208 CA ALA A 174 7.797 2.124 -52.350 1.00 16.85 C
ATOM 1213 CA LEU A 175 4.431 3.707 -53.288 1.00 16.47 C
ATOM 1221 CA VAL A 176 4.498 6.885 -51.185 1.00 13.92 C
ATOM 1228 CA ARG A 177 6.418 10.059 -51.947 1.00 20.28 C
ATOM 1241 CA GLN B 23 -15.516 -2.515 13.305 1.00 32.36 C
ATOM 1250 CA ASP B 24 -12.740 -2.653 10.715 1.00 22.25 C
ATOM 1258 CA PHE B 25 -12.476 -2.459 6.886 1.00 19.17 C
ATOM 1269 CA TYR B 26 -12.886 -6.243 6.470 1.00 14.87 C
ATOM 1281 CA ASP B 27 -16.276 -6.196 8.222 1.00 18.01 C
ATOM 1289 CA PHE B 28 -17.998 -4.432 5.309 1.00 15.39 C
ATOM 1300 CA LYS B 29 -19.636 -5.878 2.191 1.00 14.46 C
ATOM 1309 CA ALA B 30 -19.587 -4.640 -1.378 1.00 15.26 C
ATOM 1314 CA VAL B 31 -21.000 -5.566 -4.753 1.00 16.26 C
what I want to go is to get rid of the B's and keep the A's, and then get rid of everything but the 6th row
grep ^ATOM 2p31protein.pdb | grep ' CA ' | grep ' A ' | cut -c23-27
this is what i have tried, get everything with ATOM and CA which i what i want and get the row that i want but it does not get rid of the B's
This is more suited to awk:
$ awk '$1=="ATOM"&&$3=="CA"&&$5=="A"{print $6}' file
172
173
173
174
175
176
177
with awk you may do it easier:
awk '$1=="ATOM" && $3=="CA" && $5=="A"{print $6}' your.pdb