using grep in a pdb file - awk

I have a PDB file, in short it look a bit like this
ATOM 1189 CA ILE A 172 4.067 0.764 -48.818 1.00 19.53 C
ATOM 1197 CA ATHR A 173 7.121 3.051 -48.711 0.50 17.77 C
ATOM 1198 CA BTHR A 173 7.198 2.978 -48.704 0.50 16.94 C
ATOM 1208 CA ALA A 174 7.797 2.124 -52.350 1.00 16.85 C
ATOM 1213 CA LEU A 175 4.431 3.707 -53.288 1.00 16.47 C
ATOM 1221 CA VAL A 176 4.498 6.885 -51.185 1.00 13.92 C
ATOM 1228 CA ARG A 177 6.418 10.059 -51.947 1.00 20.28 C
ATOM 1241 CA GLN B 23 -15.516 -2.515 13.305 1.00 32.36 C
ATOM 1250 CA ASP B 24 -12.740 -2.653 10.715 1.00 22.25 C
ATOM 1258 CA PHE B 25 -12.476 -2.459 6.886 1.00 19.17 C
ATOM 1269 CA TYR B 26 -12.886 -6.243 6.470 1.00 14.87 C
ATOM 1281 CA ASP B 27 -16.276 -6.196 8.222 1.00 18.01 C
ATOM 1289 CA PHE B 28 -17.998 -4.432 5.309 1.00 15.39 C
ATOM 1300 CA LYS B 29 -19.636 -5.878 2.191 1.00 14.46 C
ATOM 1309 CA ALA B 30 -19.587 -4.640 -1.378 1.00 15.26 C
ATOM 1314 CA VAL B 31 -21.000 -5.566 -4.753 1.00 16.26 C
what I want to go is to get rid of the B's and keep the A's, and then get rid of everything but the 6th row
grep ^ATOM 2p31protein.pdb | grep ' CA ' | grep ' A ' | cut -c23-27
this is what i have tried, get everything with ATOM and CA which i what i want and get the row that i want but it does not get rid of the B's

This is more suited to awk:
$ awk '$1=="ATOM"&&$3=="CA"&&$5=="A"{print $6}' file
172
173
173
174
175
176
177

with awk you may do it easier:
awk '$1=="ATOM" && $3=="CA" && $5=="A"{print $6}' your.pdb

Related

handling text file in awk and making a new file

I have a text file like this small example:
chr10:102721669-102724893 3217 3218 5
chr10:102721669-102724893 3218 3219 1
chr10:102721669-102724893 3219 3220 5
chr10:102721669-102724893 421 422 1
chr10:102721669-102724893 858 859 2
chr10:102539319-102568941 13921 13922 1
chr10:102587299-102589074 1560 1561 1
chr10:102587299-102589074 1565 1566 1
chr10:102587299-102589074 1595 1596 1
chr10:102587299-102589074 944 945 1
the expected output would look like this:
chr10:102721669-102724893 3217 3218 5 CA
chr10:102721669-102724893 3218 3219 1 CA
chr10:102721669-102724893 3219 3220 5 CA
chr10:102721669-102724893 421 422 1 BA
chr10:102721669-102724893 858 859 2 BA
chr10:102539319-102568941 13921 13922 1 NON
chr10:102587299-102589074 1560 1561 1 CA
chr10:102587299-102589074 1565 1566 1 CA
chr10:102587299-102589074 1595 1596 1 CA
chr10:102587299-102589074 944 945 1 BA
the input has 4 tab separated columns and in the output, I have one more column with 3 different class (CA, NON or BA).
1- if the 1st column in the input is not repeated, in the 5th column of output it will be classified as NON
2- if (the number just after ":" (in the 1st column) + the 2nd column) - the number just after "-" (in the 1st column) is smaller than -30 (meaning -31 or smaller), that line will be classified as BA. for example in the last line:
(102587299 + 944) - 102589074 = -831 , so this line is classified as BA.
3- if (the number just after ":" (in the 1st column) + the 2nd column) - the number just after "-" (in the 1st column) is equal or bigger than -30 (meaning -30 or -29), that line will be classified as CA. for example the 1st line:
(102721669 + 3217) - 102724893 = -7
I am trying to do that in awk.
awk -F "\t"":""-" '{if($2+$4-$3 < -30) ; print $7 = BA, if($2+$4-$3 >= -30) ; print $7 = CA}' file.txt > out.txt
but it does not returns what I expect. do you know how to fix it?
Try
$ awk 'BEGIN{FS=OFS="\t"} NR==FNR{a[$1]++; next}
{ split($1, b, /[\t:-]/);
$5 = a[$1]==1 ? "NON" : (b[2]+$2-b[3]) < -30 ? "BA" : "CA" }
1' file.txt file.txt
chr10:102721669-102724893 3217 3218 5 CA
chr10:102721669-102724893 3218 3219 1 CA
chr10:102721669-102724893 3219 3220 5 CA
chr10:102721669-102724893 421 422 1 BA
chr10:102721669-102724893 858 859 2 BA
chr10:102539319-102568941 13921 13922 1 NON
chr10:102587299-102589074 1560 1561 1 BA
chr10:102587299-102589074 1565 1566 1 BA
chr10:102587299-102589074 1595 1596 1 BA
chr10:102587299-102589074 944 945 1 BA
BEGIN{FS=OFS="\t"} set both input/output field separator as tab
NR==FNR{a[$1]++; next} to count how many times first field is present in the file. Input file is passed twice, so that on second pass we can make decision based on count
split($1, b, /[\t:-]/) split the first column further, results saved in b array
rest of the code is assigning 5th field depending on given conditions and printing the modified line
Further reading
Idiomatic awk
split function

how do I use awk to edit a file

I have a text file like this small example:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 150 151 2 BA
chr10:103909786-103910082 152 153 1 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 294 295 4 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 2932 2933 2 CA
chr10:104573088-104576021 58 59 1 BA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
in this file there are 5 tab separated columns. the first column is considered as ID. for example in the first row the whole "chr10:103909786-103910082" is ID.
1- in the 1st step I would like to filter out the rows based on the 4th column.
if the number in the 4th column is less than 10 and the same row but in the 5th column the group is BA, that row will be filtered out. also if the number in the 4th column is less than 5 and the same row but in the 5th column the group is CA, that row will be filtered out.
3- 3rd step:
I want to get the ratio of number in 4th column. in fact in the 1st column there are repeated values which represent the same ID. I want to get one ratio per ID, so in the output every ID will be repeated only once. each ID has both BA and CA in the 5th column. for each ID I should get 2 values for CA and BA separately and get the ration of CA/BA as the final value for each ID. to get one value as CA, I should add up all values in the 4th column which belong the same ID and classified as CA and to get one value as BA, I should add up all values in the 4th column which belong the same ID and classified as BA. the last step is to get the ration of CA/BA per ID. the expected output for the small example would look like this:
1- after filtration:
chr10:103909786-103910082 147 148 24 BA
chr10:103909786-103910082 149 150 11 BA
chr10:103909786-103910082 274 275 5 CA
chr10:103909786-103910082 288 289 15 CA
chr10:103909786-103910082 295 296 15 CA
chr10:104573088-104576021 2925 2926 134 CA
chr10:104573088-104576021 2926 2927 10 CA
chr10:104573088-104576021 689 690 12 BA
chr10:104573088-104576021 819 820 33 BA
2- after summarizing each group (CA and BA):
chr10:103909786-103910082 147 148 35 BA
chr10:103909786-103910082 274 275 35 CA
chr10:104573088-104576021 2925 2926 144 CA
chr10:104573088-104576021 819 820 45 BA
3- the final output(this ratio is made using the values in 4th column):
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2
in the above lines, 1 = 35/35 and 3.2 = 144/45.
I am trying to do that in awk
awk 'ID==$1 {
if (ID) {
print ID, a["CA"]/a["BA"]; a["CA"]=a["BA"]=0;
}
ID=$1
}
$5=="BA" && $4>=10 || $5=="CA" && $4>=5 { a[$5]+=$4 }
END{ print ID, a["CA"]/a["BA"] }' file.txt
I tried to use the code but did not succeed. this code returns one number. in fact sum of all CA and divides it by sum of all BAs but I want to do that per ID and get the ration per ID. do you know how to solve the problem and correct the code?
awk '$4 >= 5 && $5 == "CA" { a[$1]+=$4 }
$4 >= 10 && $5 == "BA" { b[$1]+=$4 }
END{ for ( i in a) print i,a[i]/b[i]}' file
output:
chr10:103909786-103910082 1
chr10:104573088-104576021 3.2

Finding the mean values of a field using awk?

This is what I am trying to do:
find the mean values for x,y,z for the HETATM records , the x value are the 7 field, the y values are the 8 field, and z values are the 9 field.
I am trying to do this using this file http://pastebin.com/EqA2SUMy
Here is the sample
HETATM 1756 O HOH A 501 -0.923 10.560 127.393 1.00 16.58 O
HETATM 1757 O HOH A 502 9.272 22.148 134.167 1.00 15.08 O
HETATM 1758 O HOH A 503 0.109 20.243 112.094 1.00 20.74 O
HETATM 1759 O HOH A 504 -3.930 10.522 125.779 1.00 20.79 O
HETATM 1760 O HOH A 505 -0.759 36.323 88.018 1.00 17.42 O
HETATM 1761 O HOH A 506 -4.645 51.936 81.852 1.00 21.43 O
HETATM 1762 O HOH A 507 -3.900 17.103 128.596 1.00 14.08 O
HETATM 1763 O HOH A 508 6.834 21.053 135.062 1.00 16.98 O
Can anyone show me how to do a script for this.
(this part is related to a comment viewers can ignore)
ATOM 214 OE2 GLU A 460 -2.959 24.000 103.360 1.00 32.19 O
ATOM 215 N ARG A 461 -5.878 28.748 106.473 1.00 22.68 N
ATOM 216 CA ARG A 461 -6.553 30.043 106.524 1.00 24.34 C
ATOM 217 C ARG A 461 -5.583 31.176 106.219 1.00 22.42 C
ATOM 218 O ARG A 461 -5.918 32.121 105.497 1.00 25.07 O
ATOM 219 CB ARG A 461 -7.222 30.272 107.887 1.00 24.53 C
ATOM 220 CG ARG A 461 -8.425 29.394 108.150 1.00 26.38
$ awk '{for (i=1;i<=3;i++) sum[i]+=$(i+6)}
END{if (NR) for (i=1;i in sum;i++) print sum[i]/NR}' file
0.25725
23.736
116.62
The if (NR) is necessary to avoid a divide by zero error on an empty file.
If #jaypal is correct and you need to select just the input lines containing HETATM then change it to:
awk '/HETATM/{++nr; for (i=1;i<=3;i++) sum[i]+=$(i+6)}
END{if (nr) for (i=1;i in sum;i++) print sum[i]/nr}' file
It's not rocket science. (Updated to catch only HETATM records — a trivial change; you can use more exacting regexes if you need to. However, it is also necessary to count which records match and use that count, not NR, since you're ignoring many records, in general.)
awk '/HETATM/ { sum7 += $7; sum8 += $8; sum9 += $9; count++ }
END { if (count > 0)
printf("avg(x) = %f, avg(y) = %f, avg(z) = %f\n",
sum7/count, sum8/count, sum9/count)
}'
And yes, you could put it all on one line but it wouldn't be as readable.
I can't answer for why it produced zeros for you; when run on the data from the question, wobbly line starts and all, it produced the output:
avg(x) = 0.257250, avg(y) = 23.736000, avg(z) = 116.620125
If you think there is a possibility of empty input (or, at least, no HETATM records in the input) and an error message is not acceptable, then you can protect the printing action with if (count > 0) or equivalent (added to script). You can generate your own preferred output if ``count` is zero.

Can't cut column in Linux

I have a file like this
ATOM 3197 HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM 3198 C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM 3199 OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM 3200 O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
I want to cut the second column, however when I use
cut -f1,3,4,5,6,7,8,9,10 filename
it doesn't work. Am I do something wrong?
This is because there are multiple spaces and cut can just handle them one by one.
You can start from the 5th position:
$ cut -d' ' -f 1,5- file
ATOM HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
Or squeeze spaces with tr -s like below (multiple spaces will be lost, though):
$ tr -s ' ' < file | cut -d' ' -f1,3,4,5,6,7,8,9,10
ATOM HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
Note you can indicate from 3 to the end with 3-:
tr -s ' ' < file | cut -d' ' -f1,3-
In fact I would use awk for this:
awk '{$2=""; print}' file
or just
awk '{$2=""} 1' file
There are many spaces in your file. So, you've to start with number of spaces.
The new.txt contains
ATOM 3197 HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM 3198 C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM 3199 OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM 3200 O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
and this is the command to print second column
root#52:/home/ubuntu# cut -d' ' -f4 new.txt
3197
3198
3199
3200
where -d stands for delimiter i.e 'space' in this case denoted by ' '
However, awk comes pretty handy in such cases
**# awk '{print $2}' new.txt**
You can select the position of the content in the first row at that column (3197) and then select the string at the same position in all rows with awk:
cat filename | awk -v field="3197" 'NR==1 {c = index($0,field)} {print substr($0,c,length(field))}'
souce: https://unix.stackexchange.com/a/491770/20661

calculate the number of atoms in the PDB file

I would like to calculate the number of atoms for each residue in the pdb files. A PDB file looks as follows.The third column denotes the atoms and the fourth column denotes the residues.
ATOM 1 N ASN A 380 -0.011 22.902 -13.714 1.00 65.81 N
ATOM 2 CA ASN A 380 0.401 23.938 -12.714 1.00 65.53 C
ATOM 3 C ASN A 380 1.926 24.019 -12.595 1.00 64.48 C
ATOM 9 N THR A 381 2.553 24.693 -13.562 1.00 61.65 N
ATOM 10 CA THR A 381 4.006 24.848 -13.609 1.00 58.60 C
ATOM 16 N ILE A 382 5.156 22.716 -13.481 1.00 53.48 N
ATOM 17 CA ILE A 382 5.808 21.571 -12.830 1.00 49.47 C
ATOM 18 C ILE A 382 6.645 21.933 -11.584 1.00 45.24 C
ATOM 28 CB GLN A 383 8.735 24.763 -10.759 1.00 30.19 C
ATOM 29 CG GLN A 383 10.140 24.257 -11.037 1.00 29.17 C
ATOM 30 CD ASN A 384 10.397 23.975 -12.514 1.00 29.51 C
ATOM 31 OE1 ASN A 384 10.892 24.838 -13.237 1.00 30.67 O
I would like to get the output as follows
Total no:of ASN atoms - 5
Total no:of THR atoms - 2
Total no:of ILE atoms - 3
Total no:of GLN atoms - 2
This should do the job:
awk '{print $4}' <file> | sort | uniq -c | \
awk '{print "Total no:of", $2, "atoms -", $1}'
Or pure awk:
awk '{atom[$4]++;}
END{for (i in atom) {print "Total no:of", i, "atoms -", atom[i]} }' <file>
Output for both methods:
Total no:of GLN atoms - 2
Total no:of THR atoms - 2
Total no:of ASN atoms - 5
Total no:of ILE atoms - 3