Removing 'TER' keywords from PDB file while merging two chains of PDB - pdb

Goal: two chains from PDB should be merged using Biopython. In the following example, I want to merge two chains A and B into C.
ATOM 1133 N VAL A 100 12.484 -30.583 106.831 1.00 30.28 N
ATOM 1134 CA VAL A 100 11.430 -31.194 106.033 1.00 34.41 C
ATOM 1135 C VAL A 100 11.985 -32.402 105.259 1.00 39.25 C
ATOM 1136 O VAL A 100 11.248 -33.126 104.568 1.00 46.37 O
ATOM 1137 CB VAL A 100 10.822 -30.174 105.029 1.00 35.16 C
ATOM 1138 CG1 VAL A 100 10.159 -29.020 105.767 1.00 36.95 C
ATOM 1139 CG2 VAL A 100 11.865 -29.669 104.007 1.00 30.60 C
TER
ATOM 1141 N GLU B 1 12.344 -43.792 102.987 1.00 64.25 N
ATOM 1142 CA GLU B 1 11.253 -42.785 103.240 1.00 66.15 C
ATOM 1143 C GLU B 1 11.742 -41.350 102.948 1.00 65.40 C
ATOM 1144 O GLU B 1 12.011 -40.595 103.895 1.00 65.31 O
ATOM 1145 CB GLU B 1 10.779 -42.877 104.712 1.00 67.04 C
These lines of codes can merge them into single chain but they are not able to remove TER keywords.
merged_chains=['A', 'B']
new_rsd_num = 1
for model in structure:
for chain in model:
if chain.id in merged_chains:
chain.id = 'C'
for residue in chain:
residue.id = (' ', new_rsd_num, ' ')
new_rsd_num += 1
This set of code produces following output, which contains TER keyword between the two chains.
...
ATOM 1133 N VAL C 100 12.484 -30.583 106.831 1.00 30.28 N
ATOM 1134 CA VAL C 100 11.430 -31.194 106.033 1.00 34.41 C
ATOM 1135 C VAL C 100 11.985 -32.402 105.259 1.00 39.25 C
ATOM 1136 O VAL C 100 11.248 -33.126 104.568 1.00 46.37 O
ATOM 1137 CB VAL C 100 10.822 -30.174 105.029 1.00 35.16 C
ATOM 1138 CG1 VAL C 100 10.159 -29.020 105.767 1.00 36.95 C
ATOM 1139 CG2 VAL C 100 11.865 -29.669 104.007 1.00 30.60 C
TER
ATOM 1141 N GLU C 101 12.344 -43.792 102.987 1.00 64.25 N
ATOM 1142 CA GLU C 101 11.253 -42.785 103.240 1.00 66.15 C
ATOM 1143 C GLU C 101 11.742 -41.350 102.948 1.00 65.40 C
ATOM 1144 O GLU C 101 12.011 -40.595 103.895 1.00 65.31 O
ATOM 1145 CB GLU C 101 10.779 -42.877 104.712 1.00 67.04 C
...
But output should be following in which TER keyword should be removed.
...
ATOM 1133 N VAL C 100 12.484 -30.583 106.831 1.00 30.28 N
ATOM 1134 CA VAL C 100 11.430 -31.194 106.033 1.00 34.41 C
ATOM 1135 C VAL C 100 11.985 -32.402 105.259 1.00 39.25 C
ATOM 1136 O VAL C 100 11.248 -33.126 104.568 1.00 46.37 O
ATOM 1137 CB VAL C 100 10.822 -30.174 105.029 1.00 35.16 C
ATOM 1138 CG1 VAL C 100 10.159 -29.020 105.767 1.00 36.95 C
ATOM 1139 CG2 VAL C 100 11.865 -29.669 104.007 1.00 30.60 C
ATOM 1141 N GLU C 101 12.344 -43.792 102.987 1.00 64.25 N
ATOM 1142 CA GLU C 101 11.253 -42.785 103.240 1.00 66.15 C
ATOM 1143 C GLU C 101 11.742 -41.350 102.948 1.00 65.40 C
ATOM 1144 O GLU C 101 12.011 -40.595 103.895 1.00 65.31 O
ATOM 1145 CB GLU C 101 10.779 -42.877 104.712 1.00 67.04 C
...
Any idea to remove TER keyword using BioPython?

The residue still belongs to the chain object, i.e. the number of residues belonging to chain A does not change when you overwrite the id.
You could add the residue from chain B to chain A and then delete chain B.
#read a PDB file with two chains
from Bio import PDB
pdbl = PDB.PDBList()
pdbl.retrieve_pdb_file('5K04')
parser = PDB.PDBParser()
structure = parser.get_structure('5K04', pdbl.local_pdb + '/k0/pdb5k04.ent')
#get all chains
chains = list()
for model in structure:
for chain in model:
chains.append(chain)
#get the id of the last residue in the first chain
len_chain_a = int(chains[0].get_unpacked_list()[-1].id[1]) + 1
#get all residues from the 2nd chain
for i, residue in enumerate(chains[1].get_residues()):
old_id = list(residue.id)
old_id[1] = len_chain_a + i
#increment the id
residue.id = tuple(old_id)
#add the residue to the first chain
chains[0].add(residue)
#now delete all chains but the first
for model in structure:
for chain in model:
if chain.id != 'A':
model.detach_child(chain.id)
#save the merged chains
pdb_io = PDB.PDBIO()
pdb_io.set_structure(structure)
pdb_io.save('5k04_merged.pdb')

Related

Awk or Bash, Replacing string from specific column and line number to foreward

I want to substituted capital letter B with C in column 5 and from line 6 to the end of the file, need to keep the spaces as it is from my original input file as it is.
ATOM 1939 HG2 PRO A 125 35.681 32.906 38.437 1.00 43.59 H
ATOM 1940 HG3 PRO A 125 34.593 33.765 37.652 1.00 41.79 H
ATOM 1941 HD2 PRO A 125 37.364 34.075 37.624 1.00 43.38 H
ATOM 1942 HD3 PRO A 125 36.333 34.312 36.415 1.00 41.29 H
TER
ATOM 1944 N MET B 11 16.583 29.975 -4.306 1.00 51.32 N
ATOM 1945 CA MET B 11 15.542 30.263 -3.327 1.00 39.92 C
ATOM 1946 C MET B 11 16.146 30.366 -1.933 1.00 32.50 C
I have read:
https://unix.stackexchange.com/questions/486840/replace-a-string-with-sed-from-specific-lines
https://unix.stackexchange.com/questions/70878/replacing-string-based-on-line-number
Sed replace pattern with line number
and my attempt is: awk 'NR == 6 && $ == 5, { sub(" B ", " C ") }'
This simple awk should help you in same. Written and tested in GNU awk.
awk '
FNR>=6 && match($0,/^(\S+[[:space:]]+)(\S+[[:space:]]+)(\S+[[:space:]]+)(\S+[[:space:]]+)(\S+)(.*)$/,arr) && arr[5]=="B"{
$0=arr[1] arr[2] arr[3] arr[4] "C" arr[6]
}
1
' Input_file
Using match function here to keep your spaces as it is even after substitution.
You can use any awk and preserve the format by using a sub() on the entire record. This prevents awk from recalculating the fields. For instance, you can do:
awk 'NR > 5 { sub(/MET B/,"MET C") }1' file
Here awk will replace the first occurrence of "MET B" with "MET C" beginning with the 6th record leaving the spacing alone.
Output
ATOM 1939 HG2 PRO A 125 35.681 32.906 38.437 1.00 43.59 H
ATOM 1940 HG3 PRO A 125 34.593 33.765 37.652 1.00 41.79 H
ATOM 1941 HD2 PRO A 125 37.364 34.075 37.624 1.00 43.38 H
ATOM 1942 HD3 PRO A 125 36.333 34.312 36.415 1.00 41.29 H
TER
ATOM 1944 N MET C 11 16.583 29.975 -4.306 1.00 51.32 N
ATOM 1945 CA MET C 11 15.542 30.263 -3.327 1.00 39.92 C
ATOM 1946 C MET C 11 16.146 30.366 -1.933 1.00 32.50 C
Use NR >= 6 to update all lines starting with line 6. Your code does it on line 6 only.
Put $5 in the sub() call to make the substitution only in that field.
awk 'NR >= 6 {sub("B", "C", $5)} 1'
I hope you don't care about the number of spaces between the columns. The line that are updated will have multiple spaces collapsed into one, so
ATOM 1944 N MET B 11 16.583 29.975 -4.306 1.00 51.32 N
becomes
ATOM 1944 N MET C 11 16.583 29.975 -4.306 1.00 51.32 N
If you can make use of gawk, you could use split which will have a seps array that stores the separators.
You can split on the field separator FS, and then use the number returned by split to loop through all the fields.
When encountering field 5 and the value is B, then change it to C
awk 'NR > 5 {
nr = split($0, a, FS, seps)
for (i = 1; i <= nr; ++i) {
if (i == 5 && a[i] == "B") a[i] = "C"
printf "%s%s", a[i], seps[i]
}
printf "\n"
next
}1' file
Output
ATOM 1939 HG2 PRO A 125 35.681 32.906 38.437 1.00 43.59 H
ATOM 1940 HG3 PRO A 125 34.593 33.765 37.652 1.00 41.79 H
ATOM 1941 HD2 PRO A 125 37.364 34.075 37.624 1.00 43.38 H
ATOM 1942 HD3 PRO A 125 36.333 34.312 36.415 1.00 41.29 H
TER
ATOM 1944 N MET C 11 16.583 29.975 -4.306 1.00 51.32 N
ATOM 1945 CA MET C 11 15.542 30.263 -3.327 1.00 39.92 C
ATOM 1946 C MET C 11 16.146 30.366 -1.933 1.00 32.50 C
here's a way to preserve all the spaces and tabs you like without using vendor-proprietary solutions :
mawk 'BEGIN { _ = length(FS="[ \t]+") } NR<_ || NF<_ ||
$!NF = sprintf("%.*sC%s", (__ = index($!_,
$_))- !!_, $!_, substr($!_, ++__))'
ATOM 1939 HG2 PRO A 125 35.681 32.906 38.437 1.00 43.59 H
ATOM 1940 HG3 PRO A 125 34.593 33.765 37.652 1.00 41.79 H
ATOM 1941 HD2 PRO A 125 37.364 34.075 37.624 1.00 43.38 H
ATOM 1942 HD3 PRO A 125 36.333 34.312 36.415 1.00 41.29 H
TER
ATOM 1944 N MET C 11 16.583 29.975 -4.306 1.00 51.32 N
ATOM 1945 CA MET C 11 15.542 30.263 -3.327 1.00 39.92 C
ATOM 1946 C MET C 11 16.146 30.366 -1.933 1.00 32.50 C

How to delete duplicate character and leave one occurrence only?

I'm a beginner user in shell scripting, and I need to automate a process of removing duplicates for a bunch of files.
an example of my files :
1 Br Atom
2 Cl Atom
3 F Atom
4 P Atom
5 Al Atom
6 Rh Atom
7 Cl Atom
8 Cl Atom
9 Mg Atom
10 O Atom
11 N Atom
12 H Atom
13 H Atom
14 H Atom
15 H Atom
16 H Atom
17 Cl Atom
What I'm looking for is to remove all the "Cl" atoms and leave only one "Cl", no matter which one.
I tried different commands like sort with uniq, and the command awk '!a[$2]++' atomos.txt, but I didn't get the desired result.
I need to remove specific duplicate atom whitout touching the others, and the order of other atoms must kept the same.
The desired output :
1 Br Atom
3 F Atom
4 P Atom
5 Al Atom
6 Rh Atom
7 Cl Atom
9 Mg Atom
10 O Atom
11 N Atom
12 H Atom
13 H Atom
14 H Atom
15 H Atom
16 H Atom
You're pretty close. Try:
awk '$2 != "Cl" || !dup++'
which will print the first Cl and remove the rest.
Your attempt need rework to treat Cl and not-Cl differently, let file.txt content be
1 Br Atom
2 Cl Atom
3 F Atom
4 P Atom
5 Al Atom
6 Rh Atom
7 Cl Atom
8 Cl Atom
9 Mg Atom
10 O Atom
11 N Atom
12 H Atom
13 H Atom
14 H Atom
15 H Atom
16 H Atom
17 Cl Atom
then
awk '$2=="Cl"?!a[$2]++:1' file.txt
output
1 Br Atom
2 Cl Atom
3 F Atom
4 P Atom
5 Al Atom
6 Rh Atom
9 Mg Atom
10 O Atom
11 N Atom
12 H Atom
13 H Atom
14 H Atom
15 H Atom
16 H Atom
Explanation: I used so called ternary operator which syntax is condition?valueiftrue:valueiffalse, so if we deal with Cl then use piece of code you already crafted, otherwise 1 i.e. do print.
(tested in gawk 4.2.1)
Another possibility with awk can be:
awk '!f; $2=="Cl" {f=1;next} f' file
1 Br Atom
2 Cl Atom
3 F Atom
4 P Atom
5 Al Atom
6 Rh Atom
9 Mg Atom
10 O Atom
11 N Atom
12 H Atom
13 H Atom
14 H Atom
15 H Atom
16 H Atom
or the desired output where it appears at second (not first) time:
awk '{f=1} $2=="Cl" {f=0;c++} {if (c==2) f=1} f' file
1 Br Atom
3 F Atom
4 P Atom
5 Al Atom
6 Rh Atom
7 Cl Atom
9 Mg Atom
10 O Atom
11 N Atom
12 H Atom
13 H Atom
14 H Atom
15 H Atom
16 H Atom
Other outputs:
we control occurences of Cl with c++. So if its value is 2, we change f to true.
But if what you need is it appears the third Cl, or the fourth Cl, change the c value in the comparison: ...{if (c==3) f=1}... or ...{if (c==4) f=1}...

Awk replace column AFTER matched line

I have a PDB file that is returned from a receptor/ligand docking prediction. I don't know why the authors of the program named the chains "A" for both receptor and ligand, but I want to change it. This should be a basic thing that I want to do and I am not sure why I cannot find any example on the internet. What I want to do is simple.
Match a line, for example "HEADER lig"
Then for every line after that replace column $5 with a "B"
Here is example of input file:
ATOM 9197 OG SER A1176 103.395 152.201 139.176 1.00 0.00 RA2 O
ATOM 9198 HG SER A1176 104.092 151.786 138.659 1.00 0.00 RA2 H
ATOM 9199 C SER A1176 101.857 153.749 136.254 1.00 0.00 RA2 C
ATOM 9200 O SER A1176 102.183 152.962 135.366 1.00 0.00 RA2 O
TER
HEADER lig.006.10.pdb
ATOM 1 N GLY A 25 182.812 181.892 153.587 1.00 0.00 LA0 N
ATOM 2 H GLY A 25 182.954 182.546 152.840 1.00 0.00 LA0 H
ATOM 3 CA GLY A 25 183.834 180.858 153.715 1.00 0.00 LA0 C
ATOM 4 C GLY A 25 184.544 180.646 152.391 1.00 0.00 LA0 C
ATOM 5 O GLY A 25 184.450 181.466 151.487 1.00 0.00 LA0 O
ATOM 6 N PRO A 26 185.249 179.494 152.297 1.00 0.00 LA0 N
ATOM 7 CD PRO A 26 185.371 178.458 153.319 1.00 0.00 LA0 C
I tried this below but it only replaces column $5 for the first line after match. Not sure why nothing is posted on this example anywhere.
awk '{ print; } /^HEADER lig/ { getline; $5="B"; print }' model.006.10.pdb
awk '{ if (headerfound==1){ $5="B" }}/^HEADER/{ headerfound=1}{ print }' mode.pdb
Three parts:
headerfound==1 ==> assing "B" to 5th column
/^HEADER/ => does the line start with header?
just a simple print to print the (eventually changed) line.
A short explanation
Part 2, detection of ^HEADER goeds after the check headerfound==1 because when the HEADER is found, the current line is the line containing the text HEADER, and we do not want to assign "B" to the 5th column of that line.
On the next line, we first check if any previous line did contain a HEADER (headerfound==1), and update the $5.
IIRC white space matters between fields in these "PDB" files so here's a GNU awk solution that preserves white space:
$ awk 'f{$0=gensub(/((\S+\s+){4})\S+/,"\\1B",1)} /HEADER lig/{f=1} 1' file
ATOM 9197 OG SER A1176 103.395 152.201 139.176 1.00 0.00 RA2 O
ATOM 9198 HG SER A1176 104.092 151.786 138.659 1.00 0.00 RA2 H
ATOM 9199 C SER A1176 101.857 153.749 136.254 1.00 0.00 RA2 C
ATOM 9200 O SER A1176 102.183 152.962 135.366 1.00 0.00 RA2 O
TER
HEADER lig.006.10.pdb
ATOM 1 N GLY B 25 182.812 181.892 153.587 1.00 0.00 LA0 N
ATOM 2 H GLY B 25 182.954 182.546 152.840 1.00 0.00 LA0 H
ATOM 3 CA GLY B 25 183.834 180.858 153.715 1.00 0.00 LA0 C
ATOM 4 C GLY B 25 184.544 180.646 152.391 1.00 0.00 LA0 C
ATOM 5 O GLY B 25 184.450 181.466 151.487 1.00 0.00 LA0 O
ATOM 6 N PRO B 26 185.249 179.494 152.297 1.00 0.00 LA0 N
ATOM 7 CD PRO B 26 185.371 178.458 153.319 1.00 0.00 LA0 C

count the number of residues in a pdb file

I have a PDB file as shown below. I want to count the number of residues. 4th column is the residue name and 6th column is the residue position.
file1.pdb
ATOM 1 N ASN A 2 18.359 26.869 52.955 1.00 39.17 N
ATOM 2 CA ASN A 2 19.635 26.632 53.671 1.00 38.01 C
ATOM 5 N LEU A 3 20.916 28.708 54.068 1.00 32.39 N
ATOM 6 CA LEU A 3 21.304 29.943 54.753 1.00 28.83 C
ATOM 7 C LEU A 3 20.084 30.834 54.955 1.00 25.23 C
ATOM 13 N LYS A 4 19.824 31.394 56.099 1.00 23.92 N
ATOM 14 CA LYS A 4 18.654 32.292 56.333 1.00 21.94 C
ATOM 15 C LYS A 5 19.164 33.678 56.668 1.00 20.25 C
file2.pdb
ATOM 1 N ASN A 2 18.359 26.869 52.955 1.00 39.17 N
ATOM 2 CA ASN A 2 19.635 26.632 53.671 1.00 38.01 C
ATOM 5 N LEU A 3 20.916 28.708 54.068 1.00 32.39 N
ATOM 6 CA LEU A 3 21.304 29.943 54.753 1.00 28.83 C
ATOM 7 C LEU A 3 20.084 30.834 54.955 1.00 25.23 C
ATOM 13 N LYS A 4 19.824 31.394 56.099 1.00 23.92 N
ATOM 14 CA LYS A 4 18.654 32.292 56.333 1.00 21.94 C
ATOM 15 C LYS A 5 19.164 33.678 56.668 1.00 20.25 C
Desired output
Total no:of ASN - 2
Total no:of LEU - 2
Total no:of LYS - 4
Total no:of residues - 8
$ awk '{ a[$4 $6 FILENAME]++ }
END {
for (i in a) { b[substr(i,1,3)]++ }
for (i in b)
{
total+=b[i]
printf "Total no:of %s - %d\n", i, b[i]
}
printf "\nTotal no:of residues - %d\n", total
}' file1.pdb file2.pdb
Total no:of LEU - 2
Total no:of ASN - 2
Total no:of LYS - 4
Total no:of residues - 8

giving space in the 5th column with awk [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to give space between columns?
I have a PDB file as shown below. I would like to give a space in the 5th column.
ATOM 2614 CB ALA B1130 -3.307 -21.676 3.322 1.00 15.69 C
ATOM 2615 N LEU B1131 -4.584 -19.000 2.466 1.00 18.48 N
ATOM 2616 CA LEU B1131 -4.847 -17.586 2.701 1.00 18.86 C
desired output
ATOM 2614 CB ALA B 1130 -3.307 -21.676 3.322 1.00 15.69 C
ATOM 2615 N LEU B 1131 -4.584 -19.000 2.466 1.00 18.48 N
ATOM 2616 CA LEU B 1131 -4.847 -17.586 2.701 1.00 18.86 C
If you require tab separated output, you could use awk like this:
awk -v OFS="\t" '{ sub(/./, "& ", $5) }1' file
Results:
ATOM 2614 CB ALA B 1130 -3.307 -21.676 3.322 1.00 15.69 C
ATOM 2615 N LEU B 1131 -4.584 -19.000 2.466 1.00 18.48 N
ATOM 2616 CA LEU B 1131 -4.847 -17.586 2.701 1.00 18.86 C
awk '$5 = substr($5, 1, 1) " " substr($5, 2)' /tmp/foo.pdb
if you have gawk:
awk '{$5=gensub(/^./,"& ","g",$5)}1' file