Can't cut column in Linux - cut

I have a file like this
ATOM 3197 HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM 3198 C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM 3199 OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM 3200 O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
I want to cut the second column, however when I use
cut -f1,3,4,5,6,7,8,9,10 filename
it doesn't work. Am I do something wrong?

This is because there are multiple spaces and cut can just handle them one by one.
You can start from the 5th position:
$ cut -d' ' -f 1,5- file
ATOM HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
Or squeeze spaces with tr -s like below (multiple spaces will be lost, though):
$ tr -s ' ' < file | cut -d' ' -f1,3,4,5,6,7,8,9,10
ATOM HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
Note you can indicate from 3 to the end with 3-:
tr -s ' ' < file | cut -d' ' -f1,3-
In fact I would use awk for this:
awk '{$2=""; print}' file
or just
awk '{$2=""} 1' file

There are many spaces in your file. So, you've to start with number of spaces.
The new.txt contains
ATOM 3197 HD13 ILE 206 9.900 15.310 13.450 0.0196 1.4870
ATOM 3198 C ILE 206 10.870 16.560 17.500 0.8343 1.9080
ATOM 3199 OXT ILE 206 11.780 15.734 17.425 -0.8190 1.6612
ATOM 3200 O ILE 206 9.929 16.225 18.095 -0.8190 1.6612
and this is the command to print second column
root#52:/home/ubuntu# cut -d' ' -f4 new.txt
3197
3198
3199
3200
where -d stands for delimiter i.e 'space' in this case denoted by ' '
However, awk comes pretty handy in such cases
**# awk '{print $2}' new.txt**

You can select the position of the content in the first row at that column (3197) and then select the string at the same position in all rows with awk:
cat filename | awk -v field="3197" 'NR==1 {c = index($0,field)} {print substr($0,c,length(field))}'
souce: https://unix.stackexchange.com/a/491770/20661

Related

grep and tail -f for a UTF-16 binary file - trying to use simple awk

How can I achieve the equivalent of:
tail -f file.txt | grep 'regexp'
to only output the buffered lines that match a regular expression such as 'Result' from the file type:
$ file file.txt
file.txt:Little-endian UTF-16 Unicode text, with CRLF line terminators
Example of the tail -f stream content below converted to utf-8:
Package end.
Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.
Result: Success
Awk?
The problems in piping to grep led me to awk as a on-stop-shop solution for stripping the offending characters and also producing matched lines from regex.
awk seems to be giving the most promising results, however, I am finding that it returns the whole stream rather than individual matching lines:
tail -f file.txt | awk '{sub("/[^\x20-\x7F]/", "");/Result/;print}'
Package end.
Total warnings: 40
Total errors: 0
Elapsed time: 24.4267192 secs.
...Package Executed.
Result: Success
What I have tried
converting the stream and piping to grep
tail -f file.txt | iconv -t UTF-8 | grep 'regexp'
using luit to change terminal encoding as per this post
luit -encoding UTF-8 -- tail -f file.txt | grep 'regexp'
delete non ASCII characters, described here, then piping to grep
tail -f file.txt | tr -d '[^\x20-\x7F]' | grep 'regexp'
tail -f file.txt | sed 's/[^\x00-\x7F]//' | grep 'regexp'
various combinations of the above using grep flags --line-buffered, -a as well as sed -u
using luit -encoding UTF-8 -- pre-pended to the above
using a file with the same encoding containing the regular expression for grep -f
Why they failed
Most attempts, simply nothing is printed to the screen because grep searches 'regexp' when in fact the text is something like '\x00r\x00e\x00g\x00e\x00x\x00p' - for example 'R' will return the line 'Result: Success' but 'Result' won't
If a full regular expression gets a match, such as in the case of using grep -f, it will return the whole stream and doesn't seem to just return the matched lines
piping through sed or tr or iconv seems to break the pipe to grep and grep seems to still only be able to match individual characters
Edit
I looked at the raw file in it's utf-16 format using xxd with an aim of using regex to match the encoding, which gave the following output:
$ tail file.txt | xxd
00000000: 0050 0061 0063 006b 0061 0067 0065 0020 .P.a.c.k.a.g.e.
00000010: 0065 006e 0064 002e 000d 000a 000d 000a .e.n.d..........
00000020: 0054 006f 0074 0061 006c 0020 0077 0061 .T.o.t.a.l. .w.a
00000030: 0072 006e 0069 006e 0067 0073 003a 0020 .r.n.i.n.g.s.:.
00000040: 0034 0030 000d 000a 0054 006f 0074 0061 .4.0.....T.o.t.a
00000050: 006c 0020 0065 0072 0072 006f 0072 0073 .l. .e.r.r.o.r.s
00000060: 003a 0020 0030 000d 000a 0045 006c 0061 .:. .0.....E.l.a
00000070: 0070 0073 0065 0064 0020 0074 0069 006d .p.s.e.d. .t.i.m
00000080: 0065 003a 0020 0032 0034 002e 0034 0032 .e.:. .2.4...4.2
00000090: 0036 0037 0031 0039 0032 0020 0073 0065 .6.7.1.9.2. .s.e
000000a0: 0063 0073 002e 000d 000a 002e 002e 002e .c.s............
000000b0: 0050 0061 0063 006b 0061 0067 0065 0020 .P.a.c.k.a.g.e.
000000c0: 0045 0078 0065 0063 0075 0074 0065 0064 .E.x.e.c.u.t.e.d
000000d0: 002e 000d 000a 000d 000a 0052 0065 0073 ...........R.e.s
000000e0: 0075 006c 0074 003a 0020 0053 0075 0063 .u.l.t.:. .S.u.c
000000f0: 0063 0065 0073 0073 000d 000a 000d 000a .c.e.s.s........
00000100: 00
The sloppiest solution that should work on Cygwin is fixing your awk statement:
tail -f file.txt | \
LC_CTYPE=C awk '{ gsub("[^[:print:]]", ""); if($0 ~ /Result/) print; }'
This has a few bugs that cancel each other out, like tail cutting a UTF-16LE file in awkward places but awk stripping what we hope is garbage.
A robust solution might be:
tail -c +1 -f file.txt | \
script -qc 'iconv -f UTF-16LE -t UTF-8' /dev/null | grep Result
but it reads the entire file and I don't know how well Cygwin works with using script to convince iconv not to buffer (it would work on GNU/Linux).
I realised a simple regex to ignore any characters between letters in the search string might work...
This matches 'Result' whilst allowing any one character between each letter...
$ tail -f file.txt | grep -a 'R.e.s.u.l.t'
Result: Success
$ tail -f file.txt | awk '/R.e.s.u.l.t./'
Result: Success
or as per this answer: to avoid typing all the tedious dots...
search="Result"
tail -f file.txt | grep -a -e "$(echo "$search" | sed 's/./&./g')"
You can use ripgrep instead which will handle nicely UTF-16 without having to convert your input
tail -f file.txt | rg regexp

Finding the mean values of a field using awk?

This is what I am trying to do:
find the mean values for x,y,z for the HETATM records , the x value are the 7 field, the y values are the 8 field, and z values are the 9 field.
I am trying to do this using this file http://pastebin.com/EqA2SUMy
Here is the sample
HETATM 1756 O HOH A 501 -0.923 10.560 127.393 1.00 16.58 O
HETATM 1757 O HOH A 502 9.272 22.148 134.167 1.00 15.08 O
HETATM 1758 O HOH A 503 0.109 20.243 112.094 1.00 20.74 O
HETATM 1759 O HOH A 504 -3.930 10.522 125.779 1.00 20.79 O
HETATM 1760 O HOH A 505 -0.759 36.323 88.018 1.00 17.42 O
HETATM 1761 O HOH A 506 -4.645 51.936 81.852 1.00 21.43 O
HETATM 1762 O HOH A 507 -3.900 17.103 128.596 1.00 14.08 O
HETATM 1763 O HOH A 508 6.834 21.053 135.062 1.00 16.98 O
Can anyone show me how to do a script for this.
(this part is related to a comment viewers can ignore)
ATOM 214 OE2 GLU A 460 -2.959 24.000 103.360 1.00 32.19 O
ATOM 215 N ARG A 461 -5.878 28.748 106.473 1.00 22.68 N
ATOM 216 CA ARG A 461 -6.553 30.043 106.524 1.00 24.34 C
ATOM 217 C ARG A 461 -5.583 31.176 106.219 1.00 22.42 C
ATOM 218 O ARG A 461 -5.918 32.121 105.497 1.00 25.07 O
ATOM 219 CB ARG A 461 -7.222 30.272 107.887 1.00 24.53 C
ATOM 220 CG ARG A 461 -8.425 29.394 108.150 1.00 26.38
$ awk '{for (i=1;i<=3;i++) sum[i]+=$(i+6)}
END{if (NR) for (i=1;i in sum;i++) print sum[i]/NR}' file
0.25725
23.736
116.62
The if (NR) is necessary to avoid a divide by zero error on an empty file.
If #jaypal is correct and you need to select just the input lines containing HETATM then change it to:
awk '/HETATM/{++nr; for (i=1;i<=3;i++) sum[i]+=$(i+6)}
END{if (nr) for (i=1;i in sum;i++) print sum[i]/nr}' file
It's not rocket science. (Updated to catch only HETATM records — a trivial change; you can use more exacting regexes if you need to. However, it is also necessary to count which records match and use that count, not NR, since you're ignoring many records, in general.)
awk '/HETATM/ { sum7 += $7; sum8 += $8; sum9 += $9; count++ }
END { if (count > 0)
printf("avg(x) = %f, avg(y) = %f, avg(z) = %f\n",
sum7/count, sum8/count, sum9/count)
}'
And yes, you could put it all on one line but it wouldn't be as readable.
I can't answer for why it produced zeros for you; when run on the data from the question, wobbly line starts and all, it produced the output:
avg(x) = 0.257250, avg(y) = 23.736000, avg(z) = 116.620125
If you think there is a possibility of empty input (or, at least, no HETATM records in the input) and an error message is not acceptable, then you can protect the printing action with if (count > 0) or equivalent (added to script). You can generate your own preferred output if ``count` is zero.

using grep in a pdb file

I have a PDB file, in short it look a bit like this
ATOM 1189 CA ILE A 172 4.067 0.764 -48.818 1.00 19.53 C
ATOM 1197 CA ATHR A 173 7.121 3.051 -48.711 0.50 17.77 C
ATOM 1198 CA BTHR A 173 7.198 2.978 -48.704 0.50 16.94 C
ATOM 1208 CA ALA A 174 7.797 2.124 -52.350 1.00 16.85 C
ATOM 1213 CA LEU A 175 4.431 3.707 -53.288 1.00 16.47 C
ATOM 1221 CA VAL A 176 4.498 6.885 -51.185 1.00 13.92 C
ATOM 1228 CA ARG A 177 6.418 10.059 -51.947 1.00 20.28 C
ATOM 1241 CA GLN B 23 -15.516 -2.515 13.305 1.00 32.36 C
ATOM 1250 CA ASP B 24 -12.740 -2.653 10.715 1.00 22.25 C
ATOM 1258 CA PHE B 25 -12.476 -2.459 6.886 1.00 19.17 C
ATOM 1269 CA TYR B 26 -12.886 -6.243 6.470 1.00 14.87 C
ATOM 1281 CA ASP B 27 -16.276 -6.196 8.222 1.00 18.01 C
ATOM 1289 CA PHE B 28 -17.998 -4.432 5.309 1.00 15.39 C
ATOM 1300 CA LYS B 29 -19.636 -5.878 2.191 1.00 14.46 C
ATOM 1309 CA ALA B 30 -19.587 -4.640 -1.378 1.00 15.26 C
ATOM 1314 CA VAL B 31 -21.000 -5.566 -4.753 1.00 16.26 C
what I want to go is to get rid of the B's and keep the A's, and then get rid of everything but the 6th row
grep ^ATOM 2p31protein.pdb | grep ' CA ' | grep ' A ' | cut -c23-27
this is what i have tried, get everything with ATOM and CA which i what i want and get the row that i want but it does not get rid of the B's
This is more suited to awk:
$ awk '$1=="ATOM"&&$3=="CA"&&$5=="A"{print $6}' file
172
173
173
174
175
176
177
with awk you may do it easier:
awk '$1=="ATOM" && $3=="CA" && $5=="A"{print $6}' your.pdb

calculate the number of atoms in the PDB file

I would like to calculate the number of atoms for each residue in the pdb files. A PDB file looks as follows.The third column denotes the atoms and the fourth column denotes the residues.
ATOM 1 N ASN A 380 -0.011 22.902 -13.714 1.00 65.81 N
ATOM 2 CA ASN A 380 0.401 23.938 -12.714 1.00 65.53 C
ATOM 3 C ASN A 380 1.926 24.019 -12.595 1.00 64.48 C
ATOM 9 N THR A 381 2.553 24.693 -13.562 1.00 61.65 N
ATOM 10 CA THR A 381 4.006 24.848 -13.609 1.00 58.60 C
ATOM 16 N ILE A 382 5.156 22.716 -13.481 1.00 53.48 N
ATOM 17 CA ILE A 382 5.808 21.571 -12.830 1.00 49.47 C
ATOM 18 C ILE A 382 6.645 21.933 -11.584 1.00 45.24 C
ATOM 28 CB GLN A 383 8.735 24.763 -10.759 1.00 30.19 C
ATOM 29 CG GLN A 383 10.140 24.257 -11.037 1.00 29.17 C
ATOM 30 CD ASN A 384 10.397 23.975 -12.514 1.00 29.51 C
ATOM 31 OE1 ASN A 384 10.892 24.838 -13.237 1.00 30.67 O
I would like to get the output as follows
Total no:of ASN atoms - 5
Total no:of THR atoms - 2
Total no:of ILE atoms - 3
Total no:of GLN atoms - 2
This should do the job:
awk '{print $4}' <file> | sort | uniq -c | \
awk '{print "Total no:of", $2, "atoms -", $1}'
Or pure awk:
awk '{atom[$4]++;}
END{for (i in atom) {print "Total no:of", i, "atoms -", atom[i]} }' <file>
Output for both methods:
Total no:of GLN atoms - 2
Total no:of THR atoms - 2
Total no:of ASN atoms - 5
Total no:of ILE atoms - 3

awk print multiple column file into single column

My file looks like that:
315
717
461 737
304
440
148 206 264 322 380 438 496
801
495
355
249 989
768
946
I want to print all those columns in a single column file (one long first column).
If I try to
awk '{print $1}'> new_file;
awk '{print $2}' >> new_file
There are white spaces in between. How to solve this thing?
Perhaps a bit cryptic:
awk '1' RS='[[:space:]]+' inputfile
It says "print every record, treating any whitespace as record separators".
You can simply use something like:
awk '{ for (i=1; i<=NF; i++) print $i }' file
For each line, iterate through columns, and print each column in a new line.
You don't need as much as sed for this: just translate spaces to newlines
tr ' ' '\n' < file
tr is purely a filter, so you have to redirect a file into it.
A perl solution:
perl -pe 's/\s+(?=\S)/\n/g' infile
Output:
315
717
461
737
304
440
148
206
264
322
380
438
496
801
495
355
249
989
768
946