awk extract lines between two patterns with a twist - awk

I have a type of data file that contains only once (!) the following block of text:
Begin final coordinates
new unit-cell volume = 460.57251 a.u.^3 ( 68.24980 Ang^3 )
density = 7.37364 g/cm^3
CELL_PARAMETERS (alat= 7.29434300)
0.995319813 0.000000000 0.000000000
0.000000000 0.995319813 0.000000000
0.000000000 0.000000000 1.197882354
ATOMIC_POSITIONS (crystal)
Pb 0.0000000000 0.0000000000 -0.0166356359
O 0.5000000000 0.5000000000 0.1549702780
Ti 0.5000000000 0.5000000000 0.5327649171
O 0.0000000000 0.5000000000 0.6381882204
O 0.5000000000 0.0000000000 0.6381882204
End final coordinates
I have found how to extract the entire block of lines between the Begin final coordinates and End final coordinates patterns but I need to it to be more refined. I would like to extract first the three lines below the line starting with CELL_PARAMETERS. Then I would like to extract (with another action not in the same awk command), the 5 lines below ATOMIC_POSITIONS.
I have to make an observation here: I said at the beginning the the block of text appears only once and this is true for that specific form with Begin final coordinates and End final coordinates. Throughout the data file there are many blocks with this form:
CELL_PARAMETERS (alat= 7.29434300)
0.995319813 0.000000000 0.000000000
0.000000000 0.995319813 0.000000000
0.000000000 0.000000000 1.197882354
ATOMIC_POSITIONS (crystal)
Pb 0.0000000000 0.0000000000 -0.0166356359
O 0.5000000000 0.5000000000 0.1549702780
Ti 0.5000000000 0.5000000000 0.5327649171
O 0.0000000000 0.5000000000 0.6381882204
O 0.5000000000 0.0000000000 0.6381882204
So unfortunately I cannot just use the CELL_PARAMETERS and ATOMIC_POSITIONS lines as patterns. The only ones appearing only once are the Begin final coordinates and End final coordinates so I have to extract text relative to these lines.
I have tried to marry the method to extract lines between two patterns from here with the one for skipping N lines after finding pattern from here. Unfortunately I can't make it work.
So my idea was:
for the first case: I was trying to find the Begin final coordinates pattern and skip 5 lines including the one with the pattern) then print the 3 lines I am interested in and then skip the rest until the End final coordinates.
for the second case: find Begin final coordinates then skip the lines until ATOMIC_POSITIONS (skipping this one too), print the next 5 lines until the End final coordinates.
Can this be done?
Update:
I have just tried this:
awk '/Begin final coordinates/ {n=NR+9} n < NR < n+3'
but i get syntax error:
awk: cmd. line:1: /Begin final coordinates/ {n=NR+9} n<NR<n+3
awk: cmd. line:1: ^ syntax error
What am i doing wrong here?
Update2:
Hold the presses, I got it!
this solves the first case: awk '/Begin final coordinates/{n=NR+4;m=NR+8} (n<NR) && (NR<m)' file
this solves the second case: awk '/Begin final coordinates/{n=NR+9;m=NR+8} (n<NR) && (NR<m)' file
Is not very nice but it will do the job!

Hold the presses, I got it!
this solves the first case:
awk '/Begin final coordinates/{n=NR+4;m=NR+8} (n<NR) && (NR<m)' file
this solves the second case:
awk '/Begin final coordinates/{n=NR+9;m=NR+8} (n<NR) && (NR<m)' file

With this you only need to read the input once:
awk '/Begin final coordinates/{n1=NR+4;m1=NR+8; n2=NR+9;m2=NR+8}
(n1<NR) && (NR<m1){ print > "CELL_PARAMETERS.txt"; }
(n2<NR) && (NR<m2){ print > "ATOMIC_POSITIONS.txt"; }
' file

Assuming that Begin final block only occurs after all of the other blocks:
$ awk '/^Begin final/{f=1} c&&c--; f && /^CELL/{c=3}' file
0.995319813 0.000000000 0.000000000
0.000000000 0.995319813 0.000000000
0.000000000 0.000000000 1.197882354
$ awk '/^Begin final/{f=1} c&&c--; f && /^ATOMIC/{c=5}' file
Pb 0.0000000000 0.0000000000 -0.0166356359
O 0.5000000000 0.5000000000 0.1549702780
Ti 0.5000000000 0.5000000000 0.5327649171
O 0.0000000000 0.5000000000 0.6381882204
O 0.5000000000 0.0000000000 0.6381882204
or if it could appear anywhere then change c&&c--; to c{print; if (!c--) exit}.
See https://stackoverflow.com/a/17914105/1745001 for related idioms.

Related

Extract two different types of values from a file and print it to an output file

I have a file where the data looks like:
sp_0005_SySynthetic ConstructTumor protein p53 N-terminal transcription-activation domain
A=9 C=2 D=3 E=4 F=2 G=15 I=3 K=3 L=9 M=3 N=5 P=2 Q=11 R=8 S=12 T=6 V=8 W=1 Y=5
Amino acid alphabet = 19
Sequence length = 115
sp_0017_CaCamelidSorghum bicolor multidrug and toxic compound extrusion sbmate
A=10 C=2 D=4 E=4 F=2 G=15 H=1 I=2 K=4 L=7 M=2 N=5 P=3 Q=6 R=4 S=18 T=7 V=10 W=5 Y=10
Amino acid alphabet = 20
Sequence length = 126
sp_0021_LgLlamabotulinum neurotoxin BoNT serotype F
A=14 C=2 D=4 E=5 F=4 G=15 I=2 K=3 L=6 M=2 N=6 P=4 Q=7 R=8 S=13 T=10 V=8 W=3 Y=10
Amino acid alphabet = 19
Sequence length = 131
I want to extract the vales of 'Amino acid alphabet' and 'Sequence length into an output file', and it should look like:
19 115
20 126
19 131
As I am new to bash, all I could try so far is:
grep -i "Amino acid alphabet = $i" test.txt >>out.txt
But, I don't want the word "Amino acid alphabet" in the output. I only want the values of "Amino acid alphabet" and "Sequence length" as two columns.
Can I get any help how to do that? Thanks in advance.
$ awk -v RS= '{print $(NF-4), $NF}' file
19 115
20 126
19 131
Assuming both fields exist for all your records:
awk '/^Amino acid alphabet/{printf $NF FS} /^Sequence length/{print $NF}' file
19 115
20 126
19 131
Also you may want to have some introduction about awk into the awk wiki
This code: grep -i "Amino acid alphabet = $i" test.txt >>out.txt includes the shell expansion of $i. If you have not given a value to i then the search pattern resolves to Amino acid alphabet = , and thus will find each line that contains that. The $i would change the search pattern if $i had a value.
There are many ways to get what you want with BASH. one is to use grep with PCRE (Perl-style) regex enabled:
grep -Po "(?<=Amino acid alphapbet = )\d+" test.txt >> out.text
#yields:
19
20
19
(?<=string) tells grep that for the rest to match, it must have been preceded by string, but string is not a part of the Match. -Po are the options to enable PCRE (Perl Style) and to only print the match, rather than the whole line in which there was a match.
Note that the output redirect is >> if you want to append to a file if it already contains lines, > will overwrite an existing file if it exists, (without asking for confirmation!)
sed can do this too.
sed -En '/^Amino acid alphabet =/h; /^Sequence length =/{ H; x; s/[^0-9]+/ /g; s/^ //; p; }' infile > outfile
/^Amino acid alphabet =/h stores the first line in the save buffer.
/^Sequence length =/{ triggers all the steps inside the curlies.
H adds the current line to the save buffer.
x swaps the save buffer back to the workspace.
s/[^0-9]+/ /g; changes every sequence on NON-digits to a single space.
This includes the newline.
s/^ //; removes the leading space.
p prints the output line for this data set.

Odd `gawk` filtering of very small floating point number

gawk filters out very small positive number differently depending on threshold used, but all thresholds should retain the entry.
Example input file, tmp:
A 3.92e-373
B 5e-300
C 5e-20
D 5e-6
E 5e-3
Output:
% gawk '$2 < 5e-4' tmp
B 5e-300
C 5e-20
D 5e-6
% gawk '$2 < 5e-8' tmp
A 3.92e-373
D 5e-300
C 5e-20
Note gawk '$2 < 5e-4' should retain entry as $2 < 3.92e-373, which works for gawk '$2 < 5e-8'.
Clearly this is issue with limit of floating point, but I find it odd that the result is not consistent for both thresholds. Shouldn't gawk simply limit 3.92e-373 to 0 and thus print this line under all circumstances?
I wouldn't assume that gawk can figure out what's a number vs a string given your input and hard-coded values. Make sure they're treated as numbers by using strtonum() on them:
$ gawk 'strtonum($2) < strtonum("5e-4")' file
A 3.92e-373
B 5e-300
C 5e-20
D 5e-6
$ gawk 'strtonum($2) < strtonum("5e-8")' file
A 3.92e-373
B 5e-300
C 5e-20
You can see what types gawk thinks it's dealing with by calling typeof() on each:
$ gawk '{print typeof($2), $2, typeof(5e-4), 5e-4, strtonum($2), strtonum("5e-4")}' file | column -t
string 3.92e-373 number 0.0005 0 0.0005
strnum 5e-300 number 0.0005 5e-300 0.0005
strnum 5e-20 number 0.0005 5e-20 0.0005
strnum 5e-6 number 0.0005 5e-06 0.0005
strnum 5e-3 number 0.0005 0.005 0.0005
So it looks like the strtonum("5e-4") is redundant but IMHO it improves clarity so I'd keep it.
Notice that gawk doesn't automatically recognize 3.92e-373 as a number and so the comparison for that input would be string vs number and that's done as a string comparison (see the table at https://www.gnu.org/software/gawk/manual/gawk.html#Typing-and-Comparison).

incorrect count of unique text in awk

I am getting the wrong counts using the awk below. The unique text in $5 before the - is supposed to be counted.
input
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 1 15
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 2 16
chr1 955543 955763 chr1:955543-955763 AGRN-6|gc=75 3 16
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 553 567
chr1 1267394 1268196 chr1:1267394-1268196 TAS1R3-46|gc=68.2 554 569
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 46 203
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 47 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 48 206
chr1 9781175 9781316 chr1:9781175-9781316 PIK3CD-276|gc=63.1 49 207
current output
1
desired output (AGRN,TAS1R3, PIK3CD) are unique and counted
3
awk
awk -F '[- ]' '!seen[$6]++ {n++} END {print n}' file
Try
awk -F '-| +' '!seen[$6]++ {n++} END {print n}' file
Your problem is that when ' ' (a space) is included as part of a regex to form FS (via -F) it loses its special default-value behavior, and only matches spaces individually as separators.
That is, the default behavior of recognizing runs of whitespace (any mix of spaces and tabs) as a single separator no longer applies.
Thus, [- ] won't do as the field separator, because it recognizes the empty strings between adjacent spaces as empty fields.
You can verify this by printing the field count - based on your intended parsing, you're expecting 9 fields:
$ awk -F '[- ]' '{ print NF }' file
17 # !! 8 extra fields - empty fields
$ awk -F '-| +' '{ print NF }' file
9 # OK, thanks to modified regex
You need alternation -| + to ensure that runs of spaces are treated as a single separator; if tabs should also be matched, use '-|[[:blank:]]+'
Including "-" in FS might be fine in some cases, but in general if the actual field separator is something else (e.g. whitespace, as seems to be the case here, or perhaps a tab), it would be far better to set FS according to the specification of the file format. In any case, it's easy to extract the subfield of interest. In the following, I'll assume the FS is whitespace.
awk '{split($5, a, "-"); if (!(count[a[1]]++)) n++ }
END {print n}'
If you want the details:
awk '{split($5, a, "-"); count[a[1]]++}
END { for(i in count) {print i, count[i]}}'
Output of the second incantation:
AGRN 3
PIK3CD 4
TAS1R3 2

Use awk to get min-max column values

Given a file with data such as
2015-12-24 22:02 12 9.87 feet High Tide
2015-12-25 03:33 12 -0.38 feet Low Tide
2015-12-25 06:11 12 Full Moon
2015-12-25 10:16 12 11.01 feet High Tide
2015-12-25 16:09 12 -1.29 feet Low Tide
This awk command will return a min value in col 4:
awk 'min=="" || $4 < min {min=$4} END{ print min}' FS=" " 12--December.txt
How do I get it to exclude any line where $4 contains text? I imagine this needs regex but poring over the regex manuals I am lost as to how to do it.
You can use a regular expression comparison on the fourth field as
$4~/[0-9]+/
Test
$ awk '$4~/[0-9]+/ && $4 < min {min=$4} END{print min}' input
-1.29
Note This is a minimised version of the code. You can safely skip some of the statements in the example code as in the test code

awk print long hex overflow?

Recently, I read some hex data with 16 length, like 0x1000 0000 0000 0000, but print out some strange thing when print 0xffff ffff ffff ffff
awk '{printf("0x\n", 0x1000000000000000)}' output `0x1000000000000000` ok.
But, instead with
16 f awk '{printf("0x\n", 0xffffffffffffffff)}', output 0
15 f awk '{printf("0x\n", 0xfffffffffffffff)}', output 1000000000000000 (14 0)
15 f awk '{printf("0x\n", 0xfffffffffffffff0)}', output 0
14 f awk '{printf("0x\n", 0xffffffffffffff)}', output 100000000000000 (13 0)
14 f awk '{printf("0x\n", 0xffffffffffffff0)}', output 100000000000000 (14 0)
14 f awk '{printf("0x\n", 0xffffffffffffff00)}', output 0
13 f awk '{printf("0x\n", 0xfffffffffffff)}', output fffffffffffff (13f)
13 f awk '{printf("0x\n", 0xfffffffffffff0)}', output fffffffffffff0 (13f)
13 f awk '{printf("0x\n", 0xfffffffffffff00)}', output fffffffffffff00 (13f)
13 f awk '{printf("0x\n", 0xfffffffffffff000)}', output fffffffffffff000 (13f)
so 13f is ok, how to print 16f?
You are missing the format string in the printf() call. Note that printfworks differently than print. I assume you see some random erratic behavior of the awk interpreter.
Another typical awk problem is that it typically uses double precision floating point numbers to represent all numeric value (also integers) so you would loose precision and get strange artifacts when you get near 64 bit. This depends on the actual awk implementation.
You are probably seeing a mixture of these two problems. I admit the rounding to 0 is bizarre.