AWK/SED/getline - How to simplify/improve this example?

AWK/SED/getline - How to simplify/improve this example? - awk

I'm trying to take a 3 column input file and separate it based on a condition in column 3. I think it'll be easier to show you than explain:
Input File:
outputfile1.txt
26 NCC 1 # First Start
38 NME 2
44 NSC 1 # Start2
56 NME 2
62 NCC 1 # Start3
...
314 NCC 1 # Start17
326 NME 2
332 NSC 1 # Start18
344 NME 2
349 NME 2 # Final End
(The hashed comments aren't part of the file, I've added to make things clearer).
Column 3 is used to determine a new "START" entry
"START/END" values are from Column 1
"TITLE" I would like to be all values from Column 2 between consecutive "STARTS"
Desired Output
outputfile2.txt
START=26 ; END=43 ; TITLE=NCC_NME
START=44 ; END=61 ; TITLE=NSC_NME
START=62 ; END=79 ; TITLE=NCC_...
...
START=314 ; END=331 ; TITLE=NCC_NME
START=332 ; END=349 ; TITLE=NSC_NME
Crude script that 'almost' does this but makes 5 single column temporary files in the process.
awk '{ print $1 }' outputfile1.txt | sed '$d' > tempfile1.txt
awk '{ print $1-1 }' outputfile1.txt | sed '$d' > tempfile2.txt
sed '$d' outputfile1.txt | awk 'NR{print $3-p}{p=$3}' > tempfile3.txt
awk ' { getline value < "tempfile1.txt" }
{ if (NR==1)
print value ;
else if( $1 != 1 )
print value }' tempfile3.txt > tempfile4.txt
awk ' { getline value < "tempfile2.txt" }
{ if (NR==1)
print value ;
else if ( $1 != 1 )
print value }' tempfile3.txt | sed '1d' > tempfile5.txt
awk 'END{print $1}' outputfile1.txt >> tempfile5.txt
awk ' { getline value < "tempfile5.txt" }
{print "START="$0 " ; END="value}' tempfile4.txt > outputfile2.txt
Contents of temp files
| temp1 temp2 temp3
NR=1 | 26 25 1
NR=2 | 38 37 1
NR=3 | 44 43 -1
NR=4 | 56 55 1
NR=5 | 62 61 -1
... | ... ... ...
NR=33 | 314 313 -1
NR=34 | 326 325 1
NR=35 | 332 331 -1
NR=36 | 344 343 1
----------------------------------
| temp4 temp5
NR=1 | 26 43
NR=2 | 44 61
NR=3 | 62 79
... | ... ...
NR=17 | 314 331
NR=18 | 332 359
Current output
outputfile2.txt
START=26 ; END=43
START=44 ; END=61
START=62 ; END=79
...
START=314 ; END=331
START=332 ; END=349

Try:
awk '
function print_range() {
printf "START=%s ; END=%s ; TITLE=%s\n", start, end-1, title
}
{
end=$1
}
# if column 3 is equal to 1, then there is a new start
$3==1 {
if(title) print_range()
start=$1
title=$2
next
}
# if the label in field 2 is not part of the title then add it
title!~"(^|_)" $2 "(_|$)" {
title=title"_"$2
}
END {
end++
print_range()
}
' file

You can do everything in one go using:
awk '{
if(NR==1){
# if we are the first record we initialize our variables
PREVIOUS_ONE=$1
TITLE=$2
PREVIOUS_THIRD=$3
} else {
# as long as the new third column is larger we update our variables
if(PREVIOUS_THIRD < $3) {
TITLE=TITLE"_"$2
PREVIOUS_THIRD=$3
} else {
# this means the third column was smaller
# we print out the data and reinitialize our variables
print "START="PREVIOUS_ONE" ; END="$1-1" ; TITLE= "TITLE;
PREVIOUS_ONE=$1
TITLE=$2
PREVIOUS_THIRD=$3
}
}
}' outputfile1.txt

Related

Making AWK code more efficient when evaluating sets of records

I have a file with 5 fields of content. I am evaluating 4 lines at a time in the file. So, records 1-4 are evaluated as a set. Records 5-8 are another set. Within each set, I want to extract the time from field 5 when field 4 has the max value. If there are duplicate values in field 4, then evaluate the maximum value in field 2 and use the time in field 5 associated with the max value in field 2.
For example, in the first 4 records, there is a duplicate max value in field 4 (value of 53). If that is true, I need to look at field 2 and find the maximum value. Then print the time associated with the max value in field 2 with the time in field 5.
The Data Set is:
00 31444 8.7 24 00:04:32
00 44574 12.4 25 00:01:41
00 74984 20.8 53 00:02:22
00 84465 23.5 53 00:12:33
01 34748 9.7 38 01:59:28
01 44471 12.4 37 01:55:29
01 74280 20.6 58 01:10:24
01 80673 22.4 53 01:55:49
The desired Output for records 1 through 4 is 00:12:33
The desired output for records 5 through 8 is 01:10:24
Here is my answer:
Evaluate Records 1 through 4
awk 'NR==1,NR==4 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt
Output is: 00:12:33
Evaluate Records 5 through 8
awk 'NR==5,NR==8 {if(max <= $4) {max = $4; time = $5} else if(max == $4) {max = $2; time = $5};next}END {print time}' test.txt test.txt
Output is 01:10:24
Any suggestions on how to evaluate the record ranges more efficiently without having to write an awk statement for each set of records?
Thanks

Based on your sample input, the fact there's 4 lines for each key (first field) seems to be irrelevant and what you really want is to just produce output for each key so consider sorting the input by your desired comparison fields (field 4 then field 2) then printing the first desired output (field 5) value seen for each block per key (field 1):
$ sort -n -k1,1 -k4,4r -k2,2r file | awk '!seen[$1]++{print $5}'
00:12:33
01:10:24

This awk code
NR % 4 == 1 {max4 = $4; max2 = $2}
$4 > max4 || $4 == max4 && $2 >= max2 {max4 = $4; max2 = $2; val5 = $5}
NR % 4 == 0 {printf "lines %d-%d: %s\n", (NR - 3), NR, val5}
outputs
lines 1-4: 00:12:33
lines 5-8: 01:10:24
Looking at the data, you might want to group sets by $1 instead of hardcoding 4 lines:
awk '
function emit(nr) {printf "lines %d-%d: %s\n", nr - 3, nr, val5}
$1 != setId {
if (NR > 1) emit(NR - 1)
setId = $1
max4 = $4
max2 = $2
}
$4 > max4 || $4 == max4 && $2 >= max2 {max4 = $4; max2 = $2; val5 = $5}
END {emit(NR)}
' data

an awk-based solution that utilizes a synthetic ascii-string-comparison key combining $4 and $5, while avoiding any %-modulo operations :
mawk '
BEGIN { CONVFMT = "%020.f" (__=___=____=_____="")
_+=_+=++_ } { ____= __!=(__=__==$((_____=(+$_ "")"(" $NF)^!_) \
? __ : $!!_) || ____<_____ ? _____ : ____
} _==++___ {
printf(" group %-*s [%*.f, %-*.f] :: %s\n", --_*--_, "\"" (__) "\"", _+_,
NR-++_, ++_, NR, substr(____, index(____, "(")+_^(_____=____=___=""))) }'
group "00" [ 1, 4 ] :: 00:12:33
group "01" [ 5, 8 ] :: 01:10:24

Compare two numerical ranges in two distincts files with awk and print ALL lines from file1 and the matching ones from file2

This new question is a followup from a recent question : Compare two numerical ranges in two distincts files with awk. The proposed solution that perfectly worked was not practical for downstream analysis (misconception of my question, not on the solution that worked).
I have a file1 with 3 columns. Columns 2 and 3 define a numerical range. Data are sorted from the smaller to the bigger value in column 2. Numerical ranges never overlap.
file1
S 24 96
S 126 352
S 385 465
S 548 600
S 621 707
S 724 736
I have a second file2 (test) structured similarly.
file2
S 27 93
S 123 348
S 542 584
S 726 740
S 1014 2540
S 12652 12987
Desired output: Print ALL lines from file1 and next to them, lines of file2 for which numerical ranges overlap (including partially) the ones of file1. If no ranges from file2 overlap to a range of file1, print zero next to the range of file 1.
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 355 * 123-355 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736
Based on the answer of the previous question from #EdMorton I modified the print command of the tst.awk script to add these new features. In addition I also changed the order file1/file2 to file2/file1 to have all the lines from file1 printed (whether or not there is a match in the second file)
'NR == FNR {
begs2ends[$2] = $3
next
}
{
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,"\t",$1,"\t",beg,"\t",end
else
print $0,"\t","0"
next
}
}
}
Note: $1 is identical in file1 and file2. This is why I used print ... $1 to make it appear. No idea how to print it from file2 and not file1 (if I understand correctly this $1 refers to file1.
And I launch the analysis with awk -f tst.awk file2 file1
The script is not accepting the else argument and I dont understand why? I assuming that it is linked to the looping but I tried several changes without any success.
Thanks if you can help me with this.

Assumptions:
a range from file1 can only overlap with one range from file2
The current code is almost correct, just need some work with the placement of the braces (using some consistent indentation helps):
awk '
BEGIN { OFS="\t" } # output field delimiter is "\t"
NR == FNR { begs2ends[$2] = $3; next }
{
# $1=$1 # uncomment to have current line ($0) reformatted with "\t" delimiters during print
for (beg in begs2ends) {
end = begs2ends[beg] + 0
beg += 0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
print $0,$1,beg,end # spacing within $0 unchanged, 3 new fields prefaced with "\t"
next
}
}
# if we get this far it is because we have exhausted the "for" loop
# (ie, found no overlaps) so print current line + "0"
print $0,"0" # spacing within $0 unchanged, 1 new field prefaced with "\t"
}
' file2 file1
This generates:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
With the $1=$1 line uncommented the output becomes:
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740
S 900 1000 S 901 905

A slight variation on #markp-fuso's answer
Works with GNU awk: saved as overlaps.awk
BEGIN { PROCINFO["sorted_in"] = "#ind_num_asc" }
function in_range(val, min, max) { return min <= val && val <= max }
NR == FNR {
line[FNR] = $0
lo[FNR] = $2
hi[FNR] = $3
next
}
{
overlap = "0"
for (i in line) {
if (in_range(lo[i], $2, $3) || in_range(hi[i], $2, $3)) {
overlap = line[i]
delete line[i]
break
}
}
print $0, overlap
}
Then
gawk -f overlaps.awk file2 file1 | column -t
outputs
S 24 96 S 27 93
S 126 352 S 123 348
S 385 465 0
S 548 600 S 542 584
S 621 707 0
S 724 736 S 726 740

$ cat tst.awk
BEGIN { OFS="\t" }
NR == FNR {
ranges[++numRanges] = $0
next
}
{
overlapped = 0
for ( i=1; i<=numRanges; i++ ) {
range = ranges[i]
split(range,vals)
beg = vals[2]+0
end = vals[3]+0
if ( ( ($2 >= beg) && ($2 <= end) ) ||
( ($3 >= beg) && ($3 <= end) ) ||
( ($2 <= beg) && ($3 >= end) ) ) {
overlapped = 1
break
}
}
if ( overlapped ) {
print $0, range, sprintf("* %d-%d overlaps with %d-%d", beg, end, $2, $3)
}
else {
print $0, 0, sprintf("* nothing in %s overlaps with this range", ARGV[1])
}
}
$ awk -f tst.awk file2 file1 | column -s$'\t' -t
S 24 96 S 27 93 * 27-93 overlaps with 24-96
S 126 352 S 123 348 * 123-348 overlaps with 126-352
S 385 465 0 * nothing in file2 overlaps with this range
S 548 600 S 542 584 * 542-584 overlaps with 548-600
S 621 707 0 * nothing in file2 overlaps with this range
S 724 736 S 726 740 * 726-740 overlaps with 724-736

Counting the number of unique values based on more than two columns in bash

I need to modify the below code to work on more than one column.
Counting the number of unique values based on two columns in bash
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Table to count how many of each double (all DIS)
patient sex DISa DISb DISc DISd DISe DISf DISg DISh DISi
patient1 male 550.1 550.5 594.1 594.3 594.8 591 1019 960.1 550.1
patient2 female 041 208 250.2 276.14 426.32 550.1 550.5 558 041
patient3 female NA NA NA NA NA NA NA 041 NA
The output I need is:
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
550.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1

Consider this awk:
awk -v OFS='\t' 'NR > 1 {for (i=3; i<=NF; ++i) if ($i+0 == $i) ++fq[$i]} END {for (i in fq) print i, fq[i]}' file
276.14 1
960.1 1
594.3 1
426.32 1
208 1
041 3
594.8 1
550.1 3
591 1
1019 1
558 1
550.5 2
250.2 1
594.1 1
A more readable form:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if ($i+0 == $i)
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' file
$i+0 == $i is a check for making sure column value is numeric.

If the ordering must be preserved, then you need an additional array b[] to keep the order each number is encountered, e.g.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if ($i~/^[0-9]/) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' file
Example Use/Output
$ awk '
> BEGIN { OFS = "\t" }
> FNR > 1 {
> for (i=3;i<=NF;i++)
> if ($i~/^[0-9]/) {
> if (!($i in a))
> b[++n] = $i;
> a[$i]++
> }
> }
> END {
> for (i=1;i<=n;i++)
> print b[i], a[b[i]]
> }' patients
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Let me know if you have further questions.

Taking complete solution from above 2 answers(#anubhava and #David) with all respect, just adding a little tweak(of applying check for integer value here as per shown samples of OP) to their solutions and adding 2 solutions here. Written and tested with shown samples only.
1st solution: If order doesn't matter in output try:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if (int($i))
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' Input_file
2nd solution: If order matters based on David's answer try.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if (int($i)) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' Input_file

Using GNU awk for multi-char RS:
$ awk -v RS='[[:space:]]+' '$0+0 == $0' file | sort | uniq -c
3 041
1 1019
1 208
1 250.2
1 276.14
1 426.32
3 550.1
2 550.5
1 558
1 591
1 594.1
1 594.3
1 594.8
1 960.1
If the order of fields really matters just pipe the above to awk '{print $2, $1}'.

rearrange from specific string into respective column

I'm trying to rearrange from specific string into respective column.
etc:
126N (will be sorted into "Normal" column)
Value 1 (the integer will be concatenated with 126)
Resulting :
N=Normal
126 # 1
Here is the input
(N=Normal, W=Weak)
Value 1
126N,
Value 3
18N,
Value 4
559N, 562N, 564N,
Value 6
553W, 565A, 553N,
Value 5
490W,
Value 9
564N,
And the output should be
W=Weak
490 # 5
553 # 6
A=Absolute
565 # 6
N=Normal
126 # 1
18 # 3
559 # 4
562 # 4
564 # 4
553 # 6
564 # 9
Let me know your thought on this.
I've tried this script, I'm still figuring out to concatenating the value
cat input.txt | sed '/^\s*$/d' | awk 'BEGIN{RS=","};match($0,/N/){print $3"c"$2}' | sed ':a;N;$!ba;s/\n/;/g' | sed 's/W//g;s/N//g;s/S//g'
And some of it, are missing

This should give you what you want using gnu awk
IT will work with any number of letters, not just A N W
awk -F, '
!/Value/ {
for (i=1;i<NF;i++) {
hd=substr($i,length($i),1);
arr[hd][++cnt[hd]]=($i+0" # "f)}
}
{split($0,b," ");f=b[2];}
END {
for (i in arr) { print "\n"i"\n---";
for (j in arr[i]) {
print arr[i][j]}}
}' file
A
---
565 # 6
N
---
562 # 4
564 # 4
553 # 6
564 # 9
126 # 1
18 # 3
559 # 4
W
---
553 # 6
490 # 5

Another alternative in awk would be:
awk -F',| ' '
$1 == "Value" {value = $2; next}
{ for (i=1; i<=NF; i++) {
if ($i~"N$")
N[substr($i, 1, length($i) - 1)] = value
if ($i~"W$")
W[substr($i, 1, length($i) - 1)] = value
}
}
END {
print "W=Weak"
for (i in W)
print i, "#", W[i]
print "\nN=Normal"
for (i in N)
print i, "#", N[i]
}
' file
(note: this relies on knowing the wanted headers are W=Weak and N=Normal. If would take a few additional expression if the headers are subject to change.)
Output
$ awk -F',| ' '
> $1 == "Value" {value = $2; next}
> { for (i=1; i<=NF; i++) {
> if ($i~"N$")
> N[substr($i, 1, length($i) - 1)] = value
> if ($i~"W$")
> W[substr($i, 1, length($i) - 1)] = value
> }
> }
> END {
> print "W=Weak"
> for (i in W)
> print i, "#", W[i]
> print "\nN=Normal"
> for (i in N)
> print i, "#", N[i]
> }
> ' file
W=Weak
490 # 5
N=Normal
18 # 3
126 # 1
559 # 4
562 # 4
564 # 9

$ cat tst.awk
NR%2 { val = $NF; next }
{
for (i=1; i<=NF; i++) {
num = $i+0
abbr = $i
gsub(/[^[:alpha:]]/,"",abbr)
list[abbr] = list[abbr] num " # " val ORS
}
}
END {
n = split("Weak Absolute Normal",types)
for (i=1; i<=n; i++) {
name = types[i]
abbr = substr(name,1,1)
print abbr "=" name ORS list[abbr]
}
}
.
$ awk -f tst.awk file
W=Weak
553 # 6
490 # 5
A=Absolute
565 # 6
N=Normal
126 # 1
18 # 3
559 # 4
562 # 4
564 # 4
553 # 6
564 # 9

AWK: Average of each row from different measurement series

My objective is to calculate the average of the second column from multiple measurement series (the average of the first row of K blocks, the average of the second row of K blocks etc.). All data is contained in one file and is seperated in blocks with a blank line. The file has the following structure:
#
#
33 -0.23
34.5 -0.32
36 -0.4
.
.
.
#
#
33 -0.25
34.5 -0.31
36 -0.38
.
.
.

$ cat avg.awk
BEGIN { FS=" " }
/^#/ { next }
/^\s*$/ { print col1/nr " " col2/nr; col1=col2=nr=0; next }
{ col1 += $1; col2 += $2; nr++ }
END {print col1/nr " " col2/nr }
with input:
$ cat test.txt
#
#
33 -0.23
34.5 -0.32
36 -0.4
#
#
33 -0.25
34.5 -0.31
36 -0.38
gives as result:
$ awk -f avg.awk test.txt
34.5 -0.316667
34.5 -0.313333

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

AWK/SED/getline - How to simplify/improve this example? - awk

Related

Making AWK code more efficient when evaluating sets of records

Compare two numerical ranges in two distincts files with awk and print ALL lines from file1 and the matching ones from file2

Counting the number of unique values based on more than two columns in bash

rearrange from specific string into respective column

AWK: Average of each row from different measurement series

Categories

Resources