sort according to a column value and take first n lines between the same headers in a file - awk

I have an input like this;
header score1 score2
item 1.3 100
item 2.3 170
item 4.0 35
header score1 score2
item 2.9 45
item 1.7 55
header score1 score2
item 0.5 60
header score1 score2
header score1 score2
item 1.4 75
item 2.5 120
item 3.7 200
header score1 score2
I want to consider the lines between two lines including 'header' individually. Sort the lines according to the value in the second column in descending order and take the first two lines with the highest values. Also, add the header at the top. It is known that the list starts with "header score1 score2" .
So desired output is this;
header score1 score2
item 4.0 35
item 2.3 170
header score1 score2
item 2.9 45
item 1.7 55
header score1 score2
item 0.5 60
header score1 score2
header score1 score2
item 3.7 200
item 2.5 120
header score1 score2
I am a relatively new awk user so my best methodology for now is separating the steps in words and then doing the code research and application stepwise. Building a code block is something I cannot apply for now.
So first I have to separately consider every interval between the lines starting with "header"
1.
awk '/header/ {p=1;print;next} /^header/ && p {p=0;print} p' input.txt
This inputs the same file as expected. What I understand from this is when there is 'header it is printing it and continues to print below lines until another 'header'
the sorting and taking the first 2 I am doing with this code:
2.
sort -k2 -nr | head -2 # this should be without the header
I am guessing that I have to insert the second code inside the first one somehow, so I would appreciate any help about this.
Thank you

awk '
/header/ {
# Close pipe to sort when we get header.
close("sort -r|head -2")
# Copy header line to standard output.
print
next
}
{ # other lines to sort.
print | "sort -r|head -2"
}' input_file|column -t
header score1 score2
item 4.0 35
item 2.3 170
header score1 score2
item 2.9 45
item 1.7 55
header score1 score2
item 0.5 60
header score1 score2
header score1 score2
item 3.7 200
item 2.5 120
header score1 score2

Using the DSU (Decorate/Sort/Undecorate) idiom with any awk+sort+cut:
$ cat tst.sh
#!/usr/bin/env bash
awk -v OFS='\t' '
$1 == "header" {
blockNr++
}
{ print blockNr, ($1 == "header" ? 0 : 1), $0 }
' "$#" |
sort -k1,1n -k2,2n -k4,4rn |
cut -f3- |
awk '
$1 == "header" {
cnt = 0
}
cnt++ < 3
'
$ ./tst.sh file
header score1 score2
item 4.0 35
item 2.3 170
header score1 score2
item 2.9 45
item 1.7 55
header score1 score2
item 0.5 60
header score1 score2
header score1 score2
item 3.7 200
item 2.5 120
header score1 score2
See How to sort data based on the value of a column for part (multiple lines) of a file? for more info on how DSU works.

another awk with some support and no error handling.
$ awk 'BEGIN {cmd="sort -k2nr | head -2"}
/header/ {close(cmd); print; next}
{print | cmd}' file
header score1 score2
item 4.0 35
item 2.3 170
header score1 score2
item 2.9 45
item 1.7 55
header score1 score2
item 0.5 60
header score1 score2
header score1 score2
item 3.7 200
item 2.5 120
header score1 score2
uses awk for partitioning the data to sections at the headers and delegating the sort and take 2 functions to other commands.

took me long enough — the concept is to create a synthetic array index key for sorting purposes that
incorporates the rank ordering from multiple columns,
can deal with floating point, have them mix-and match with integers, and
maintains numeric rank ordering in ASCII
be wide enough that it'll only start to overflow in very
large or small numbers, in the absence of big-integer library
minimizing number of compare ops needed
* a single unified sort key compare
instead of
* having to go into each sub-key field when dealing with tie-breakers
CODE
BEGIN {
1 OFS = "\t"
1 PROCINFO["sorted_in"] = "#ind_str_desc"
1 __ = "[+-:]+"
1 gsub("^|$", "[[:blank:]]", __)
1 ____ = (((_*=_+=_^=_<_)^_)^--_)
1 _ = _<_
}
# Rule(s)
4 ($_)!~__ { # 2
2 print
2 while (($_) !~ __) {
2 split("", ___)
2 $_ = ""
2 getline
}
}
4 {
9 do {
9 ___[sprintf("x%.12X%.12X",-(____/$(NF-!_)),-(____/$(NF)))] = $_
9 getline
} while (($_) ~ __)
4 ______ = (! _) + (! _)
9 for (_______ in ___) {
9 if (-______ < +______--) { # 7
7 print _______, ___[_______]
}
}
4 split("", ___)
4 print
4 $_ = ""
}
OUTPUT
header score1 score2
xFFFFFFFFFFC00000FFFFFFFFFFF8AF8B item 4.0 35
xFFFFFFFFFF90B217FFFFFFFFFFFE7E7F item 2.3 170
header score1 score2
xFFFFFFFFFFA7B962FFFFFFFFFFFA4FA5 item 2.9 45
xFFFFFFFFFF69696AFFFFFFFFFFFB5870 item 1.7 55
header score1 score2
xFFFFFFFFFE000000FFFFFFFFFFFBBBBC item 0.5 60
header score1 score2
header score1 score2
xFFFFFFFFFFBACF92FFFFFFFFFFFEB852 item 3.7 200
xFFFFFFFFFF99999AFFFFFFFFFFFDDDDE item 2.5 120
header score1 score2

Related

AWK split file every 50 ocurrences of a string

I have a beginner question about awk.
I am using the line below to split a file into c files, using 'MATCH' as my delimiter.
awk 'BEGIN{flag=0} /MATCH/{flag++;next} {print $0 > (flag ".txt")}' file
My file is very long, but it has the form shown below:
MATCH
a
b
c
d
MATCH
a
b
I want to have the above awk line split my file every 50 'MATCH' ocurrences. The current command creates a new file for each 'MATCH' ocurrence. I am sure there is a simple way to achieve this, but I have not figured it out yet. I have tried using the line below with no luck.
awk 'BEGIN{flag=0} /MATCH/{flag++ == 50;next} {print $0 > (flag ".txt")}' file
I appreciate the help and guidance.
Untested, using any awk:
awk '
/MATCH/ && ( ( (++matchCnt) % 50 ) == 1 ) {
close(out)
out = (++outCnt) ".txt"
}
{ print > out }
' file
Assumptions:
the number of lines in a MATCH block are not known beforehand
the number of lines in a MATCH block could vary
the MATCH lines are to be copied to the output files
Sample input with 9 MATCH blocks:
$ cat file
MATCH
1.1
1.2
MATCH
2.1
2.2
MATCH
3.1
3.2
MATCH
4.1
4.2
MATCH
5.1
5.2
MATCH
6.1
6.2
MATCH
7.1
7.2
MATCH
8.1
8.2
MATCH
9.1
9.2
One awk idea:
awk -v blkcnt=3 ' # for OP case set blkcnt=50
BEGIN { outfile= ++fcnt ".txt" }
/MATCH/ { if (++matchcnt > blkcnt) {
close(outfile)
outfile= ++fcnt ".txt"
matchcnt=1
}
# next # uncomment if the "MATCH" lines are *NOT* to be copied to the output files
}
{ print $0 > outfile }
' file
For blkcnt=3 this generates:
$ head -40 {1..3}.txt
==> 1.txt <==
MATCH
1.1
1.2
MATCH
2.1
2.2
MATCH
3.1
3.2
==> 2.txt <==
MATCH
4.1
4.2
MATCH
5.1
5.2
MATCH
6.1
6.2
==> 3.txt <==
MATCH
7.1
7.2
MATCH
8.1
8.2
MATCH
9.1
9.2
For blkcnt=4 this generates:
$ head -40 {1..3}.txt
==> 1.txt <==
MATCH
1.1
1.2
MATCH
2.1
2.2
MATCH
3.1
3.2
MATCH
4.1
4.2
==> 2.txt <==
MATCH
5.1
5.2
MATCH
6.1
6.2
MATCH
7.1
7.2
MATCH
8.1
8.2
==> 3.txt <==
MATCH
9.1
9.2
If I've understood correctly, the first 50 blocks of a,b,c,d lines should be written to 1.txt, the next 50 to 2.txt and so on.
This can be achieved by building the filename from the integer value of (flag/50) and adding 1 to it (assuming you want the file series to being with 1 and not 0).
The BEGIN block can be removed as variables are set to 0 when first created if no value is given and they are used numerically.
Thus the following should achieve the desired output:
awk '/MATCH/{flag++;next} {print $0 >(int(flag/50)+1 ".txt")}' file
so while this isn't a complete solution, it does showcase how to capture each group of rows with each "MATCH", so once you count off every 50, then print them out in one shot, bearing in mind one needs to trim out the tail "MATCH" and save it for the next round
nice jot 53 | mawk 'NR % 6 != 1 || ($!NF = "MATCH")^_' |
mawk '{ printf(" :: input row(s) = %8u\n ::" \
" output row # = %8u\n " \
"-------------------\n %s%s " \
"----END-NEW-ROW----\n\n", NF^!!NF, NR, $!(NF = NF), ORS)
}' RS='(^)?MATCH\r?\n' ORS='MATCH\n' FS='\n' OFS='\f'
:: input row(s) = 1
:: output row # = 1
-------------------
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 2
-------------------
2
3
4
5
6
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 3
-------------------
8
9
10
11
12
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 4
-------------------
14
15
16
17
18
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 5
-------------------
20
21
22
23
24
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 6
-------------------
26
27
28
29
30
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 7
-------------------
32
33
34
35
36
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 8
-------------------
38
39
40
41
42
MATCH
----END-NEW-ROW----
:: input row(s) = 6
:: output row # = 9
-------------------
44
45
46
47
48
MATCH
----END-NEW-ROW----
:: input row(s) = 5
:: output row # = 10
-------------------
50
51
52
53
MATCH
----END-NEW-ROW----

Pythonic style of writing a "for-loop" with "if" clause

I use Java and I'm new to python.
I have the following code snippet:
count_of_yes = 0
for str_idx in str_indexes: # Ex. ["abc", "bbb","cb","aaa"]
if "a" in str_idx:
count_of_yes += one_dict["data_frame_of_interest"].loc[(str_idx), 'yes_column']
The one_dict looks like:
# categorical, can only be 1 or 0 in either column
one_dict --> data_frame_of_interest --> ______|__no_column__|__yes_column__
"abc" | 1.0 | 0.0
\ "cb" | 1.0 | 0.0
\ "aaab"| 0.0 | 1.0
\ "bb" | 0.0 | 1.0
\ ...
\
other_dfs_dont_need --> ...
...
I'm trying to get count_of_yes, is there a better pythonic way to refactor the above for-loop and calculate the sum of count_of_yes?
Thanks!

How to loop awk command over row values

I would like to use awk to search for a particular word in the first column of a table and print the value in the 6th column. I understand how to do this searching one word at time using something along the lines of:
awk '$1 == "<insert-word>" { print $6 }' file.txt
But I was wondering if it is possible to loop this over a list of words in a row?
For example If I had a table like file1.txt below:
cat file1.txt
dna1 dna4 dna5
dna3 dna6 dna2
dna7 dna8 dna9
Could I loop over each value in row 1 and search for this word in column 1 of file2.txt below, each time printing the value of column 6? Then do this for row 2, 3 and so on...
cat file2
dna1 0 229 7 0 4 0 0
dna2 0 296 39 2 1 3 100
dna3 0 255 15 0 6 0 0
dna4 0 209 3 0 0 0 0
dna5 0 253 14 2 3 7 100
dna6 0 897 629 7 8 1 100
dna7 0 214 4 0 9 0 0
dna8 0 255 15 0 2 0 0
dna9 0 606 338 8 3 1 100
So an example looping the awk over row 1 of file 1 would return the numbers 4, 0 and 3.
The looping the command over row 2 would return the numbers 6, 8 and 1
And finally looping over row 3 would return the number 9, 2, 3
An example output might be
4 0 3
6 8 1
9 2 3
What I would really like to to is sum the total value of the numbers returned for each row. I just wasn't sure if this would be possible...
An example output of this would be
7
15
14
But I am not worried if this step isn't possible using awk as I could just do it separately
Hope this makes sense
Cheers
Ollie
yes, you can give awk multiple input files. For your example:
awk 'NR==FNR{a[$1]=a[$2]=1;next}a[$1]{print $6}' file1 file2
I didn't test the above one-liner, but it should go. At least you get the idea.
If you don't know how many columns in your file1, as you said, you want to do a loop:
awk 'NR==FNR{for(x=1;x<=NF;x++)a[$x]=1;next}a[$1]{print $6}' file1 file2
update
edit for the new requirement:
awk 'NR==FNR{a[$1]=$6;next}{for(i=1;i<=NF;i++)s+=a[$i];print s;s=0}' f2 f1
The output of above one-liner: (take f1 and f2 as your input example file1 file2):
7
15
14

Selecting elements of two column whose difference is less than some given value using awk

While doing post processing for a numerical analysis, I have the following problem of selection of data :
time_1 result_1 time_2 result_2
1 10 1.1 10.1
2 20 1.6 15.1
3 30 2.1 20.1
4 40 2.6 25.1
5 50 3.1 30.1
6 60 3.6 35.1
7 70 4.1 40.1
8 80 4.6 45.1
9 90 5.1 50.1
10 100 5.6 55.1
6.1 60.1
6.6 65.1
7.1 70.1
7.6 75.1
8.1 80.1
8.6 85.1
9.1 90.1
9.6 95.1
10.1 100.1
This file has 4 columns, the first column (time_1) represents the calculated instants of a program 1, the second column (result_1) is the results calculated for each instant.
The third column (time_2) represents represents the calculated instants of another program, the fourth column (result_2) is the results calculated for each instant of this program 2.
Now I wish to select only the instants of the third column (time_2) that is very near the instants of the first column (time_1), the difference admitted is less than or equal to 0.1. For example :
for the instant 1 of the time_1 column, I wish to select the instant 1.1 of the time_2 column, because (1.1 - 1) = 0.1, I do not want to select the others instants of the time_2 column because (1.6 - 1) > 0.1, or (2.1 - 1) > 0.1
for the instant 2 of the time_1 column, I wish to select the instant 2.1 of the time_2 column, because (2.1 - 2) = 0.1, I do not want to select the others instants of the time_2 column because (2.6 - 1) > 0.1, or (3.1 - 1) > 0.1
At the end, I would like to obtain the following data :
time_1 result_1 time_2 result_2
1 10 1.1 10.1
2 20 2.1 20.1
3 30 3.1 30.1
4 40 4.1 40.1
5 50 5.1 50.1
6 60 6.1 60.1
7 70 7.1 70.1
8 80 8.1 80.1
9 90 9.1 90.1
10 100 10.1 100.1
I wish to use awk but I have not been familiarized with this code. I do not know how to fix an element of the first column then compare this to all elements of the third column in order to select the right value of this third column. If I do very simply like this, I can print only the first line :
{if (($3>=$1) && (($3-$1) <= 0.1)) {print $2, $4}}
Thank you in advance for your help !
You can try the following perl script:
#! /usr/bin/perl
use strict;
use warnings;
use autodie;
use File::Slurp qw(read_file);
my #lines=read_file("file");
shift #lines; # skip first line
my #a;
for (#lines) {
my #fld=split;
if (#fld == 4) {
push (#a,{id=>$fld[0], val=>$fld[1]});
}
}
for (#lines) {
my #fld=split;
my $id; my $val;
if (#fld == 4) {
$id=$fld[2]; $val=$fld[3];
} elsif (#fld == 2) {
$id=$fld[0]; $val=$fld[1];
}
my $ind=checkId(\#a,$id);
if ($ind>=0) {
$a[$ind]->{sel}=[] if (! exists($a[$ind]->{sel}));
push(#{$a[$ind]->{sel}},{id=>$id,val=>$val});
}
}
for my $item (#a) {
if (exists $item->{sel}) {
my $s= $item->{sel};
for (#$s) {
print $item->{id}."\t".$item->{val}."\t";
print $_->{id}."\t".$_->{val}."\n";
}
}
}
sub checkId {
my ($a,$id) = #_;
my $dif=0.1+1e-10;
for (my $i=0; $i<=$#$a; $i++) {
return $i if (abs($a->[$i]->{id}-$id)<=$dif)
}
return -1;
}
One thing to be aware of: due to the vagaries of floating point numbers, comparing a value to 0.1 is unlikely to give you the results you're looking for:
awk 'BEGIN {x=1; y=x+0.1; printf "%.20f", y-x}'
0.10000000000000008882⏎
here, y=x+0.1, but y-x > 0.1
So, we will look at the diff as diff = 10*y - 10x:
Also, I'm going to process the file twice: once to grab all the time_1/result_1 values, the second time to extract the "matching" time_2/result_2 values.
awk '
NR==1 {print; next}
NR==FNR {if (NF==4) r1[$1]=$2; next}
FNR==1 {next}
{
if (NF == 4) {t2=$3; r2=$4} else {t2=$1; r2=$2}
for (t1 in r1) {
diff = 10*t1 - 10*t2;
if (-1 <= diff && diff <= 1) {
print t1, r1[t1], t2, r2
break
}
}
}
' ~/tmp/timings.txt ~/tmp/timings.txt | column -t
time_1 result_1 time_2 result_2
1 10 1.1 10.1
2 20 2.1 20.1
3 30 3.1 30.1
4 40 4.1 40.1
5 50 5.1 50.1
6 60 6.1 60.1
7 70 7.1 70.1
8 80 8.1 80.1
9 90 9.1 90.1
10 100 10.1 100.1

awk, calculate the average for different interval of time

can anybody teach me how to calculate the average for between the difference of time? for example
412.00 560.00
0 0
361.00 455.00 561.00
0 0
0 0
0 0
237.00 581.00
425.00 464.00
426.00 520.00
0 0
the normal case, they do the sum of all of those number divide by total set of number
sum/NR
the challenge here
the number of column is dynamic, which mean not all of the line have the same number column
to calculate the average , example we have this : 361.00 455.00 561.00
so the calculation :
((455-361) + (561 - 455))/2
so, the output i'm expecting is like this :
total_time divided_by average
148 1 148
0 1 0
200 2 100
0 1 0
0 1 0
0 1 0
344 1 344
: : :
: : :
: : :
im trying to use awk, but i stuck...
The intermediate values on lines with three or more time values are meaningless -- only the number of values matters. To see this from your example, note that:
((455-361) + (561 - 455))/2 = (561 - 361) / 2
Thus, you really just need to do something like
cat time_data |
awk '{ printf("%f\t%d\t%f\n", ($NF - $1), (NF - 1), ($NF - $1) / (NF - 1)) }'
For your sample data, this gives the results you specify (although not formatted as nicely as you present it).
This assumes that the time values are sorted on the lines. If not, calculate the maximum and minimum values and replace the $NF and $1 uses, respectively.
A bash script:
#!/bin/bash
(echo "total_time divided_by average"
while read line
do
arr=($line)
count=$((${#arr[#]}-1))
total=$(bc<<<${arr[$count]}-${arr[0]})
echo "$total $count $(bc<<<$total/$count)"
done < f.txt ) | column -t
Output
total_time divided_by average
148.00 1 148
0 1 0
200.00 2 100
0 1 0
0 1 0
0 1 0
344.00 1 344
39.00 1 39
94.00 1 94