awk print row where column 1 matches and column three is highest value - awk

I am looking to print columns 1 & 2 where column 1 matches and column 3 is the highest value. I am currently using awk and sort to get this type of output:
EXCEPTION 91 3
EXCEPTION 15 5
TEST 5 1
TEST 1 8
the end desired output I am looking for:
EXCEPTION 15
TEST 1
Here is a file example and the commands I am running to get the uniq counts. What I would really like is for sort to print the last record in the uniq sort
EXCEPTION 15
so I don't have to all the crazy uniq --count logic.
IE: I want to know if column1 matches >= 3times and print the last recorded column two value for that match.
cat /tmp/testing.txt |grep EXCEPTION
EXCEPTION 15
EXCEPTION 15
EXCEPTION 15
EXCEPTION 91
EXCEPTION 91
EXCEPTION 91
EXCEPTION 91
EXCEPTION 15
EXCEPTION 15
cat /tmp/testing.txt|awk '{print $1 " " $2}'|sed '/^$/d'| awk '$2 >= '1' '|sort |uniq --count|awk '{print $1" "$2" "$3}'|awk '$1 >= '3''|awk '{print $1" "$2" "$3}'|awk '{print $2" "$3" "$1}'
EXCEPTION 15 5
EXCEPTION 91 4

Just keep track of the maximums for any given 1st field and store its corresponding 2nd field:
awk '{if ($3>max[$1])
{max[$1]=$3; val[$1]=$2}
}
END {for (i in val) print i, val[i]}' file
Test
$ awk '{if ($3>max[$1]) {max[$1]=$3; val[$1]=$2}} END {for (i in val) print i, val[i]}' file
EXCEPTION 15
TEST 1

You said you didn't want horrible uniq logic... but in case you change your mind, this task does fit quite neatly into sort/uniq/cut's purview (though this isn't as efficient as the awk solution).
From your testing file you can get your desired output with
sort -k1,2n < testing |
uniq -c |
sort -k2,2 -k1rn,1 |
cut -c8- |
sort -u -k1,1
In order: sort by first column alphabetically (default) and then second column numerically - this puts identical lines in sequence.
Then count the occurences of each line, prepend 8 characters to each line containing the count along with whitespaces.
Sort by count descending, after sorting by the string (EXCEPTION, TEST) which is now the second field.
Remove the first 8 characters from each line (the counts).
Finally "sort" by the string and only output uniques. As the record you're interested in has been sorted to the top, this is the one it outputs. This can be thought of as "uniq by field."
(If you want to remove the trailing spaces from your input you can replace the cut command with sed 's/^ *[0-9]\+ *//')

Related

How do I print every nth entry of the mth column, starting from a particular line of a file?

Consider the following data in a file file.txt:
$
$
$
FORCE 10 30 40
* 1 5 4
FORCE 11 20 22
* 2 3 0
FORCE 19 25 10
* 16 12 8
.
.
.
I want to print every 2nd element of the third column, starting from line 4, resulting in:
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors generated either.
You might use awk checking that the row number > 3 and then check for an even row number with NR%2==0.
Note that you don't have to use cat
awk 'NR > 3 && NR%2==0 {
print $3
}' file.txt
Output
30
20
25
Using sed
$ sed -En '4~2s/([^ \t]*[ \t]+){2}([^ \t]*).*/\2/p' input_file
30
20
25
I have tried:
cat file.txt | sed 's/\|/ /' | awk 'NR%2==4 {print $3}'
However, this is not resulting in anything being printed and no errors
generated either.
You do not need cat whilst using GNU sed as it can read file on its' own, in this case it would be sed 's/\|/ /' file.txt.
You should consider if you need that part at all, your sample input does not have pipe character at all, so it would do nothing to it. You might also drop that part if lines holding values you want to print do not have that character.
Output is empty as NR%2==4 does never hold, remainder of division by x is always smaller than x (in particular case of %2 only 2 values are possible: 0 and 1)
This might work for you (GNU sed):
sed -nE '4~2s/^((\S+)\s*){3}.*/\2/p' file
Turn off implicit printing by setting the -n option and reduce back slashes in regexps by turning on -E.
From the fourth line and then every second line thereafter, capture the third column and print it.
N.B. The \2 represents the last inhabitant of that back reference which in conjunction with the {3} means the above.
Alternative:
sed -n '4,${s/^\(\(\S\+\)\s*\)\{3\}.*/\2/p;n}' file

awk not printing expected result from file

In the awk below I expected that adding NR==2 would only print check the 2nd line $1 value, the 1 and ensure that it is a number. If it is then print Index is a number else
Index is not a number. It seems close but the results is not expected... maybe I used the wrong variable? Thank you :).
file.txt
Index Chr Start End Ref Alt Quality Freq Score HGMD
1 1 10 100 A - GOOD .002 2 .
2 1 100 1000 - C STRAND BIAS .036 10 .
3 5 50 500 AA T GOOD 1 5 .
awk
awk -F'\t' 'NR==2 $1 ~ /^[[:digit:]]/ {print "Index is a number"} ELSE {print "Index is not a number"}' file.txt
Index is a number
Index is a number
Index is a number
Index is a number
Index is a number
desired output
Index is a number
awk 'NR==2 {print "Index is "($2~/^[0-9]+$/?"":"not ") "a number";exit}' file
If you just want to check line 2, you have to exit after the processing
If you need other Field Separator, add -F option
I think you are looking for something like below,
awk 'BEGIN{FS="\t"} NR==2 { if (match($1,/^[[:digit:]]/)) { print "Index is a number" } else { print "Index is not a number" } }' file
Index is a number
you can of-course extend this to any number of lines by dropping NR==2 or adding NR>1 which allows you to skip header only.
Use the following approach:
awk 'NR==2{printf("Index is%s a number\n", ($1~/^[0-9]+$/)? "":" not")}' file.txt
The output:
Index is a number
$1~/^[0-9]+$/ - ensures that the first field is a number

awk - skip last line for condition

When I wrote an answer for this question I used the following:
something | sed '$d' | awk '$1>3{print $0}'
e.g.
print only lines where the 1st field is bigger than 3 (awk)
but omit the last line sed '$d'.
This seems for me a bit of duplicate work, surely it is possible to do the above only with awk - without the sed?
I'm an awkdiot - so, can someone suggest a solution?
Here's one way you could do it:
$ printf "%s\n" {1..10} | awk 'NR>1&&p>3{print p}{p=$1}'
4
5
6
7
8
9
Basically, print the first field of the previous line, rather than the current one.
As Wintermute has rightly pointed out in the comments (thanks), in order to print the whole line, you can modify the code to this:
awk 'p { print p; p="" } $1 > 3 { p = $0 }'
This only assigns the contents of contents of the line to p if the first field is greater than 3.

How to track lines in large log file that don't appear in the expected order?

I have a large log file which includes lines in the format
id_number message_type
Here is an example for a log file where all lines appear in the expected order
1 A
2 A
1 B
1 C
2 B
2 C
However, not all lines appear in the expected order in my log file and I'd like to get a list of all id numbers that don't appear in expected order. For the following file
1 A
2 A
1 C
1 B
2 B
2 C
I would like to get an output that indicates id number 1 has lines that don't appear in the expected order. How to do this, using grep, sed and awk?
This works for me:
awk -v "a=ABC" 'substr(a, b[$1]++ + 1, 1) != $2 {print $1}' logfile
When you run this, the ID number from each out-of-order line will be printed. If there are no out-of-order lines, then nothing is printed.
How it works
-v "a=ABC"
This defines the variable a with the list of characters in their expected order.
substr(a, b[$1]++ + 1, 1) != $2 {print $1}
For each ID number, the array b keeps track of where we are. Initially, b is zero for all IDs. With this initial value, that is b[$1]==0, the expression substr(a, b[$1] + 1, 1) returns A which is our first expected output. The condition substr(a, b[$1] + 1, 1) != $2 thus checks if the expected output, from the substr function, differs from the actual output shown in the second field, $2. If it does differ, then the ID value, $1, is printed.
After the substr expression is computed, the trailing ++ in the expression b[$1]++ increments the value of b[$1] by 1 so that the value of b[$1] is ready for the next time that ID $1 is encountered.
Refinement
The above prints an ID number every time an out-of-order line is encountered. If you just want each bad ID printed once, not multiple times, use:
awk -v "a=ABC" 'substr(a, b[$1]++ + 1, 1) != $2 {bad[$1]++} END{for (n in bad) print n}' logfile
I am only on my iPad with no way to test this, but I can give you an idea how to do it with awk since no-one else is answering...
Something like this:
awk 'BEGIN{for(i=0;i<10000;i++)expected[i]=ord("A")}
{if(expected[$1]!=ord($2))
print "Out of order at line ", NR, $0;
expected[i]=ord($2)+1
}' yourFile
You will need to paste in the ord() function from here.
Basically, the concept is to initialise an array called expected[] that keeps track of the next message type expected for each id and then, as each line is read, check it is the next expected value.
Batch only (last sort is not mandatory)
sort -k1n YourFile | tee file1 | sort -k2 > file2 && comm -23 file1 file2 | sort

awk script for finding smallest value from column

I am beginner in AWK, so please help me to learn it. I have a text file with name snd and it values are
1 0 141
1 2 223
1 3 250
1 4 280
I want to print the entire row when the third column value is minimu
This should do it:
awk 'NR == 1 {line = $0; min = $3}
NR > 1 && $3 < min {line = $0; min = $3}
END{print line}' file.txt
EDIT:
What this does is:
Remember the 1st line and its 3rd field.
For the other lines, if the 3rd field is smaller than the min found so far, remember the line and its 3rd field.
At the end of the script, print the line.
Note that the test NR > 1 can be skipped, as for the 1st line, $3 < min will be false. If you know that the 3rd column is always positive (not negative), you can also skip the NR == 1 ... test as min's value at the beginning of the script is zero.
EDIT2:
This is shorter:
awk 'NR == 1 || $3 < min {line = $0; min = $3}END{print line}' file.txt
You don't need awk to do what you want. Use sort
sort -nk 3 file.txt | head -n 1
Results:
1 0 141
I think sort is an excellent answer, unless for some reason what you're looking for is the awk logic to do this in a larger script, or you want to avoid the extra pipes, or the purpose of this question is to learn more about awk.
$ awk 'NR==1{x=$3;line=$0} $3<x{line=$0} END{print line}' snd
Broken out into pieces, this is:
NR==1 {x=$3;line=$0} -- On the first line, set an initial value for comparison and store the line.
$3<x{line=$0} - On each line, compare the third field against our stored value, and if the condition is true, store the line. (We could make this run only on NR>1, but it doesn't matter.
END{print line} -- At the end of our input, print whatever line we've stored.
You should read man awk to learn about any parts of this that don't make sense.
a short answer for this would be:
sort -k3,3n temp|head -1
since you have asked for awk:
awk '{if(min>$3||NR==1){min=$3;a[$3]=$0}}END{print a[min]}' your_file
But i prefer the shorter one always.
For calculating the smallest value in any column , let say last column
awk '(FNR==1){a=$NF} {a=$NF < a?$NF:a} END {print a}'
this will only print the smallest value of the column.
In case if complete line is needed better to use sort:
sort -r -n -t [delimiter] -k[column] [file name]
awk -F ";" '(NR==1){a=$NF;b=$0} {a=$NF<a?$NF:a;b=$NF>a?b:$0} END {print b}' filename
this will print the line with smallest value which is encountered first.
awk 'BEGIN {OFS=FS=","}{if ( a[$1]>$2 || a[$1]=="") {a[$1]=$2;} if (b[$1]<$2) {b[$1]=$2;} } END {for (i in a) {print i,a[i],b[i]}}' input_file
We use || a[$1]=="" because when 1st value of field 1 is encountered it will have null in a[$1].