Find max/min of each column in awk - awk

I have a file with an variable number of columns:
Input:
1 1 2
2 1 5
5 2 3
7 0 -1
4 1 4
I want to print the max and min of each column:
Desired output:
max: 7 2 5
min: 1 0 -1
For a single column, e.g. $1, I know I can find the max and min using something like:
awk '{if(min==""){min=max=$1}; if($1>max) {max=$1}; if($1<min) {min=$1};} END {printf "%.2g %.2g\n", min, max}'
Question
How can I extend this to loop over all columns (not necessarily just the 3 in my example)?
Many thanks!

awk 'NR==1{for(i=1;i<=NF;i++)min[i]=max[i]=$i;}
{for(i=1;i<=NF;i++){if($i<min[i]){min[i]=$i}else if($i>max[i])max[i]=$i;}}
END{printf "max:\t"; for(i in max) printf "%d ",max[i]; printf "\nmin:\t"; for(i in min)printf "%d ",min[i];}' input.txt
input.txt:
1 1 2 2
2 1 5 3
5 2 3 10
7 0 -1 0
4 1 4 5
output:
max: 7 2 5 10
min: 1 0 -1 0

Like this
awk 'NR==1{for(i=1;i<=NF;i++){xmin[i]=$i;xmax[i]=$i}}
{for(i=1;i<=NF;i++){if($i<xmin[i])xmin[i]=$i;if($i>xmax[i])xmax[i]=$i}}
END{for(i=1;i<=NF;i++)print xmin[i],xmax[i]}' file

Let's try to make it a bit shorter by using the min=(current<min?current:min) expression. This is a ternary operator that is the same as saying if (current<min) min=current.
Also, printf "%.2g%s", min[i], (i==NF?"\n":" ") prints the new line on the END{} block whenever it reaches the last field.
awk 'NR==1{for (i=1; i<=NF; i++) {min[i]=$i}; next}
{for (i=1; i<=NF; i++) { min[i]=(min[i]>$i?$i:min[i]); max[i]=(max[i]<$i?$i:max[i]) }}
END {printf "min: "; for (i=1;i<=NF;i++) printf "%.2g%s", min[i], (i==NF?"\n":" ");
printf "max: "; for (i=1;i<=NF;i++) printf "%.2g%s", max[i], (i==NF?"\n":" ")}' file
Sample output:
$ awk 'NR==1{for (i=1; i<=NF; i++) {min[i]=$i}; next} {for (i=1; i<=NF; i++) { min[i]=(min[i]>$i?$i:min[i]); max[i]=(max[i]<$i?$i:max[i]) }} END {printf "min: "; for (i=1;i<=NF;i++) printf "%.2g%s", min[i], (i==NF?"\n":" "); printf "max: "; for (i=1;i<=NF;i++) printf "%.2g%s", max[i], (i==NF?"\n":" ")}' file
min: 1 0 -1
max: 7 2 5

Related

Counting the number of unique values based on more than two columns in bash

I need to modify the below code to work on more than one column.
Counting the number of unique values based on two columns in bash
awk ' ##Starting awk program from here.
BEGIN{
FS=OFS="\t"
}
!found[$0]++{ ##Checking condition if 1st and 2nd column is NOT present in found array then do following.
val[$1]++ ##Creating val with 1st column inex and keep increasing its value here.
}
END{ ##Starting END block of this progra from here.
for(i in val){ ##Traversing through array val here.
print i,val[i] ##Printing i and value of val with index i here.
}
}
' Input_file ##Mentioning Input_file name here.
Table to count how many of each double (all DIS)
patient sex DISa DISb DISc DISd DISe DISf DISg DISh DISi
patient1 male 550.1 550.5 594.1 594.3 594.8 591 1019 960.1 550.1
patient2 female 041 208 250.2 276.14 426.32 550.1 550.5 558 041
patient3 female NA NA NA NA NA NA NA 041 NA
The output I need is:
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
550.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Consider this awk:
awk -v OFS='\t' 'NR > 1 {for (i=3; i<=NF; ++i) if ($i+0 == $i) ++fq[$i]} END {for (i in fq) print i, fq[i]}' file
276.14 1
960.1 1
594.3 1
426.32 1
208 1
041 3
594.8 1
550.1 3
591 1
1019 1
558 1
550.5 2
250.2 1
594.1 1
A more readable form:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if ($i+0 == $i)
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' file
$i+0 == $i is a check for making sure column value is numeric.
If the ordering must be preserved, then you need an additional array b[] to keep the order each number is encountered, e.g.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if ($i~/^[0-9]/) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' file
Example Use/Output
$ awk '
> BEGIN { OFS = "\t" }
> FNR > 1 {
> for (i=3;i<=NF;i++)
> if ($i~/^[0-9]/) {
> if (!($i in a))
> b[++n] = $i;
> a[$i]++
> }
> }
> END {
> for (i=1;i<=n;i++)
> print b[i], a[b[i]]
> }' patients
550.1 3
550.5 2
594.1 1
594.3 1
594.8 1
591 1
1019 1
960.1 1
041 3
208 1
250.2 1
276.14 1
426.32 1
558 1
Let me know if you have further questions.
Taking complete solution from above 2 answers(#anubhava and #David) with all respect, just adding a little tweak(of applying check for integer value here as per shown samples of OP) to their solutions and adding 2 solutions here. Written and tested with shown samples only.
1st solution: If order doesn't matter in output try:
awk -v OFS='\t' '
NR > 1 {
for (i=3; i<=NF; ++i)
if (int($i))
++fq[$i]
}
END {
for (i in fq)
print i, fq[i]
}' Input_file
2nd solution: If order matters based on David's answer try.
awk '
BEGIN { OFS = "\t" }
FNR > 1 {
for (i=3;i<=NF;i++)
if (int($i)) {
if (!($i in a))
b[++n] = $i;
a[$i]++
}
}
END {
for (i=1;i<=n;i++)
print b[i], a[b[i]]
}' Input_file
Using GNU awk for multi-char RS:
$ awk -v RS='[[:space:]]+' '$0+0 == $0' file | sort | uniq -c
3 041
1 1019
1 208
1 250.2
1 276.14
1 426.32
3 550.1
2 550.5
1 558
1 591
1 594.1
1 594.3
1 594.8
1 960.1
If the order of fields really matters just pipe the above to awk '{print $2, $1}'.

Process multiple files with awk

I would like to count the number of points in each interval. I have the positions of the points in the first file and the intervals in the second. First I store the point attributes in two arrays(pos and name) and then i want to loop over them in order to determine wheter it belongs to the given interval ($1 is the name and $2 is the start and $3 is the end of the interval). I have the following code:
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}} file1 file2 > out'
I have a syntax error: "syntax error near unexpected token `i"
I am beginner in awk. Any help is highly appriciated. Thanks
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[FNR] += 1; }
}
}
END {
for(i = 1; i <=FNR; i++){
print sum[i];
}
}
' points windows > output
points:
chr1 52
chr1 65
chr2 120
chr2 101
chr2 160
chr3 20
chr4 50
windows:
chr1 0 100
chr1 100 200
chr2 0 100
chr2 100 200
chr3 0 100
chr3 100 200
chr4 0 100
chr5 0 100
chr6 0 100
chr6 100 200
chr7 0 100
chr8 0 100
gave me the desired output:
2
3
1
1
Thank You
Your ' is in wrong place and awk command is not ending properly, could you please try following. Couldn't test it since no samples are given.
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}}' file1 file2
Non-one liner form of above solution.
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[NR] += 1 }
}
}
END{
for(i = 1; i <=length(sum); i++){
print sum[i]
}
}
' file1 file2 > out
As per sir #Ed Morton 's comment following could be few recommendations: Again these are not tested since no samples were given but you could try to apply them.
sum[NR] should be as sum[FNR] if in case you want to put index as per line number of Input_file2, why because difference between NR and FNR is that NR's value will be keep keep increasing till all Input_file(s) are read but FNR value will be RESET to 1 whenever there is new Input_file is being read.
Then, length(sum) could be value of FNR because basically you may be looking for total number of times loop has to run which you could get by FNR value.

Using awk, how to average numbers in column between two strings in a text file

A text file containing multiple tabular delimited columns between strings with an example below.
Code 1 (3)
5 10 7 1 1
6 10 9 1 1
7 10 10 1 1
Code 2 (2)
9 11 3 1 3
10 8 5 2 1
Code 3 (1)
12 10 2 1 1
Code 4 (2)
14 8 1 1 3
15 8 7 5 1
I would like to average the numbers in the third column for each code block. The example below is what the output should look like.
8.67
4
2
4
Attempt 1
awk '$3~/^[[:digit:]]/ {i++; sum+=$3; print $3} $3!~/[[:digit:]]/ {print sum/i; sum=0;i=0}' in.txt
Returned fatal: division by zero attempted.
Attempt 2
awk -v OFS='\t' '/^Code/ { if (NR > 1) {i++; sum+=$3;} {print sum/i;}}' in.txt
Returned another division by zero error.
Attempt 3
awk -v OFS='\t' '/^Code/ { if (NR > 1) { print s/i; s=0; i=0; } else { s += $3; i += 1; }}' in.txt
Returned 1 value: 0.
Attempt 4
awk -v OFS='\t' '/^Code/ {
if (NR > 1)
i++
print sum += $3/i
}
END {
i++
print sum += $3/i
}'
Returned:
0
0
0
0.3
I am not sure where that last number is coming from, but this has been the closest solution so far. I am getting a number for each block, but not the average.
Could you please try following.
awk '
/^Code/{
if(value!=0 && value){
print sum/value
}
sum=value=""
next
}
{
sum+=$NF;
value++
}
END{
if(value!=0 && value){
print sum/value
}
}
' Input_file

Difference between adjacent data rows, with multiple columns

If I have:
1 2 3 4 5 6 . .
3 4 5 4 2 1 . .
5 7 5 7 2 0 . .
.
.
I want to show the difference of adjacent data rows, so that it would show:
2 2 2 0 -3 -5 . .
2 3 0 3 0 -1 . .
.
.
I found the post difference between number in the same column using AWK, and adapting the second answer, I thought that this will do the job:
awk 'NR>1{print $0-p} {p=$0}' file
But that produces output in and of a single column. How do I get it to retain the column structure of the data?
$ cat tst.awk
NR>1 {
for (i=1; i<=NF; i++) {
printf "%2d%s", $i - p[i], (i<NF ? OFS : ORS)
}
}
{ split($0,p) }
$ awk -f tst.awk file
2 2 2 0 -3 -5
2 3 0 3 0 -1
Try something like this:
awk '{for (i=1; i <= NF; i++) { c[i] = $i - c[i] }; count = NF }
END { for (i = 1; i <= count; i++) { printf c[i] " "}}' numbers
Written out:
$ cat > subtr.awk
{
for (i=1; i<=NF; i++) b[i]=a[i]
# for (i in a) b[i]=a[i]
n=split($0,a)
}
NR > 1 {
for (i=1; i<=NF; i++) {
#for(i in a) {
printf "%s%s", a[i]-b[i], (i==n?ORS:OFS)
}
delete b
}
Test it:
$ awk -f subtr.awk file
2 2 2 0 -3 -5
2 3 0 3 0 -1

find the Max and Min with AWK in specific range

I have file with three columns , I want to get max of $3 and min of $2 but in specific range of $1 with awk:
Col1 Col2 Col3
==============
X 1 2
X 3 4
Y 5 6
Y 7 8
E.g. I want to get the minimum value of Col2 , and the maximum value of Col3 while Col1=X.
I could handle max and min value but I dont find out how to find it in specific range
this is my code :
awk ' min=="" || $2 < min {min=$2; minline=$0} $3 > max {max=$3; maxline=$0};END {print $1,min,max}'
I tried to add {If ($1==X)} but It doesnt work well.
kent$ echo "X 1 2
X 3 4
Y 5 6
Y 7 8
"|awk '$1=="X"{min=$2<min||min==""?$2:min;max=$3>max||max==""?$3:max}END{print min,max}'
1 4
is this what you want?
What about:
awk 'BEGIN { c=1 }
$1 == "X" { if (c==1) { mmin=$2; mmax=$3 ;c++ }
if ($2<mmin) { mmin=$2 }
if ($3>mmax) { mmax=$3 }
}
END { print "X min: " mmin ", max: " mmax }' INPUTFILE
See it in action # Ideone.
If you want to collect all the minima and maxima:
awk '
$2 < min[$1] {min[$1] = $2}
$3 > max[$1] {max[$1] = $3}
{col1[$1] = 1}
END {for (c in col1) {print c, min[c], max[c]}}
' file