Use awk to get min-max column values - awk

Given a file with data such as
2015-12-24 22:02 12 9.87 feet High Tide
2015-12-25 03:33 12 -0.38 feet Low Tide
2015-12-25 06:11 12 Full Moon
2015-12-25 10:16 12 11.01 feet High Tide
2015-12-25 16:09 12 -1.29 feet Low Tide
This awk command will return a min value in col 4:
awk 'min=="" || $4 < min {min=$4} END{ print min}' FS=" " 12--December.txt
How do I get it to exclude any line where $4 contains text? I imagine this needs regex but poring over the regex manuals I am lost as to how to do it.

You can use a regular expression comparison on the fourth field as
$4~/[0-9]+/
Test
$ awk '$4~/[0-9]+/ && $4 < min {min=$4} END{print min}' input
-1.29
Note This is a minimised version of the code. You can safely skip some of the statements in the example code as in the test code

Related

Search and Print by Two Conditions using AWK

I have this file:
- - - Results from analysis of weight - - -
Akaike Information Criterion 307019.66 (assuming 2 parameters).
Bayesian Information Criterion 307036.93
Approximate stratum variance decomposition
Stratum Degrees-Freedom Variance Component Coefficients
id 39892.82 490.360 0.7 0.6 1.0
damid 0.00 0.00000 0.0 0.0 1.0
Residual Variance 1546.46 320.979 0.0 0.0 1.0
Model_Term Gamma Sigma Sigma/SE % C
id NRM_V 17633 0.18969 13.480 4.22 0 P
damid NRM_V 17633 0.07644 13.845 2.90 0 P
ide(damid) IDV_V 17633 0.00000 32.0979 1.00 0 S
Residual SCA_V 12459 1.0000 320.979 27.81 0 P
And I Would Like to print the Value of Sigma on id, note there are two id on the file, so I used the condition based on NRM_V too.
I tried this code:
tac myfile | awk '(/id/ && /NRM_V/){print $5}'
but the results printed were:
13.480
13.845
and I need just the first one
Could you please try following, I have added exit function of awk here which will help us to exit from code ASAP whenever first occurrence of condition comes, it will help us to save time too, since its no longer reading whole Input_file.
awk '(/id/ && /NRM_V/){print $5;exit}' Input_file
OR with columns:
awk '($1=="id" && $2=="NRM_V"){print $5;exit}' Input_file
In case you want to read file from last line towards first line and get its first value then try:
tac Input_file | awk '(/id/ && /NRM_V/){print $5;exit}'
OR with columns comparisons:
tac Input_file | awk '($1=="id" && $2=="NRM_V"){print $5;exit}'
The problem is that /id/ also matches damid. You could use the following to print the Sigma value only if the first field is id and the second field is NRM_V:
awk '$1=="id" && $2=="NRM_V"{ print $5 }' myfile

awk not printing correct value for 1 condition that uses multiple criteria

Thanks to #Jose Ricardo Bustos M. whos help led to the below using file1 and file2:
However, I can not seem to capture BRCA2 from file1 with BRCA 1, BRCA2 from file2 (line 2 skipping the header). I am not sure if this is because BCRA2 is the second instance after the , or if problem that $7is full gene sequence and full deletion/duplication analysis, that is full gene sequence is a partial match to the full line in $7? Thank you :).
file1
BRCA2
BCR
SCN1A
fbn1
file2
Tier explanation . List code gene gene name methodology disease
Tier 1 . . 811 DMD dystrophin deletion analysis and duplication analysis, if performed Publication Date: January 1, 2014 Duchenne/Becker muscular dystrophy
Tier 1 . Jan-16 81 BRCA 1, BRCA2 breast cancer 1 and 2 full gene sequence and full deletion/duplication analysis hereditary breast and ovarian cancer
Tier 1 . Jan-16 70 ABL1 ABL1 gene analysis variants in the kinse domane acquired imatinib tyrosine kinase inhibitor
Tier 1 . . 806 BCR/ABL 1 t(9;22) major breakpoint, qualitative or quantitative chronic myelogenous leukemia CML
Tier 1 . Jan-16 85 FBN1 Fibrillin full gene sequencing heart disease
Tier 1 . Jan-16 95 FBN1 fibrillin del/dup heart disease
awk
awk 'BEGIN{FS=OFS="\t"} # define fs and output
{$0=toupper($0)} # convert all `file1` to uppercase
{$5=toupper($5)} # convert '$5' in `file2` to uppercase
{$7=toupper($7)} # convert '$7' in `file2` to uppercase
FNR==NR{ # process each field in line of `file1`
if(NR>1 && ($7 ~ /FULL GENE SEQUENC/)) { # skip header and check for full gene sequenc or full gene sequencing, using `regexp`
gsub(" ","",$5) #removing white space
n=split($5,v,"/")
d[v[1]] = $4 #from split, first element as key
}
next
}{print $1, ($1 in d?d[$1]:279)}' file2 file1 # print name then default if no match
BRCA2 279
BCR 279
SCN1A 279
FBN1 85
desired output
BRCA2 81 --- match in line 2 of $5 in file 2, BRCA 1, BRCA2 and $7 has full gene sequence
BCR 279
SCN1A 279
FBN1 85
The problem is with the below part in the code,
gsub(" ","",$5)
n=split($5,v,"/")
d[v[1]] = $4
AFAIK, it is handling good for the case, BCR/ABL 1 properly, but when you use it for BRCA 1, BRCA2 it does NOT produce results as you expect it to be. Removing white-spaces on BRCA 1, BRCA2 would be BRCA1,BRCA2 and splitting by / would produce the same string BRCA1,BRCA2 itself, as the de-limiter is wrong.
So you need the split string again by , and hash-it. Something like,
n=split($5,v,",")
for (i=1; i <= n; i++) {
d[v[i]] = $4
}
So that now d is hashed with d[BRCA1] and d[BRCA2]. Use the above along with your existing code.
Or) remove the code
gsub(" ","",$5)
n=split($5,v,"/")
d[v[1]] = $4
altogether and do,
gsub(" ","",$5)
n=split($5,v,"\\||,")
for (i=1; i <= n; i++) {
d[v[i]] = $4
}
which means split $5 on either | or , and loop over its contents and hash it to the array d.

Using awk gsub with /1 to replace chars with a section of the original characters

This is what I'm doing (I just want to get rid of the leading numbers in the fourth column)
cat text.txt | awk 'BEGIN {OFS="\t"} {gsub(/[0-9XY][0-9]?([pq])/,"\1",$4); print}'
This is my input
AADDC 4902 3 21q11.3-p11.1 4784 4793
DEEDA 4023 6 9p21.31|22.3-p22.1 2829 2832
ZWTEF 3920 10 8q21-q22 5811 5812
This is my Output
AADDC 4902 3 11.3-p11.1 4784 4793
DEEDA 4023 6 21.31|22.3-p22.1 2829 2832
ZWTEF 3920 10 21-q22 5811 5812
But I want this to be my output
AADDC 4902 3 q11.3-p11.1 4784 4793
DEEDA 4023 6 p21.31|22.3-p22.1 2829 2832
ZWTEF 3920 10 q21-q22 5811 5812
If you use GNU awk, you can use gensub which, unlike gsub, supports backreferences:
awk 'BEGIN {OFS="\t"} {$4=gensub(/[0-9XY][0-9]?([pq])/,"\\1",1,$4); print}' text.txt
Some explanations:
What is the extra "\" for by the 1:
Because otherwise, that would be the character of ascii code 1.
Why does 1 need to be placed between the \1" and the $4:
To tell gensub to replace only the first occurence of the pattern.
Is there a reason why you must put $4= as well as $4
Yes, unlike gsub, gensub doesn't modify the field but returns the updated one.

Calculate Average of Column Data with Headers

I have data for example that looks like this:
Flats 2b
01/1991, 3.45
01/1992, 4.56
01/1993, 4.21
01/1994, 5.21
01/1995, 7.09
01/2013, 6.80
Eagle 2
01/1991, 4.22
01/1992, 6.32
01/1993, 5.21
01/1994, 8.09
01/1995, 7.92
01/2013, 6.33
I'm trying to calculate an average of column 2 so that my desired output looks like this preferably:
Flats 2b
Avg = 4.67
Eagle 2
Avg = 5.26
or even simpler that looks like this without the header:
Avg = 4.67
Avg = 5.26
and so on...since the input file is full of many headers with data like that shown above.
I have tried to do pattern matching options and using NR with something like this without success as an awk one-liner:
awk '/01/1991,/01/1993 {sum+=$2; cnt+=1} {print "Avg =" sum/cnt}' myfile.txt
I get averages but not my desired average for JUST the years 1991, 1992 and 1993 separately for each met tower.
Your help is much appreciated!
If you want to consider only the years 1991-1993
#! /usr/bin/awk -f
# new header, print average if exists, reset values
/[a-zA-Z]/ {
if (cnt > 0) {
print header;
printf("Avg = %.2f\n", sum/cnt);
}
header=$0; sum=0; cnt=0;
}
# calculate average
/^01\/199[123]/ { sum+=$2; cnt++; }
# print last average
END {
if (cnt > 0) {
print header;
printf("Avg = %.2f\n", sum/cnt);
}
}
This looks for awk script looks for a header, prints an average, if there is one and then resets all variables for the next average calculation. If it finds a data row it calculates the sum needed for the average later. If last line is read it prints the final average.
The script considers only the years 1991 until 1993 inclusive. If you want to include more years, you can either duplicate the calculation line or add multiple years with the or operator ||
# calculate average
/^01\/199[0-9]/ || /^01\/200[0-9]/ { sum+=$2; cnt++; }
This takes all the 1990s and 2000s into account.
If you don't want to print the headers, remove the appropriate lines print header.
You call this awk script as
awk -f script.awk myfile.txt

not getting sort output in awk

I am using book "The AWK programming langauge" by Aho ,Kernighan ..
On page 20 they have given a program which doesn't work on my system.
emp.data is
Beth 4.00 0
Dan 3.75 0
Kathy 4.00 10
Mark 5.00 20
Mary 5.50 22
Susie 4.25 18
program they have given is
awk '{ printf("%6.2f %s\n" , $2*$3, $0) }' emp.data | sort
and the output they have given is
But my output is
0.00 Beth 4.00 0
0.00 Dan 3.75 0
100.00 Mark 5.00 20
121.00 Mary 5.50 22
40.00 Kathy 4.00 10
76.50 Susie 4.25 18
so whats happening ?
Your awk is broken or you have control chars in your input.
Your printf syntax is wrong (but would still produce correct output)
To get "2" out of the way: printf is a builtin language construct, not a function. When you do this:
printf("%s",foo)
you are not calling a printf function with 2 arguments, you are invoking the printf builtin with 1 argument which you are constructing from "(" "%s" "," "foo" and ")". The correct syntax is simply:
printf "%s",foo
but you can stick brackets around any of that and it won't add any value but won't break it either. Any of these would "work" in the same way:
printf ("%s"),foo
printf "%s",(foo)
printf ("%s"),(foo)
printf (((((((((("%s",foo))))))))))
More importantly, though is point "1" above: you're telling awk to produce output formatted as:
"%6.2f ...."
which means that the leading digits should be padded with up to 2 leading spaces on the left but your output has no leading spaces on the first line. That is impacting your "sort" but there's more going on here too since given the strings:
2
10
it doesn't matter if you do a numeric sort or an alphabetic sort because 2 is numerically less than 10 but space is also numerically less than 1 so the result should be the same either way.
Your posted output, though, is implying that your sort is sorting alphabetically in such a way that "100" is less than " 40" which just is not the way sort works. Even if somehow in your locale was greater than "1" alphabetically, it wouldn't explain why you get the equivalent of:
2
10
3
in your output, i.e. sometimes it treats space as less than one and other times as more.
Since your awk is clearly producing bad output there is definitely a problem with either your awk or your input file, so I think it's unlikely that there's also a problem with your sort tool.
Try these commands and post your result if you'd like help debugging your problem:
$ awk '{ printf "%6.2f\n" , $2*$3 }' emp.data
0.00
0.00
40.00
100.00
121.00
76.50
$ awk '{ printf "%6.2f\n" , $2*$3 }' emp.data | sort
0.00
0.00
40.00
76.50
100.00
121.00
I had one other thought - if you messed up the copy/paste of your awk output then maybe it's a locale issue. Try doing this:
export LC_ALL=C
and then running the command again (without the "-n" on sort).
Try sort -n at the end, to do a numerical sort. The default sort would put 10 before 2.
They have assumed sort sorts numerically, your sort appears to default to alphabetically.
Have a look at your sort command line options to see if you can make it numeric.