How to find the maximum value for the field by ignoring the lines with characters using awk? - awk

Since am newbie to the awk , please help me with your suggestions. I tried the below command to filter the maximum value and ignore the first & last lines from the sample text file separately. They work when I try them separately.
My query:
I need to ignore the last line and first few lines and from the file and then need to take the maximum value for the field 7 using awk .
I also need to ignore the lines with the characters . Can anyone suggest me the possibilities two use both the commands together and get the required output.
Sample file:
Linux 3.10.0-957.5.1.el7.x86_64 (j051s784) 11/24/2020 _x86_64_ (8 CPU)
12:00:02 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
12:10:01 AM 4430568 61359128 93.27 1271144 27094976 66771548 33.04 39005492 16343196 1348
12:20:01 AM 4423380 61366316 93.28 1271416 27102292 66769396 33.04 39012312 16344668 1152
12:30:04 AM 4406324 61383372 93.30 1271700 27108332 66821724 33.06 39028320 16343668 2084
12:40:01 AM 4404100 61385596 93.31 1271940 27107724 66799412 33.05 39031244 16344532 1044
06:30:04 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
07:20:01 PM 3754904 62034792 94.29 1306112 27555948 66658632 32.98 39532204 16476848 2156
Average: 4013043 61776653 93.90 1293268 27368986 66755606 33.03 39329729 16427160 2005
Commands used:
cat testfile | awk '{print $7}' | head -n -1 | tail -n+7
awk 'BEGIN{a= 0}{if ($7>0+a) a=$7} END{print a}' testfile
Expected output:
Maximum value for the column 7 by excluding the lines wherever alphabet character is available

1st solution(Generic solution): Adding one Generic solution here, where sending field name to an awk variable(which we want to look for for maximum value) it will automatically find out its field number from very first line and will work accordingly. Considering that your first line has that field name which you want to look for.
awk -v var="kbcached" '
FNR==1{
for(i=1;i<=NF;i++){
if($i==var){ field=i }
}
next
}
/kbmemused/{
next
}
{
if($2!~/^[AP]M$/){
val=$(field-1)
}
else{
val=$field
}
}
{
max=(max>val?max:val)
val=""
}
END{
print "Maximum value is:" max
}
' Input_file
2nd solution(As per shown samples only): Could you please try following, based on your shown samples only. I am assuming you want the field value of column kbcached.
awk '
/kbmemfree/{
next
}
{
if($2!~/^[AP]M$/){
val=$6
}
else{
val=$7
}
}
{
max=(max>val?max:val)
val=""
}
END{
print "Maximum value is:" max
}
' Input_file

awk '$7 ~ ^[[:digit:]]+$/ && $1 != "Average:" {
max[$7]=""
}
END {
PROCINFO["sorted_in"]="#ind_num_asc";
for (i in max) {
maxtot=i
}
print maxtot
}' file
One liner:
awk '$7 ~ /^[[:digit:]]+$/ && $1 != "Average:" { max[$7]="" } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in max) { maxtot=i } print maxtot }' file
Using GNU awk, search for lines where field 7 is only numbers and field one is not "Average:" In these instances, create an array entry with field 7 as the index. At the end, sort the array in index ascending number order. Loop through the array setting a maxtot variable. The last entry in the max array will be the highest kbcached and so print maxtot

Related

Extract columns by matching, rename, and assign value using AWK

I have a tab delimited csv file containing summary statistics for object lengths:
sampled. objs. obj. min. len. obj. mean. len. obj. max. len. obj. std.
50 22 60 95 5
I want the information about minimum and maximum lengths by searching matching column headers obj. min. len. and obj. max. len.. I then want to create a new csv file, comma-delimited with new column headers to get the result
object_minimum,object_maximum
22,95
I first print the new headers. Then I tried retrieving the indices of the match and then extracting from the second row using these indices:
#!/bin/awk -f
BEGIN {
cols="object_minimum:object_maximum"
FS="\t"
RS="\n"
col_count=split(cols, col_arr, ":");
for(i=1; i<=col_count; i++) printf col_arr[i] ((i==col_count) ? "\n" : ",");
}
{
for (i=1; i<=NF; i++) {
if(index($i,"obj. min. len.") !=0) {
data["object_minimum"]=i;
}
if(index($i,"obj. max. len.") !=0) {
data["object_maximum"]=i;
}
}
}
END NR==1 {
for (j=1; j<=col_count; j++) printf NF==data[j] ((i==col_count) ? "\n" : ",");
}
There could be more columns and in a different order so it is necessary to do the matching to find the position, and also I may have to select for more columns by changing cols and looking for more matches. I execute by running
awk -f awk_script.awk original.csv > new.csv
With awk:
awk 'BEGIN {FS="\t"; OFS=","}
NR==1 {for (i=1; i<=NF; i++){f[$i] = i}} # fill array with header
NR> 1 {print $(f["obj. min. len."]), $(f["obj. max. len."])}' file
Output:
22,95
Source: https://unix.stackexchange.com/a/359699/74329
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR
here is one working prototype, add formatting and error checking...
$ awk -F'\t' -v OFS=, '
NR==1 {for(i=1;i<=NF;i++)
if($i=="obj. min. len.") min=i;
else if($i=="obj. max. len.") max=i;
print "min","max"}
NR==2 {print $min,$max; exit}' file
min,max
22,95
Could you please try following, completely based on your shown samples only, written and tested in GNU awk. Created an awk variable named sep="###" this could be changed as per need too.
awk -v sep="###" '
BEGIN{
OFS=","
}
FNR==1{
while(match($0,/ +obj\./)){
val=substr($0,RSTART,RLENGTH)
sub(/^ +/,"",val)
line=(line?line:"")substr($0,1,RSTART-1)sep val
$0=substr($0,RSTART+RLENGTH)
}
if(substr($0,RSTART+RLENGTH)!=""){
line=line substr($0,RSTART+RLENGTH)
}
num=split(line,arr,sep)
for(i=1;i<=num;i++){
if(arr[i]=="obj. min. len."){ min=i }
if(arr[i]=="obj. max. len."){ max=i }
}
print "object_minimum,object_maximum"
next
}
{
print $min,$max
}
' Input_file
Logical explanation: Working on the very first line of Input_file. Then using awk's match function to look for matches +obj\. in current line. In this creating a variable which has values of matched and before matched values. Once all searching of specific regex is done(means all occurrences of matched regex are found). Then splitting newly created variable(which has value of first line with separators ### assuming these are NOT present in your Input_file else change them to something else) into array. Finally going through all elements of that array and putting condition if a column is obj. min. len. then setting min variable value to that specific index number(which is actually field number for rest of the lines) and if value is obj. max. len. then setting max variable. After processing first line simply printing corresponding fields by doing $min,$max.

How to change the whole number in file to decimal value

I need to edit the amount in a file which is delimited by "|", I need to change the whole number in to decimal for field 1 and 4 alone, Could some one help me here
INput
1|A|b|1|5468|k|l|78789
3434|c|d|3434|045958|l|h|784889
12000|e|f|12000|6767474|klk|kjjhf|890898
200000|g|h|200000|5676474|jfjjf|teyt|67878
Output
1.00|A|b|1.00|5468|k|l|78789
34.34|c|d|34.34|045958|l|h|784889
120.00|e|f|120.00|6767474|klk|kjjhf|890898
2000.00|g|h|2000.00|5676474|jfjjf|teyt|67878
Could you please try following.
awk -F"|" '
{
if(/0+$/){
for(i=1;i<=NF;i++){
$i=substr($i,1,length($i)-1)"."substr($i,length($i)-1)
}
}
else{
for(i=1;i<=NF;i++){
$i=sprintf("%.02f",$i)
}
}
}
1
' OFS="|" Input_file
Output will be as follows.
1.00|1.00
3434.00|3434.00
1200.00|1200.00
20000.00|20000.00

Move values to column based on row value

The input file the date block change every each 4 lines (column 1). Example for days 061218 and 061418, but not in the case for date 061318, which contends 8 lines.
Then in the case where the date does not change after 5 lines,like the example on date 061318 in that case the values of the second part lines 5-8 need to be added to the END ond the lines 1-4. To get correctly in the output file desired.
Input file
061218,2660,2660,2661
061218,0,0,0,0
061218,48,30,569
061218,SD/05,F1/R0,SD/05
061318,2654,2654
061318,0,0
061318,114,60
061318,SD/05,F1/R0
061318,2666
061318,0
061318,1
061318,F1/R0
061418,2648,2648,2649
061418,0,0,0
061418,871,868,876
061418,SD/05,F1/R0,SD/05
Output file
061218,2660,2660,2661
061218,0,0,0,0
061218,48,30,569
061218,SD/05,F1/R0,SD/05
061318,2654,2654,2666
061318,0,0,0
061318,114,60,1
061318,SD/05,F1/R0,F1/R0
061418,2648,2648,2649
061418,0,0,0
061418,871,868,876
061418,SD/05,F1/R0,SD/05
I tried:
awk -F, '{a[$1]=a[$1]?a[$1]","$2:$2;}END{for (i in a)print i, a[i];}' OFS=, file
Thanks in advance
If your Input_file is same as shown sample(which you mentioned in your comments it is) then could you please try following.
awk '
BEGIN{
FS=OFS=","
}
prev!=$1 && prev{
for(i=1;i<=count;i++){
print prev,a[prev,i]
}
prev=count=""
}
{
prev=$1
sub(/[^,]*,/,"")
if(count==4){
count=1
}
else{
count++
}
a[prev,count]=a[prev,count]?a[prev,count] OFS $0:$0
}
END{
if(prev){
for(i=1;i<=count;i++){
print prev,a[prev,i]
}
}
}' Input_file
Change above a[prev,count] line to a[prev,count]=(a[prev,count]?a[prev,count] OFS:"")$0 in Ed Morton sir's style too, to shorten and make it compatible to other awks too.

awk group by multiple columns and print max value with non-primary key

i'm new to this site and trying to learn awk. i'm trying to find the maximum value of field3, grouping by field1 and print all the fields with maximum value. Field 2 contains time, that means for each item1 there is 96 values of field2,field3 and field4
input file: (comma separated)
item1,00:15,10,30
item2,00:45,20,45
item2,12:15,30,45
item1,00:30,20,56
item3,23:00,40,44
item1,12:45,50,55
item3,11:15,30,45
desired output:
item1,12:45,50,55
item2,12:15,30,45
item3,11:15,30,45
what i tried so far:
BEGIN{
FS=OFS=","}
{
if (a[$1]<$3){
a[$1]=$3}
}
END{
for (i in a ){
print i,a[i]
}
but this only prints
item1,50
item2,30
item3,30
but i need to print the corresponding field2 and field4 with the max value as shown in the desired output. please help.
The problem here is that you are not storing the whole line, so when you go through the final data there is no full data to print.
What you need to do is to use another array, say data[index]=full line:
BEGIN{
FS=OFS=","}
{
if (a[$1]<$3){
a[$1]=$3
data[$1]=$0} # store it here!
}
END {
for (i in a )
print data[i] # print it here
}
Or as a one-liner:
$ awk 'BEGIN{FS=OFS=","} {if (a[$1]<$3) {a[$1]=$3; data[$1]=$0}} END{for (i in a) print data[i]}' file
item1,12:45,50,55
item2,12:15,30,45
item3,23:00,40,44
With a little help of the sort command:
sort -t, -k1,1 -k3,3nr file | awk -F, '!seen[$1]++'
To do this job robustly you need:
$ cat tst.awk
BEGIN { FS="," }
!($1 in max) {
max[$1] = $3
data[$1] = $0
keys[++numKeys] = $1
}
$3 > max[$1] {
max[$1] = $3
data[$1] = $0
}
END {
for (keyNr=1; keyNr<=numKeys; keyNr++) {
print data[keys[keyNr]]
}
}
$ awk -f tst.awk file
item1,12:45,50,55
item2,12:15,30,45
item3,23:00,40,44
When doing min/max calculations you should always seed your min/max value with the first value read rather than assuming it'll always be less than or greater than some arbitrary value (e.g. zero-or-null if you skip the !($1 in max) block above).
You need the keys array to preserve input order when printing the output. If you use in instead then the output order will be random.
Note that idiomatic awk syntax is simply:
<condition> { <action> }
not C-style:
{ if ( <condition> ) { <action> } }

awk | Add new row or update existing row in a file

I want to update file1 on the basis of file2. If any row is new in file2 then it should be added in file1. If any row from file2 is already in file1, then update that row with the row from file2 if the time is greater in file2.
file1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051015,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
file2
DL,1111111101,201312041013,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051016,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111104,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
newfile1
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
Notes:
2nd field should be unique in the output.
Addition of new value: the latest 2nd field for value "1111111104" in file2 is taken which is newer (201312051016) then old value (201312051014) on the basis of date column (3rd field).
Update an existing value: updated "1111111102" with newer value on the basis of date in 3rd column
file1 is very LARGE whereas file2 has 5-10 entries only.
row with 2nd field "1111111101" doesn't need to b updated because it's entry in file1 already has the latest date "201312051014" as compared to new date "201312041013" in file2.
I haven't tried much on this because it really has complex condition for me as beginner..
BEGIN { FS = OFS = "," }
FNR == NR {
m=$2;
a[m] = $0;
next
}
{
if($2 in a)
{
split(a[$2],datetime,",")
if($3>datetime[3])
print $0;
else
print a[$2]"Old time"
}
else print $0"NOMATCH";
delete a[$2];
}
Assuming that you can start your awk as follows:
awk -f script.awk input2.csv input1.csv > result.csv
you can use the following script to obtain the desired output:
BEGIN {
FS = OFS = ","
}
FILENAME == "input2.csv" {
date[$2] = $3
data[$2] = $0
used[$2] = 0
}
FILENAME == "input1.csv" {
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Notes:
The script takes advantages of the assumption that file2 is smaller than file1 because it uses an array only for the few entries in file2.
The new entries are simply appended to the output. There is no sorting. If this is required there will have to be an extra effort.
EDIT
Heeding #JonathanLeffler's remark about the way I determine which file is being processed I would like to offer an alternate version that may (or may not :-) ) be a little more straight forward to understand than checking NR=FNR. However, it only works for sufficiently recent versions of awk which are capable of returning the size of an array as length(array):
BEGIN {
FS = ","
}
{
# The following effectively creates an array entry for each filename found (for "known" filenames existing entries are overwritten).
files[FILENAME] = 1
# check the number of files we have so far
if (length(files) == 1) {
# we are still in the first file
date[$2] = $3
data[$2] = $0
used[$2] = 0
} else {
# we are in the second file (or any other following file)
if ($2 in date) {
used[$2] = 1
if ($3 < date[$2])
print data[$2]
else
print $0
} else {
print $0
}
}
}
END {
for (key in used) {
if (used[key] == 0)
print data[key]
}
}
Also, if you require your output to be sorted according to the second row you can replace the call to awk by this:
awk -f script.awk input2.csv input1.csv | sort -t "," -n -k 2 > result.csv
The latter, of course, works for both versions of the script.
Since file1 is very large but file2 is very small (5-10 entries), you need to read all of file2 into memory first, dealing with the duplicate values. As a result, you'll have an array indexed by the record number with the new data; you should also have a record of the date for each record in a separate array. Then, as you read the main file, you look up the the record number and the date in the arrays, and if you need to, substitute the saved new record for the incoming old record.
Your outline script is most of the way there. It is more complex because you didn't save the dates coming in. This more or less works:
awk -F, '
FNR == NR { if (!($2 in date) || date[$2] < $3) { date[$2] = $3; line[$2] = $0; } next; }
{ if ($2 in date)
{
if (date[$2] > $3)
print line[$2]
else
print
delete line[$2]
delete date[$2]
}
else
print
}
END { for (l in line) print line[l]; }' file2 file1
Sample output for given data:
DL,1111111100,201312051013,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111101,201312051014,val,FIX01,OptIn,Y,Ext1,Ext2
DL,1111111102,201312051017,val,FIX02,OptIn,N,Ext1,Ext2
DL,1111111103,201312051016,val,FIX01,OptIn,N,Ext1,Ext2
DL,1111111104,201312051016,val,FIX02,OptIn,Y,Ext1,Ext2
However, if there were 4 new records, there's no guarantee that they'd be in sorted order, though they would all be at the end of the list. It would be possible to upgrade the script to print the new records at the appropriate place in the list if the input is guaranteed to be in sorted order. You simply have to search through the list of lines to see whether there are any lines that should be printed before the current line, and if so, do so (and delete the record so that they are not printed at the end).
Note that uniqueness in the output depends on uniqueness in the input (file1). That is, if field 2 in the input is repeated, this code won't notice. There is also nothing that can be done with the current design even if a duplicate was spotted; the old row has been printed so printing the new row will simply cause the duplicate. If you were worried about this, you could design the awk script to keep the whole of file1 in memory and only print anything when the whole of the input has been processed. Needless to say, this uses a lot more memory than the current design, and will generally be less efficient because of that. Nevertheless, it could be done if needed.