Extract maximum and minimum using awk by with groupby

Extract maximum and minimum using awk by with groupby - awk

i'm new to this site and trying to learn awk. i'm trying to find the maximum value of field5, grouping by years, and also months..
for every month (of a year), printing just the line with the maximum of probability
input file: (comma separated)
year,month,lat,lng,probability
0,0,40,331,1.00000
0,2,38,334,0.01111
0,2,38,334,0.05511
0,4,38,335,0.06667
0,8,38,336,0.16667
1,2,39,334,0.12222
1,2,39,335,0.04444
1,4,39,336,0.02222
1,4,40,333,0.14444
1,4,40,334,0.12222
2,6,40,335,0.06667
2,6,40,336,0.14444
output file desired
months,lat,lng
2,38,334
4,38,335
8,38,336
14,40,333
16,40,336
thank you everyone for the help

There are inconsistencies in your example. If by 'group' you mean a group defined by $1,$2 needs to have more than one entry, that explains why 0,40,331 is not included. But then why is 4,38,335 included?
Anyway, you ask for a start and here it is:
$ awk 'BEGIN{FS=OFS=","}
NR==1{print $2,$3,$4; next}
NR==FNR && FNR>1 {
if ($5>max[$1 OFS $2]) max[$1 OFS $2]=$5
next
}
max[$1 OFS $2]==$5 { print $1*12+$2,$3,$4}
' file file
Prints:
month,lat,lng
0,40,331
2,38,334
4,38,335
8,38,336
14,39,334
16,40,333
30,40,336
Notice that the script traverses the file twice (by using file twice on the command line). The first time is to find the max of the group defined by $1,$2 and the second time to print that line.
If you only want groups included, count them:
$ awk 'BEGIN{FS=OFS=","}
NR==1{print $2,$3,$4; next}
NR==FNR && FNR>1 {
cnt[$1 OFS $2]++
if ($5>max[$1 OFS $2]) max[$1 OFS $2]=$5
next
}
max[$1 OFS $2]==$5 && cnt[$1 OFS $2]>1 { print $1*12+$2,$3,$4}
' file file
month,lat,lng
2,38,334
14,39,334
16,40,333
30,40,336
I acknowledge that is different than your example, but I think your example needs more explanation.

thank you all, and thanks #dawg for the help
i want to give a feedback of my final code:
#!/bin/bash
awk 'BEGIN{FS=OFS=","}
NR==1{print "months",$3,$4; next}
NR==FNR && FNR>1 {
if ($5>max[$1,$2])
max[$1,$2]=$5
next
}
{if (max[$1,$2] == $5)
print $1*12+$2,$3,$4;}' example.csv example.csv `

Related

Grouping duplicated fields with awk

I have the following file:
ID|2018-04-29
ID|2018-04-29
ID|2018-04-29
ID1|2018-06-26
ID1|2018-06-26
ID1|2018-08-07
ID1|2018-08-22
and using awk, I want to add $3 that groups the duplicated IDs based on $1 and $2 so that the output would be
ID|2018-04-29|group1
ID|2018-04-29|group1
ID|2018-04-29|group1
ID1|2018-06-26|group2
ID1|2018-06-26|group2
ID1|2018-08-07|group3
ID1|2018-08-22|group4
I tried the following code but it does not give me the desired output. Also, I am not sure if I can apply it to a column with date in it.
awk -F"|" '{print $0,"group"++seen[$1,$3]}' OFS="|"
Any hints on how to achieve it using awk (one-liner, if possible) would be highly appreciated.

With your shown samples, please try following awk code.
awk -v OFS="|" '!arr[$0]++{count++} {print $0,"group"count}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
OFS="|" ##Setting OFS to | here.
}
!arr[$0]++{ ##Checking if current line is NOT present in array then do following.
count++ ##Increasing count with 1 here.
}
{
print $0,"group"count ##Printing current line with group and count value here.
}
' Input_file ##Mentioning Input_file name here.

and using awk, I want to add $3 that groups the duplicated IDs based
on $1 and $2 so that the output would be
Using $1 and $2
If input file is sorted then:
$ awk 'BEGIN{FS=OFS="|"}{print $0, "group" (!a[$1,$2]++?++c:c)}' file
ID|2018-04-29|group1
ID|2018-04-29|group1
ID|2018-04-29|group1
ID1|2018-06-26|group2
ID1|2018-06-26|group2
ID1|2018-08-07|group3
ID1|2018-08-22|group4
If file not sorted then :
$ awk 'BEGIN{FS=OFS="|"}{k=$1 SUBSEP $2}!(k in a){a[k]=++c}{print $0, "group" a[k]}' file
ID|2018-04-29|group1
ID|2018-04-29|group1
ID|2018-04-29|group1
ID1|2018-06-26|group2
ID1|2018-06-26|group2
ID1|2018-08-07|group3
ID1|2018-08-22|group4
Better Readable version:
awk 'BEGIN{
FS=OFS="|"
}
{
k=$1 SUBSEP $2
}
!(k in a){
a[k]=++c
}
{
print $0, "group" a[k]
}' file

BEGIN {OFS = FS = "|"}
{ if ($0 != prev) { #new item
prev = $0
print $1, $2, "group" ++g
}
else {
print $1, $2, "group" g
}
}
Note that the list has to be sorted (from your example, I assume it is).
This is my first time posting answer here. Hope the code is readable for you and hope it helps.

Extract columns by matching, rename, and assign value using AWK

I have a tab delimited csv file containing summary statistics for object lengths:
sampled. objs. obj. min. len. obj. mean. len. obj. max. len. obj. std.
50 22 60 95 5
I want the information about minimum and maximum lengths by searching matching column headers obj. min. len. and obj. max. len.. I then want to create a new csv file, comma-delimited with new column headers to get the result
object_minimum,object_maximum
22,95
I first print the new headers. Then I tried retrieving the indices of the match and then extracting from the second row using these indices:
#!/bin/awk -f
BEGIN {
cols="object_minimum:object_maximum"
FS="\t"
RS="\n"
col_count=split(cols, col_arr, ":");
for(i=1; i<=col_count; i++) printf col_arr[i] ((i==col_count) ? "\n" : ",");
}
{
for (i=1; i<=NF; i++) {
if(index($i,"obj. min. len.") !=0) {
data["object_minimum"]=i;
}
if(index($i,"obj. max. len.") !=0) {
data["object_maximum"]=i;
}
}
}
END NR==1 {
for (j=1; j<=col_count; j++) printf NF==data[j] ((i==col_count) ? "\n" : ",");
}
There could be more columns and in a different order so it is necessary to do the matching to find the position, and also I may have to select for more columns by changing cols and looking for more matches. I execute by running
awk -f awk_script.awk original.csv > new.csv

With awk:
awk 'BEGIN {FS="\t"; OFS=","}
NR==1 {for (i=1; i<=NF; i++){f[$i] = i}} # fill array with header
NR> 1 {print $(f["obj. min. len."]), $(f["obj. max. len."])}' file
Output:
22,95
Source: https://unix.stackexchange.com/a/359699/74329
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

here is one working prototype, add formatting and error checking...
$ awk -F'\t' -v OFS=, '
NR==1 {for(i=1;i<=NF;i++)
if($i=="obj. min. len.") min=i;
else if($i=="obj. max. len.") max=i;
print "min","max"}
NR==2 {print $min,$max; exit}' file
min,max
22,95

Could you please try following, completely based on your shown samples only, written and tested in GNU awk. Created an awk variable named sep="###" this could be changed as per need too.
awk -v sep="###" '
BEGIN{
OFS=","
}
FNR==1{
while(match($0,/ +obj\./)){
val=substr($0,RSTART,RLENGTH)
sub(/^ +/,"",val)
line=(line?line:"")substr($0,1,RSTART-1)sep val
$0=substr($0,RSTART+RLENGTH)
}
if(substr($0,RSTART+RLENGTH)!=""){
line=line substr($0,RSTART+RLENGTH)
}
num=split(line,arr,sep)
for(i=1;i<=num;i++){
if(arr[i]=="obj. min. len."){ min=i }
if(arr[i]=="obj. max. len."){ max=i }
}
print "object_minimum,object_maximum"
next
}
{
print $min,$max
}
' Input_file
Logical explanation: Working on the very first line of Input_file. Then using awk's match function to look for matches +obj\. in current line. In this creating a variable which has values of matched and before matched values. Once all searching of specific regex is done(means all occurrences of matched regex are found). Then splitting newly created variable(which has value of first line with separators ### assuming these are NOT present in your Input_file else change them to something else) into array. Finally going through all elements of that array and putting condition if a column is obj. min. len. then setting min variable value to that specific index number(which is actually field number for rest of the lines) and if value is obj. max. len. then setting max variable. After processing first line simply printing corresponding fields by doing $min,$max.

Print the first column of multiple files using awk

I have 20 files, I want to print the first column of each file into a different file.I need 20 output files.
i have tried the following command, but this one puts all the output into a single file.
awk '{print $1}' /home/gee/SNP_data/20* > out_file
write the output to different files, i have 20 input files

1st solution: Could you please try following.
awk '
FNR==1{
if(file){
close(file)
}
file="out_file_"FILENAME".txt"
}
{
print $1 > (file)
}
' /home/gee/SNP_data/20*
Explanation: Adding explanation for above code.
awk ' ##Starting awk program here.
FNR==1{ ##checking condition if FNR==1 then do following.
if(file){ ##Checking condition if variable file is NOT NULL then do following.
close(file) ##Using close to close the opened output file in backend, to avoid too many opened files error.
} ##Closing BLOCK for if condition.
file="out_file_"FILENAME".txt" ##Setting variable file value to string out_file_ then FILENAME(which is Input_file) and append .txt to it.
} ##Closing BLOCK for condition for FNR==1 here.
{
print $1 > (file) ##Printing first field to variable file here.
}
' /home/gee/SNP_data/20* ##Mentioning Input_file path here to pass files here.
2nd solution: In case you need to get output files like output_file_1.txt ans so on then try following. I have created an awk variable named out_file where you could change your output file's name's initial too(as per your need).
awk -v out_file="Output_file_" '
FNR==1{
if(file){
close(file)
}
++count
file=out_file count".txt"
}
{
print $1 > (file)
}
' /home/gee/SNP_data/20*

Awk has a builtin redirection operator, you can use it like:
awk '{ print $1 > ("out_" FILENAME) }' /home/gee/SNP_data/20*
or, even better:
awk 'FNR==1 { close(f); f=("out_" FILENAME) } { print $1 > f }' /home/gee/SNP_data/20*
Former is just an example usage of redirection operator, latter is how to use it robustly.

Move values to column based on row value

The input file the date block change every each 4 lines (column 1). Example for days 061218 and 061418, but not in the case for date 061318, which contends 8 lines.
Then in the case where the date does not change after 5 lines,like the example on date 061318 in that case the values of the second part lines 5-8 need to be added to the END ond the lines 1-4. To get correctly in the output file desired.
Input file
061218,2660,2660,2661
061218,0,0,0,0
061218,48,30,569
061218,SD/05,F1/R0,SD/05
061318,2654,2654
061318,0,0
061318,114,60
061318,SD/05,F1/R0
061318,2666
061318,0
061318,1
061318,F1/R0
061418,2648,2648,2649
061418,0,0,0
061418,871,868,876
061418,SD/05,F1/R0,SD/05
Output file
061218,2660,2660,2661
061218,0,0,0,0
061218,48,30,569
061218,SD/05,F1/R0,SD/05
061318,2654,2654,2666
061318,0,0,0
061318,114,60,1
061318,SD/05,F1/R0,F1/R0
061418,2648,2648,2649
061418,0,0,0
061418,871,868,876
061418,SD/05,F1/R0,SD/05
I tried:
awk -F, '{a[$1]=a[$1]?a[$1]","$2:$2;}END{for (i in a)print i, a[i];}' OFS=, file
Thanks in advance

If your Input_file is same as shown sample(which you mentioned in your comments it is) then could you please try following.
awk '
BEGIN{
FS=OFS=","
}
prev!=$1 && prev{
for(i=1;i<=count;i++){
print prev,a[prev,i]
}
prev=count=""
}
{
prev=$1
sub(/[^,]*,/,"")
if(count==4){
count=1
}
else{
count++
}
a[prev,count]=a[prev,count]?a[prev,count] OFS $0:$0
}
END{
if(prev){
for(i=1;i<=count;i++){
print prev,a[prev,i]
}
}
}' Input_file
Change above a[prev,count] line to a[prev,count]=(a[prev,count]?a[prev,count] OFS:"")$0 in Ed Morton sir's style too, to shorten and make it compatible to other awks too.

While Read and AWK to Change Field

I have two files - FileA and FileB. FileA has 10 fields with 100 lines. If Field1 and Field2 match, Field3 should be changed. FileB has 3 fields. I am reading in FileB with a while loop to match the two fields and to get the value that should be use for field 3.
while IFS=$'\t' read hostname interface metric; do
awk -v var1=${hostname} -v var2=${interface} -v var3=${metric} '{if ($1 ~ var1 && $2 ~ var2) $3=var3; print $0}' OFS="\t" FileA.txt
done < FileB.txt
At each line iteration, this prints FileB.txt with the single line that changed. I only want it to print the line that was changed.
Please Help!

It's a smell to be calling awk once for each line of file B. You should be able to accomplish this task with a single pass through each file.
Try something like this:
awk -F'\t' -v OFS='\t' '
# first, read in data from file B
NR == FNR { values[$1 FS $2] = $3; next }
# then, output modified lines from matching lines in file A
($1 FS $2) in values { $3 = values[$1 FS $2]; print }
' fileB fileA
I'm assuming that you actually want to match with string equality instead of ~ pattern matching.

I only want it to print the line that was changed.
Simply put your print $0 statement to if clause body:
'{if ($1 ~ var1 && $2 ~ var2) { $3=var3; print $0 }}'
or even shorter:
'$1~var1 && $2~var2{ $3=var3; print $0 }'

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Extract maximum and minimum using awk by with groupby - awk

thank you all, and thanks #dawg for the help i want to give a feedback of my final code: #!/bin/bash awk 'BEGIN{FS=OFS=","} NR==1{print "months",$3,$4; next} NR==FNR && FNR>1 { if ($5>max[$1,$2]) max[$1,$2]=$5 next } {if (max[$1,$2] == $5) print $1*12+$2,$3,$4;}' example.csv example.csv `

Related

Grouping duplicated fields with awk

Extract columns by matching, rename, and assign value using AWK

Print the first column of multiple files using awk

Move values to column based on row value

While Read and AWK to Change Field

Categories

Resources