Extract columns by matching, rename, and assign value using AWK - awk

I have a tab delimited csv file containing summary statistics for object lengths:
sampled. objs. obj. min. len. obj. mean. len. obj. max. len. obj. std.
50 22 60 95 5
I want the information about minimum and maximum lengths by searching matching column headers obj. min. len. and obj. max. len.. I then want to create a new csv file, comma-delimited with new column headers to get the result
object_minimum,object_maximum
22,95
I first print the new headers. Then I tried retrieving the indices of the match and then extracting from the second row using these indices:
#!/bin/awk -f
BEGIN {
cols="object_minimum:object_maximum"
FS="\t"
RS="\n"
col_count=split(cols, col_arr, ":");
for(i=1; i<=col_count; i++) printf col_arr[i] ((i==col_count) ? "\n" : ",");
}
{
for (i=1; i<=NF; i++) {
if(index($i,"obj. min. len.") !=0) {
data["object_minimum"]=i;
}
if(index($i,"obj. max. len.") !=0) {
data["object_maximum"]=i;
}
}
}
END NR==1 {
for (j=1; j<=col_count; j++) printf NF==data[j] ((i==col_count) ? "\n" : ",");
}
There could be more columns and in a different order so it is necessary to do the matching to find the position, and also I may have to select for more columns by changing cols and looking for more matches. I execute by running
awk -f awk_script.awk original.csv > new.csv

With awk:
awk 'BEGIN {FS="\t"; OFS=","}
NR==1 {for (i=1; i<=NF; i++){f[$i] = i}} # fill array with header
NR> 1 {print $(f["obj. min. len."]), $(f["obj. max. len."])}' file
Output:
22,95
Source: https://unix.stackexchange.com/a/359699/74329
See: 8 Powerful Awk Built-in Variables – FS, OFS, RS, ORS, NR, NF, FILENAME, FNR

here is one working prototype, add formatting and error checking...
$ awk -F'\t' -v OFS=, '
NR==1 {for(i=1;i<=NF;i++)
if($i=="obj. min. len.") min=i;
else if($i=="obj. max. len.") max=i;
print "min","max"}
NR==2 {print $min,$max; exit}' file
min,max
22,95

Could you please try following, completely based on your shown samples only, written and tested in GNU awk. Created an awk variable named sep="###" this could be changed as per need too.
awk -v sep="###" '
BEGIN{
OFS=","
}
FNR==1{
while(match($0,/ +obj\./)){
val=substr($0,RSTART,RLENGTH)
sub(/^ +/,"",val)
line=(line?line:"")substr($0,1,RSTART-1)sep val
$0=substr($0,RSTART+RLENGTH)
}
if(substr($0,RSTART+RLENGTH)!=""){
line=line substr($0,RSTART+RLENGTH)
}
num=split(line,arr,sep)
for(i=1;i<=num;i++){
if(arr[i]=="obj. min. len."){ min=i }
if(arr[i]=="obj. max. len."){ max=i }
}
print "object_minimum,object_maximum"
next
}
{
print $min,$max
}
' Input_file
Logical explanation: Working on the very first line of Input_file. Then using awk's match function to look for matches +obj\. in current line. In this creating a variable which has values of matched and before matched values. Once all searching of specific regex is done(means all occurrences of matched regex are found). Then splitting newly created variable(which has value of first line with separators ### assuming these are NOT present in your Input_file else change them to something else) into array. Finally going through all elements of that array and putting condition if a column is obj. min. len. then setting min variable value to that specific index number(which is actually field number for rest of the lines) and if value is obj. max. len. then setting max variable. After processing first line simply printing corresponding fields by doing $min,$max.

Related

How to find the maximum value for the field by ignoring the lines with characters using awk?

Since am newbie to the awk , please help me with your suggestions. I tried the below command to filter the maximum value and ignore the first & last lines from the sample text file separately. They work when I try them separately.
My query:
I need to ignore the last line and first few lines and from the file and then need to take the maximum value for the field 7 using awk .
I also need to ignore the lines with the characters . Can anyone suggest me the possibilities two use both the commands together and get the required output.
Sample file:
Linux 3.10.0-957.5.1.el7.x86_64 (j051s784) 11/24/2020 _x86_64_ (8 CPU)
12:00:02 AM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
12:10:01 AM 4430568 61359128 93.27 1271144 27094976 66771548 33.04 39005492 16343196 1348
12:20:01 AM 4423380 61366316 93.28 1271416 27102292 66769396 33.04 39012312 16344668 1152
12:30:04 AM 4406324 61383372 93.30 1271700 27108332 66821724 33.06 39028320 16343668 2084
12:40:01 AM 4404100 61385596 93.31 1271940 27107724 66799412 33.05 39031244 16344532 1044
06:30:04 PM kbmemfree kbmemused %memused kbbuffers kbcached kbcommit %commit kbactive kbinact kbdirty
07:20:01 PM 3754904 62034792 94.29 1306112 27555948 66658632 32.98 39532204 16476848 2156
Average: 4013043 61776653 93.90 1293268 27368986 66755606 33.03 39329729 16427160 2005
Commands used:
cat testfile | awk '{print $7}' | head -n -1 | tail -n+7
awk 'BEGIN{a= 0}{if ($7>0+a) a=$7} END{print a}' testfile
Expected output:
Maximum value for the column 7 by excluding the lines wherever alphabet character is available
1st solution(Generic solution): Adding one Generic solution here, where sending field name to an awk variable(which we want to look for for maximum value) it will automatically find out its field number from very first line and will work accordingly. Considering that your first line has that field name which you want to look for.
awk -v var="kbcached" '
FNR==1{
for(i=1;i<=NF;i++){
if($i==var){ field=i }
}
next
}
/kbmemused/{
next
}
{
if($2!~/^[AP]M$/){
val=$(field-1)
}
else{
val=$field
}
}
{
max=(max>val?max:val)
val=""
}
END{
print "Maximum value is:" max
}
' Input_file
2nd solution(As per shown samples only): Could you please try following, based on your shown samples only. I am assuming you want the field value of column kbcached.
awk '
/kbmemfree/{
next
}
{
if($2!~/^[AP]M$/){
val=$6
}
else{
val=$7
}
}
{
max=(max>val?max:val)
val=""
}
END{
print "Maximum value is:" max
}
' Input_file
awk '$7 ~ ^[[:digit:]]+$/ && $1 != "Average:" {
max[$7]=""
}
END {
PROCINFO["sorted_in"]="#ind_num_asc";
for (i in max) {
maxtot=i
}
print maxtot
}' file
One liner:
awk '$7 ~ /^[[:digit:]]+$/ && $1 != "Average:" { max[$7]="" } END { PROCINFO["sorted_in"]="#ind_num_asc";for (i in max) { maxtot=i } print maxtot }' file
Using GNU awk, search for lines where field 7 is only numbers and field one is not "Average:" In these instances, create an array entry with field 7 as the index. At the end, sort the array in index ascending number order. Loop through the array setting a maxtot variable. The last entry in the max array will be the highest kbcached and so print maxtot

Extract maximum and minimum using awk by with groupby

i'm new to this site and trying to learn awk. i'm trying to find the maximum value of field5, grouping by years, and also months..
for every month (of a year), printing just the line with the maximum of probability
input file: (comma separated)
year,month,lat,lng,probability
0,0,40,331,1.00000
0,2,38,334,0.01111
0,2,38,334,0.05511
0,4,38,335,0.06667
0,8,38,336,0.16667
1,2,39,334,0.12222
1,2,39,335,0.04444
1,4,39,336,0.02222
1,4,40,333,0.14444
1,4,40,334,0.12222
2,6,40,335,0.06667
2,6,40,336,0.14444
output file desired
months,lat,lng
2,38,334
4,38,335
8,38,336
14,40,333
16,40,336
thank you everyone for the help
There are inconsistencies in your example. If by 'group' you mean a group defined by $1,$2 needs to have more than one entry, that explains why 0,40,331 is not included. But then why is 4,38,335 included?
Anyway, you ask for a start and here it is:
$ awk 'BEGIN{FS=OFS=","}
NR==1{print $2,$3,$4; next}
NR==FNR && FNR>1 {
if ($5>max[$1 OFS $2]) max[$1 OFS $2]=$5
next
}
max[$1 OFS $2]==$5 { print $1*12+$2,$3,$4}
' file file
Prints:
month,lat,lng
0,40,331
2,38,334
4,38,335
8,38,336
14,39,334
16,40,333
30,40,336
Notice that the script traverses the file twice (by using file twice on the command line). The first time is to find the max of the group defined by $1,$2 and the second time to print that line.
If you only want groups included, count them:
$ awk 'BEGIN{FS=OFS=","}
NR==1{print $2,$3,$4; next}
NR==FNR && FNR>1 {
cnt[$1 OFS $2]++
if ($5>max[$1 OFS $2]) max[$1 OFS $2]=$5
next
}
max[$1 OFS $2]==$5 && cnt[$1 OFS $2]>1 { print $1*12+$2,$3,$4}
' file file
month,lat,lng
2,38,334
14,39,334
16,40,333
30,40,336
I acknowledge that is different than your example, but I think your example needs more explanation.
thank you all, and thanks #dawg for the help
i want to give a feedback of my final code:
#!/bin/bash
awk 'BEGIN{FS=OFS=","}
NR==1{print "months",$3,$4; next}
NR==FNR && FNR>1 {
if ($5>max[$1,$2])
max[$1,$2]=$5
next
}
{if (max[$1,$2] == $5)
print $1*12+$2,$3,$4;}' example.csv example.csv `

Enumerate lines with same ID in awk

I'm using awk to process the following [sample] of data:
id,desc
168048,Prod_A
217215,Prod_C
217215,Prod_B
168050,Prod_A
168050,Prod_F
168050,Prod_B
What I'm trying to do is to create a column 'item' enumerating the lines within the same 'id':
id,desc,item
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3
Here what I've tried:
BEGIN {
FS = ","
a = 1
}
NR != 1 {
if (id != $1) {
id = $1
printf "%s,%s\n", $0, "#"a
}
else {
printf "%s,%s\n", $0, "#"a++
}
}
But it messes the numbering:
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#1
168050,Prod_A,#2
168050,Prod_F,#2
168050,Prod_B,#3
Could someone give me some hints?
P.S. The line order doesn't matter
$ awk -F, 'NR>1{print $0,"#"++c[$1]}' OFS=, file
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3
How it works
-F,
This sets the field separator on input to a comma.
NR>1{...}
This limits the commands in braces to lines other than the first, that is, the one with the header.
print $0,"#"++c[$1]
This prints the line followed by # and a count of the number of times that we have seen the first column.
Associative array c keeps a count of the number of times that an id has been seen. For every line, we increment by 1 the count for id $1. ++ increments. Because ++ precedes c[$1], the increment is done before the value if printed.
OFS=,
This sets the field separator on output to a comma.
Printing a new header as well
$ awk -F, 'NR==1{print $0,"item"} NR>1{print $0,"#"++c[$1]}' OFS=, file
id,desc,item
168048,Prod_A,#1
217215,Prod_C,#1
217215,Prod_B,#2
168050,Prod_A,#1
168050,Prod_F,#2
168050,Prod_B,#3

Comparing the first field of two files and out putting the entire record for those fields of both files if the fields match

I have two files, par1.txt, par2.txt. I want to look at the first field or column of both files, compare them and then if they match print the record or row where they are matched.
Examplefiles:
par1.txt
ocean;stuff about an ocean;definitions of oeans
park;stuff about parks;definitions of parks
ham;stuff about ham;definitions of ham
par2.txt
hand,stuff about hands,definitions of hands
bread,stuff about bread,definitions of bread
ocean,different stuff about an ocean,difference definitions of oceans
ham,different stuff about ham,different definitions of ham
As for my output I want something like
ocean:stuff about an ocean:definitions of oeans
ocean:different stuff about an ocean:difference definitions of oceans
ham:different stuff about ham:different definitions of ham
ham:stuff about ham:definitions of ham
The FS in the files are different, as shown in the example.
The output FS doesn't have to be ":" it just can't be a space.
Using awk:
awk -v OFS=":" '
{ $1 = $1 }
NR==FNR { lines[$1] = $0; next }
($1 in lines) { print lines[$1] RS $0 }
' FS=";" par1.txt FS="," par2.txt
Output:
ocean:stuff about an ocean:definitions of oeans
ocean:different stuff about an ocean:difference definitions of oceans
ham:stuff about ham:definitions of ham
ham:different stuff about ham:different definitions of ham
Explanation:
Set the Output field separator to :. If you want space delimited you dont need to set -v OFS.
$1=$1 helps us reformat the entire line so that it can take the value of OFS while re-constructing.
NR==FNR reads the first file in to array.
When we process the second file, we look for first column in our array. If is present we print the line from array and the line from second file.
FS=";" par1.txt FS="," par2.txt is a technique where you can specify different field separator for different files.
If you have repeatitive first column in both files and would like to capture everything then use the following. It is similar logic but we keep all lines in array and print at the end.
awk -v OFS=":" '
{ $1 = $1 }
NR==FNR {
lines[$1] = (lines[$1] ? lines[$1] RS $0 : $0);
next
}
($1 in lines) {
lines[$1] = lines[$1] RS $0;
seen[$1]++
}
END { for (patt in seen) print lines[patt] }
' FS=";" par1.txt FS="," par2.txt
Edited Answer
Based on your comments, I believe you have more than 2 files and that the files have sometimes commas and sometimes semicolons as separators and that you want to print any number of lines which have matching first fields as long there are more than one with that first field. If so, I think you want this:
awk -F, '
{
gsub(/;/,",");$0=$0; # Replace ";" with "," and reparse line using new field sep
sep=""; # Preset record separator to blank
if(counts[$1]++) sep="\n"; # Add newline if anything already stored in records[$1]
records[$1] = records[$1] sep $0; # Append this record to other records with same key
}
END { for (x in counts) if (counts[x]>1) print records[x] }' par*.txt
Original Answer
I came up with this:
awk -F';' '
FNR==NR {x[$1]=$0; next}
$1 in x {printf "%s\n%s\n",$0,x[$1]}' par1.txt <(sed 's/,/;/' par2.txt)
Read in par1.txt and store in array x[] indexed by first field. Replace the comma in par2.txt with a semicolon so that the separators match. As each line of par2.txt is read, see if it is in the stored array x[] and if it is, print the stored array x[] and the current line.

Use Awk to Print every character as its own column?

I am in need of reorganizing a large CSV file. The first column, which is currently a 6 digit number needs to be split up, using commas as the field separator.
For example, I need this:
022250,10:50 AM,274,22,50
022255,11:55 AM,275,22,55
turned into this:
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Let me know what you think!
Thanks!
It's a lot shorter in perl:
perl -F, -ane '$,=","; print split("",$F[0]), #F[1..$#F]' <file>
Since you don't know perl, a quick explanation. -F, indicates the input field separator is the comma (like awk). -a activates auto-split (into the array #F), -n implicitly wraps the code in a while (<>) { ... } loop, which reads input line-by-line. -e indicates the next argument is the script to run. $, is the output field separator (it gets set iteration of the loop this way, but oh well). split has obvious purpose, and you can see how the array is indexed/sliced. print, when lists as arguments like this, uses the output field separator and prints all their fields.
In awk:
awk -F, '{n=split($1,a,""); for (i=1;i<=n;i++) {printf("%s,",a[i])}; for (i=2;i<NF;i++) {printf("%s,",$i)}; print $NF}' <file>
I think this might work. The split function (at least in the version I am running) splits the value into individual characters if the third parameter is an empty string.
BEGIN{ FS="," }
{
n = split( $1, a, "" );
for ( i = 1; i <= n; i++ )
printf("%s,", a[i] );
sep = "";
for ( i = 2; i <= NF; i++ )
{
printf( "%s%s", sep, $i );
sep = ",";
}
printf("\n");
}
here's another way in awk
$ awk -F"," '{gsub(".",",&",$1);sub("^,","",$1)}1' OFS="," file
0,2,2,2,5,0,10:50 AM,274,22,50
0,2,2,2,5,5,11:55 AM,275,22,55
Here's a variation on a theme. One thing to note is it prints the remaining fields without using a loop. Another is that since you're looping over the characters in the first field anyway, why not just do it without using the null-delimiter feature of split() (which may not be present in some versions of AWK):
awk -F, 'BEGIN{OFS=","} {len=length($1); for (i=1;i<len; i++) {printf "%s,", substr($1,i,1)}; printf "%s", substr($1,len,1);$1=""; print $0}' filename
As a script:
BEGIN {FS = OFS = ","}
{
len = length($1);
for (i=1; i<len; i++)
{printf "%s,", substr($1, i, 1)};
printf "%s", substr($1, len, 1)
$1 = "";
print $0
}