How to find the max values from columns - awk

I am trying to find the maxima in three columns of a file called data.dat. The idea is
5414 6267 3157
4521 1235 5418
1366 6472 4598
5153 7814 5648
5414
7814
5648
I'm trying to use awk as
for k in {1..3};awk 'BEGIN {max = 0} {if ('$k'>max) max='$k'} END {print max}' data.dat;done
but I have not been lucky.

Could you please try following, written and tested with shown samples in GNU awk.
awk '
{
for(i=1;i<=NF;i++){
arr[i]=(arr[i]>$i?arr[i]:$i)
}
}
END{
for(k=1;k<=NF;k++){
print arr[k]
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
for(i=1;i<=NF;i++){ ##Start a for loop from 1st field to last field of current line.
arr[i]=(arr[i]>$i?arr[i]:$i) ##Creating array arr with index of column number and keeping only greater value by comparing its [revious value in each iteration.
}
}
END{ ##Starting END block of this awk program from here.
for(k=1;k<=NF;k++){ ##Starting a loop from k=1 to till number of fields here.
print arr[k] ##Printing value of arr with index of k here.
}
}' Input_file ##Mentioning Input_file name here.

This is what awk arrays are made for. You can simply loop over each field, using the field number as the array index and comparing the value against the current field value. If it is greater, update the value at that index with the current value, e.g.
awk '{
for (i=1; i<=NF; i++)
if ($i > a[i])
a[i] = $i
}
END {
for (j = 1; j < i; j++)
print a[j]
}' file
Example Use/Output
For example, with your data in the filename file, you can just open an xterm and select-copy the awk script above and middle-mouse paste in the current directory containing file to test, e.g.
$ awk '{
> for (i=1; i<=NF; i++)
> if ($i > a[i])
> a[i] = $i
> }
> END {
> for (j = 1; j < i; j++)
> print a[j]
> }' file
5414
7814
5648

Related

How to sum up all other columns based on column 1?

I have an example csv file like below (but with way more columns numbering up to Sample 100 and several rows)
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,44
Unclassified,0,0,392
Unclassified,0,0,0
Woeseia,0,0,76
and I would like to have a summed csv file as below where all the identical entries on column 1 are summed up
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
I tried the following awk code but it didn't work
awk -F "," 'function SP() {n=split ($0, T); ID=$1}
function PR() {printf "%s", ID; for (i=2; i<=n; i++) printf "\t%s", T[i]; printf "\n"}
NR==1 {SP();next}
$1 != ID {PR(); SP(); next}
{for (i=2; i<=NF; i++) T[i]+=$i}
END {PR()}
' Filename.csv
I am also aware of doing something like below but it is impractical when there are hundreds of columns. Any help here would be appreciated.
awk -F "," ' NR==1 {print; next} NF {a[$1]+=$2; b[$1]+=$3; c[$1]+=$4; d[$1]+=$5; e[$1]+=$6; f[$1]++} END {for(i in a)print i, a[i], b[i], c[i], d[i], e[i], f[i]} ' Filename.csv
With your shown samples, please try following awk program. You need NOT to create these many arrays, you could easily do it with 1 or 2 here.
awk '
BEGIN { FS=OFS="," }
FNR==1{
print
next
}
{
for(i=2;i<=NF;i++){
arr1[$1]
arr2[$1,i]+=$i
}
}
END{
for(i in arr1){
printf("%s,",i)
for(j=2;j<=NF;j++){
printf("%s%s",arr2[i,j],j==NF?ORS:OFS)
}
}
}
' Input_file
Output will be as follows:
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN { FS=OFS="," } ##In BEGIN section setting FS and OFS as comma here.
FNR==1{ ##Checking if this is first line then do following.
print ##Printing current line.
next ##next will skip further statements from here.
}
{
for(i=2;i<=NF;i++){ ##Running for loop from 2nd field to till NF here.
arr1[$1] ##Creating arr1 array with index of 1st field.
arr2[$1,i]+=$i ##Creating arr2 with index of 1st field and current field number and value is current field value which is keep adding into it.
}
}
END{ ##Starting END block for this program from here.
for(i in arr1){ ##Traversing through arr1 all elements here one by one.
printf("%s,",i) ##Printing its current index here.
for(j=2;j<=NF;j++){ ##Running for loop from 2nd field to till NF here.
printf("%s%s",arr2[i,j],j==NF?ORS:OFS) ##Printing value of arr2 with index of i and j, printing new line if its last field.
}
}
}
' Input_file ##Mentioning Input_file here.
Here's an other awk:
awk -v FS=',' -v OFS=',' '
NR == 1 {
print
next
}
{
ids[$1]
for (i = 2; i <= NF; i++)
sums[i "," $1] += $i
}
END {
for (id in ids) {
out = id
for (i = 2; i <= NF; i++)
out = out OFS sums[i "," id]
print out
}
}
' Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
You can also use a CSV-aware program that provides tools for data analysis.
Here's an example with Miller, which is available as a stand-alone executable:
IFS='' read -r csv_header < Filename.csv
mlr --csv \
stats1 -a sum -g "${csv_header%%,*}" -f "${csv_header#*,}" \
then rename -r '(.*)_sum,\1' \
Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

Counting unique occurrences in each column

I have a file with several columns like $2$3 (until $32) as in
A refdevhet devdevhomo
B refdevhet refdevhet
C refrefhomo refdevhet
D devrefhet refdevhet
I need to count how many occurrences of each unique element in each column separately
so that I have
refdevhet 2 3
refrefhomo 1 0
devrefhet 1 0
devdevhomo 0 1
I tried several variations of
awk 'BEGIN {
FS=OFS="\t"
}
{
for(i=1; i<=32; i++) a[$i]++
}
END {
for (i in a) print i, a[i]
}' file
but instead it's printing the cumulative sum of occurrences of unique elements across the selected fields.
Here is a solution:
BEGIN {
FS=OFS="\t"
}
{
if (NF>mxf) mxf = NF;
for(i=1; i<=NF; i++) {ws[$i]=1; c[$i,i]++}
}
END {
for (w in ws) {
printf "%s", w
for (i=1;i<=mxf;i++) printf "%s%d", OFS, c[w,i];
print ""
}
}
Notice that solution is general. It will include first column into consideration as well. To omit the first column, change i=1 to i=2 in both places.
In addition to #Andriy's good answer, with GNU awk you can use a 2-dimensional array
gawk '
{for (i=2; i<=NF; i++) count[$i][i]++}
END {
for (word in count) {
printf "%s", word
for (i=2; i<=NF; i++) printf "%s%d", OFS, count[word][i]
print ""
}
}
' file | column -t
I'm assuming here that each line has the same number of fields as the last line.

awk to merge lines based on column

Would like to merge the lines based on first column $1 into row and formatted the output. While printing the header need to generate Max Unique count of first field.
For example, Angola appears count=3 , Brazil appears count=5 , Zambia appears count=1 .
Max unique count of field $1 is 5 , so need to print the header 5 times to have a proper header for all the fields.
While print the output, want to keep the original input file line Orders.
My actual input files used to be vary like 10 fields, 12 fields etc.
Input.csv
Country,Network,Details,Amount
Angola,voda,xxx,10
Angola,at&t,xxx,20
Angola,mtn,xxx,30
Brazil,voda,yyy,40
Brazil,voda,yyy,50
Brazil,at&t,yyy,60
Brazil,mtn,yyy,70
Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Desired Output.csv
Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount
Angola,voda,xxx,10,Angola,at&t,xxx,20,Angola,mtn,xxx,30
Brazil,voda,yyy,40,Brazil,voda,yyy,50,Brazil,at&t,yyy,60,Brazil,mtn,yyy,70,Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Presently , I am using the below two commands to get the desired output and changing the count manually each time based on number of fields in the actual input file.
Step:#1
awk 'BEGIN { while (count++<5) header=header "Country,Network,Details,Amount,"; print header }' > output.csv
Step:#2
awk -F, '
/.+/{
if (!($1 in Val)) { Key[++i] = $1; }
Val[$1] = Val[$1] $0 ",";
}
END{
for (j = 1; j <= i; j++) {
print(Val[Key[j]] );
}
}' input.csv >> output.csv
Looking for your suggestions ...
awk One-liner :
awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS:"")$0; if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",n,i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Input:
$ cat infile
Country,Network,Details,Amount
Angola,voda,xxx,10
Angola,at&t,xxx,20
Angola,mtn,xxx,30
Brazil,voda,yyy,40
Brazil,voda,yyy,50
Brazil,at&t,yyy,60
Brazil,mtn,yyy,70
Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Output:
$ awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS:"")$0; if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",n,i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount
Angola,voda,xxx,10,Angola,at&t,xxx,20,Angola,mtn,xxx,30
Brazil,voda,yyy,40,Brazil,voda,yyy,50,Brazil,at&t,yyy,60,Brazil,mtn,yyy,70,Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Better Readable:
awk 'BEGIN{
FS=OFS=","
}
FNR==1{
n=$0;
next
}
{
a[$1]=($1 in a ? a[$1] OFS:"")$0;
if(!($1 in b)){ o[++i]=$1 };
b[$1]++;
mx=mx>b[$1]?mx:b[$1]
}
END{
for(i=1; i<=mx; i++)
printf("%s%s",n,i==mx?RS:OFS);
for(i=1; i in o; i++)
print a[o[i]]
}' infile
For comment :
Would like to know, where to change the code, to print "Country" in
the Output only first time , if I don't need to print the same country
name 2nd time , third time
$ awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS substr($0,index($0,",")+1) : $0); if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",i==1?n:substr(n,index(n,",")+1),i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Country,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount
Angola,voda,xxx,10,at&t,xxx,20,mtn,xxx,30
Brazil,voda,yyy,40,voda,yyy,50,at&t,yyy,60,mtn,yyy,70,voda,yyy,80
Zambia,tcl,zzz,90
Modified-code:
awk 'BEGIN{
FS=OFS=","
}
FNR==1{
n=$0;
next
}
{
# this line modified
# look for char pos of comma,
a[$1]=($1 in a ? a[$1] OFS substr($0,index($0,",")+1) : $0);
if(!($1 in b)){ o[++i]=$1 };
b[$1]++;
mx=mx>b[$1]?mx:b[$1]
}
END{
for(i=1; i<=mx; i++)
# this line modified
printf("%s%s",i==1?n:substr(n,index(n,",")+1),i==mx?RS:OFS);
for(i=1; i in o; i++)
print a[o[i]]
}' infile
Explanation related to modification:
index(in, find)
Search the string in for the first occurrence of the string find, and
return the position in characters where that occurrence begins in the
string in.
substr(string, start [, length ])
Return a length-character-long substring of string, starting at
character number start.

awk: transpose column header to first field of row

My input looks like this:
A|B|C
1|2|3
4|5|6
Using awk, I am trying to get:
A|1
B|2
C|3
A|4
B|5
C|6
My code:
gawk '
BEGIN{FS=OFS="|"}
NR==1{
for(i=1; i<=NF; i++){
x_i=$i
}
}
NR>1{
for(i=1; i<=NF; i++){
print x_i FS $i
}
}' input
But it keeps only the last iteration of the NR==1 block even if I use the same loop in NR>1 bock:
C|1
C|2
C|3
C|4
C|5
C|6
Any trick?
EDIT
Thanks to Jose, I needed to change x_i by x[i].
What about if using the same input, I would need to output:
A;B;C|1|2|3
A;B;C|4|5|6
$ awk 'BEGIN{FS=OFS="|"} NR==1{split($0,h);next} {for (i=1;i<=NF;i++) print h[i], $i}' file
A|1
B|2
C|3
A|4
B|5
C|6
$ awk 'BEGIN{FS=OFS="|"} NR==1{gsub(/\|/,";");h=$0;next} {print h, $0}' file
A;B;C|1|2|3
A;B;C|4|5|6
Read Effective Awk Programming, 4th Edition, by Arnold Robbins.
you can try,
awk 'BEGIN{FS=OFS="|"}
NR==1{for(i=1; i<=NF; ++i) d[i]=$i; next}
{for(i=1; i<=NF; ++i) print d[i], $i}
' input
you get
A|1
B|2
C|3
A|4
B|5
C|6
Important Note
your logic is correct, only x[i] instead of x_i
gawk '
BEGIN{FS=OFS="|"}
NR==1{
for(i=1; i<=NF; i++){
x[i]=$i
}
}
NR>1{
for(i=1; i<=NF; i++){
print x[i] FS $i
}
}' input
Here is another using split and for:
$ awk 'NR==1 { split($0,a,"|") }
NR>1 { n=split($0,b,"|");
for(i=1;i<=n;i++)
print a[i] "|" b[i] }' file
A|1
B|2
C|3
A|4
B|5
C|6

how to collapse fields with zeros from a datafile

I have a file with data as follows
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
1,2,3,4,5,6,7,8,9,10
1,2,1,2,0,1,0,1,0,1
1,1,1,1,0,2,3,0,0,0
5,1,1,0,0,0,0,0,1,0
I would like to change the delimiters from col6 through column 10 to pipe '|' and the column value would be followed by column name.
Desired Output:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1
I tried using the command
awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i; next}
{for (i=1; i<=NF; i++) if ($i>0) printf "%s%s", ((i>5)?hdr[i] ":":"") $i,
((i<NF)? ((i>5)?"|":",") : ORS)}' data.csv
but not getting the result as expected
Output:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,col6:1|col8:1|col10:1
1,1,1,1,col6:2|col7:3|5,1,1,col9:1|
The columns that does not contain a zero is ending with '|' and the next line is starting data starts there !
In this example, row 2 data ends with a pipe '|' and row 3 data starts in the row 2. row 4 data ends with a pipe '|'
Can some one help me fix this please
P.S: For people looking for the reason behind all this work, I'm trying to load the data from a csv file in to a framework. Source data has 10 columns and The destination dataset would have 6 columns - first 5 from source as is and the rest as a map. Also, I'll have to make sure that there is no map key with the value as zero and then start the data analysis on the set.
This post is to get help for making the data set ready for analysis.
$ awk -F ', *' 'NR==1{for (i=1; i<=NF; i++) hdr[i]=$i":"; next} {for (i=1; i<=5; i++) printf $i","; b=""; for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";} printf ORS}' data.csv
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1
Or, written over multiple lines:
awk -F ', *' '
NR==1{
for (i=1; i<=NF; i++) hdr[i]=$i":"
next
}
{
for (i=1; i<=5; i++) printf $i","
b=""
for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";}
printf ORS
}
' data.csv
How it works
NR==1{for (i=1; i<=NF; i++) hdr[i]=$i":"; next}
For the first line, NR==1, we save each field and a trailing colon into array hdr. Then, the rest of the commands are skipped and we just to the next line.
for (i=1; i<=5; i++) printf $i","
If we get here, we are working or the second are later lines. In this case, we print the first five fields, each followed by a comma.
b=""
We initialize the variable b to the empty string.
for (i=6; i<=NF; i++) if ($i>0) {printf "%s%s", b, hdr[i] $i; b="|";}
For fields 6 to the end, if the field is nonzero, we print b followed by the hdr followed by the field value. After we have encountered the first such nonzero field, b is set to |.
printf ORS
After printing the last field, we print an output record separator (default is a newline).
The above solution is excellent. Helps me with a similar issue. However, I need to cater for an all-zero case in columns 6 to 10. See the last line of your data below.
col1,col2,col3,col4,col5,col6,col7,col8,col9,col10
1,2,3,4,5,6,7,8,9,10
1,2,1,2,0,1,0,1,0,1
1,1,1,1,0,2,3,0,0,0
5,1,1,0,0,0,0,0,1,0
5,1,1,0,0,0,0,0,0,0
This might never happen in your data, however if it does you are left with an inconvenient comma at the end of the line:
1,2,3,4,5,col6:6|col7:7|col8:8|col9:9|col10:10
1,2,1,2,0,col6:1|col8:1|col10:1
1,1,1,1,0,col6:2|col7:3
5,1,1,0,0,col9:1
5,1,1,0,0,
To get around it I made change. Here it is, somewhat spread out for clarity:
awk -F ', *' '
NR==1{
for (i=1; i<=NF; i++) hdr[i]=$i":"
next
}
{
for (i=1; i<5; i++) printf("%s,", $i);
if(i==5) printf("%s", $i);
b="";
for (i=6; i<=NF; i++) {
if ($i>0) {
if(b=="") b=","; else b="|";
printf("%s%s",b, hdr[i] $i);
}
}
printf(ORS);
}