How to sum up all other columns based on column 1? - awk

I have an example csv file like below (but with way more columns numbering up to Sample 100 and several rows)
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,44
Unclassified,0,0,392
Unclassified,0,0,0
Woeseia,0,0,76
and I would like to have a summed csv file as below where all the identical entries on column 1 are summed up
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
I tried the following awk code but it didn't work
awk -F "," 'function SP() {n=split ($0, T); ID=$1}
function PR() {printf "%s", ID; for (i=2; i<=n; i++) printf "\t%s", T[i]; printf "\n"}
NR==1 {SP();next}
$1 != ID {PR(); SP(); next}
{for (i=2; i<=NF; i++) T[i]+=$i}
END {PR()}
' Filename.csv
I am also aware of doing something like below but it is impractical when there are hundreds of columns. Any help here would be appreciated.
awk -F "," ' NR==1 {print; next} NF {a[$1]+=$2; b[$1]+=$3; c[$1]+=$4; d[$1]+=$5; e[$1]+=$6; f[$1]++} END {for(i in a)print i, a[i], b[i], c[i], d[i], e[i], f[i]} ' Filename.csv

With your shown samples, please try following awk program. You need NOT to create these many arrays, you could easily do it with 1 or 2 here.
awk '
BEGIN { FS=OFS="," }
FNR==1{
print
next
}
{
for(i=2;i<=NF;i++){
arr1[$1]
arr2[$1,i]+=$i
}
}
END{
for(i in arr1){
printf("%s,",i)
for(j=2;j<=NF;j++){
printf("%s%s",arr2[i,j],j==NF?ORS:OFS)
}
}
}
' Input_file
Output will be as follows:
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN { FS=OFS="," } ##In BEGIN section setting FS and OFS as comma here.
FNR==1{ ##Checking if this is first line then do following.
print ##Printing current line.
next ##next will skip further statements from here.
}
{
for(i=2;i<=NF;i++){ ##Running for loop from 2nd field to till NF here.
arr1[$1] ##Creating arr1 array with index of 1st field.
arr2[$1,i]+=$i ##Creating arr2 with index of 1st field and current field number and value is current field value which is keep adding into it.
}
}
END{ ##Starting END block for this program from here.
for(i in arr1){ ##Traversing through arr1 all elements here one by one.
printf("%s,",i) ##Printing its current index here.
for(j=2;j<=NF;j++){ ##Running for loop from 2nd field to till NF here.
printf("%s%s",arr2[i,j],j==NF?ORS:OFS) ##Printing value of arr2 with index of i and j, printing new line if its last field.
}
}
}
' Input_file ##Mentioning Input_file here.

Here's an other awk:
awk -v FS=',' -v OFS=',' '
NR == 1 {
print
next
}
{
ids[$1]
for (i = 2; i <= NF; i++)
sums[i "," $1] += $i
}
END {
for (id in ids) {
out = id
for (i = 2; i <= NF; i++)
out = out OFS sums[i "," id]
print out
}
}
' Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
You can also use a CSV-aware program that provides tools for data analysis.
Here's an example with Miller, which is available as a stand-alone executable:
IFS='' read -r csv_header < Filename.csv
mlr --csv \
stats1 -a sum -g "${csv_header%%,*}" -f "${csv_header#*,}" \
then rename -r '(.*)_sum,\1' \
Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

Related

How to sort inside a cell captured by awk

I have a file with rows like following, where 3rd column has multiple numeric values which I need to sort:
file: h1.csv
Class S101-T1;3343-1-25310;3344-1-25446 3345-1-25691 3348-1-27681 3347-1-28453
Class S101-T2;3343-2-25310;3344-2-25446 3345-2-25691
Class S101-T1;3343-3-25310;3345-3-25691 3343-3-25314
Class S101-T2;3343-4-25310;3345-4-25691 3343-4-25314 3344-4-25314
Class S102-T1;3343-5-25310;3344-5-25446 3345-5-25691
So, expected output is:
Class S101-T1;3343-1-25310;3344-1-25446 3345-1-25691 3347-1-28453 3348-1-27681
Class S101-T2;3343-2-25310;3344-2-25446 3345-2-25691
Class S101-T1;3343-3-25310;3343-3-25314 3345-3-25691
Class S101-T2;3343-4-25310;3343-4-25314 3344-4-25314 3345-4-25691
Class S102-T1;3343-5-25310;3344-5-25446 3345-5-25691
My idea was to capture 3rd column with awk and then sort it, and finally print output, but I have arrived only to capture the column. I have not succeeded in sorting it, nor printing disired output.
Here's the code I've got so far...
cat h1.csv | awk -F';' '{ gsub(" ","\n",$3); print $0 }'
I have tried (and some others giving error):
cat h1.csv | awk -F';' '{ gsub(" ","\n",$3); print $3 | "sort -u" }'
cat h1.csv | awk -F';' '{ gsub(" ","\n",$3); sort -u; print $3 }'
So, is it possible to do so, how?, any help! Thanks...
One option could be to split the 3rd column on a space, and then using asort() for the values using gnu-awk.
Then concatenate the first 2 fields and the splitted and sorted fields again.
awk '
BEGIN{FS=OFS=";"}
{
n=split($3, a, " ")
asort(a)
res = $1 OFS $2 OFS
for (i = 1; i <= n; i++) {
res = res " " a[i]
}
print res
}' file
Output
Class S101-T1;3343-1-25310; 3344-1-25446 3345-1-25691 3347-1-28453 3348-1-27681
Class S101-T2;3343-2-25310; 3344-2-25446 3345-2-25691
Class S101-T1;3343-3-25310; 3343-3-25314 3345-3-25691
Class S101-T2;3343-4-25310; 3343-4-25314 3344-4-25314 3345-4-25691
Class S102-T1;3343-5-25310; 3344-5-25446 3345-5-25691
In GNU awk, with your shown samples, please try following awk code.
awk '
BEGIN{
FS=OFS=";"
PROCINFO["sorted_in"] = "#val_num_asc"
}
{
nf=val=""
delete value
num=split($NF,arr," ")
for(i=1;i<=num;i++){
split(arr[i],arr2,"-")
value[arr2[1]]=arr[i]
}
for(i in value){
nf=(nf?nf " ":"")value[i]
}
$NF=nf
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=";" ##Setting FS, OFS as ; here.
PROCINFO["sorted_in"] = "#val_num_asc" ##Setting PROCINFO using sorted_in to make sure array values are sorted by values in ascending order only.
}
{
nf=val="" ##Nullifying variables here.
delete value ##Deleting value array here.
num=split($NF,arr," ") ##Splitting last field into arr with separator as space here.
for(i=1;i<=num;i++){ ##Traversing through all elements of array arr.
split(arr[i],arr2,"-") ##Splitting first value of arr into arr2 by delimiter of - to make sure to get only first value eg: 3344, 3345 etc.
value[arr2[1]]=arr[i] ##Assigning value array value to arr value with index of arr2 value whose index of 1st.
}
for(i in value){ ##Traversing through array value here.
nf=(nf?nf " ":"")value[i] ##Concatenating all values to nf here.
}
$NF=nf ##Assigning last field value to nf here.
}
1 ##printing edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
Using GNU awk for sorted_in:
$ cat tst.awk
BEGIN {
FS = OFS = ";"
PROCINFO["sorted_in"] = "#val_str_asc"
}
{
split($3,a," ")
sorted = ""
for (i in a) {
sorted = (sorted=="" ? "" : sorted " ") a[i]
}
$3 = sorted
print
}
$ awk -f tst.awk file
Class S101-T1;3343-1-25310;3344-1-25446 3345-1-25691 3347-1-28453 3348-1-27681
Class S101-T2;3343-2-25310;3344-2-25446 3345-2-25691
Class S101-T1;3343-3-25310;3343-3-25314 3345-3-25691
Class S101-T2;3343-4-25310;3343-4-25314 3344-4-25314 3345-4-25691
Class S102-T1;3343-5-25310;3344-5-25446 3345-5-25691
Note that this assumes alphabetic sort so it'd sort 1000-1-1 before 200-1-1. That works as long as the strings you want sorted are always made up of the same length parts, i.e. 4digits-1digit-5digits.

How to merge duplicate lines into same row with primary key and more than one column of information

Here is my data:
NAME1,NAME1_001,NULL,LIC100_1,NULL,LIC300-3,LIC300-6
NAME1,NAME1_003,LIC000_1,NULL,NULL,NULL,NULL
NAME2,NAME2_001,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME3,NAME3_001,NULL,LIC400_2,NULL,NULL,LIC500_1
NAME3,NAME3_005,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME3,NAME3_006,LIC000_1,NULL,LIC400_2,NULL,NULL
NAME4,NAME4_002,NULL,LIC100_1,NULL,LIC300-3,LIC300-6
Expected result:
NAME1|NAME1_001|NULL|LIC100_1|NULL|LIC300-3|LIC300-6|NAME1_003|LIC000_1|NULL|NULL|NULL|NULL
NAME2|NAME2_001|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME3|NAME3_001|NULL|LIC400_2|NULL|NULL|LIC500_1|NAME3_005|LIC000_1|NULL|LIC400_2|NULL|NULL|NAME3_006|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME4|NAME4_002|NULL|LIC100_1|NULL|LIC300-3|LIC300-6
I tried below command, but have no idea how to add the details ($3 to $7)
awk '
BEGIN{FS=","; OFS="|"};
{ arr[$1] = arr[$1] == ""? $2 : arr[$1] "|" $2 }
END {for (i in arr) print i, arr[i] }' file.csv
Any suggestion? thanks!!
Could you please try following. Written and tested with shown samples in GNU awk.
awk '
BEGIN{
FS=","
OFS="|"
}
FNR==NR{
first=$1
$1=""
sub(/^,/,"")
arr[first]=(first in arr?arr[first] OFS:"")$0
next
}
($1 in arr){
print $1 arr[$1]
delete arr[$1]
}
' Input_file Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS="," ##Setting FS as comma here.
OFS="|" ##Setting OFS as | here.
}
FNR==NR{ ##Checking FNR==NR which will be TRUE when first time Input_file is being read.
first=$1 ##Setting first as 1st field here.
$1="" ##Nullifying first field here.
sub(/^,/,"") ##Substituting starting comma with NULL in current line.
arr[first]=(first in arr?arr[first] OFS:"")$0 ##Creating arr with index of first and keep adding same index value to it.
next ##next will skip all further statements from here.
}
($1 in arr){ ##Checking condition if 1st field is present in arr then do following.
print $1 arr[$1] ##Printing 1st field with arr value here.
delete arr[$1] ##Deleting arr item here.
}
' Input_file Input_file ##Mentioning Input_file names here.
Another awk:
$ awk '
BEGIN { # set them field separators
FS=","
OFS="|"
}
{
if($1 in a) { # if $1 already has an entry in a hash
t=$1 # store key temporarily
$1=a[$1] # set the a hash entry to $1
a[t]=$0 # and hash the record
} else { # if $1 seen for the first time
$1=$1 # rebuild record to change the separators
a[$1]=$0 # and hash the record
}
}
END { # afterwards
for(i in a) # iterate a
print a[i] # and output
}' file
Assuming your input is grouped by the key field as shown in your example (if it isn't then sort it first) you don't need to store the whole file in memory or read it twice and this will output the lines in the same order they appear in the input:
$ cat tst.awk
BEGIN { FS=","; OFS="|" }
$1 != prev {
if (NR>1) {
print rec
}
prev = rec = $1
}
{
$1 = ""
rec = rec $0
}
END { print rec }
$ awk -f tst.awk file
NAME1|NAME1_001|NULL|LIC100_1|NULL|LIC300-3|LIC300-6|NAME1_003|LIC000_1|NULL|NULL|NULL|NULL
NAME2|NAME2_001|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME3|NAME3_001|NULL|LIC400_2|NULL|NULL|LIC500_1|NAME3_005|LIC000_1|NULL|LIC400_2|NULL|NULL|NAME3_006|LIC000_1|NULL|LIC400_2|NULL|NULL
NAME4|NAME4_002|NULL|LIC100_1|NULL|LIC300-3|LIC300-6

How to find the max values from columns

I am trying to find the maxima in three columns of a file called data.dat. The idea is
5414 6267 3157
4521 1235 5418
1366 6472 4598
5153 7814 5648
5414
7814
5648
I'm trying to use awk as
for k in {1..3};awk 'BEGIN {max = 0} {if ('$k'>max) max='$k'} END {print max}' data.dat;done
but I have not been lucky.
Could you please try following, written and tested with shown samples in GNU awk.
awk '
{
for(i=1;i<=NF;i++){
arr[i]=(arr[i]>$i?arr[i]:$i)
}
}
END{
for(k=1;k<=NF;k++){
print arr[k]
}
}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
for(i=1;i<=NF;i++){ ##Start a for loop from 1st field to last field of current line.
arr[i]=(arr[i]>$i?arr[i]:$i) ##Creating array arr with index of column number and keeping only greater value by comparing its [revious value in each iteration.
}
}
END{ ##Starting END block of this awk program from here.
for(k=1;k<=NF;k++){ ##Starting a loop from k=1 to till number of fields here.
print arr[k] ##Printing value of arr with index of k here.
}
}' Input_file ##Mentioning Input_file name here.
This is what awk arrays are made for. You can simply loop over each field, using the field number as the array index and comparing the value against the current field value. If it is greater, update the value at that index with the current value, e.g.
awk '{
for (i=1; i<=NF; i++)
if ($i > a[i])
a[i] = $i
}
END {
for (j = 1; j < i; j++)
print a[j]
}' file
Example Use/Output
For example, with your data in the filename file, you can just open an xterm and select-copy the awk script above and middle-mouse paste in the current directory containing file to test, e.g.
$ awk '{
> for (i=1; i<=NF; i++)
> if ($i > a[i])
> a[i] = $i
> }
> END {
> for (j = 1; j < i; j++)
> print a[j]
> }' file
5414
7814
5648

Extract sequence from list of data into separate line

sample.txt does have "tab-separated column", and there's semi-colon seperated that needed to be splitted accordingly from sequence of number into repeated value.
cat sample.txt
2 2627 588;577
2 2629 566
2 2685 568-564
2 2771 573
2 2773 597
2 2779 533
2 2799 558
2 6919 726;740-742;777
2 7295 761;771-772
Please be noted that, some of line may have inverted sequence 568-564
By using previous script, I manage to split it, but failed to extract from sequence (splitted by dash)
#!/bin/sh
awk -F"\t" '{print $1}' $1 >> $2 &&
awk -F"\t" '{print $2}' $1 >> $2 &&
awk -F"\t" '{print $3}' $1 >> $2 &&
sed -i "s/^M//;s/;\r//g" $2
#!/bin/awk -f
BEGIN { FS=";"; recNr=1}
!NF { ++recNr; lineNr=0; next }
{ ++lineNr }
lineNr == 1 { next }
recNr == 1 { a[lineNr] = $0 }
recNr == 2 { b[lineNr] = $0 }
recNr == 3 {
for (i=1; i<=NF; i++) {
print a[lineNr] "," b[lineNr] "," $i
}
}
Expected
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Could you please try following(will add explanation in few mins).
awk '
BEGIN{
OFS=","
}
{
num=split($NF,array,";")
for(i=1;i<=num;i++){
if(array[i]~/-/){
split(array[i],array2,"-")
to=array2[1]>array2[2]?array2[1]:array2[2]
from=array2[1]<array2[2]?array2[1]:array2[2]
while(from<=to){
print $1,$2,from++
}
}
else{
print $1,$2,array[i]
}
from=to=""
}
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section of code here.
OFS="," ##Setting OFS as comma here.
}
{
num=split($NF,array,";") ##Splitting last field of line into an array named array with delimiter semi-colon here.
for(i=1;i<=num;i++){ ##Starting a for loop from 1 to till value of num which is actually length of array created in previous step.
if(array[i]~/-/){ ##Checking condition if array value with index i is having dash then do followong.
split(array[i],array2,"-") ##Split value of array with index i to array2 here with delimiter -(dash) here.
to=array2[1]>array2[2]?array2[1]:array2[2] ##Creating to variable which will compare 2 elements of array2 and have maximum value out of them here.
from=array2[1]<array2[2]?array2[1]:array2[2] ##Creating from variable which will compare 2 elements of array2 and will have minimum out of them.
while(from<=to){ ##Running while loop from variable from to till value of variable to here.
print $1,$2,from++ ##Printing 1st, 2nd fields with value of from variable and increasing from value with 1 each time it comes here.
}
}
else{ ##Mention else part of if condition here.
print $1,$2,array[i] ##Printing only 1st, 2nd fields along with value of array with index i here.
}
from=to="" ##Nullifying variables from and to here.
}
}
' Input_file ##Mentioning Input_file name here.
Adding link for conditional statements ? and : explanation as per James sir's comments:
https://www.gnu.org/software/gawk/manual/html_node/Conditional-Exp.html
For shown sample output will be as follows.
2,2627,588
2,2627,577
2,2629,566
2,2685,564
2,2685,565
2,2685,566
2,2685,567
2,2685,568
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
$ awk '
BEGIN {
FS="( +|;)" # input field separator is space or ;
OFS="," # output fs is comma
}
{
for(i=3;i<=NF;i++) { # from the 3rd field to the end
n=split($i,t,"-") # split on - if any. below loop from smaller to greater
if(n) # in case of empty fields
for(j=(t[1]<t[n]?t[1]:t[n]); j<=(t[1]<t[n]?t[n]:t[1]);j++)
print $1,$2,j # output
}
}' file
Output
2,2627,588
2,2627,577
2,2629,566
2,2685,564 <─┐
2,2685,565 │
2,2685,566 ├─ wrong order, from smaller to greater
2,2685,567 │
2,2685,568 <─┘
2,2771,573
2,2773,597
2,2779,533
2,2799,558
2,6919,726
2,6919,740
2,6919,741
2,6919,742
2,6919,777
2,7295,761
2,7295,771
2,7295,772
Tested on GNU awk, mawk, Busybox awk and awk version 20121220.

Conditional transposition in awk based on column values

I'm trying to make the below transformation using awk.
Input:
status,parent,child,date
first,foo,bar,2019-01-01
NULL,foo,bar,2019-01-02
NULL,foo,bar,2019-01-03
last,foo,bar,2019-01-04
NULL,foo,bar,2019-01-05
blah,foo,bar,2019-01-06
NULL,foo,bar,2019-01-07
first,bif,baz,2019-01-02
NULL,bif,baz,2019-01-03
last,bif,baz,2019-01-04
Expected output:
parent,child,first,last
foo,bar,2019-01-01,2019-01-04
bif,baz,2019-01-02,2019-01-04
I'm pretty stumped by this problem, and haven't got anything to show yet - any pointers would be very helpful.
Could you please try following.
awk '
BEGIN{
FS=OFS=SUBSEP=","
print "parent,child,first,last"
}
$1=="first" || $1=="last"{
a[$1,$2,$3]=$NF
b[$2,$3]
}
END{
for(i in b){
print i,a["first",i],a["last",i]
}
}
' Input_file
Output will be as follows.
parent,child,first,last
bif,baz,2019-01-02,2019-01-04
foo,bar,2019-01-01,2019-01-04
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN{ ##Starting BEGIN section from here.
FS=OFS=SUBSEP="," ##Setting Fs, OFS and SUBSEP as comma here.
print "parent,child,first,last" ##Printing header values as per OP request here.
} ##Closing BEGIN BLOCK for this progam here.
$1=="first" || $1=="last"{ ##Checking condition if $1 is either string first or last then do following.
a[$1,$2,$3]=$NF ##Creating an array named a whose index is $1,$2,$3 and its value is $NF(last column of current line).
b[$2,$3] ##Creating an array named b whose index is $2,$3 from current line.
} ##Closing main BLOCK for main program here.
END{ ##Starting END BLOCK for this awk program.
for(i in b){ ##Starting a for loop to traverse through array here.
print i,a["first",i],a["last",i] ##Printing variable it, array a with index of "first",i and value of array b with index of "last",i.
} ##Closing BLOCK for, for loop here.
} ##Closing BLOCK for END block for this awk program here.
' Input_file ##Mentioning Input_file name here.
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $2 OFS $3 }
FNR==1 { print key, "first", "last" }
$1=="first" { first[key] = $4 }
$1=="last" { print key, first[key], $4 }
$ awk -f tst.awk file
parent,child,first,last
foo,bar,2019-01-01,2019-01-04
bif,baz,2019-01-02,2019-01-04
If you can have a first without a last or vice-versa or they can occur out of order then include those cases in the example in your question.
Not awk, you already have that, but here's an option in bash alone, just for kicks.
#!/usr/bin/env bash
declare -A first=()
printf 'parent,child,first,last\n'
while IFS=, read pos a b date; do
case "$pos" in
first) first[$a,$b]=$date ;;
last) printf "%s,%s,%s,%s\n" "$a" "$b" "${first[$a,$b]}" "$date" ;;
esac
done < input.csv
Requires bash 4+ for the associative array.