awk to merge lines based on column - awk

Would like to merge the lines based on first column $1 into row and formatted the output. While printing the header need to generate Max Unique count of first field.
For example, Angola appears count=3 , Brazil appears count=5 , Zambia appears count=1 .
Max unique count of field $1 is 5 , so need to print the header 5 times to have a proper header for all the fields.
While print the output, want to keep the original input file line Orders.
My actual input files used to be vary like 10 fields, 12 fields etc.
Input.csv
Country,Network,Details,Amount
Angola,voda,xxx,10
Angola,at&t,xxx,20
Angola,mtn,xxx,30
Brazil,voda,yyy,40
Brazil,voda,yyy,50
Brazil,at&t,yyy,60
Brazil,mtn,yyy,70
Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Desired Output.csv
Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount
Angola,voda,xxx,10,Angola,at&t,xxx,20,Angola,mtn,xxx,30
Brazil,voda,yyy,40,Brazil,voda,yyy,50,Brazil,at&t,yyy,60,Brazil,mtn,yyy,70,Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Presently , I am using the below two commands to get the desired output and changing the count manually each time based on number of fields in the actual input file.
Step:#1
awk 'BEGIN { while (count++<5) header=header "Country,Network,Details,Amount,"; print header }' > output.csv
Step:#2
awk -F, '
/.+/{
if (!($1 in Val)) { Key[++i] = $1; }
Val[$1] = Val[$1] $0 ",";
}
END{
for (j = 1; j <= i; j++) {
print(Val[Key[j]] );
}
}' input.csv >> output.csv
Looking for your suggestions ...

awk One-liner :
awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS:"")$0; if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",n,i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Input:
$ cat infile
Country,Network,Details,Amount
Angola,voda,xxx,10
Angola,at&t,xxx,20
Angola,mtn,xxx,30
Brazil,voda,yyy,40
Brazil,voda,yyy,50
Brazil,at&t,yyy,60
Brazil,mtn,yyy,70
Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Output:
$ awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS:"")$0; if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",n,i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount
Angola,voda,xxx,10,Angola,at&t,xxx,20,Angola,mtn,xxx,30
Brazil,voda,yyy,40,Brazil,voda,yyy,50,Brazil,at&t,yyy,60,Brazil,mtn,yyy,70,Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Better Readable:
awk 'BEGIN{
FS=OFS=","
}
FNR==1{
n=$0;
next
}
{
a[$1]=($1 in a ? a[$1] OFS:"")$0;
if(!($1 in b)){ o[++i]=$1 };
b[$1]++;
mx=mx>b[$1]?mx:b[$1]
}
END{
for(i=1; i<=mx; i++)
printf("%s%s",n,i==mx?RS:OFS);
for(i=1; i in o; i++)
print a[o[i]]
}' infile
For comment :
Would like to know, where to change the code, to print "Country" in
the Output only first time , if I don't need to print the same country
name 2nd time , third time
$ awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS substr($0,index($0,",")+1) : $0); if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",i==1?n:substr(n,index(n,",")+1),i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Country,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount
Angola,voda,xxx,10,at&t,xxx,20,mtn,xxx,30
Brazil,voda,yyy,40,voda,yyy,50,at&t,yyy,60,mtn,yyy,70,voda,yyy,80
Zambia,tcl,zzz,90
Modified-code:
awk 'BEGIN{
FS=OFS=","
}
FNR==1{
n=$0;
next
}
{
# this line modified
# look for char pos of comma,
a[$1]=($1 in a ? a[$1] OFS substr($0,index($0,",")+1) : $0);
if(!($1 in b)){ o[++i]=$1 };
b[$1]++;
mx=mx>b[$1]?mx:b[$1]
}
END{
for(i=1; i<=mx; i++)
# this line modified
printf("%s%s",i==1?n:substr(n,index(n,",")+1),i==mx?RS:OFS);
for(i=1; i in o; i++)
print a[o[i]]
}' infile
Explanation related to modification:
index(in, find)
Search the string in for the first occurrence of the string find, and
return the position in characters where that occurrence begins in the
string in.
substr(string, start [, length ])
Return a length-character-long substring of string, starting at
character number start.

Related

How to sum up all other columns based on column 1?

I have an example csv file like below (but with way more columns numbering up to Sample 100 and several rows)
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,44
Unclassified,0,0,392
Unclassified,0,0,0
Woeseia,0,0,76
and I would like to have a summed csv file as below where all the identical entries on column 1 are summed up
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
I tried the following awk code but it didn't work
awk -F "," 'function SP() {n=split ($0, T); ID=$1}
function PR() {printf "%s", ID; for (i=2; i<=n; i++) printf "\t%s", T[i]; printf "\n"}
NR==1 {SP();next}
$1 != ID {PR(); SP(); next}
{for (i=2; i<=NF; i++) T[i]+=$i}
END {PR()}
' Filename.csv
I am also aware of doing something like below but it is impractical when there are hundreds of columns. Any help here would be appreciated.
awk -F "," ' NR==1 {print; next} NF {a[$1]+=$2; b[$1]+=$3; c[$1]+=$4; d[$1]+=$5; e[$1]+=$6; f[$1]++} END {for(i in a)print i, a[i], b[i], c[i], d[i], e[i], f[i]} ' Filename.csv
With your shown samples, please try following awk program. You need NOT to create these many arrays, you could easily do it with 1 or 2 here.
awk '
BEGIN { FS=OFS="," }
FNR==1{
print
next
}
{
for(i=2;i<=NF;i++){
arr1[$1]
arr2[$1,i]+=$i
}
}
END{
for(i in arr1){
printf("%s,",i)
for(j=2;j<=NF;j++){
printf("%s%s",arr2[i,j],j==NF?ORS:OFS)
}
}
}
' Input_file
Output will be as follows:
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
Explanation: Adding detailed explanation for above code.
awk ' ##Starting awk program from here.
BEGIN { FS=OFS="," } ##In BEGIN section setting FS and OFS as comma here.
FNR==1{ ##Checking if this is first line then do following.
print ##Printing current line.
next ##next will skip further statements from here.
}
{
for(i=2;i<=NF;i++){ ##Running for loop from 2nd field to till NF here.
arr1[$1] ##Creating arr1 array with index of 1st field.
arr2[$1,i]+=$i ##Creating arr2 with index of 1st field and current field number and value is current field value which is keep adding into it.
}
}
END{ ##Starting END block for this program from here.
for(i in arr1){ ##Traversing through arr1 all elements here one by one.
printf("%s,",i) ##Printing its current index here.
for(j=2;j<=NF;j++){ ##Running for loop from 2nd field to till NF here.
printf("%s%s",arr2[i,j],j==NF?ORS:OFS) ##Printing value of arr2 with index of i and j, printing new line if its last field.
}
}
}
' Input_file ##Mentioning Input_file here.
Here's an other awk:
awk -v FS=',' -v OFS=',' '
NR == 1 {
print
next
}
{
ids[$1]
for (i = 2; i <= NF; i++)
sums[i "," $1] += $i
}
END {
for (id in ids) {
out = id
for (i = 2; i <= NF; i++)
out = out OFS sums[i "," id]
print out
}
}
' Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76
You can also use a CSV-aware program that provides tools for data analysis.
Here's an example with Miller, which is available as a stand-alone executable:
IFS='' read -r csv_header < Filename.csv
mlr --csv \
stats1 -a sum -g "${csv_header%%,*}" -f "${csv_header#*,}" \
then rename -r '(.*)_sum,\1' \
Filename.csv
Genus,Sample1,Sample2,Sample3
Unclassified,0,1,436
Woeseia,0,0,76

How to get cardinality of fields with AWK?

I am trying to count the unique occurrences for each field in a txt file.
Sample:
2008,12,13,6,1007,847,1149,1010,DL,1631,N909DA,162,143,122,99,80,ATL,IAH,689,8,32,0,,0,1,0,19,0,79
2008,12,13,6,638,640,808,753,DL,1632,N604DL,90,73,50,15,-2,JAX,ATL,270,14,26,0,,0,0,0,15,0,0
2008,12,13,6,756,800,1032,1026,DL,1633,N642DL,96,86,56,6,-4,MSY,ATL,425,23,17,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,612,615,923,907,DL,1635,N907DA,131,112,103,16,-3,GEG,SLC,546,5,23,0,,0,0,0,16,0,0
2008,12,13,6,749,750,901,859,DL,1636,N646DL,72,69,41,2,-1,SAV,ATL,215,20,11,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1002,959,1204,1150,DL,1636,N646DL,122,111,71,14,3,ATL,IAD,533,6,45,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,834,835,1021,1023,DL,1637,N908DL,167,168,139,-2,-1,ATL,SAT,874,5,23,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,655,700,856,856,DL,1638,N671DN,121,116,85,0,-5,PBI,ATL,545,24,12,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1251,1240,1446,1437,DL,1639,N646DL,115,117,89,9,11,IAD,ATL,533,13,13,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1110,1103,1413,1418,DL,1641,N908DL,123,135,104,-5,7,SAT,ATL,874,8,11,0,,0,NA,NA,NA,NA,NA
Full dataset here: https://github.com/markgrover/cloudcon-hive (Flight delay dataset from 2008.)
For a single column we can do:
for i in $(seq 1 28); do cut -d',' -f$i 2008.csv | head |sort | uniq | wc -l ; done |tr '\n' ':' ; echo
Is there a way to do it in one go for all the columns?
I think the expected output looks like this:
1:1:1:1:10:10:10:10:1:10:9:9:6:9:9:9:2:5:5:5:6:1:1:1:3:2:2:2:
For the entire dataset:
1:12:31:7:1441:1217:1441:1378:20:7539:5374:690:526:664:1154:1135:303:304:1435:191:343:2:5:2:985:600:575:157:
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
vals[i][$i]
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", length(vals[i]), (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
and with any awk:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
if ( !seen[i,$i]++ ) {
cnt[i]++
}
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", cnt[i], (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
In GNU awk:
$ awk '
BEGIN { FS=OFS="," } # delimiters to ,
{
for(i=1;i<=NF;i++) # iterate over every field
a[i][$i] # store unique values to 2d hash
}
END { # after all the records
for(i=1;i<=NF;i++) # iterate the unique values for each field
for(j in a[i])
c[i]++ # count them and
for(i=1;i<=NF;i++)
printf "%s%s",c[i], (i==NF?ORS:OFS) # output the values
}' file
1,1,1,1,10,10,10,10,1,9,7,10,10,10,10,9,8,5,8,8,8,1,1,1,3,2,4,2,3
The output is not exactly the same, not sure if the mistake is your or mine. Well, the last column has the values 79,0 and NA so mine is more accurate on that one.
another awk
this will give you a rolling counts, pipe to tail -1 to get the last line for the overall counts
$ awk -F, -v OFS=: '{for(i=1;i<=NF;i++)
printf "%s%s", NR-(a[i,$i]++?++c[i]:c[i]),(i==NF)?ORS:OFS}' file
1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1
1:1:1:1:2:2:2:2:1:2:2:2:2:2:2:2:2:2:2:2:2:1:1:1:2:1:2:1:2
1:1:1:1:3:3:3:3:1:3:3:3:3:3:3:3:3:2:3:3:3:1:1:1:3:2:3:2:3
1:1:1:1:4:4:4:4:1:4:4:4:4:4:4:4:4:3:4:4:4:1:1:1:3:2:4:2:3
1:1:1:1:5:5:5:5:1:5:5:5:5:5:5:5:5:3:5:5:5:1:1:1:3:2:4:2:3
1:1:1:1:6:6:6:6:1:5:5:6:6:6:6:6:5:4:6:6:6:1:1:1:3:2:4:2:3
1:1:1:1:7:7:7:7:1:6:6:7:7:7:7:6:5:5:7:6:6:1:1:1:3:2:4:2:3
1:1:1:1:8:8:8:8:1:7:7:8:8:8:8:7:6:5:8:7:7:1:1:1:3:2:4:2:3
1:1:1:1:9:9:9:9:1:8:7:9:9:9:9:8:7:5:8:8:8:1:1:1:3:2:4:2:3
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3

awk: print certain columns and a series of concsecutive columns

If I want to print, say, the fourth column till the tenth column, I can use:
awk '{for( i=4; i<=10; i++ ){printf( "%s ", $i )}; printf( "\n"); }
But what if I want to print columns 1, 3, and then 4-10, and then 12?
awk solution:
awk '{ for(i=1;i<=12;i++) if(i!~/^(2|11)$/) printf "%s ",$i; print "" }' file
if(i!~/^(2|11)$/) - ensures that it's not the 2nd or the 11th field. You can extend the alternation group (2|11|...) to multiple field numbers that should be skipped
Try following and let me know if this helps.
awk '{printf("%s %s",$1,$3);for( i=4; i<=10; i++ ){printf( "%s ", $i )}; printf(" %s\n",$12);}' Input_file

awk: transpose column header to first field of row

My input looks like this:
A|B|C
1|2|3
4|5|6
Using awk, I am trying to get:
A|1
B|2
C|3
A|4
B|5
C|6
My code:
gawk '
BEGIN{FS=OFS="|"}
NR==1{
for(i=1; i<=NF; i++){
x_i=$i
}
}
NR>1{
for(i=1; i<=NF; i++){
print x_i FS $i
}
}' input
But it keeps only the last iteration of the NR==1 block even if I use the same loop in NR>1 bock:
C|1
C|2
C|3
C|4
C|5
C|6
Any trick?
EDIT
Thanks to Jose, I needed to change x_i by x[i].
What about if using the same input, I would need to output:
A;B;C|1|2|3
A;B;C|4|5|6
$ awk 'BEGIN{FS=OFS="|"} NR==1{split($0,h);next} {for (i=1;i<=NF;i++) print h[i], $i}' file
A|1
B|2
C|3
A|4
B|5
C|6
$ awk 'BEGIN{FS=OFS="|"} NR==1{gsub(/\|/,";");h=$0;next} {print h, $0}' file
A;B;C|1|2|3
A;B;C|4|5|6
Read Effective Awk Programming, 4th Edition, by Arnold Robbins.
you can try,
awk 'BEGIN{FS=OFS="|"}
NR==1{for(i=1; i<=NF; ++i) d[i]=$i; next}
{for(i=1; i<=NF; ++i) print d[i], $i}
' input
you get
A|1
B|2
C|3
A|4
B|5
C|6
Important Note
your logic is correct, only x[i] instead of x_i
gawk '
BEGIN{FS=OFS="|"}
NR==1{
for(i=1; i<=NF; i++){
x[i]=$i
}
}
NR>1{
for(i=1; i<=NF; i++){
print x[i] FS $i
}
}' input
Here is another using split and for:
$ awk 'NR==1 { split($0,a,"|") }
NR>1 { n=split($0,b,"|");
for(i=1;i<=n;i++)
print a[i] "|" b[i] }' file
A|1
B|2
C|3
A|4
B|5
C|6

missing field and extra space after using for loop in awk

I need to use an awk script to extract some information from a file.
I have a title line which has 11 field and I split it to an array called titleList.
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
After finding a proper line I need to print the fields which proceeds by the titles for example if the result is :
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
I must print it in this way:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18
Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
I use a for loop to manage it:
for (i=0 ;i<=NF ;i++)
{
printf "%s %s %s %s",titleList[i],":",$i," "
}
everything look good except the result which has 2 problems:
first there is an extra space between each result and second the last field of the searched line is missing
Student Number : 92839342 Name : Robert Bloomingdale Lab1 : 9 Lab2 : 26
Lab3:18 Lab4 : 22 Lab5 : 9 Lab6 : 12 Exam1 : 25 Exam2 : 39 Final
what should I do?
is there any problem with \n at the end of the search result?
You can correct the amount of extra whitespace between fields by correcting the printf statement:
awk -F ":" 'NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
Contents of file.txt:
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
Results:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18 Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
EDIT:
Also, your missing the last value because the file you're working with probably has windows newline endings. To fix this, run: dos2unix file.txt before running your awk code. Alternatively, you can set awk's record separater so that it understands newline endings:
awk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
EDIT:
The above requires GNU awk, split() splits on the FS by default so no need to use that as an arg, it's common to use "next" rather than specifying opposite conditions, and it's common to use print "" instead of printf "\n" so you use the ORS setting rather than hard-coding it's value in output statements. So, the above should be tweaked to:
gawk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array); next } { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; print "" }' file.txt