awk: transpose column header to first field of row - awk

My input looks like this:
A|B|C
1|2|3
4|5|6
Using awk, I am trying to get:
A|1
B|2
C|3
A|4
B|5
C|6
My code:
gawk '
BEGIN{FS=OFS="|"}
NR==1{
for(i=1; i<=NF; i++){
x_i=$i
}
}
NR>1{
for(i=1; i<=NF; i++){
print x_i FS $i
}
}' input
But it keeps only the last iteration of the NR==1 block even if I use the same loop in NR>1 bock:
C|1
C|2
C|3
C|4
C|5
C|6
Any trick?
EDIT
Thanks to Jose, I needed to change x_i by x[i].
What about if using the same input, I would need to output:
A;B;C|1|2|3
A;B;C|4|5|6

$ awk 'BEGIN{FS=OFS="|"} NR==1{split($0,h);next} {for (i=1;i<=NF;i++) print h[i], $i}' file
A|1
B|2
C|3
A|4
B|5
C|6
$ awk 'BEGIN{FS=OFS="|"} NR==1{gsub(/\|/,";");h=$0;next} {print h, $0}' file
A;B;C|1|2|3
A;B;C|4|5|6
Read Effective Awk Programming, 4th Edition, by Arnold Robbins.

you can try,
awk 'BEGIN{FS=OFS="|"}
NR==1{for(i=1; i<=NF; ++i) d[i]=$i; next}
{for(i=1; i<=NF; ++i) print d[i], $i}
' input
you get
A|1
B|2
C|3
A|4
B|5
C|6
Important Note
your logic is correct, only x[i] instead of x_i
gawk '
BEGIN{FS=OFS="|"}
NR==1{
for(i=1; i<=NF; i++){
x[i]=$i
}
}
NR>1{
for(i=1; i<=NF; i++){
print x[i] FS $i
}
}' input

Here is another using split and for:
$ awk 'NR==1 { split($0,a,"|") }
NR>1 { n=split($0,b,"|");
for(i=1;i<=n;i++)
print a[i] "|" b[i] }' file
A|1
B|2
C|3
A|4
B|5
C|6

Related

An awk script without hard-coded field information

We have the following awk script that extracts fields 6, 7 and 14 from a CSV file:
awk -F, '{for (i=1; i<=NF; i++) if (i in [6, 7, 14]) printf "%s,", $i; print ""}' $input_file
The script works beautifully, except that the information about the fields of interest is hard-coded. We would like to be able to pass this information as a single command line argument (or even a series of command line arguments), to make the script more versatile. We tried a few things, including the following, but we keep getting a syntax error:
awk -F, '{for (i=1; i<=NF; i++) if (i in ['$2']) printf "%s,", $i; print ""}' $input_file
awk -F, '{for (i=1; i<=NF; i++) if (i in [6, 7, 14]) printf "%s,", $i; print ""}' $input_file
is not valid awk syntax which is one reason why
awk -F, '{for (i=1; i<=NF; i++) if (i in ['$2']) printf "%s,", $i; print ""}' $input_file
or any variation of it would also give you a syntax error.
This is probably what you're trying to do:
awk -F, -v vals="$2" '
BEGIN { split(vals,tmp); for (i in tmp) arr[tmp[i]] }
{ for (i=1; i<=NF; i++) if (i in arr) printf "%s,", $i; print "" }
' "$input_file"
assuming $2 contains a comma-separated string like 6,7,14 and your input file is a CSV with unquoted fields.
That would still print a trailing , on each line and looping through all fields and discarding the ones you don't want for every input line is an inefficient way to do what you're trying to do. This would solve both of those additional problems:
awk -v vals="$2" '
BEGIN { FS=OFS=","; n=split(vals,arr) }
{ for (i=1; i<=n; i++) printf "%s%s", $(arr[i]), (i<n ? OFS : ORS) }
' "$input_file"
Another option is to not use (g)awk, and use cut:
cut -d "," -f "6,7,14" inputfile
(or: a="6,7,14"; cut -d "," -f "$a" inputfile)
When input contains:
a,b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u,v,w,x,y,z
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,2,25,26
output should look like:
f,g,n
6,7,14

How to get cardinality of fields with AWK?

I am trying to count the unique occurrences for each field in a txt file.
Sample:
2008,12,13,6,1007,847,1149,1010,DL,1631,N909DA,162,143,122,99,80,ATL,IAH,689,8,32,0,,0,1,0,19,0,79
2008,12,13,6,638,640,808,753,DL,1632,N604DL,90,73,50,15,-2,JAX,ATL,270,14,26,0,,0,0,0,15,0,0
2008,12,13,6,756,800,1032,1026,DL,1633,N642DL,96,86,56,6,-4,MSY,ATL,425,23,17,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,612,615,923,907,DL,1635,N907DA,131,112,103,16,-3,GEG,SLC,546,5,23,0,,0,0,0,16,0,0
2008,12,13,6,749,750,901,859,DL,1636,N646DL,72,69,41,2,-1,SAV,ATL,215,20,11,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1002,959,1204,1150,DL,1636,N646DL,122,111,71,14,3,ATL,IAD,533,6,45,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,834,835,1021,1023,DL,1637,N908DL,167,168,139,-2,-1,ATL,SAT,874,5,23,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,655,700,856,856,DL,1638,N671DN,121,116,85,0,-5,PBI,ATL,545,24,12,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1251,1240,1446,1437,DL,1639,N646DL,115,117,89,9,11,IAD,ATL,533,13,13,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1110,1103,1413,1418,DL,1641,N908DL,123,135,104,-5,7,SAT,ATL,874,8,11,0,,0,NA,NA,NA,NA,NA
Full dataset here: https://github.com/markgrover/cloudcon-hive (Flight delay dataset from 2008.)
For a single column we can do:
for i in $(seq 1 28); do cut -d',' -f$i 2008.csv | head |sort | uniq | wc -l ; done |tr '\n' ':' ; echo
Is there a way to do it in one go for all the columns?
I think the expected output looks like this:
1:1:1:1:10:10:10:10:1:10:9:9:6:9:9:9:2:5:5:5:6:1:1:1:3:2:2:2:
For the entire dataset:
1:12:31:7:1441:1217:1441:1378:20:7539:5374:690:526:664:1154:1135:303:304:1435:191:343:2:5:2:985:600:575:157:
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
vals[i][$i]
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", length(vals[i]), (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
and with any awk:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
if ( !seen[i,$i]++ ) {
cnt[i]++
}
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", cnt[i], (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
In GNU awk:
$ awk '
BEGIN { FS=OFS="," } # delimiters to ,
{
for(i=1;i<=NF;i++) # iterate over every field
a[i][$i] # store unique values to 2d hash
}
END { # after all the records
for(i=1;i<=NF;i++) # iterate the unique values for each field
for(j in a[i])
c[i]++ # count them and
for(i=1;i<=NF;i++)
printf "%s%s",c[i], (i==NF?ORS:OFS) # output the values
}' file
1,1,1,1,10,10,10,10,1,9,7,10,10,10,10,9,8,5,8,8,8,1,1,1,3,2,4,2,3
The output is not exactly the same, not sure if the mistake is your or mine. Well, the last column has the values 79,0 and NA so mine is more accurate on that one.
another awk
this will give you a rolling counts, pipe to tail -1 to get the last line for the overall counts
$ awk -F, -v OFS=: '{for(i=1;i<=NF;i++)
printf "%s%s", NR-(a[i,$i]++?++c[i]:c[i]),(i==NF)?ORS:OFS}' file
1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1
1:1:1:1:2:2:2:2:1:2:2:2:2:2:2:2:2:2:2:2:2:1:1:1:2:1:2:1:2
1:1:1:1:3:3:3:3:1:3:3:3:3:3:3:3:3:2:3:3:3:1:1:1:3:2:3:2:3
1:1:1:1:4:4:4:4:1:4:4:4:4:4:4:4:4:3:4:4:4:1:1:1:3:2:4:2:3
1:1:1:1:5:5:5:5:1:5:5:5:5:5:5:5:5:3:5:5:5:1:1:1:3:2:4:2:3
1:1:1:1:6:6:6:6:1:5:5:6:6:6:6:6:5:4:6:6:6:1:1:1:3:2:4:2:3
1:1:1:1:7:7:7:7:1:6:6:7:7:7:7:6:5:5:7:6:6:1:1:1:3:2:4:2:3
1:1:1:1:8:8:8:8:1:7:7:8:8:8:8:7:6:5:8:7:7:1:1:1:3:2:4:2:3
1:1:1:1:9:9:9:9:1:8:7:9:9:9:9:8:7:5:8:8:8:1:1:1:3:2:4:2:3
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3

awk to merge lines based on column

Would like to merge the lines based on first column $1 into row and formatted the output. While printing the header need to generate Max Unique count of first field.
For example, Angola appears count=3 , Brazil appears count=5 , Zambia appears count=1 .
Max unique count of field $1 is 5 , so need to print the header 5 times to have a proper header for all the fields.
While print the output, want to keep the original input file line Orders.
My actual input files used to be vary like 10 fields, 12 fields etc.
Input.csv
Country,Network,Details,Amount
Angola,voda,xxx,10
Angola,at&t,xxx,20
Angola,mtn,xxx,30
Brazil,voda,yyy,40
Brazil,voda,yyy,50
Brazil,at&t,yyy,60
Brazil,mtn,yyy,70
Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Desired Output.csv
Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount
Angola,voda,xxx,10,Angola,at&t,xxx,20,Angola,mtn,xxx,30
Brazil,voda,yyy,40,Brazil,voda,yyy,50,Brazil,at&t,yyy,60,Brazil,mtn,yyy,70,Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Presently , I am using the below two commands to get the desired output and changing the count manually each time based on number of fields in the actual input file.
Step:#1
awk 'BEGIN { while (count++<5) header=header "Country,Network,Details,Amount,"; print header }' > output.csv
Step:#2
awk -F, '
/.+/{
if (!($1 in Val)) { Key[++i] = $1; }
Val[$1] = Val[$1] $0 ",";
}
END{
for (j = 1; j <= i; j++) {
print(Val[Key[j]] );
}
}' input.csv >> output.csv
Looking for your suggestions ...
awk One-liner :
awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS:"")$0; if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",n,i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Input:
$ cat infile
Country,Network,Details,Amount
Angola,voda,xxx,10
Angola,at&t,xxx,20
Angola,mtn,xxx,30
Brazil,voda,yyy,40
Brazil,voda,yyy,50
Brazil,at&t,yyy,60
Brazil,mtn,yyy,70
Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Output:
$ awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS:"")$0; if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",n,i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount,Country,Network,Details,Amount
Angola,voda,xxx,10,Angola,at&t,xxx,20,Angola,mtn,xxx,30
Brazil,voda,yyy,40,Brazil,voda,yyy,50,Brazil,at&t,yyy,60,Brazil,mtn,yyy,70,Brazil,voda,yyy,80
Zambia,tcl,zzz,90
Better Readable:
awk 'BEGIN{
FS=OFS=","
}
FNR==1{
n=$0;
next
}
{
a[$1]=($1 in a ? a[$1] OFS:"")$0;
if(!($1 in b)){ o[++i]=$1 };
b[$1]++;
mx=mx>b[$1]?mx:b[$1]
}
END{
for(i=1; i<=mx; i++)
printf("%s%s",n,i==mx?RS:OFS);
for(i=1; i in o; i++)
print a[o[i]]
}' infile
For comment :
Would like to know, where to change the code, to print "Country" in
the Output only first time , if I don't need to print the same country
name 2nd time , third time
$ awk 'BEGIN{FS=OFS=","}FNR==1{n=$0;next}{a[$1]=($1 in a ? a[$1] OFS substr($0,index($0,",")+1) : $0); if(!($1 in b)){o[++i]=$1}; b[$1]++; mx=mx>b[$1]?mx:b[$1] }END{for(i=1; i<=mx; i++)printf("%s%s",i==1?n:substr(n,index(n,",")+1),i==mx?RS:OFS); for(i=1; i in o; i++)print a[o[i]]}' infile
Country,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount,Network,Details,Amount
Angola,voda,xxx,10,at&t,xxx,20,mtn,xxx,30
Brazil,voda,yyy,40,voda,yyy,50,at&t,yyy,60,mtn,yyy,70,voda,yyy,80
Zambia,tcl,zzz,90
Modified-code:
awk 'BEGIN{
FS=OFS=","
}
FNR==1{
n=$0;
next
}
{
# this line modified
# look for char pos of comma,
a[$1]=($1 in a ? a[$1] OFS substr($0,index($0,",")+1) : $0);
if(!($1 in b)){ o[++i]=$1 };
b[$1]++;
mx=mx>b[$1]?mx:b[$1]
}
END{
for(i=1; i<=mx; i++)
# this line modified
printf("%s%s",i==1?n:substr(n,index(n,",")+1),i==mx?RS:OFS);
for(i=1; i in o; i++)
print a[o[i]]
}' infile
Explanation related to modification:
index(in, find)
Search the string in for the first occurrence of the string find, and
return the position in characters where that occurrence begins in the
string in.
substr(string, start [, length ])
Return a length-character-long substring of string, starting at
character number start.

Calculate average of each column in a file

I have a text file with n number of rows (separated by commas) and columns and I want to find average of each column, excluding empty field.
A sample input looks like:
1,2,3
4,,6
,7,
The desired output is:
2.5, 4.5, 4.5
I tried with
awk -F',' '{ for(i=1;i<=NF;i++) sum[i]=sum[i]+$i;if(max < NF)max=NF;};END { for(j=1;j<=max;j++) printf "%d\t",sum[j]/max;}' input
But it treats consecutive delimiters as one and mixing columns.
Any help is much appreciated.
You can use this one-liner:
$ awk -F, '{for(i=1; i<=NF; i++) {a[i]+=$i; if($i!="") b[i]++}}; END {for(i=1; i<=NF; i++) printf "%s%s", a[i]/b[i], (i==NF?ORS:OFS)}' foo
2.5 4.5 4.5
Otherwise, you can save this in a file script.awk and run awk -f script.awk your_file:
{
for(i=1; i<=NF; i++) {
a[i]+=$i
if($i!="")
b[i]++}
}
END {
for(i=1; i<=NF; i++)
printf "%s%s", a[i]/b[i], (i==NF?ORS:OFS)
}

Rounding floating number using AWK

I have a file b.xyz as,
-19.794325 -23.350704 -9.552335
-20.313872 -23.948248 -8.924463
-18.810708 -23.571757 -9.494047
-20.048543 -23.660052 -10.478968
I want to limit each of the entries to three decimal digits.
I tried this one
awk '{ $1=sprintf("%.3f",$1)} {$2=sprintf("%.3f",$2)} {$3=sprintf("%.3f",$3)} {print $1, $2, $3}' b.xyz
it works for three columns, but how to expand it to apply for n/all columns?
If you will always have three fields, then you can use:
$ awk '{printf "%.3f %.3f %.3f\n", $1, $2, $3}' file
-19.794 -23.351 -9.552
-20.314 -23.948 -8.924
-18.811 -23.572 -9.494
-20.049 -23.660 -10.479
For an undefined number of lines, you can do:
$ awk '{for (i=1; i<=NF; i++) printf "%.3f%s", $i, (i==NF?"\n":" ")}' file
-19.794 -23.351 -9.552
-20.314 -23.948 -8.924
-18.811 -23.572 -9.494
-20.049 -23.660 -10.479
It will loop through all the fields and print them. (i==NF?"\n":" ") prints a new line when the last item is reached.
Or even (thanks Jotne!):
awk '{for (i=1; i<=NF; i++) printf "%.3f %s", $i, (i==NF?RS:FS)}' file
Example
$ cat a
-19.794325 -23.350704 -9.552335 2.13423 23 23223.23 23.23442
-20.313872 -23.948248 -8.924463
-18.810708 -23.571757 -9.494047
-20.048543 -23.660052 -10.478968
$ awk '{for (i=1; i<=NF; i++) printf "%.3f %s", $i, (i==NF?"\n":" ")}' a
-19.794 -23.351 -9.552 2.134 23.000 23223.230 23.234
-20.314 -23.948 -8.924
-18.811 -23.572 -9.494
-20.049 -23.660 -10.479
$ awk '{for (i=1; i<=NF; i++) printf "%.3f %s", $i, (i==NF?RS:FS)}' a
-19.794 -23.351 -9.552 2.134 23.000 23223.230 23.234
-20.314 -23.948 -8.924
-18.811 -23.572 -9.494
-20.049 -23.660 -10.479