How to get cardinality of fields with AWK? - awk
I am trying to count the unique occurrences for each field in a txt file.
Sample:
2008,12,13,6,1007,847,1149,1010,DL,1631,N909DA,162,143,122,99,80,ATL,IAH,689,8,32,0,,0,1,0,19,0,79
2008,12,13,6,638,640,808,753,DL,1632,N604DL,90,73,50,15,-2,JAX,ATL,270,14,26,0,,0,0,0,15,0,0
2008,12,13,6,756,800,1032,1026,DL,1633,N642DL,96,86,56,6,-4,MSY,ATL,425,23,17,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,612,615,923,907,DL,1635,N907DA,131,112,103,16,-3,GEG,SLC,546,5,23,0,,0,0,0,16,0,0
2008,12,13,6,749,750,901,859,DL,1636,N646DL,72,69,41,2,-1,SAV,ATL,215,20,11,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1002,959,1204,1150,DL,1636,N646DL,122,111,71,14,3,ATL,IAD,533,6,45,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,834,835,1021,1023,DL,1637,N908DL,167,168,139,-2,-1,ATL,SAT,874,5,23,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,655,700,856,856,DL,1638,N671DN,121,116,85,0,-5,PBI,ATL,545,24,12,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1251,1240,1446,1437,DL,1639,N646DL,115,117,89,9,11,IAD,ATL,533,13,13,0,,0,NA,NA,NA,NA,NA
2008,12,13,6,1110,1103,1413,1418,DL,1641,N908DL,123,135,104,-5,7,SAT,ATL,874,8,11,0,,0,NA,NA,NA,NA,NA
Full dataset here: https://github.com/markgrover/cloudcon-hive (Flight delay dataset from 2008.)
For a single column we can do:
for i in $(seq 1 28); do cut -d',' -f$i 2008.csv | head |sort | uniq | wc -l ; done |tr '\n' ':' ; echo
Is there a way to do it in one go for all the columns?
I think the expected output looks like this:
1:1:1:1:10:10:10:10:1:10:9:9:6:9:9:9:2:5:5:5:6:1:1:1:3:2:2:2:
For the entire dataset:
1:12:31:7:1441:1217:1441:1378:20:7539:5374:690:526:664:1154:1135:303:304:1435:191:343:2:5:2:985:600:575:157:
With GNU awk for true multi-dimensional arrays:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
vals[i][$i]
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", length(vals[i]), (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
and with any awk:
$ cat tst.awk
BEGIN { FS=","; OFS=":" }
{
for (i=1; i<=NF; i++) {
if ( !seen[i,$i]++ ) {
cnt[i]++
}
}
}
END {
for (i=1; i<=NF; i++) {
printf "%s%s", cnt[i], (i<NF?OFS:ORS)
}
}
$ awk -f tst.awk file
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
In GNU awk:
$ awk '
BEGIN { FS=OFS="," } # delimiters to ,
{
for(i=1;i<=NF;i++) # iterate over every field
a[i][$i] # store unique values to 2d hash
}
END { # after all the records
for(i=1;i<=NF;i++) # iterate the unique values for each field
for(j in a[i])
c[i]++ # count them and
for(i=1;i<=NF;i++)
printf "%s%s",c[i], (i==NF?ORS:OFS) # output the values
}' file
1,1,1,1,10,10,10,10,1,9,7,10,10,10,10,9,8,5,8,8,8,1,1,1,3,2,4,2,3
The output is not exactly the same, not sure if the mistake is your or mine. Well, the last column has the values 79,0 and NA so mine is more accurate on that one.
another awk
this will give you a rolling counts, pipe to tail -1 to get the last line for the overall counts
$ awk -F, -v OFS=: '{for(i=1;i<=NF;i++)
printf "%s%s", NR-(a[i,$i]++?++c[i]:c[i]),(i==NF)?ORS:OFS}' file
1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1:1
1:1:1:1:2:2:2:2:1:2:2:2:2:2:2:2:2:2:2:2:2:1:1:1:2:1:2:1:2
1:1:1:1:3:3:3:3:1:3:3:3:3:3:3:3:3:2:3:3:3:1:1:1:3:2:3:2:3
1:1:1:1:4:4:4:4:1:4:4:4:4:4:4:4:4:3:4:4:4:1:1:1:3:2:4:2:3
1:1:1:1:5:5:5:5:1:5:5:5:5:5:5:5:5:3:5:5:5:1:1:1:3:2:4:2:3
1:1:1:1:6:6:6:6:1:5:5:6:6:6:6:6:5:4:6:6:6:1:1:1:3:2:4:2:3
1:1:1:1:7:7:7:7:1:6:6:7:7:7:7:6:5:5:7:6:6:1:1:1:3:2:4:2:3
1:1:1:1:8:8:8:8:1:7:7:8:8:8:8:7:6:5:8:7:7:1:1:1:3:2:4:2:3
1:1:1:1:9:9:9:9:1:8:7:9:9:9:9:8:7:5:8:8:8:1:1:1:3:2:4:2:3
1:1:1:1:10:10:10:10:1:9:7:10:10:10:10:9:8:5:8:8:8:1:1:1:3:2:4:2:3
Related
Print columns from two files
How to print columns from various files? I tried according to Awk: extract different columns from many different files paste <(awk '{printf "%.4f %.5f ", $1, $2}' FILE.R ) <(awk '{printf "%.6f %.0f.\n", $3, $4}' FILE_R ) FILE.R == ARGV[1] { one[FNR]=$1 } FILE.R == ARGV[2] { two[FNR]=$2 } FILE_R == ARGV[3] { three[FNR]=$3 } FILE_R == ARGV[4] { four[FNR]=$4 } END { for (i=1; i<=length(one); i++) { print one[i], two[i], three[i], four[i] } } but I don't understand how to use this script. FILE.R 56604.6017 2.3893 2.2926 2.2033 56605.1562 2.3138 2.2172 2.2033 FILE_R 56604.6017 2.29259 0.006699 42. 56605.1562 2.21716 0.007504 40. Output desired 56604.6017 2.3893 0.006699 42. 56605.1562 2.3138 0.007504 40. Thank you
This is one way: $ awk -v OFS="\t" 'NR==FNR{a[$1]=$2;next}{print $1,a[$1],$3,$4}' file1 file2 Output: 56604.6017 2.3893 0.006699 42. 56605.1562 2.3138 0.007504 40. Explained: $ awk -v OFS="\t" ' # setting the field separator to a tab NR==FNR { # process the first file a[$1]=$2 # hash the second field, use first as key next } { print $1,a[$1],$3,$4 # output }' file1 file2 If the field spacing with tabs is not enough, use printf with modifiers like in your sample.
Split multiple column with awk
I need to split a file with multiple columns that looks like this: TCONS_00000001 q1:Ovary1.13|Ovary1.13.1|100|32.599877 q2:Ovary2.16|Ovary2.16.1|100|88.36 TCONS_00000002 q1:Ovary1.19|Ovary1.19.1|100|12.876644 q2:Ovary2.15|Ovary2.15.1|100|365.44 TCONS_00000003 q1:Ovary1.19|Ovary1.19.2|0|0.000000 q2:Ovary2.19|Ovary2.19.1|100|64.567 Output needed: TCONS_00000001 Ovary1.13.1 32.599877 Ovary2.16.1 88.36 TCONS_00000002 Ovary1.19.1 12.876644 Ovary2.15.1 365.44 TCONS_00000003 Ovary1.19.2 0.000000 Ovary2.19.1 64.567 My attempt: awk 'BEGIN {OFS=FS="\t"}{split($2,two,"|");split($3,thr,"|");print $1,two[2],two[4],thr[2],thr[4]}' in.file Problem: I have many more columns to split like 2 and 3, I would like to find a shorter solutions than splitting every column one by one.
While Sundeep's answer is great, if you are planning for a redundant action on a set of records, suggest using a function and run it on each record. I would write an awk script as below #!/usr/bin/env awk function split_args(record) { n=split(record,split_array,"[:|]") return (split_array[3]"\t"split_array[n]) } BEGIN { FS=OFS="\t" } { for (i=2;i<=NF;i++) { $i=split_args($i) } print } and invoke it as awk -f script.awk inputfile An ugly command-line version of it would be awk 'function split_args(record) { n=split(record,split_array,"[:|]") return (split_array[3]"\t"split_array[n]) } BEGIN { FS=OFS="\t" } { for (i=2;i<=NF;i++) { $i=split_args($i) } print } ' newfile
$ # borrowing simplicity from #Inian's answer ;) $ awk 'BEGIN{FS=OFS="\t"} {for(i=2; i<=NF; i++){split($i,a,/[:|]/); $i=a[3] "\t" a[5]}} 1' ip.txt TCONS_00000001 Ovary1.13.1 32.599877 Ovary2.16.1 88.36 TCONS_00000002 Ovary1.19.1 12.876644 Ovary2.15.1 365.44 TCONS_00000003 Ovary1.19.2 0.000000 Ovary2.19.1 64.567 $ # previous solution which leaves tab character at end $ awk -F'\t' '{printf "%s\t",$1; for(i=2; i<=NF; i++){split($i,a,/[:|]/); printf "%s\t%s\t",a[3],a[5]}; print ""}' ip.txt TCONS_00000001 Ovary1.13.1 32.599877 Ovary2.16.1 88.36 TCONS_00000002 Ovary1.19.1 12.876644 Ovary2.15.1 365.44 TCONS_00000003 Ovary1.19.2 0.000000 Ovary2.19.1 64.567
Calculate average of each column in a file
I have a text file with n number of rows (separated by commas) and columns and I want to find average of each column, excluding empty field. A sample input looks like: 1,2,3 4,,6 ,7, The desired output is: 2.5, 4.5, 4.5 I tried with awk -F',' '{ for(i=1;i<=NF;i++) sum[i]=sum[i]+$i;if(max < NF)max=NF;};END { for(j=1;j<=max;j++) printf "%d\t",sum[j]/max;}' input But it treats consecutive delimiters as one and mixing columns. Any help is much appreciated.
You can use this one-liner: $ awk -F, '{for(i=1; i<=NF; i++) {a[i]+=$i; if($i!="") b[i]++}}; END {for(i=1; i<=NF; i++) printf "%s%s", a[i]/b[i], (i==NF?ORS:OFS)}' foo 2.5 4.5 4.5 Otherwise, you can save this in a file script.awk and run awk -f script.awk your_file: { for(i=1; i<=NF; i++) { a[i]+=$i if($i!="") b[i]++} } END { for(i=1; i<=NF; i++) printf "%s%s", a[i]/b[i], (i==NF?ORS:OFS) }
AWK command to simulate full outer join and then compare
Hello Guys I need a help in building an awk command which can simulate full outer join and then compare values Say cat File1 1|A|B 2|C|D 3|E|F cat File2 1|A|X 2|C|D 3|Z|F Assumptions first column in both the files is the key field so no duplicates both the files are expected to have same structure No limit on the number of fields Now, If I run the awk command awk -F'|' ........... File1 File2 > output Output format <Key>|<File1.column1>|<File2.column1>|<Matched/Mismatched>|<File1.column2>|<File2.column2>|<Matched/Mismatched>|<File1.column3>|<File2.column3>|<Matched/Mismatched> cat output 1|A|A|MATCHED|B|X|MISMATCHED 2|C|C|MATCHED|D|D|MATCHED 3|E|Z|MISMATCHED|F|F|MATCHED Thank You
$ awk -v OFS=\| -F\| 'NR==FNR{for(i=2;i<=NF;i++)a[$1][i]=$i;next}{printf "%s",$1;for(i=2;i<=NF;i++){printf"%s|%s|%s",a[$1][i],$i,a[$1][i]==$i?"matched":"mismatched"}printf"\n"}' file1 file2 1|A|A|matched|B|X|mismatched 2|C|C|matched|D|D|matched 3|E|Z|mismatched|F|F|matched BEGIN { OFS="|"; FS="|" } NR==FNR { # for the first file for(i=2;i<=NF;i++) # fill array with "non-key" fields a[$1][i]=$i;next # and use the "key" field as an index } { printf "%s",$1 for(i=2;i<=NF;i++) { # use the key field to match and print printf"|%s|%s|%s",a[$1][i],$i,a[$1][i]==$i?"matched":"mismatched" } printf"\n" # sugar on the top }
perhaps easier with join assist $ join -t'|' file1 file2 | awk -F'|' -v OFS='|' '{n="MIS"; m="MATCHED"; m1=($2!=$4?n:"")m; m2=($3!=$5?n:"")m; print $1,$2,$4,m1,$3,$5,m2}' 1|A|A|MATCHED|B|X|MISMATCHED 2|C|C|MATCHED|D|D|MATCHED 3|E|Z|MISMATCHED|F|F|MATCHED for unspecified number of fields need more awk $ join -t'|' file1 file2 | awk -F'|' '{c=NF/2; printf "%s", $1; for(i=2;i<=c+1;i++) printf "|%s|%s|%s", $i,$(i+c),($i!=$(i+c)?"MIS":"")"MATCHED"; print ""}'
$ cat tst.awk BEGIN { FS=OFS="|" } NR==FNR { for (i=2; i<=NF; i++) { a[$1,i] = $i } next } { printf "%s%s", $1, OFS for (i=2; i<=NF; i++) { printf "%s%s%s%s%s%s", a[$1,i], OFS, $i, OFS, (a[$1,i]==$i ? "" : "MIS") "MATCHED", (i<NF ? OFS : ORS) } } $ awk -f tst.awk file1 file2 1|A|A|MATCHED|B|X|MISMATCHED 2|C|C|MATCHED|D|D|MATCHED 3|E|Z|MISMATCHED|F|F|MATCHED
Subtracting every column in a row for similar row numbers in two separate files
If I have two files: file1 2,3,1,4,5,2,1 1,2,4,6,3,1,3 1,2,1,1,1,1,1 file2 1,1,1,1,1,1,1 1,1,1,1,1,1,1 1,1,1,1,1,1,1 I want to subtract all the numbers from the same row numbers of each file. So all the numbers of row 1 from file 1 minus all the numbers of row 1 of file 2 and so forth. Output: 1,2,0,3,4,1,0 0,1,3,5,2,0,2 0,1,0,0,0,0,0
$ paste -d, file1 file2 | awk -F, '{n=NF/2; s=""; for (i=1;i<=n;i++) {printf "%s%s", s, $i-$(i+n); s=",";}; print ""}' 1,2,0,3,4,1,0 0,1,3,5,2,0,2 0,1,0,0,0,0,0 How it works paste -d, file1 file2 This combines the files, row by row. n=NF/2; s=""; for (i=1;i<=n;i++) {printf "%s%s", s, $i-$(i+n); s=",";} This subtracts and prints. print "" This prints a newline character at the end of each line.
You could use two dimensional arrays with GNU Awk: $ cat subtract_fields.awk BEGIN { FS=OFS="," } { if(FNR==NR) { for(i=1; i<=NF; i++) a[FNR][i]=$i } else { for(i=1; i<=NF; i++) $i=a[FNR][i]-$i delete a[FNR] print } } $ awk -f subtract_fields.awk file1 file2 1,2,0,3,4,1,0 0,1,3,5,2,0,2 0,1,0,0,0,0,0