I have a file with several columns like $2$3 (until $32) as in
A refdevhet devdevhomo
B refdevhet refdevhet
C refrefhomo refdevhet
D devrefhet refdevhet
I need to count how many occurrences of each unique element in each column separately
so that I have
refdevhet 2 3
refrefhomo 1 0
devrefhet 1 0
devdevhomo 0 1
I tried several variations of
awk 'BEGIN {
FS=OFS="\t"
}
{
for(i=1; i<=32; i++) a[$i]++
}
END {
for (i in a) print i, a[i]
}' file
but instead it's printing the cumulative sum of occurrences of unique elements across the selected fields.
Here is a solution:
BEGIN {
FS=OFS="\t"
}
{
if (NF>mxf) mxf = NF;
for(i=1; i<=NF; i++) {ws[$i]=1; c[$i,i]++}
}
END {
for (w in ws) {
printf "%s", w
for (i=1;i<=mxf;i++) printf "%s%d", OFS, c[w,i];
print ""
}
}
Notice that solution is general. It will include first column into consideration as well. To omit the first column, change i=1 to i=2 in both places.
In addition to #Andriy's good answer, with GNU awk you can use a 2-dimensional array
gawk '
{for (i=2; i<=NF; i++) count[$i][i]++}
END {
for (word in count) {
printf "%s", word
for (i=2; i<=NF; i++) printf "%s%d", OFS, count[word][i]
print ""
}
}
' file | column -t
I'm assuming here that each line has the same number of fields as the last line.
"Using awk to bin values in a list of numbers" provide a solution to average each set of 3 points in a column using awk.
How is it possible to extend it to an indefinite number of columns mantaining the format? For example:
2457135.564106 13.249116 13.140903 0.003615 0.003440
2457135.564604 13.250833 13.139971 0.003619 0.003438
2457135.565067 13.247932 13.135975 0.003614 0.003432
2457135.565576 13.256441 13.146996 0.003628 0.003449
2457135.566039 13.266003 13.159108 0.003644 0.003469
2457135.566514 13.271724 13.163555 0.003654 0.003476
2457135.567011 13.276248 13.166179 0.003661 0.003480
2457135.567474 13.274198 13.165396 0.003658 0.003479
2457135.567983 13.267855 13.156620 0.003647 0.003465
2457135.568446 13.263761 13.152515 0.003640 0.003458
averaging values every 5 lines, should output something like
2457135.564916 13.253240 13.143976 0.003622 0.003444
2457135.567324 13.270918 13.161303 0.003652 0.003472
where the first result is the average of the first 1-5 lines, and the second result is the average of the 6-10 lines.
The accepted answer to Using awk to bin values in a list of numbers is:
awk '{sum+=$1} NR%3==0 {print sum/3; sum=0}' inFile
The obvious extension to average all the columns is:
awk 'BEGIN { N = 3 }
{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile
The extra flexibility here is that if you want to group blocks of 5 rows, you simply change one occurrence of 3 into 5. This ignores blocks of up to N-1 rows at the end of the file. If you want to, you can add an END block that prints a suitable average if NR % N != 0.
For the sample input data, the output I got from the script above was:
2457135.564592 13.249294 13.138950 0.003616 0.003437
2457135.566043 13.264723 13.156553 0.003642 0.003465
2457135.567489 13.272767 13.162732 0.003655 0.003475
You can make the code much more complex if you want to analyze what the output formats should be. I've simply used %.6f to ensure 6 decimal places.
If you want N to be a command-line parameter, you can use the -v option to relay the variable setting to awk:
awk -v N="${variable:-3}" \
'{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile
When invoked with $variable set to 5, the output generated from the sample data is:
2457135.565078 13.254065 13.144591 0.003624 0.003446
2457135.567486 13.270757 13.160853 0.003652 0.003472
If we have an input:
TargetIDs,CPD,Value,SMILES
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
95,CPD-3333333,-1,c1ccccc1N
Now we would like to separate the duplicates and non-duplicates based on the fourth column (smiles)
duplicate:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate
95,CPD-3333333,-1,c1ccccc1N
Now the following attempt could do separate the duplicate without any problem. However, the first occurrence of the duplicate will still be included into the non-duplicate file.
BEGIN { FS = ","; f1="a"; f2="b"}
{
# Keep count of the fields in fourth column
count[$4]++;
# Save the line the first time we encounter a unique field
if (count[$4] == 1)
first[$4] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$4] == 2)
print first[$4] > f1 ;
# From the second time onward. always print because the field is
# duplicated
if (count[$4] > 1)
print > f1;
if (count[$4] == 1) #if (count[$4] - count[$4] == 0) <= change to this doesn't work
print first[$4] > f2;
duplicate output results from the attempt:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
non-duplicate output results from the attempt
TargetIDs,CPD,Value,SMILES
95,CPD-3333333,-1,c1ccccc1N
95,CPD-1111111,-2,c1ccccc1
May I know if any guru might have comments/solutions? Thanks.
I would do this:
awk '
NR==FNR {count[$2] = $1; next}
FNR==1 {FS=","; next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' <(cut -d, -f4 input | sort | uniq -c) input
The process substitution will pre-process the file and perform a count on the 4th column. Then, you can process the file and decide if that line is "duplicated".
All in awk: Ed Morton shows a way to collect the data in a single pass. Here's a 2 pass solution that's virtually identical to my example above
awk -F, '
NR==FNR {count[$NF]++; next}
FNR==1 {next}
{
output = (count[$NF] == 1 ? "nondup" : "dup")
print > output
}
' input input
Yes, the input file is given twice.
$ cat tst.awk
BEGIN{ FS="," }
NR>1 {
if (cnt[$4]++) {
dups[$4] = nonDups[$4] dups[$4] $0 ORS
delete nonDups[$4]
}
else {
nonDups[$4] = $0 ORS
}
}
END {
print "Duplicates:"
for (key in dups) {
printf "%s", dups[key]
}
print "\nNon Duplicates:"
for (key in nonDups) {
printf "%s", nonDups[key]
}
}
$ awk -f tst.awk file
Duplicates:
95,CPD-1111111,-2,c1ccccc1
95,CPD-2222222,-3,c1ccccc1
95,CPD-2222222,-4,c1ccccc1
Non Duplicates:
95,CPD-3333333,-1,c1ccccc1N
This solution only works if the duplicates are grouped together.
awk -F, '
function fout( f, i) {
f = (cnt > 1) ? "dups" : "nondups"
for (i = 1; i <= cnt; ++i)
print lines[i] > f
}
NR > 1 && $4 != lastkey { fout(); cnt = 0 }
{ lastkey = $4; lines[++cnt] = $0 }
END { fout() }
' file
Little late
My version in awk
awk -F, 'NR>1{a[$0":"$4];b[$4]++}
END{d="\n\nnondupe";e="dupe"
for(i in a){split(i,c,":");b[c[2]]==1?d=d"\n"i:e=e"\n"i} print e d}' file
Another built similar to glenn jackmans but all in awk
awk -F, 'function r(f) {while((getline <f)>0)a[$4]++;close(f)}
BEGIN{r(ARGV[1])}{output=(a[$4] == 1 ? "nondup" : "dup");print >output} ' file
The following works great on my data in column 12 but I have over 70 columns that are not all the same and I need to output all of the columns, the converted ones replacing the scientific values.
awk -F',' '{printf "%.41f\n", $12}' $file
Thanks
This is one line..
2012-07-01T21:59:50,2012-07-01T21:59:00,1817,22901,264,283,549,1,2012-06-24T13:20:00,2.600000000000000e+001,4.152327506554059e+001,-7.893523806678388e+001,5.447572631835938e+002,2.093000000000000e+003,5.295000000000000e+003,1,194733,1.647400093078613e+001,31047680,1152540,29895140,4738,1.586914062500000e+000,-1.150000000000000e+002,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,3.606000000000000e+003,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,4.557073364257813e+002,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,11,0.000000000000000e+000,2.000000000000000e+000,0,0,0,0,4.466836981009692e-004,0.000000000000000e+000,0.000000000000000e+000,0.000000000000000e+000,8,0,840,1,600,1,6,1,1,1,5,2,2,2,1,1,1,1,4854347,0,-
UPDATE
This is working for the non converted output. I am having a bit of trouble inserting an else if statement for some reason. Everything seems to give me a syntax error in a file or on cli.
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 9||i == 16||i == 17||i == 19||i == 20||i == 21||i == 22|| i == 40|| i == 43||i == 44||i == 45||i == 46||i >= 51) printf ($i",")};}' $file
I would like to insert the following statement into the code above??
else if (i == 10) printf ("%.41f", $i)
SOLVED
Got it worked out. Thanks for all the great ideas. I can't seem to make it work in a file with awk -f but on the command line this is working great. I put this one liner in my program.
awk -F',' '{for (i=1;i<=NF;i++) {if (i <= 9||i == 16||i == 17||i >= 19&&i <= 22|| i == 40|| i >= 43&&i <= 46||i >= 51&&i <= 70) printf($i","); else if (i == 10||i == 18) printf("%.2f,", $i); else if (i == 11||i == 12) printf("%.41f,", $i); else if (i == 13) printf("%.1f,", $i); else if (i == 14||i == 15||i >= 24&&i <= 46) printf ("%d,", $i); else if (i == 23) printf("%.4f,", $i); else if (i >= 47&&i <= 50) printf("%.6f,", $i); if (i == 71) printf ($i"\n")};}'
RESULT
2012-07-01T21:59:50,2012-07-01T21:59:00,1817,22901,264,283,549,1,2012-06-24T13:20:00,26.00,41.52327506554058800247730687260627746582031,-78.93523806678388154978165403008460998535156,544.8,2093,5295,1,194733,16.47,31047680,1152540,29895140,4738,1.5869,-115,0,0,0,0,0,0,0,3606,0,0,0,455,0,0,0,11,0,2,0,0,0,0,0.000447,0.000000,0.000000,0.000000,8,0,840,1,600,1,6,1,1,1,5,2,2,2,1,1,1,1,4854347,0,-
You can do regex matching in a loop to choose the format for each field since numbers are also strings in AWK:
#!/usr/bin/awk -f
BEGIN {
d = "[[:digit:]]"
OFS = FS = ","
}
{
delim = ""
for (i = 1; i <= NF; i++) {
if ($i ~ d "e+" d d d "$") {
printf "%s%.41f", delim, $i
}
else {
printf "%s%s", delim, $i
}
delim = OFS
}
printf "\n"
}
Edit:
I've changed the version above so you can see how it would be used in a file as an AWK script. Save it (I'll call it "scinote") and set it as executable chmod u+x scinote, then you can run it like this: ./scinote inputfile
I've also modified the latest version you added to your question to make it a little simpler and so it's ready to go into a script file as above.
#!/usr/bin/awk -f
BEGIN {
plainlist = "16 17 19 20 21 22 40 43 44 45 46"
split(plainlist, arr)
for (i in arr) {
plainfmt[arr[i]] = "%s"
}
OFS = FS = ","
}
{
delim = ""
for (i = 1; i <= NF; i++) {
printf "%s", delim
if (i <= 9 || i in plainfmt || i >= 51) {
printf plainfmt[i], $i
}
else if (i == 10) {
printf "%.41f", $i
}
else if (i == 12) {
printf "%.12f", $i
}
delim = OFS
}
printf "\n"
}
If you had more fields with other formats (rather than just one per), you could do something similar to the plainfmt array.
You could always loop through all of your data fields and use them in your printf. For a simple file just to test the mechanics you could try this:
awk '{for (i=1; i<=NF; i++) printf("%d = %s\n", i, $i);}' data.txt
Note that -F is not set here, so fields will be split by whitepace.
NF is the predefined variable for number of fields on a line, fields start with 1 (e.g., $1, $2, etc until $NF). $0 is the whole line.
So for your example this may work:
awk -F',' '{for (i=1; i<=NF; i++) printf "%.41f\n", $i}' $file
Update based on comment below (not on a system test the syntax):
If you have certain fields that need to be treated differently, you may have to resort to a switch statement or an if-statement to treat different fields differently. This would be easier if you stored your script in a file, let's call it so.awk and invoked it like this:
awk -f so.awk $file
Your script might contain something along these lines:
BEGIN{ FS=',' }
{ for (i=1; i<=NF; i++)
{
if (i == 20 || i == 22|| i == 30)
printf( " .. ", $i)
else if ( i == 13 || i == 24)
printf( " ....", $i)
etc.
}
}
You can of course also use if (i > 2) ... or other ranges to avoid having to list out every single field if possible.
As an alternative to this series of if-statements see the switch statement mentioned above.