calculating the mean of columns in text files - awk

I have two folders named f1 and f2. These folders contain 300 text files with 2 columns.The content of files are shown below.I would like to calculate the mean of second column.file names are same in both folders.
file1 in f1 folder
54 6
55 10
57 5
file2 in f1 folder
24 8
28 12
file1 in f2 folder
34 3
22 8
file2 in f2 folder
24 8
28 13
output
folder1 folder2
file1 21/3= 7 11/2=5.5
file2 20/2=10 21/2=10.5
-- -- --
-- -- --
file300 -- --
total mean of folder1 = sum of the means/3oo
total mean of folder2 = sum of the means/3oo

I'd do it with two awk scripts. (Originally, I had a sort phase in the middle, but that isn't actually necessary. However, I think that two scripts is probably easier than trying to combine them into one. If someone else does it 'all in one' and it is comprehensible, then choose their solution instead.)
Sample run and output
This is based on the 4 files shown in the question. The names of the files are listed on the command line, but the order doesn't matter. The code assumes that there is only one slash in the file names, and no spaces and the like in the file names.
$ awk -f summary1.awk f?/* | awk -f summary2.awk
file1 21/3 = 7.000 11/2 = 5.500
file2 20/2 = 10.000 21/2 = 10.500
total mean of f1 = 17/2 = 8.500
total mean of f2 = 16/2 = 8.000
summary1.awk
function print_data(file, sum, count) {
sub("/", " ", file);
print file, sum, count;
}
oldfile != FILENAME { if (count > 0) { print_data(oldfile, sum, count); }
count = 0; sum = 0; oldfile = FILENAME
}
{ count++; sum += $2 }
END { print_data(oldfile, sum, count) }
This processes each file in turn, summing the values in column 2 and counting the number of lines. It prints out the folder name, the file name, the sum and the count.
summary2.awk
{
sum[$2,$1] = $3
cnt[$2,$1] = $4
if (file[$2]++ == 0) file_list[n1++] = $2
if (fold[$1]++ == 0) fold_list[n2++] = $1
}
END { for (i = 0; i < n1; i++)
{
printf("%-20s", file_list[i])
name = file_list[i]
for (j = 0; j < n2; j++)
{
folder = fold_list[j]
s = sum[name,folder]
n = cnt[name,folder]
a = (s + 0.0) / n
printf(" %6d/%-3d = %10.3f", s, n, a)
gsum[folder] += a
}
printf("\n")
}
for (i = 0; i < n2; i++)
{
folder = fold_list[i]
s = gsum[folder]
n = n1;
a = (s + 0.0) / n
printf("total mean of %-6s = %6d/%-3d = %10.3f\n", folder, s, n, a)
}
}
The file associative array tracks references to file names. The file_list array keeps the file names in the order that they're read. Similarly, the fold associative array tracks the folder names, and the fold_list array keeps track of the folder names in the order that they appear. If you do something weird enough with the order that you supply the names to the first command, you may need to insert a sort command between the two awk commands, such as sort -k2,2 -k1,1.
The sum associative array contains the sum for a given file name and folder name. The cnt associative array contains the count for a given file name and folder name.
The END section of the report has two main loops (though the first loop contains a nested loop). The first main loop processes the files in the order presented, generating one line containing one entry for each folder. It also accumulates the averages for the folder name.
The second main loop generates the 'total mean` data for each folder. I'm not sure whether the statistics makes sense (shouldn't the overall mean for folder1 be the sum of the values in folder1 divided by the number of entries, or 41/5 = 8.2 rather than 17/2 or 8.5?), but the calculation does what I think the question asks for (sum of means / number of files, written as 300 in the question).

With some help from grep:
grep '[0-9]' folder[12]/* | awk '
{
split($0,b,":");
f=b[1]; split(f,c,"/"); d=c[1]; f=c[2];
s[f][d]+=$2; n[f][d]++; nn[d]++;}
END{
for (f in s) {
printf("%-10s", f);
for (d in s[f]) {
a=s[f][d] / n[f][d];
printf(" %6.2f ", a);
p[d] += a;
}
printf("\n");
}
for (d in p) {
printf("total mean %-8s = %8.2f\n", d, p[d]/nn[d]);
}
}'

Related

How to find records with minimum value in a specific column in multiple files?

I have 2 column 4000 dat files. from each file, I need to identify first mimum value of column 2 and print corresponding row. Then this should run on multiple files in the folder and append these values to a new file. I have tried below code.
File names include common string:
fig_3-28333.dat
^^^^^ file number
awk'BEGIN{min=0}{if(($2)>min) min=($2)}END {print line}' cat >> new.dat
output file expected to be
file number Column 1 column2
28333 x value first minimum value
28334 x value first minimum value
NOTE: This only works with gawk (which understands the ENDFILE pattern), and not the regular awk
Here is my script, min.awk:
BEGIN {
print "file number Column 1 column2"
}
FNR == 1 {
min = $2;
first = $1
second = $2
}
$2 < min {
min = $2
first = $1
second = $2
}
ENDFILE {
# Extract the file number to a[1]
match(FILENAME, /.*-([0-9]+)\.dat/, a)
print a[1], first, second
}
Notes
The BEGIN pattern prints the heading
At the first line of each file (pattern: FNR == 1), establish the minimum value
For those lines whose second value is less than the minimum (pattern: $2 < min), establish the new minimum value
At the end of each file, print out the minimum value for that file
Invoke the script
gawk -f min.awk *.dat
Update
After reviewing my script, I duplicated code which I can eliminate by combining the two blocks:
BEGIN {
print "file number Column 1 column2"
}
FNR == 1 || $2 < min{
min = $2;
first = $1
second = $2
}
ENDFILE {
# Extract the file number to a[1]
match(FILENAME, /.*-([0-9]+)\.dat/, a)
print a[1], first, second
}

Passing arrays in awk function

I want to write a function which accepts two arguments one is a constant values and another is an array. The function finds the index of the element in the arrays and returns it.I want to call this function with multiple arrays just as below what I have tried.
BEGIN{
a[1]=2;
a[2]=4;
a[3]=3;
b[1]=4;
b[2]=2;
b[3]=6;
c[1]=5;
c[2]=1;
c[3]=6;
arr[1]=a;
arr[2]=b;
arr[3]=c
}
function pos(val,ar[]) {
for (m=1;m<=length(ar);m++) { if (val == ar[m] )
return m;
else continue }
}
{for( k=1;k<=NF;k++) { for(l=1;l<=length(arr);l++) { print "pos=" pos($i,arr[l])} } }
but I am getting errors :
fatal: attempt to use array `a' in a scalar context
Looking at the code can anyone tell me how can I achieve what I am trying to achieve using awk. The challenge I have here is to assign and array as an element to another array as in arr[1]=a and passing the the array as a parameter by referencing it with its index as in pos($i,arr[l] . I dont know how to make these statements syntactically and functionally correct in awk .
the input is :
2 4 6
3 5 6
1 2 5
and in the out put the code should return the position of the value read from the file if it is present in any of the arrays defined
output:
1 1 3
6
2 1
in first line of output indexed of corresponding elements in the array a b and c have been returned respectively . 1 is index of 2 in a , 1 is index of 4 in b and 3 is index of 6 in c and so on for the upcoming lines in the input file.
I truly don't understand what it is you're trying to do (especially why an input of 2 produces the index from a but not the index from b while an input of 4 does the reverse) but to create a multi-dimensional array arr[][] from a[], b[], and c[] with GNU awk (the only awk that supports true multi-dimensional arrays) would be:
for (i in a) arr[1][i] = a[i]
for (i in b) arr[2][i] = b[i]
for (i in c) arr[3][i] = c[i]
not just arr[1] = a, etc. Note that you're storing a copy of the contents of a[] in arr[1][], not a reference to a[], so if a[] changes then arr[1][] won't. What you might want to do instead (again GNU awk only) is store the sub-array names in arr[] and then access them through the builtin variable SYMTAB (see the man page), e.g.:
$ cat tst.awk
BEGIN{
split("2 4 3",a)
split("4 2 6",b)
split("5 1 6",c)
arr[1] = "a"
arr[2] = "b"
arr[3] = "c"
prtArr(arr)
}
function prtArr(arr, i,subArrName) {
for (i=1; i in arr; i++) {
subArrName = arr[i]
printf "arr[%d] -> %s[] =\n", i, subArrName
prtSubArr(SYMTAB[subArrName])
}
}
function prtSubArr(subArr, j) {
for (j=1; j in subArr; j++) {
print "\t" subArr[j]
}
}
.
$ awk -f tst.awk
arr[1] -> a[] =
2
4
3
arr[2] -> b[] =
4
2
6
arr[3] -> c[] =
5
1
6
Now arr[] is no longer a multi-dimensional array, it's just an array of array name strings, and the contents of a[] are only stored in 1 place (in a[]) and just referenced from SYMTAB[] indexed by the contents of arr[N] rather than copied into arr[N][].

AWK: How can I print averages of consecutive numbers in a file, but skip over alphabetical characters/strings?

I've figured out how to get the average of a file that contains numbers in all lines such as:
Numbers.txt
1
2
4
8
Output:
Average: 3.75
This is the code I use for that:
awk '{ sum += $1; tot++ } END { print sum / tot; }' Numbers.txt
However, the problem is that this doesn't take into account possible strings that might be in the file. For example, a file that looks like this:
NumbersAndExtras.txt
1
2
4
8
Hello
4
5
6
Cat
Dog
2
4
3
For such a file I'd want to print the multiple averages of the consecutive numbers, ignoring the strings such that the result looks something like this:
Output:
Average: 3.75
Average: 5
Average: 3
I could devise some complicated code that might accomplish that with variables and 'if' statements and loops and whatnot, but I've been told it's easier than that given some of awk features. I'd like to know how that might look like, along with an explanation of why it works.
BEGIN runs before reading the first line from file. Set sum and count to 0.
awk 'BEGIN{ sum=0; count=0} {if ( /[a-z][A-Z]/ ) { if (count > 0) {avg = sum/count; print avg;} count=0; sum=0} else { count++; sum += $1} } END{if (count > 0) {avg = sum/count; print avg}} ' NumbersAndExtras.txt
When there is an alphabet on the line, calculate and print average so far.
And do the same in the END block that runs after processing the whole file.
Keep it simple:
awk '/^$/{next}
/^[0-9]+/{a+=$1+0;c++;next}
c&&a{print "Average: "a/c;a=c=0}
END{if(c&&a){print "Average: "a/c}}' input_file
Results:
Average: 3.75
Average: 5
Average: 3
Another one:
$ awk '
function avg(s, c) { print "Average: ", s/c }
NF && !/^[[:digit:]]/ { if (count) avg(sum, count); sum = 0; count = 0; next}
NF { sum += $1; count++ }
END {if (count) avg(sum, count)}
' <file
Note: The value of this answer in explaining the solution; other answers offer more concise alternatives.
Try the following:
Note that this is an awk command with a script specified as a multi-line shell string literal - you can paste the whole thing into your terminal to try it; while it is possible to cram this into a single line, it hurts readability and the ability to comment:
awk '
# Define output function that prints an average.
function printAvg() { print "Average: ", sum/count }
# Skip blank lines
NF == 0 { next}
# Is the line non-numeric?
/[[:alpha:]]/ {
# If this line ends a numeric block, print its
# average now and reset the variables to start the next group.
if (count) {
printAvg()
wasNum = sum = count = 0
}
# Skip to next line.
next
}
# Numeric line: set flag, sum, and increment counter.
{ sum += $1; count++ }
# Finally:
END {
# If there is a group whose average has not been printed yet,
# do it now.
if (count) printAvg()
}
' NumbersAndExtras.txt
If we condense whitespace and strip the comments, we still get a reasonably readable solution, as long as we still use multiple lines:
awk '
function printAvg() { print "Average: ", sum/count }
NF == 0 { next}
/[[:alpha:]]/ { if (count) { printAvg(); sum = count = 0 } next }
{ sum += $1; count++ }
END { if (count) printAvg() }
' NumbersAndExtras.txt

Awk average of n data in each column

"Using awk to bin values in a list of numbers" provide a solution to average each set of 3 points in a column using awk.
How is it possible to extend it to an indefinite number of columns mantaining the format? For example:
2457135.564106 13.249116 13.140903 0.003615 0.003440
2457135.564604 13.250833 13.139971 0.003619 0.003438
2457135.565067 13.247932 13.135975 0.003614 0.003432
2457135.565576 13.256441 13.146996 0.003628 0.003449
2457135.566039 13.266003 13.159108 0.003644 0.003469
2457135.566514 13.271724 13.163555 0.003654 0.003476
2457135.567011 13.276248 13.166179 0.003661 0.003480
2457135.567474 13.274198 13.165396 0.003658 0.003479
2457135.567983 13.267855 13.156620 0.003647 0.003465
2457135.568446 13.263761 13.152515 0.003640 0.003458
averaging values every 5 lines, should output something like
2457135.564916 13.253240 13.143976 0.003622 0.003444
2457135.567324 13.270918 13.161303 0.003652 0.003472
where the first result is the average of the first 1-5 lines, and the second result is the average of the 6-10 lines.
The accepted answer to Using awk to bin values in a list of numbers is:
awk '{sum+=$1} NR%3==0 {print sum/3; sum=0}' inFile
The obvious extension to average all the columns is:
awk 'BEGIN { N = 3 }
{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile
The extra flexibility here is that if you want to group blocks of 5 rows, you simply change one occurrence of 3 into 5. This ignores blocks of up to N-1 rows at the end of the file. If you want to, you can add an END block that prints a suitable average if NR % N != 0.
For the sample input data, the output I got from the script above was:
2457135.564592 13.249294 13.138950 0.003616 0.003437
2457135.566043 13.264723 13.156553 0.003642 0.003465
2457135.567489 13.272767 13.162732 0.003655 0.003475
You can make the code much more complex if you want to analyze what the output formats should be. I've simply used %.6f to ensure 6 decimal places.
If you want N to be a command-line parameter, you can use the -v option to relay the variable setting to awk:
awk -v N="${variable:-3}" \
'{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile
When invoked with $variable set to 5, the output generated from the sample data is:
2457135.565078 13.254065 13.144591 0.003624 0.003446
2457135.567486 13.270757 13.160853 0.003652 0.003472

array over non-existing indices in awk

Sorry for the verbose question, it boils down to a very simple problem.
Assume there are n text files each containing one column of strings (denominating groups) and one of integers (denominating the values of instances within these groups):
# filename xxyz.log
a 5
a 6
b 10
b 15
c 101
c 100
#filename xyzz.log
a 3
a 5
c 116
c 128
Note that while the length of both columns within any given file is always identical it differs between files. Furthermore, not all files contain the same range of groups (the first one contains groups a, b, c, while the second one only contains groups a and c). In awk one could calculate the average of column 2 for each string in column 1 within each file separately and output the results with the following code:
NAMES=$(ls|grep .log|awk -F'.' '{print $1}');
for q in $NAMES;
do
gawk -F' ' -v y=$q 'BEGIN {print "param", y}
{sum1[$1] += $2; N[$1]++}
END {for (key in sum1) {
avg1 = sum1[key] / N[key];
printf "%s %f\n", key, avg1;
} }' $q.log | sort > $q.mean;
done;
Howerver, for the abovementioned reasons, the length of the resulting .mean files differs between files. For each .log file I'd like to output a .mean file listing the entire range of groups (a-d) in the first column and the corresponding mean value or empty spaces in the second column depending on whether this category is present in the .log file. I've tried the following code (given without $NAMES for brevity):
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
{sum[$1] += $2; N[$1]++}
END {for (i in arr) {
if (i in sum) {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;}
else {
printf "%s %s\n" i, "";}
}}' xxyz.log > xxyz.mean;
but it returns the following error:
awk: (FILENAME=myfile FNR=7) fatal: not enough arguments to satisfy format string
`%s %s
'
^ ran out for this one
Any suggestions would be highly appreciated.
Will you ever have explicit zeroes or negative numbers in the log files? I'm going to assume not.
The first line of your second script doesn't do what you wanted:
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
This assigns "a" to arr[0] (because a is a variable not previously used), then "b" to the same element (because b is a variable not previously used), then "c", then "d". Clearly, not what you had in mind. This (untested) code should do the job you need as long as you know that there are just the four groups. If you don't know the groups a priori, you need a more complex program (it can be done, but it is harder).
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0) N[i] = 1 # Divide by zero protection
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}' xxyz.log > xxyz.mean;
This will print a zero average for the missing groups. If you prefer, you can do:
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0)
printf("%s\n", i;
else {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}
}' xxyz.log > xxyz.mean;
For each .log file I'd like to output a .mean file listing the entire
range of groups (a-d) in the first column and the corresponding mean
value or empty spaces in the second column depending on whether this
category is present in the .log file.
Not purely an awk solution, but you can get all the groups with this.
awk '{print $1}' *.log | sort -u > groups
After you calculate the means, you can then join the groups file. Let's say the means for your second input file look like this temporary, intermediate file. (I called it xyzz.tmp.)
a 4
c 122
Join the groups, preserving all the values from the groups file.
$ join -a1 groups xyzz.tmp > xyzz.mean
$ cat xyzz.mean
a 4
b
c 122
Here's my take on the problem. Run like:
./script.sh
Contents of script.sh:
array=($(awk '!a[$1]++ { print $1 }' *.log))
readarray -t sorted < <(for i in "${array[#]}"; do echo "$i"; done | sort)
for i in *.log; do
for j in "${sorted[#]}"; do
awk -v var=$j '
{
sum[$1]+=$2
cnt[$1]++
}
END {
print var, (var in cnt ? sum[var]/cnt[var] : "")
}
' "$i" >> "${i/.log/.main}"
done
done
Results of grep . *.main:
xxyz.main:a 5.5
xxyz.main:b 12.5
xxyz.main:c 100.5
xyzz.main:a 4
xyzz.main:b
xyzz.main:c 122
Here is a pure awk answer:
find . -maxdepth 1 -name "*.log" -print0 |
xargs -0 awk '{SUBSEP=" ";sum[FILENAME,$1]+=$2;cnt[FILENAME,$1]+=1;next}
END{for(i in sum)print i, sum[i], cnt[i], sum[i]/cnt[i]}'
Easy enough to push this into a file --