Passing arrays in awk function - awk

I want to write a function which accepts two arguments one is a constant values and another is an array. The function finds the index of the element in the arrays and returns it.I want to call this function with multiple arrays just as below what I have tried.
BEGIN{
a[1]=2;
a[2]=4;
a[3]=3;
b[1]=4;
b[2]=2;
b[3]=6;
c[1]=5;
c[2]=1;
c[3]=6;
arr[1]=a;
arr[2]=b;
arr[3]=c
}
function pos(val,ar[]) {
for (m=1;m<=length(ar);m++) { if (val == ar[m] )
return m;
else continue }
}
{for( k=1;k<=NF;k++) { for(l=1;l<=length(arr);l++) { print "pos=" pos($i,arr[l])} } }
but I am getting errors :
fatal: attempt to use array `a' in a scalar context
Looking at the code can anyone tell me how can I achieve what I am trying to achieve using awk. The challenge I have here is to assign and array as an element to another array as in arr[1]=a and passing the the array as a parameter by referencing it with its index as in pos($i,arr[l] . I dont know how to make these statements syntactically and functionally correct in awk .
the input is :
2 4 6
3 5 6
1 2 5
and in the out put the code should return the position of the value read from the file if it is present in any of the arrays defined
output:
1 1 3
6
2 1
in first line of output indexed of corresponding elements in the array a b and c have been returned respectively . 1 is index of 2 in a , 1 is index of 4 in b and 3 is index of 6 in c and so on for the upcoming lines in the input file.

I truly don't understand what it is you're trying to do (especially why an input of 2 produces the index from a but not the index from b while an input of 4 does the reverse) but to create a multi-dimensional array arr[][] from a[], b[], and c[] with GNU awk (the only awk that supports true multi-dimensional arrays) would be:
for (i in a) arr[1][i] = a[i]
for (i in b) arr[2][i] = b[i]
for (i in c) arr[3][i] = c[i]
not just arr[1] = a, etc. Note that you're storing a copy of the contents of a[] in arr[1][], not a reference to a[], so if a[] changes then arr[1][] won't. What you might want to do instead (again GNU awk only) is store the sub-array names in arr[] and then access them through the builtin variable SYMTAB (see the man page), e.g.:
$ cat tst.awk
BEGIN{
split("2 4 3",a)
split("4 2 6",b)
split("5 1 6",c)
arr[1] = "a"
arr[2] = "b"
arr[3] = "c"
prtArr(arr)
}
function prtArr(arr, i,subArrName) {
for (i=1; i in arr; i++) {
subArrName = arr[i]
printf "arr[%d] -> %s[] =\n", i, subArrName
prtSubArr(SYMTAB[subArrName])
}
}
function prtSubArr(subArr, j) {
for (j=1; j in subArr; j++) {
print "\t" subArr[j]
}
}
.
$ awk -f tst.awk
arr[1] -> a[] =
2
4
3
arr[2] -> b[] =
4
2
6
arr[3] -> c[] =
5
1
6
Now arr[] is no longer a multi-dimensional array, it's just an array of array name strings, and the contents of a[] are only stored in 1 place (in a[]) and just referenced from SYMTAB[] indexed by the contents of arr[N] rather than copied into arr[N][].

Related

awk does not get multiple matches in a line with match

AWK has the match(s, r [, a]) function which according to the manual is capable of recording all occuring patterns into array "a":
...If array a is provided, a is cleared and then elements 1 through n are filled with the portions of s that match the corresponding parenthesized subexpression in r. The 0'th element of a contains the portion of s matched by the entire regular expression r. Subscripts a[n, "start"], and a[n, "length"] provide the starting index in the string and length respectively, of EACH matching substring.
I expect that the following line:
echo 123412341234 | awk '{match($0,"1",arr); print arr[0] arr[1] arr[2];)'
prints 111
But in fact "match" ignores all other matches except the first one.
Could please someone tell me please what is the proper syntax here to populate "arr" with all occurrences of "1"?
match only finds first match and stops there. You will have to run match in a loop or else use this way where we use split input on anything this is not 1:
echo '123412341234' | awk -F '[^1]+' '{print $1 $2 $3}'
111
Or using split in gnu-awk:
echo '123412341234' | awk 'split($0, a, /1/, m) {print m[1] m[2] m[3]}'
111
I would harness GNU AWK patsplit function for that task following way, let file.txt content be
123412341234
then
awk '{patsplit($0,arr,"1");print arr[1] arr[2] arr[3]}' file.txt
gives output
111
Explanation: patsplit is function which allows you to get similar effect to using FPAT variable, it does put all matches of 3rd argument into array provided as 2nd argument (clearing it if is not empty) found in string provided as 1st argument. Observe that 1st finding does goes under key 1, 2nd under 2, 3rd under 3 and so on (there is nothing under 0)
(tested in GNU Awk 5.0.1)
If sub is allowed then you can do a substitution here. Try following awk code once.
awk '{gsub(/[^1]+/,"")} 1' Input_file
patsplit() is basically same as wrapping the desired regex pattern with a custom pair of SEPs before splitting, which is what anysplit() is emulating here, while being UTF-8 friendly.
echo "123\uC350abc:\uF8FF:|\U1F921#xyz" |
mawk2x '{ print ("\t\f"($0)"\n")>>(STDERR)
anysplit($_, reFLEX_UCode8 "|[[-_!-/3-?]",___=2,__)
OFS="\t"
for(_ in __) { if (!(_%___)) {
printf(" matched_items[ %2d ] = # %-2d = \42%s\42\n",
_,_/___,__[_])
} } } END { printf(ORS) }'
123썐abc::|🤡#xyz
matched_items[ 2 ] = # 1 = "3썐"
matched_items[ 4 ] = # 2 = "::"
matched_items[ 6 ] = # 3 = "🤡#"
In the background, anysplit() is nothing all that complicated either :
xs3pFS is a 3-byte string of \301\032\365 that I assumed would be extremely rare to show up even in binary data.
gsub(patRE, xs3pFS ((pat=="&")?"\\":"") "&" xs3pFS,_)
gsub(xs3pFS "("xs3pFS")+", "",_)
return split(_, ar8, xs3pFS)
By splitting the input string in this manner, all the desired items would exist in even-numbered array indices, while the rest of the string would be distributed along odd-numbered indices,
somewhat similar to the 2nd array i.e. 4th argument in gawk's split() and patsplit() for the seps, but difference being that both the matches and the seps, whichever way you want to see them, are in the same array.
When you print out every cell in the array, you'll see :
_SEPS_[ 1 ] = # 1 = "123"
matched_items
[ 2 ] = # 1 = "썐"
_SEPS_[ 3 ] = # 2 = "abc"
matched_items
[ 4 ] = # 2 = "::"
_SEPS_[ 5 ] = # 3 = "|"
matched_items
[ 6 ] = # 3 = "🤡#"
_SEPS_[ 7 ] = # 4 = "xyz"

How to detect the last line in awk before END?

I'm trying to concatenate String values and print them, but if the last types are Strings and there is no change of type then the concatenation won't print:
input.txt:
String 1
String 2
Number 5
Number 2
String 3
String 3
awk:
awk '
BEGIN { tot=0; ant_t=""; }
{
t = $1; val=$2;
#if string, concatenate its value
if (t == "String") {
tot+=val;
nx=1;
} else {
nx=0;
}
#if type change, add tot to res
if (t != "String" && ant_t == "String") {
res=res tot;
tot=0;
}
ant_t=t;
#if string, go next
if (nx == 1) {
next;
}
res=res"\n"val;
}
END { print res; }' input.txt
Current output:
3
5
2
Expected output:
3
5
2
6
How can I detect if awk is reading last line, so if there won't be change of type it will check if it is the last line?
awk reads line by line hence it cannot determine if it is reading the last line or not. The END block can be useful to perform actions once the end of file has reached.
To perform what you expect
awk '/String/{sum+=$2} /Number/{if(sum) print sum; sum=0; print $2} END{if(sum) print sum}'
will produce output as
3
5
2
6
what it does?
/String/ selects line that matches String so is Number
sum+=$2 performs the concatanation with String lines. When Number occurs, print the sum and reset to zero
Like this maybe:
awk -v lines="$(wc -l < /etc/hosts)" 'NR==lines{print "LAST"};1' /etc/hosts
I am pre-calculating the number of lines (using wc) and passing that into awk as a variable called lines, if that is unclear.
Just change last line to:
END { print res; print tot;}'
awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
Explanation
y is used as a boolean, and I check at the END if the last pattern was a string and print the sum
You can actually use x as the boolean like nu11p01n73R does which is smarter
Test
$ cat file
String 1
String 2
Number 5
Number 2
String 3
String 3
$ awk '$1~"String"{x+=$2;y=1}$1~"Number"{if (y){print x;x=0;y=0;}print $2}END{if(y) print x}' file
3
5
2
6

Declaring all elements of an associative array in a single statement - AWK

I am fairly new to awk and am trying to figure out how to declare all elements of an associative array in one go. For example, if I wanted to declare an associative array in Python (which is effectively the dictionary) I would do this:
numbers = {'uno': 1, 'sero': 0}
Now, in awk is it possible to convert the two lines of code below into one?
numbers["uno"] = 1
numbers["sero"] = 0
AWK doesn't have array literals as far as I know, but this script demonstrates something you can do to get close:
BEGIN {
split("uno|1|sero|0",a,"|");
for (i = 1; i < 4; i += 2) {b[a[i]] = a[i+1];}
}
END {
print b["sero"];
print b["uno"];
}
Of course, you can always make a function that could be called like
newarray("uno", 1, "sero", 0);
or like
newarray("uno|1|sero|0");
No. Best you can do is:
$ awk 'BEGIN {
# populate the "numbers" array:
split("uno:1,sero:0",a,/[:,]/)
for (i=1;i in a;i+=2)
numbers[a[i]] = a[i+1]
# print the "numbers" array:
for (i in numbers)
print i, numbers[i]
}'
uno 1
sero 0

calculating the mean of columns in text files

I have two folders named f1 and f2. These folders contain 300 text files with 2 columns.The content of files are shown below.I would like to calculate the mean of second column.file names are same in both folders.
file1 in f1 folder
54 6
55 10
57 5
file2 in f1 folder
24 8
28 12
file1 in f2 folder
34 3
22 8
file2 in f2 folder
24 8
28 13
output
folder1 folder2
file1 21/3= 7 11/2=5.5
file2 20/2=10 21/2=10.5
-- -- --
-- -- --
file300 -- --
total mean of folder1 = sum of the means/3oo
total mean of folder2 = sum of the means/3oo
I'd do it with two awk scripts. (Originally, I had a sort phase in the middle, but that isn't actually necessary. However, I think that two scripts is probably easier than trying to combine them into one. If someone else does it 'all in one' and it is comprehensible, then choose their solution instead.)
Sample run and output
This is based on the 4 files shown in the question. The names of the files are listed on the command line, but the order doesn't matter. The code assumes that there is only one slash in the file names, and no spaces and the like in the file names.
$ awk -f summary1.awk f?/* | awk -f summary2.awk
file1 21/3 = 7.000 11/2 = 5.500
file2 20/2 = 10.000 21/2 = 10.500
total mean of f1 = 17/2 = 8.500
total mean of f2 = 16/2 = 8.000
summary1.awk
function print_data(file, sum, count) {
sub("/", " ", file);
print file, sum, count;
}
oldfile != FILENAME { if (count > 0) { print_data(oldfile, sum, count); }
count = 0; sum = 0; oldfile = FILENAME
}
{ count++; sum += $2 }
END { print_data(oldfile, sum, count) }
This processes each file in turn, summing the values in column 2 and counting the number of lines. It prints out the folder name, the file name, the sum and the count.
summary2.awk
{
sum[$2,$1] = $3
cnt[$2,$1] = $4
if (file[$2]++ == 0) file_list[n1++] = $2
if (fold[$1]++ == 0) fold_list[n2++] = $1
}
END { for (i = 0; i < n1; i++)
{
printf("%-20s", file_list[i])
name = file_list[i]
for (j = 0; j < n2; j++)
{
folder = fold_list[j]
s = sum[name,folder]
n = cnt[name,folder]
a = (s + 0.0) / n
printf(" %6d/%-3d = %10.3f", s, n, a)
gsum[folder] += a
}
printf("\n")
}
for (i = 0; i < n2; i++)
{
folder = fold_list[i]
s = gsum[folder]
n = n1;
a = (s + 0.0) / n
printf("total mean of %-6s = %6d/%-3d = %10.3f\n", folder, s, n, a)
}
}
The file associative array tracks references to file names. The file_list array keeps the file names in the order that they're read. Similarly, the fold associative array tracks the folder names, and the fold_list array keeps track of the folder names in the order that they appear. If you do something weird enough with the order that you supply the names to the first command, you may need to insert a sort command between the two awk commands, such as sort -k2,2 -k1,1.
The sum associative array contains the sum for a given file name and folder name. The cnt associative array contains the count for a given file name and folder name.
The END section of the report has two main loops (though the first loop contains a nested loop). The first main loop processes the files in the order presented, generating one line containing one entry for each folder. It also accumulates the averages for the folder name.
The second main loop generates the 'total mean` data for each folder. I'm not sure whether the statistics makes sense (shouldn't the overall mean for folder1 be the sum of the values in folder1 divided by the number of entries, or 41/5 = 8.2 rather than 17/2 or 8.5?), but the calculation does what I think the question asks for (sum of means / number of files, written as 300 in the question).
With some help from grep:
grep '[0-9]' folder[12]/* | awk '
{
split($0,b,":");
f=b[1]; split(f,c,"/"); d=c[1]; f=c[2];
s[f][d]+=$2; n[f][d]++; nn[d]++;}
END{
for (f in s) {
printf("%-10s", f);
for (d in s[f]) {
a=s[f][d] / n[f][d];
printf(" %6.2f ", a);
p[d] += a;
}
printf("\n");
}
for (d in p) {
printf("total mean %-8s = %8.2f\n", d, p[d]/nn[d]);
}
}'

array over non-existing indices in awk

Sorry for the verbose question, it boils down to a very simple problem.
Assume there are n text files each containing one column of strings (denominating groups) and one of integers (denominating the values of instances within these groups):
# filename xxyz.log
a 5
a 6
b 10
b 15
c 101
c 100
#filename xyzz.log
a 3
a 5
c 116
c 128
Note that while the length of both columns within any given file is always identical it differs between files. Furthermore, not all files contain the same range of groups (the first one contains groups a, b, c, while the second one only contains groups a and c). In awk one could calculate the average of column 2 for each string in column 1 within each file separately and output the results with the following code:
NAMES=$(ls|grep .log|awk -F'.' '{print $1}');
for q in $NAMES;
do
gawk -F' ' -v y=$q 'BEGIN {print "param", y}
{sum1[$1] += $2; N[$1]++}
END {for (key in sum1) {
avg1 = sum1[key] / N[key];
printf "%s %f\n", key, avg1;
} }' $q.log | sort > $q.mean;
done;
Howerver, for the abovementioned reasons, the length of the resulting .mean files differs between files. For each .log file I'd like to output a .mean file listing the entire range of groups (a-d) in the first column and the corresponding mean value or empty spaces in the second column depending on whether this category is present in the .log file. I've tried the following code (given without $NAMES for brevity):
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
{sum[$1] += $2; N[$1]++}
END {for (i in arr) {
if (i in sum) {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;}
else {
printf "%s %s\n" i, "";}
}}' xxyz.log > xxyz.mean;
but it returns the following error:
awk: (FILENAME=myfile FNR=7) fatal: not enough arguments to satisfy format string
`%s %s
'
^ ran out for this one
Any suggestions would be highly appreciated.
Will you ever have explicit zeroes or negative numbers in the log files? I'm going to assume not.
The first line of your second script doesn't do what you wanted:
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
This assigns "a" to arr[0] (because a is a variable not previously used), then "b" to the same element (because b is a variable not previously used), then "c", then "d". Clearly, not what you had in mind. This (untested) code should do the job you need as long as you know that there are just the four groups. If you don't know the groups a priori, you need a more complex program (it can be done, but it is harder).
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0) N[i] = 1 # Divide by zero protection
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}' xxyz.log > xxyz.mean;
This will print a zero average for the missing groups. If you prefer, you can do:
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0)
printf("%s\n", i;
else {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}
}' xxyz.log > xxyz.mean;
For each .log file I'd like to output a .mean file listing the entire
range of groups (a-d) in the first column and the corresponding mean
value or empty spaces in the second column depending on whether this
category is present in the .log file.
Not purely an awk solution, but you can get all the groups with this.
awk '{print $1}' *.log | sort -u > groups
After you calculate the means, you can then join the groups file. Let's say the means for your second input file look like this temporary, intermediate file. (I called it xyzz.tmp.)
a 4
c 122
Join the groups, preserving all the values from the groups file.
$ join -a1 groups xyzz.tmp > xyzz.mean
$ cat xyzz.mean
a 4
b
c 122
Here's my take on the problem. Run like:
./script.sh
Contents of script.sh:
array=($(awk '!a[$1]++ { print $1 }' *.log))
readarray -t sorted < <(for i in "${array[#]}"; do echo "$i"; done | sort)
for i in *.log; do
for j in "${sorted[#]}"; do
awk -v var=$j '
{
sum[$1]+=$2
cnt[$1]++
}
END {
print var, (var in cnt ? sum[var]/cnt[var] : "")
}
' "$i" >> "${i/.log/.main}"
done
done
Results of grep . *.main:
xxyz.main:a 5.5
xxyz.main:b 12.5
xxyz.main:c 100.5
xyzz.main:a 4
xyzz.main:b
xyzz.main:c 122
Here is a pure awk answer:
find . -maxdepth 1 -name "*.log" -print0 |
xargs -0 awk '{SUBSEP=" ";sum[FILENAME,$1]+=$2;cnt[FILENAME,$1]+=1;next}
END{for(i in sum)print i, sum[i], cnt[i], sum[i]/cnt[i]}'
Easy enough to push this into a file --