Using AWK to process CSV file with rows containing an array - awk

Trying to process a CSV file using AWK, however I have met a problem that many of my cells in my row already contain comma ,, meaning I can not separate field using awk -F,.
CSV FILE
Name,...DATE,COLUMNX,ADDRESSES
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"
The ... represents rows/columns containing [, ,, ', "
What I have tried:
awk -F, '{print $X}'
This give me following output:
'Element2']
"['192.168.x.101'
What I want to accomplish:
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101

I'd recommend a proper CSV parser to do the job, then use awk to do the regex, e.g.
$ ruby -r 'csv' -ne 'lines=$_
CSV.parse(lines) do |i|
i.each do |j|
printf("%s ", j)
end
puts ""
end' file |
awk '{gsub(/\[\047|\047\]|\047|\]|,/, "", $0)}
/^host/{for(i=1;i<=NF;i++){if($i~/^[0-9]+\.+/){print $1, $i}}}'
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101

Using awk:
awk -F",\"|\"$" 'NR>1 { \
gsub(/\047|[\[\]]/,""); \
split($2,a,", "); \
split($1,h,","); \
for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n]}}' src.csv
Output:
host1 192.168.x.100
host1 192.168.x.99
host2 192.168.x.101
Details:
-F",\"|\"$" (split on ," or " at end of record (will remove trailing double quote and each record will be split into two fields.
gsub(/\047|[\[\]]/,""); (sanitize by removing both single quotes and/or brackets)
split($2,a,", "); (split second field into array a on comma-space)
split($1,h,","); (split first field into array h on comma.
for (n in a) {if (a[n] ~ /^[0-9]/) printf "%s %s\n", h[1], a[n] Iterate over array a and only print output if array item starts with a number

Modern versions of awk allow a record to split at more than one field separator. Thus, each line can be split at both commas and single quote marks to isolate the data you need.
To use ' as a field separator along with , requires the former to be escaped and it can be quite tricky to then combine the two. The simplest way I've found after a few trials is to use the shell F switch with a regular expression including the escaped ' and ,. It's messy as you have to close the first single quote before escaping the required one and re-opening a single-quotted command: -F'[,'\''=]' (I generally prefer setting field separators within the awk procedure but this one defeated me).
This edited version works to isolate the field (change $35 to suit by trail-and error):
awk -F'[,'\'']' 'NR>1{print $1" "$35}' data.csv
I tested the above on the following test file:
data.csv:
Name,...DATE,COLUMNX,ADDRESSES
host1,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host2,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.168.xxx.yy', 'fe80:XX','192.168.x.100', fe80:XX]"
host3,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['192.xxx.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
host4,['El3', 'El6'],['El7', 'El12'],['El1', 'El2'],['El', 'E12'],NOV 24, 2022,['Element1', 'Element2'],"['xxx.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
output:
host1 192.168.x.99
host2 192.168.xxx.yy
host3 192.xxx.x.99
host4 xxx.168.x.99

Using GNU awk for FPAT:
$ cat tst.awk
BEGIN {
FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
n = split($NF,a,/\047/)
for ( i=2; i<=n; i+=4 ) {
print $1, a[i]
}
}
$ awk -f tst.awk file
host1 192.168.x.99
host1 192.168.x.100
host2 192.168.x.101
To see how FPAT is splitting the input into comma-separated fields and then split() is splitting the last field into '-separated subfields, add some prints, e.g.:
$ cat tst.awk
BEGIN {
FPAT = "([^,]*)|(\"[^\"]+\")|([[][^]]+])"
}
NR > 1 {
print "============="
print
for ( i=1; i<=NF; i++ ) {
print i, "<" $i ">"
}
n = split($NF,a,/\047/)
for ( i=1; i<=n; i++ ) {
print "\t" NF "." i, "<" a[i] ">"
}
for ( i=2; i<=n; i+=4 ) {
print $1, a[i]
}
}
$ awk -f tst.awk file
=============
host1,...,NOV 24, 2022,['Element1', 'Element2'],"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]"
1 <host1>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element1', 'Element2']>
6 <"['192.168.x.99', 'fe80:XX','192.168.x.100', fe80:XX]">
6.1 <"[>
6.2 <192.168.x.99>
6.3 <, >
6.4 <fe80:XX>
6.5 <,>
6.6 <192.168.x.100>
6.7 <, fe80:XX]">
host1 192.168.x.99
host1 192.168.x.100
=============
host2,...,NOV 24, 2022,['Element3'],"['192.168.x.101', 'fe80:XX']"
1 <host2>
2 <...>
3 <NOV 24>
4 < 2022>
5 <['Element3']>
6 <"['192.168.x.101', 'fe80:XX']">
6.1 <"[>
6.2 <192.168.x.101>
6.3 <, >
6.4 <fe80:XX>
6.5 <]">
host2 192.168.x.101
See What's the most robust way to efficiently parse CSV using awk? for more information on FPAT and parsing CSVs with awk.

Related

How to replace all escape sequences with non-escaped equivalent with unix utilities (sed/tr/awk)

I'm processing a Wireshark config file (dfilter_buttons) for display filters and would like to print out the filter of a given name. The content of file is like:
Sample input
"TRUE","test","sip contains \x22Hello, world\x5cx22\x22",""
And the resulting output should have the escape sequences replaced, so I can use them later in my script:
Desired output
sip contains "Hello, world\x22"
My first pass is like this:
Current parser
filter_name=test
awk -v filter_name="$filter_name" 'BEGIN {FS="\",\""} ($2 == filter_name) {print $3}' "$config_file"
And my output is this:
Current output
sip contains \x22Hello, world\x5cx22\x22
I know I can handle these exact two escape sequences by piping to sed and matching those exact two sequences, but is there a generic way to substitutes all escape sequences? Future filters I build may utilize more escape sequences than just " and , and I would like to handle future scenarios.
Using gnu-awk you can do this using split, gensub and strtonum functions:
awk -F '","' -v filt='test' '$2 == filt {n = split($3, subj, /\\x[0-9a-fA-F]{2}/, seps); for (i=1; i<n; ++i) printf "%s%c", subj[i], strtonum("0" substr(seps[i], 2)); print subj[i]}' file
sip contains "Hello, world\x22"
A more readable form:
awk -F '","' -v filt='test' '
$2 == filt {
n = split($3, subj, /\\x[0-9a-fA-F]{2}/, seps)
for (i=1; i<n; ++i)
printf "%s%c", subj[i], strtonum("0" substr(seps[i], 2))
print subj[i]
}' file
Explanation:
Using -F '","' we split input using delimiter ","
$2 == filt we filter input for $2 == "test" condition
Using /\\x[0-9a-fA-F]{2}/ as regex (that matches 2 digit hex strings) we split $3 and save split tokens into array subj and matched separators into array seps
Using substr we remove first char i.e \\ and prepend 0
Using strtonum we convert hex string to equivalent ascii number
Using %c in printf we print corresponding ascii character
Last for loop joins $3 back using subj and seps array elements
Using GNU awk for FPAT, gensub(), strtonum(), and the 3rd arg to match():
$ cat tst.awk
BEGIN { FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," }
$2 == ("\"" filter_name "\"") {
gsub(/^"|"$/,"",$3)
while ( match($3,/(\\x[0-9a-fA-F]{2})(.*)/,a) ) {
printf "%s%c", substr($3,1,RSTART-1), strtonum(gensub(/./,0,1,a[1]))
$3 = a[2]
}
print $3
}
$ awk -v filter_name='test' -f tst.awk file
sip contains "Hello, world\x22"
The above assumes your escape sequences are always \x followed by exactly 2 hex digits. It isolates every \xHH string in the input, replaces \ with 0 in that string so that strtonum() can then convert the string to a number, then uses %c in the printf formatting string to convert that number to a character.
Note that GNU awk has a debugger (see https://www.gnu.org/software/gawk/manual/gawk.html#Debugger) so if you're ever not sure what any part of a program does you can just run it in the debugger (-D) and trace it, e.g. in the following I plant a breakpoint to tell awk to stop at line 1 of the script (b 1), then start running (r) and the step (s) through the script printing the value of $3 (p $3) at each line so I can see how it changes after the gsub():
$ awk -D -v filter_name='test' -f tst.awk file
gawk> b 1
Breakpoint 1 set at file `tst.awk', line 1
gawk> r
Starting program:
Stopping in BEGIN ...
Breakpoint 1, main() at `tst.awk':1
1 BEGIN { FPAT="([^,]*)|(\"[^\"]*\")"; OFS="," }
gawk> p $3
$3 = uninitialized field
gawk> s
Stopping in Rule ...
2 $2 == "\"" filter_name "\"" {
gawk> p $3
$3 = "\"sip contains \\x22Hello, world\\x5cx22\\x22\""
gawk> s
3 gsub(/^"|"$/,"",$3)
gawk> p $3
$3 = "\"sip contains \\x22Hello, world\\x5cx22\\x22\""
gawk> s
4 while ( match($3,/(\\x[0-9a-fA-F]{2})(.*)/,a) ) {
gawk> p $3
$3 = "sip contains \\x22Hello, world\\x5cx22\\x22"

Awk column with pattern array

Is it possible to do this but use an actual array of strings where it says "array"
array=(cat
dog
mouse
fish
...)
awk -F "," '{ if ( $5!="array" ) { print $0; } }' file
I would like to use spaces in some of the strings in my array.
I would also like to be able to match partial matches, so "snow" in my array would match "snowman"
It should be case sensitive.
Example csv
s,dog,34
3,cat,4
1,african elephant,gd
A,African Elephant,33
H,snowman,8
8,indian elephant,3k
7,Fish,94
...
Example array
snow
dog
african elephant
Expected output
s,dog,34
H,snowman,8
1,african elephant,gd
Cyrus posted this which works well, but it doesn't allow spaces in the array strings and wont match partial matches.
echo "${array[#]}" | awk 'FNR==NR{len=split($0,a," "); next} {for(i=1;i<=len;i++) {if(a[i]==$2){next}} print}' FS=',' - file
The brief approach using a single regexp for all array contents:
$ array=('snow' 'dog' 'african elephant')
$ printf '%s\n' "${array[#]}" | awk -F, 'NR==FNR{r=r s $0; s="|"; next} $2~r' - example.csv
s,dog,34
1,african elephant,gd
H,snowman,8
Or if you prefer string comparisons:
$ cat tst.sh
#!/bin/env bash
array=('snow' 'dog' 'african elephant')
printf '%s\n' "${array[#]}" |
awk -F',' '
NR==FNR {
array[$0]
next
}
{
for (val in array) {
if ( index($2,val) ) { # or $2 ~ val for a regexp match
print
next
}
}
}
' - example.csv
$ ./tst.sh
s,dog,34
1,african elephant,gd
H,snowman,8
This prints no line from csv file which contains an element from array in column 5:
echo "${array[#]}" | awk 'FNR==NR{len=split($0,a," "); next} {for(i=1;i<=len;i++) {if(a[i]==$5){next}} print}' FS=',' - file

AWK: add a sequential number out of 4 digits

How do I achieve from following string.ext
>Lipoprotein releasing system transmembrane protein LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>Phosphoserine phosphatase (EC 3.1.3.3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
to change the sequential number after string to a 4 digit number (starting with 0001) and separate that number with | from string, so that output is returned like:
>string|0001|Lipoprotein_releasing_system_transmembrane_protein_LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine_phosphatase_(EC_3_1_3_3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
the commands I came up until here are ($faa is referring to the filename string.ext)
faa=$1
var=$(basename "$faa" .ext)
awk '!/^>/ { printf "%s", $0; n = "\n" } /^>/ { print n $0; n = "" } END { printf "%s", n }' $faa >$faa.tmp
sed 's/ /_/g' $faa.tmp >$faa.tmp2
awk -v var="$var" '/>/{sub(">","&"var"|");sub(/\.ext/,x)}1' $faa.tmp2 >$faa.tmp3
awk '/>/{sub(/\|/,++i"|")}1' $faa.tmp3 >$faa.tmp4
tr '\.' '_' <$faa.tmp4 | tr '\:' '_' | sed 's/__/_/g' >$faa.tmp5
Edit: I also want to change following characters to 1 underscore: / . :
I'd use perl here:
perl -pe '
next unless /^>/; # only transform the "header" lines
s/[\h.]/_/g; # change dots and horizontal whitespace
substr($_,1,0) = sprintf("string|%04d|", ++$n) # insert the counter
' file
$ awk '
FNR==1 {base=FILENAME; sub(/\.[^.]+$/,"",base) }
sub(/^>/,"") { gsub(/[\/ .:]+/,"_"); $0=sprintf(">%s|%04d|%s",base,++c,$0) }
1' string.ext
>string|0001|Lipoprotein_releasing_system_transmembrane_protein_LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine_phosphatase_(EC_3_1_3_3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
I'm assuming from your posted sample and code that you actually want every contiguous sequence of any combination of spaces, periods, forward slashes and/or colons converted to a single underscore.
In awk.
$ awk '/^>/{n=sprintf("%04d",++i);sub(/^>/,">string|" n "|")}1' file
>string|0001|Lipoprotein releasing system transmembrane protein LolC
MKWLWFAYQNVIRNRRRSLMTILIIAVGTAAILLSNGFALYTYDNLREGSALASGHVIIAHVDHFDKEEEIPMEYGLSDYEDIERHIAADDRVRMAIPRLQFSGLISNGDKSVIFMGTGVDPEGEFDIGGVLTNVLTGNTLSTHSAPDAVPEVMLAKDLAKQLHADIGGLLTLLATTADGALNALDVQVRGIFSTGVPEMDKRMLAVALPTAQELIMTDKVGTLSVYLHEIEQTDAMWAVLAEWYPNFATQPWWEQASFYFKVRALYDIIFGVMGVIILLIVFFTITNTLSMTIVERTRETGTLLALGTLPRQIMRNFALEALLIGLAGALLGMLIAGFTSITLFIAEIQMPPPPGSTEGYPLYIYFSPWLYGITSLLVVTLSIAAAFLTSRKAARKPIVEALAHV
>string|0002|Phosphoserine phosphatase (EC 3.1.3.3)
MFQEHALTLAIFDLDNTLLAGDSDFLWGVFLVERGIVDGDEFERENERFYRAYQEGDLDIFEFLRFAFRPLRDNRLEDLKRWRQDFLREKIEPAILPMACELVEHHRAAGDTLLIITSTNEFVTAPIAEQLGIPNLIATVPEQLHGCYTGEAAGTPAFQAGKVKRLLDWLEETSTELAGSTFYSDSHNDIPLLEWVDHPVATDPDDRLRGYARDRGWPIISLREEIAP
Explained:
$ awk '
/^>/ { # if string starts with >
n=sprintf("%04d",++i) # iterate i from 1 and zeropad
sub(/^>/,">string|" n "|") # replace the > with stuff
}1' file # implicit output
Don't include & in string (see comments).
awk -F'[ \.]' 'BEGIN{a=1;OFS="_"}/^>/{$1=sprintf(">String|%04d",a);++a;print $0; next;}{print $0}' filename

missing field and extra space after using for loop in awk

I need to use an awk script to extract some information from a file.
I have a title line which has 11 field and I split it to an array called titleList.
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
After finding a proper line I need to print the fields which proceeds by the titles for example if the result is :
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
I must print it in this way:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18
Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
I use a for loop to manage it:
for (i=0 ;i<=NF ;i++)
{
printf "%s %s %s %s",titleList[i],":",$i," "
}
everything look good except the result which has 2 problems:
first there is an extra space between each result and second the last field of the searched line is missing
Student Number : 92839342 Name : Robert Bloomingdale Lab1 : 9 Lab2 : 26
Lab3:18 Lab4 : 22 Lab5 : 9 Lab6 : 12 Exam1 : 25 Exam2 : 39 Final
what should I do?
is there any problem with \n at the end of the search result?
You can correct the amount of extra whitespace between fields by correcting the printf statement:
awk -F ":" 'NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
Contents of file.txt:
Student Number:Name:Lab1:Lab2:Lab3:Lab4:Lab5:Lab6:Exam1:Exam2:Final
92839342:Robert Bloomingdale:9:26:18:22:9:12:25:39:99
Results:
Student Number:92839342 Name:Robert Bloomingdale Lab1:9 Lab2:26 Lab3:18 Lab4:22 Lab5:9 Lab6:12 Exam1:25 Exam2:39 Final:99
EDIT:
Also, your missing the last value because the file you're working with probably has windows newline endings. To fix this, run: dos2unix file.txt before running your awk code. Alternatively, you can set awk's record separater so that it understands newline endings:
awk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array, FS) } NR >= 2 { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; printf "\n" }' file.txt
EDIT:
The above requires GNU awk, split() splits on the FS by default so no need to use that as an arg, it's common to use "next" rather than specifying opposite conditions, and it's common to use print "" instead of printf "\n" so you use the ORS setting rather than hard-coding it's value in output statements. So, the above should be tweaked to:
gawk 'BEGIN { RS="\r\n"; FS=":" } NR == 1 { split($0, array); next } { for (i=1; i<=NF; i++) printf "%s:%s ", array[i], $i; print "" }' file.txt

awk output format for average

I am computing average of many values and printing it using awk using following script.
for j in `ls *.txt`; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
echo -n $j $i" "; cat $j | grep $i | awk '{ sum+=$2} END {print sum/NR}'
done;
echo ""
done
but problem is, it is printing the value in in 1.2345e+05, which I do not want, I want it to print values in round figure. but I am unable to find where to pass the output format.
EDIT: using {print "average,%3d = ",sum/NR}' inplace of {print sum/NR}' is not helping, because it is printing "average,%3d 1.2345e+05".
You need printf instead of simply print. Print is a much simpler routine than printf is.
for j in *.txt; do
for i in emptyloop dd cp sleep10 gpid forkbomb gzip bzip2; do
awk -v "i=$i" -v "j=$j" '$0 ~ i {sum += $2} END {printf j, i, "average %6d", sum/NR}' "$j"
done
echo
done
You don't need ls - a glob will do.
Useless use of cat.
Quote all variables when they are expanded.
It's not necessary to use echo - AWK can do the job.
It's not necessary to use grep - AWK can do the job.
If you're getting numbers like 1.2345e+05 then %6d might be a better format string than %3d. Use printf in order to use format strings - print doesn't support them.
The following all-AWK script might do what you're looking for and be quite a bit faster. Without seeing your input data I've made a few assumptions, primarily that the command name being matched is in column 1.
awk '
BEGIN {
cmdstring = "emptyloop dd cp sleep10 gpid forkbomb gzip bzip2";
n = split(cmdstring, cmdarray);
for (i = 1; i <= n; i++) {
cmds[cmdarray[i]]
}
}
$1 in cmds {
sums[$1, FILENAME] += $2;
counts[$1, FILENAME]++
files[FILENAME]
}
END {
for file in files {
for cmd in cmds {
printf "%s %s %6d", file, cmd, sums[cmd, file]/counts[cmd, file]
}
}
}' *.txt