Awk: for-loop with array of numbers - awk

How to use an array of numbers in a for loop with awk ?
I tried:
awk '{ for (i in [10, 12, 18]) print i }' myfile.txt
But I'm getting a syntax error.

The in operator works on arrays. The way to create an array from a list of numbers like 10 12 18 is to split() a string containing those numbers.
To have those numbers stored as values in an array a[] with indices 1 to 3:
awk 'BEGIN{FS=OFS="|"; split("10 12 18",a," ")}
(FNR>2) { for(j in a) { i=a[j]; k=$i OFS $(i+1); c[k]++; d[k] = i } }
END{for (k in c) print d[k],k,c[k] }' myfile.txt
To have those numbers stored as indices of an array b[] with all values 0-or-null (same as an uninitialized scalar variable):
awk 'BEGIN{FS=OFS="|"; split("10 12 18",a," "); for (j in a) b[a[j]]}
(FNR>2) { for(i in b) { k=$i OFS $(i+1); c[k]++; d[k] = i } }
END{for (k in c) print d[k],k,c[k] }' myfile.txt
If you didn't want to create the array once up front for some reason (e.g. the list of numbers you want to split is created dynamically) then you could create it every time you need it, e.g.:
awk 'BEGIN{FS=OFS="|"}
(FNR>2) { split("10 12 18",a," "); for(j in a) { i=a[j]; k=$i OFS $(i+1); c[k]++; d[k] = i } }
END{for (k in c) print d[k],k,c[k] }' myfile.txt
but obviously creating the same array multiple times is less efficient than creating it once.

kinda made a very rough emulation of a for-loop that directly takes in a list without needing an extra function call prior to that to initialize it :
it tries to be as flexible as possible regarding what the
delimiter(s) might be, so
foreach("[CA=MX=JP=FR=SG=AUS=N.Z.]")
would actually also work.
Despite being shown with the gawk profile below,
and using the PROCINFO array, you don't need gawk for it to work :
it's functional on mawk 1.3.4, mawk 1.9.9.6, gnu gawk 5.1.1, and macos x
just added a Unicode UTF8 feature, which works regardless of what your locale setting is, or whether you're using gawk mawk or nawk
emojis work fine too
that said, it cannot properly parse CSV, XML, or JSON inputs
(didn't have the time to make it that fancy)
list 1 :: 10
list 1 :: 12
list 1 :: 18.
list 1 :: 27
list 1 :: 36
list 1 :: pi
list 1 :: hubble
list 1 :: kelvins
list 1 :: euler
list 1 :: higgs
list 1 :: 9.6
list 2 :: CA
list 2 :: MX
list 2 :: JP
list 2 :: FR
list 2 :: SG
list 2 :: AUS
list 2 :: N.Z.
# gawk profile, created Mon May 9 22:06:03 2022
# BEGIN rule(s)
BEGIN {
11 while (i = foreach("[10, 12, 18., 27, 36, pi, hubble, kelvins, euler, higgs, 9.6]")) {
11 print "list 1 :: ", i
}
1 printf ("\n\n\n")
7 while (i = foreach("[CA, MX, JP, FR, SG, AUS, N.Z., ]")) {
7 print "list 2 :: ", i
}
}
# Functions, listed alphabetically
20 function foreach(_, __)
{
20 if (_=="") {
return \
PROCINFO["foreach", "_"] = \
PROCINFO["foreach", "#"] = _
}
20 __ = "\032\016\024"
20 if (_!= PROCINFO["foreach", "_"]) { # 2
2
PROCINFO["foreach","_"]=_
2 gsub("^[ \t]*[[<{(][ \t]*"\
"|[ \t]*[]>})][ \t]*$"\
"|\\300|\\301","",_)
gsub("[^"(\
"\333\222"~"[^\333\222]"\
? "\\200-\\277"\
"\\302-\\364"\
: "" \
)"[:alnum:]"\
\
"\302\200""""-\337\277" \
"\340\240\200-\355\237\277" \
"\356\200\200-\357\277\277" \
"\360\220\200\200-\364\217\277\277"\
\
".\42\47#$&%+-]+",__,_)
gsub("^"(__)"|"\
(__)"$","", _)
2 PROCINFO["foreach","#"]=_
}
20 if ((_=PROCINFO["foreach","#"])=="") { # 2
2 return _
}
18 sub((__) ".*$", "", _)
sub("^[^"(__)"]+("(__)")?","",PROCINFO["foreach","#"])
18 return _
}
list 2 :: CA
list 2 :: MX
list 2 :: JP
list 2 :: FR
list 2 :: SG
list 2 :: 눷
list 2 :: 🤡
list 2 :: N.Z.
while(i = foreach("[CA=MX=JP=FR=SG=\353\210\267"\
"=\360\237\244\241=N.Z.]")) {
print "list 2 :: ", i
}

Related

awk : awk script to group by column with condition

I have tab delimited file like following and I am trying to write a awk script
aaa_log-000592 2 p STARTED 7027691 21.7 a1
aaa_log-000592 28 r STARTED 7027815 21.7 a2
aaa_log-000592 33 p STARTED 7032607 21.7 a3
aaa_log-000592 33 r STARTED 7032607 21.7 a4
aaa_log-000592 43 p STARTED 7025709 21.7 a5
aaa_log-000592 43 r STARTED 7025709 21.7 a6
aaa_log-000595 2 r STARTED 7027691 21.7 a7
aaa_log-000598 28 p STARTED 7027815 21.7 a8
aaa_log-000599 13 p STARTED 7033090 21.7 a9
I am trying to count for 3rd column (p or r) and group by column 1
Output would be like
Col1 Count-P Count-R
aaa_log-000592 3 3
aaa_log-000595 0 1
aaa_log-000598 1 0
aaa_log-000599 1 0
I can't find an example that would have IF condition with group by in awk.
awk(more specifically, the GNU variant, gawk) has multi-dimensional arrays that can be indexed using input values (including character strings like in your example). As such, you can count the values in the way you want by doing
{
values[$3] = 1 # this line records the values in column three
counts[$1][$3]++ # and this lines counts their frequency
}
The first line isn't strictly required, but it simplifies generating the output.
The only remaining part is to have an END clause that outputs the tabulated results.
END {
# Print column headings
printf "Col1 "
for (v in values) {
printf " Count-%s", v
}
printf "\n"
# Print tabulated results
for (i in counts) {
printf "%-20s", i
for (v in values) {
printf " %d", counts[i][v]
}
printf "\n"
}
}
Generating the values array handles the case when the values of column three may not be known (e.g., like when there's an error in your input).
If you're using a different awk implementation (like what you might find in macOS, for example), array indexing may be different (e.g., they are single-dimensional arrays, but indexed by a comma-separate list of indices). This may add some additional complexity, but the idea is the same.
{
files[$1] = 1
values[$3] = 1
counts[$1,$3]++
}
END {
# Print column headings
printf "Col1 "
for (v in values) {
printf " Count-%s", v
}
printf "\n"
# Print tabulated results
for (f in files) {
printf "%-20s", f
for (v in values) {
printf " %d", counts[f,v]
}
printf "\n"
}
}

Concatenating array elements into a one string in for loop using awk

I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do:
Input:
1 877803 838425 GC G
1 878077 966631 C CCACGG
Output:
1 877803 838425 C -
1 878077 966631 - CACGG
In summary, I am trying to delete the first letters of longer strings.
And here is my code:
awk 'BEGIN { OFS="\t" } /#/ {next}
{
m = split($4, a, //)
n = split($5, b, //)
x = "-"
delete y
if (m>n){
for (i = n+1; i <= m; i++) {
y = sprintf("%s", a[i])
}
print $1, $2, $3, y, x
}
else if (n>m){
for (j = m+1; i <= n; i++) {
y = sprintf("%s", b[j]) ## Problem here
}
print $1, $2, $3, x, y
}
}' input.vcf > output.vcf
But,
I am getting the following error in line 15, not even in line 9
awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context
I don't know how to concatenate array elements into a one string using awk.
I will be very happy if you guys help me.
Merry X-Mas!
You may try this awk:
awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file
1 877803 838425 C -
1 878077 966631 - CACGG
More readable form:
awk -v OFS="\t" 'function trim(s) {
return (length(s) == 1 ? "-" : substr(s, 2))
}
{
$4 = trim($4)
$5 = trim($5)
} 1' file
You can use awk's substr function to process the 4th and 5th space delimited fields:
awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file
If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.

AWK Convert Decimal to Binary

I want to use AWK to convert a list of decimal numbers in a file to binary but there seems to be no built-in method. Sample file is as below:
134218506
134218250
134217984
1610612736
16384
33554432
Here is an awk way, functionized for your pleasure:
awk '
function d2b(d, b) {
while(d) {
b=d%2b
d=int(d/2)
}
return(b)
}
{
print d2b($0)
}' file
Output of the first three records:
1000000000000000001100001010
1000000000000000001000001010
1000000000000000000100000000
You can try Perl one-liner
$ cat hamdani.txt
134218506
134218250
134217984
134217984
1610612736
16384
33554432
$ perl -nle ' printf("%b\n",$_) ' hamdani.txt
1000000000000000001100001010
1000000000000000001000001010
1000000000000000000100000000
1000000000000000000100000000
1100000000000000000000000000000
100000000000000
10000000000000000000000000
$
You can try with dc :
# -f infile : Use infile for data
# after -e , it is there are the dc command
dc -f infile -e '
z # number of values
sa # keep in register a
2
o # set the output radix to 2 : binary
[
Sb # keep all the value of infile in the register b
# ( b is use here as a stack)
z
0 <M # until there is no more value
] sM # define macro M in [ and ]
lMx # execute macro M to populate stack b
[
Lb # get all values one at a time from stack b
p # print this value in binary
la # get the number of value
1
- # decremente it
d # duplicate
sa # keep one in register a
0<N # the other is use here
]sN # define macro N
lNx' # execute macro N to print each values in binary
Here's an approach that works by first converting the decimal to hex and then converting each hex character to it's binary equivalent:
$ cat dec2bin.awk
BEGIN {
h2b["0"] = "0000"; h2b["8"] = "1000"
h2b["1"] = "0001"; h2b["9"] = "1001"
h2b["2"] = "0010"; h2b["a"] = "1010"
h2b["3"] = "0011"; h2b["b"] = "1011"
h2b["4"] = "0100"; h2b["c"] = "1100"
h2b["5"] = "0101"; h2b["d"] = "1101"
h2b["6"] = "0110"; h2b["e"] = "1110"
h2b["7"] = "0111"; h2b["f"] = "1111"
}
{ print dec2bin($0) }
function hex2bin(hex, n,i,bin) {
n = length(hex)
for (i=1; i<=n; i++) {
bin = bin h2b[substr(hex,i,1)]
}
sub(/^0+/,"",bin)
return bin
}
function dec2bin(dec, hex, bin) {
hex = sprintf("%x\n", dec)
bin = hex2bin(hex)
return bin
}
$ awk -f dec2bin.awk file
1000000000000000001100001010
1000000000000000001000001010
1000000000000000000100000000
1100000000000000000000000000000
100000000000000
10000000000000000000000000
# gawk binary number functions
# RPC 09OCT2022
# convert an 8 bit binary number to an integer
function bin_to_n(i)
{
n = 0;
#printf(">> %s:", i);
for (k = 1; k < 9; k++) {
n = n * 2;
b = substr(i, k, 1);
if (b == "1") {
n = n + 1;
}
}
return (n);
}
# convert a number to a binary number
function dectobin(n)
{
printf("dectobin: n in %d ",n);
binstring = "0b"; # some c compilers allow 0bXXXXXXXX format numbers
bn = 128;
for(k=0;k<8;k++) {
if (n >= bn) {
binstring = binstring "1";
n = n - bn;
} else {
binstring = binstring "0"
}
printf(" bn %d",bn);
bn = bn / 2;
}
return binstring;
}
BEGIN {
FS = " ";
# gawk (I think) has no atoi() funciton or equiv. So a table of all
# chars (well 256 ascii) can be used with the index function to get
# round this
for (i = 0; i < 255; i++) {
table = sprintf("%s%c", table, i);
}
}
{
# assume on stdin a buffer of 8 bit binary numbers "01000001 01000010" is AB etc
for (i = 1; i <= NF; i++)
printf("bin-num#%d: %x --> %c\n", i, bin_to_n($i), bin_to_n($i));
s = "ABC123string to test";
for (i = 0; i < length(s); i++) {
nn = index(table, substr(s,i+1,1))-1;
printf("substr :%s:%x:",ss,nn);
printf(" :%d: %s\n", i, dectobin(nn));
}
}
on top of what others have already mentioned, this function has a rapid shortcut for non-negative integer powers of 2
—- (since they always have a binary pattern of /^[1][0]*$/ )
version 1 : processing in 3-bit chunks instead of bit-by-bit :
{m,g}awk '
BEGIN {
1 CONVFMT="%.250g"
1 _^=OFMT="%.25g"
}
($++NF=________v1($_))^!_
function ________v1(__,___,_,____,_____)
{
6 if (+__==(_+=_^=____="")^(___=log(__)/log(_))) { # 2
2 return \
___<=_^_^_ \
? (_+_*_*_)^___ \
: sprintf("%.f%0*.f",--_,___,--_)
}
4 ___=(!_!_!_!!_) (_^((_____=_*_*_)+_)-_^_^_+(++_))
4 gsub("..", "&0&1", ___)
41 while(__) {
41 ____ = substr(___,
__%_____*_+(__=int(__/_____))^!_,_)____
}
4 return substr(__=____, index(__, _^(! _)))
}'
version 2 : first use sprintf() to convert to octals, before mapping to binary
function ________v2(__,___,_,____,_____)
{
6 if (+__==(_+=_^=____="")^(___=log(__)/log(_))) { # 2
2 return \
___<=_^_^_ \
? (_+_*_*_)^___ \
: sprintf("%.f%0*.f",--_,___,--_)
}
4 ___=(!_!_!_!!_) (_^((_____=_*_*_)+_)-_^_^_+(++_))
4 gsub("..", "&0&1", ___)
4 _____=___
4 __=sprintf("%o%.*o", int(__/(___=++_^(_*--_+_))),
_*_+!!_, __%___)
4 sub("^[0]+", "", __)
41 for (___=length(__); ___; ___--) {
41 ____ = substr(_____, substr(__,
___,!!_)*_ + !!_,_)____
}
4 return substr(____, index(____,!!_))
}
|
134218506 1000000000000000001100001010
134218250 1000000000000000001000001010
134217984 1000000000000000000100000000
1610612736 1100000000000000000000000000000
16384 100000000000000
33554432 10000000000000000000000000
version 3 : reasonably zippy (29.5 MB/s throughput on mawk2) version by using a caching array and processing 8-bits each round
ouputs are zero-padded to minimum 8 binary digits wide
.
{m,g,n}awk '
1 function ________(_______,_, __,____,______)
{
1 split(_=__=____=______="", _______, _)
2 for (_^=_<_; -_<=+_; _--) {
4 for (__^=_<_; -__<=+__; __--) {
8 for (____^=_<_; -____<=+____; ____--) {
16 for (______^=_<_; -______<=+______; ______--) {
16 _______[_+_+_+_+_+_+_+_+__+__+\
__+__+____+____+______]=\
(_)__ (____)______
}
}
}
}
1 return _^(_<_)
}
BEGIN {
1 CONVFMT = "%." ((_+=(_^=_<_)+(_+=_))*_)(!_)"g"
1 OFMT = "%." (_*_) "g"
1 _ = ________(_____)
}
($++NF=___($_))^!_
function ___(__,____,_,______)
{
6 if ((__=int(__))<(______=\
(_*=_+=_+=_^=____="")*_)) {
return _____[int(__/_)]_____[__%_]
}
16 do { ____=_____[int(__/_)%_]_____[__%_]____
} while (______<=(__=int(__/______)))
6 return int(_____[int(__/_)%_]\
_____[ (__) %_])____
}
You should not use awk for this but bc:
$ bc <<EOF
ibase=10
obase=2
$(cat file)
EOF
or
bc <<< $(awk 'BEGIN{ print "ibase=10; obase=2"}1' file)

get the statistics in a text file using awk

I have a text file like this small example:
>chr10:101370300-101370301
A
>chr10:101370288-101370289
A
>chr10:101370289-101370290
G
>chr10:101471626-101471627
g
>chr10:101471865-101471866
g
>chr10:101471605-101471606
a
>chr10:101471606-101471607
g
>chr10:101471681-101471682
as you see below every line that starts with ">" I have a letter. these letters are A, G, T or C. in my results I would like to get the frequency of them in percentage. here is a small example of expected output.
A = 28.57
G = 14.29
g = 42.85
a = 14.29
I am trying to do that in awk using :
awk 'if $1 == "G", num=+1 { a[$1]+=num/"G" }
if $1 == "G", num=+1 { a[$1]+=num/"C" }
if $1 == "G", num=+1 { a[$1]+=num/"T" }
if $1 == "G", num=+1 { a[$1]+=num/"A" }
' infile.txt > outfile.txt
but it does not return what I want. do you know how to fix it?
Awk solution:
awk '/^[a-zA-Z]/{ a[$1]++; cnt++ }
END{ for (i in a) printf "%s = %.2f\n", i, a[i]*100/cnt }' file.txt
/^[a-zA-Z]/ - on encountering records that only starts with a letter [a-zA-Z]:
a[$1]++ - accumulate occurrences of each item(letter)
cnt++ - count the total number of items(letters)
The output:
A = 28.57
a = 14.29
G = 14.29
g = 42.86
you sample is in contradiction with your comment (every line starting with > have no letter on my display, so i assume it's a copy/paste translation error)
awk '{C[$NF]++;S+=0.01} END{ for( c in C ) printf( "%s = %2.2f\n", c, C[c]/S)}' infile.txt > outfile.txt
if line are well under like the sample add 'NF==1' as first part of the awk code

sum up the output of 'pkpgcounter -ccmyk' in groups Cyan, Magenta, Yellow, Black to calculate ink usage

For printaccounting I'm using Tea4Cups. In /etc/cups/tea4cups.conf I have a line:
echo `pkpgcounter -ccmyk $TEADATAFILE` |sed 's/C\ \://g'|sed 's/M\ \://g'|sed 's/Y\ \://g'|sed 's/K\ \://g'|sed 's/\%/;/g'|sed 's/\./,/g' >>/var/log/accounting_ink.csv
pkpgcounter -ccmyk $TEADATAFILE give output like:
C : 4.732829% M : 4.716022% Y : 3.545420% K : 0.000000%
C : 4.753109% M : 4.736302% Y : 3.560630% K : 0.000000%
C : 4.760295% M : 4.743488% Y : 3.566019% K : 0.000000%
The more pages a file has, the more output the command will give.
sed strips the output from the characters that are not numeric and turn it into the following:
3,699918; 3,285596; 2,983343; 4,169371; 1,596966; 1,635378; 1,621895; 1,306214;
Now I need to add every value for C; for M; Y and B to get an idea of the toner/ink usage of the print jobs.
So value 1 and 5; value 2 and 6 etc. But maybe a first step is to determine the total number of values?
You never need sed when you're using awk so the intermediate output from your sed command isn't useful for this, all we need is your original output from pkpgcounter.
You don't show your expected output so it's a guess but is this what you're trying to do?
$ cat file
C : 4.732829% M : 4.716022% Y : 3.545420% K : 0.000000%
C : 4.753109% M : 4.736302% Y : 3.560630% K : 0.000000%
C : 4.760295% M : 4.743488% Y : 3.566019% K : 0.000000%
$ cat tst.awk
{
for (i=1; i<NF; i+=3) {
val[i] = $i
sum[i] += $(i+2)
}
}
END {
for (i=1; i<NF; i+=3) {
printf "%s%s: %s", (i>1?OFS:""), val[i], sum[i]
}
print ""
}
$ awk -f tst.awk file
C: 14.2462 M: 14.1958 Y: 10.6721 K: 0
You could calculate the individual totals using awk
echo `pkpgcounter -ccmyk $TEADATAFILE` |
awk '{c+=$3;m+=$6;y+=$9;k+=$12}{print}END{printf "C : %.5f%% M:%.5f%% Y:%.5f%% K:%.5f%%",c,m,y,k; print ""}'
C : 4.732829% M : 4.716022% Y : 3.545420% K : 0.000000%
C : 4.753109% M : 4.736302% Y : 3.560630% K : 0.000000%
C : 4.760295% M : 4.743488% Y : 3.566019% K : 0.000000%
C : 14.24623% M:14.19581% Y:10.67207% K:0.00000%
In GNU awk (multichar RS), file has the output from the pkpgcounter:
$ awk 'BEGIN{RS="%"RS"?"}{a[$1]+=$NF}END{for(i in a)print i, a[i]}' file
C 14.2462
K 0
Y 10.6721
M 14.1958
You could pipe the output to the awk instead of using the file as source.
Edit: Single line printing version, as requested:
$ awk 'BEGIN{RS="%"RS"?"}{a[$1]+=$NF}END{for(i in a)printf "%s: %s ", i, a[i];print ""}' file
C: 14.2462 K: 0 Y: 10.6721 M: 14.1958
Edit 2: Single line printing with output order picked from the first record:
$ awk '
BEGIN { RS="%"RS"?" } # set RS
NR<=4 { b[NR]=$1 } # store original order to b
{ a[$1]+=$NF } # sum
END { for(i=1;i<=4;i++) # respect the original order
printf "%s: %s ", b[i], a[b[i]] # output
print "" # finishing touch
}' file
C: 14.2462 M: 14.1958 Y: 10.6721 K: 0