How to generate string with a char repeated for n times? - awk

I can use the following code to generate a string.
$ awk -e 'BEGIN { for(i=1;i<=10;++i) s = s "x"; print s }'
xxxxxxxxxx
But its complexity is super-linear wrt the string length.
$ time awk -e 'BEGIN { for(i=1;i<=10000000;++i) s = s "x" }'
real 0m0.868s
user 0m0.857s
sys 0m0.008s
$ time awk -e 'BEGIN { for(i=1;i<=100000000;++i) s = s "x" }'
real 0m9.886s
user 0m9.801s
sys 0m0.065s
$ time awk -e 'BEGIN { for(i=1;i<=1000000000;++i) s = s "x" }'
real 1m46.074s
user 1m45.171s
sys 0m0.760s
Is there a better way to repeat a char n times and assign the result to a variable?

Use sprintf to create a string of spaces, then use gsub to replace each space with an x:
$ time awk 'BEGIN {s = sprintf("%*s", 100000000, ""); gsub(".", "x", s)}'
real 0m1.744s
user 0m1.645s
sys 0m0.098s
This can be wrapped in an awk function:
function mkStr(c, n, s) {
s = sprintf("%*s", n, "");
gsub(".", c, s);
return s;
}
(s is a parameter simply to scope the variable to the function; it needs no argument, and indeed, any argument passed will be ignored.)
Update: there appears to be a significant difference in performance depending on which version of awk you are using. The above test used 20070501, the BSD(?) awk that ships with macOS. gawk-5.1.0 takes significantly longer.
I don't know what accounts for the difference; perhaps there is a solution that is fast in both versions.
Update 2: Ed Morton (in the comments) has verified that gsub is responsible for the slow running time in gawk, could not find a workaround, and has filed a bug report with the maintainers.

function loop(n){
for(i=1;i<=n;i++) s = s "x";
return s;
}
function repl(n){
s = sprintf("%*s", n, "");
gsub(/ /, "x", s);
return s;
}
function recStack(n, h){
switch( n ){
case 0:
return "";
default:
if( n % 2 == 1 ){
h = recStack( int((n-1)/2) )
return h h "x";
} else {
h = recStack( int(n/2) )
return h h;
}
}
}
function recStackIf(n, h){
if( n == 0 ) return "";
if( n % 2 == 1 ){
h = recStackIf( int((n-1)/2) ); # create first half
return h h "x"; # duplicate + one "x"
} else {
h = recStackIf( int(n/2) ); # create first half
return h h; # duplicate
}
}
function recArray(n, h, n2){
if( n in a ) return a[n];
switch( n ){
case 0:
return a[0] = "";
default:
if( n % 2 == 1 ){
n2 = int((n-1)/2);
h = recArray( n2 );
return a[n] = h h "x";
} else {
n2 = int(n/2);
h = recArray( n2 );
return a[n] = h h;
}
}
}
function recArrayIf(n, h, n2){
if( n in a ) return a[n];
if( n == 0 ) return a[0] = "";
if( n % 2 == 1 ){
n2 = int((n-1)/2);
h = recArrayIf( n2 );
return a[n] = h h "x";
} else {
n2 = int(n/2);
h = recArrayIf( n2 );
return a[n] = h h;
}
}
function concat(n){
exponent = log(n)/log(2)
m = int(exponent) # floor
m += (m < exponent ? 1 : 0) # ceiling
s = "x"
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
return s
}
BEGIN {
switch( F ){
case "recStack":
xxx = recStack( 100000000 );
break;
case "recStackIf":
xxx = recStackIf( 100000000 );
break;
case "recArray":
xxx = recArray( 100000000 );
break;
case "recArrayIf":
xxx = recArrayIf( 100000000 );
break;
case "loop":
xxx = loop( 100000000 );
break;
case "repl":
xxx = repl( 100000000 );
break;
case "concat":
xxx = concat( 100000000 );
break;
}
print length(xxx);
## xloop = loop(100000000 );
## if( xxx == xloop ) print "Match";
}
Times are:
# loop : real 0m5,405s, user 0m5,243s, sys 0m0,160s
# repl : real 0m7,670s, user 0m7,506s, sys 0m0,164s
# recArray: real 0m0,302s, user 0m0,141s, sys 0m0,161s
# recArrayIf: real 0m0,309s, user 0m0,168s, sys 0m0,141s
# recStack: real 0m0,316s, user 0m0,124s, sys 0m0,192s
# recStackIf: real 0m0,305s, user 0m0,152s, sys 0m0,152s
# concat: real 0m0,664s, user 0m0,300s, sys 0m0,364s
There's not much difference between the 5 versions of binary decomposition: a bunch of heap memory is used in all cases. Having the global array hang around until the end of all times isn't good and therefore I'd prefer either stack version.
wlaun#terra:/tmp$ gawk -V
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
wlaun#terra:/tmp$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Note that the above timing has been done with a statement printing the resulting string's length, which adds about 0.2 s to each version. Also, /usr/bin/time isn't reliable. Here are the relative "real" values from time without the print length(xxx):
# loop: 0m5,248s
# repl: 0m7,705s
# recStack: 0m0,103s
# recStackIf: 0m0,097s
# recArray: 0m0,103s
# recArrayIf: 0m0,099s
# concat: 0m0,455s
Added on request of Ed Morton:
Why is any of the recursive functions faster than any of the linear functions that iterate over O(N) elements? (The "O(N)" is the "big oh" symbol and is used to indicate a value that is N, possibly multiplied and/or incremented by some constant. A circle's circumference is O(r), it's area is O(r²).)
The answer is simple: By dividing N by 2, we get two strings of length O(N/2). This provides the possibility of re-using the result for the first half (no matter how we obtain it) for the second half! Thus, we'll get the second half of the result for free (except for the string copy operation, which is basically a machine instruction on most popular architectures). There is no reason why this great idea should not be applied for creating the first half as well, which means that we get three quarters of the result for free (except - see above). A little overhead results from the single "x" we have to throw in to cater for odd subdivisions of N.
There are many other recursive algorithms along the idea of halving and dealing with both sections individually, the most famous of them are Binary Search and Quicksort.

Here's a fast solution using any POSIX awk (tested on an average 8G RAM laptop running cygwin with GNU awk 5.1.0):
time awk -v n=100000000 'BEGIN {
exponent = log(n)/log(2)
m = int(exponent) # floor
m += (m < exponent ? 1 : 0) # ceiling
s = "x"
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
}'
real 0m0.665s
user 0m0.328s
sys 0m0.343s
The above just appends a copy of the string to itself until it's at least as big as the target string length and finally truncates it to exactly the desired length. The only somewhat tricky part is calculating how many times to append s to itself and that's just a case of solving 2^m = n for m given you already know n (100000000 in this case), see https://www.calculatorsoup.com/calculators/algebra/exponentsolve.php.
Obviously you could make the loop while ( length(s) < n ) instead of calculating m and that'd make the script briefer but would slow it down a bit (but it'd still be pretty fast):
$ time awk -v n=100000000 'BEGIN{s="x"; while (length(s) < n) s=s s; s=substr(s,1,n)}'
real 0m1.072s
user 0m0.562s
sys 0m0.483s
#JamesBrown had a better idea than calling length() on each iteration that also avoids having to calculate m while being about as fast:
$ time awk -v n=100000000 'BEGIN{s="x"; for (i=1; i<n; i*=2) s=s s; s=substr(s,1,n)}'
real 0m0.710s
user 0m0.281s
sys 0m0.390s
Originally I had the following thinking doubling strings of "x"s would be a faster approach than doubling individual "x"s on each iteration of the loop but it was a bit slower:
$ time awk -v n=100000000 '
BEGIN {
d = 1000000
m = 7
s = sprintf("%*s",d,"")
gsub(/ /,"x",s)
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
}
'
real 0m1.030s
user 0m0.625s
sys 0m0.375s
The idea in that 2nd script was to generate a string long enough to be meaningful but short enough that gsub() can convert all blanks to "x"s quickly (which I've found to be 1000000 chars by trial and error), then just repeat the above process with fewer iterations.
I opened a bug report with the gawk providers about gsub() being slow for this case, see https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00030.html if you're interested, but it looks like the outcome of that will just be that gsub() is a heavy tool, it's not surprising that it takes a long time for 100 million regexp matches, but you can speed it up considerable by setting LC_ALL=C before it runs so gawk doesn't have to worry about multibyte locales.

If you are okay with perl:
# assigning to a variable
$ perl -e '$s = "x" x 10'
# printing the string
$ perl -e 'print "x" x 10'
xxxxxxxxxx
Note that there's no newline for the print in the above example, use perl -le if you want one.
Here's a timing comparison:
$ time perl -e '$s = "x" x 100000000'
real 0m0.071s
# script.awk is the first code from Ed Morton's answer
$ time mawk -v n=100000000 -f script.awk
real 0m0.136s
$ time gawk -v n=100000000 -f script.awk
real 0m0.429s

Is there a better way to repeat a char n times and assign the result to a variable?
I propose following solution limited to case where n is equal to 2^x and x is integer equal or greater 1. For example to get x repeated 32 i.e. 2^5 times
awk 'BEGIN{s="x";for(i=1;i<=5;i+=1)s=s s;print s}' emptyfile.txt
output
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
(tested in gawk 4.2.1)

extreme test case benchmark - comparing python3's built-in repeater running on compiled C-code, versus user-level scripting of a different algorithm, generating 152.58 bn zeros.
Starting point :: length = 4 < "0000" > | time = 0.000000 secs
For-Loop # 1 | Lgth-Gain Ratio 6.0x | #-0's : 24. | 0.000036 secs (cum.)
For-Loop # 2 | Lgth-Gain Ratio 26.0x | #-0's : 624. | 0.000050 secs (cum.)
For-Loop # 3 | Lgth-Gain Ratio 626.0x | #-0's : 390,624. | 0.000175 secs (cum.)
For-Loop # 4 | Lgth-Gain Ratio 390626.0x | #-0's : 152,587,890,624. | 21.485092 secs (cum.)
( mawk2 ; )
4.38s user 13.69s system 81% cpu 22.226 total
( python3 -c 'print(len("0"*(5**4**2-1)))'; )
4.34s user 17.32s system 71% cpu 30.291 total
152587890624
% ( time ( mawk2 '
function comma(_,__,___,____,_____) {
if(length(sprintf("%+.f",_))<=(\
__*=__+=__=__~__)) {
return \
index(_,".") ?_:_"."
}
____=CONVFMT;CONVFMT="%.1000g";___=_
_____="[0-9]"
gsub("^|$",_____,_____)
sub("^[^.]+","",___)
sub("("(_____)")+([.]|$)",",&",_)
gsub(_____,"&,",_)
sub("([,]*[.].*)?$",___,_)
sub("[,]+$",".",_)
sub("^[ \t]*[+-]?[,]+",\
substr("-",__==__,_~"^[ \t]*[-]"),_)
CONVFMT=____;
return _ }
function printProgress(_____) {
print sprintf(" For-Loop "\
"# %d | Lgth-Gain Ratio %8.1fx | "\
"#-0\47s : %16s | %9s secs (cum.)",
++_-_____+!—_,
(___^-!!___)*(___=length(__)),
comma(___),timer())
return fflush() }
function timer(_,__,___) {
return sprintf("%.*s%.*s%*.*f",
!(_+=_+=(_=(_=__=___="")~_)+_),
srand(),
!(__+=__=--_),___=int((__^=++_)*\
+substr(srand(),++_)),
__%(++_-+-++_),_+=_=(++_+--_)%—_,
!_TIME0_?(_TIME0_=___)*!__\
:(___-_TIME0_)/__)
} BEGIN {
____=__+=__*=___=__^=(__+=++__)^__;
gsub("",__,__)
gsub(".",!__,__)
print ORS,"Starting point :: length = ",
___=length(__=substr(__,__~__,_=4)),
" < \42"(__)"\42 > | time = " ,
timer(), "secs", ORS;
for(_____=_;_____;_____--) {
gsub("",__,__)
printProgress(_____) } }' )) ;`
sleep 2;
( time (python3 -c 'print(len("0"*(5**4**2-1)))' ))|lgp3;
UPDATE 1 : generic version of it
depending on which variant of awk, it automatically switches to BAU binary doubler, and for the squarer, the speed gains are so immense it's actually faster to square, then substring than trying to be pin-drop accurate at each level, which inherently slows it down. _mN is the binary double, genZeros is the exp-sqr one
mawk2 '
function _mN(__,____,_,___,_____,______){
return \
(_=_~_)<+__?(_____=_mN((__-(\
______=__%(_+=_)))/_,____,___))___\
(_____)(______?(___)____:""):__<_?"":____
}
function genZeros(_,__,___,____) {
if (+_<=+((__=___=____="")<__)) {
return \
(-_<+_) ? +__:___;
} else if (FLG1*(_~"...")) {
return _mN(_,_<_)
}
____=__+=__*=___=__^=(__+=++__)^__;
gsub("",__,__)-gsub(".",!__,__)
if (_<__){ return \
substr((__)__,_~_,_)
}; ___=(___=__)___;__="";
__+=++__+__;
_-=__*(__=int((_/__)^(--__)^-!!_))*__;
___=(__<length(___))\
? substr(___,__~__,__) \
: sprintf("%"(+"")"*.f",__,!__)
gsub(".",(___)(___)___,___)
return \
!_?___:FLG2?\
sprintf("%"(_<_)"*.f",+_,_)___:_mN(+_,_<_)
} BEGIN { CONVFMT=\
OFMT="%.20g"
FLG1=(+sprintf("%o",-1)<"0x1")
FLG2=!(sprintf("%d",3E10)%2)
} {
print length(genZeros(3^18)),3^18 }'
387420489 387420489
=================================
Doubling each loop is too slow. To get even better performance, you need exponential squaring. how about 88 milli-seconds :
echo; ( time (gawk -v n=100000000 -Sbe 'BEGIN {
CONVFMT=OFMT="%.20g"
__="x";____="."
_=10^((___=3)*___)
gsub(____,__,_)
while(___--){
gsub(____,_,_)
}
print length(_) }' ) ) | lgp3
( gawk -v n=100000000 -Sbe ; )
0.07s user 0.01s system 95% cpu 0.088 total
100000000
And that's only for gawk. For mawk-2, it's an unearthly 32 msecs
echo; ( time ( mawk2 -v n=100000000 'BEGIN { CONVFMT=OFMT="%.20g"
__="x";____=".";
_=10^((___=3)*___)
gsub(____,__,_)
while(___--) {
gsub(____,_,_) }
print length(_) }' ) ) | lgp3
( mawk2 -v n=100000000 ; )
0.01s user 0.02s system 84% cpu 0.032 total
100000000
404 msecs is all it takes to go from just 1 single copy of "x", to nearly 4.3 billion copies of it :
( time ( mawk2 '
BEGIN {
___=(_+=_+=++_)^(_*_)
__=length(_=(_="x")(_)_);++___
while(__*__<___) {
gsub("",_,_)
__*=++__
print --__
}
print ORS,length(_),ORS }' ) )
15
255
65535
4294967295
4294967295
0.09s user 0.25s system 84% cpu 0.404 total
UPDATE :
benchmarking generating 25.7 billion zeros — at this size level, python3's built-in repeater is being left in the dust
echo; (fg && fg && fg ) 2>/dev/null;
echo;
( time ( mawk2 '
BEGIN { _=_<_
__= 4
while(__—) {
gsub("",(_)(_)_,_)
}
print _ }'
) \
| pvE9 | LC_ALL=C gwc -lc ) | ggXy3 | ecp
sleep 1.5
echo
( time ( python3 -c 'print("0"*25710762175)'
) \
| pvE9 | LC_ALL=C gwc -lc ) | ggXy3 | ecp ; echo
out9: 23.9GiB 0:00:08 [2.85GiB/s] [2.85GiB/s] [ <=> ]
( mawk2 'BEGIN { _=_<_;__=4; while(__--){ gsub("",(_)(_)_,_) }; print _ }'; ) 0.79s user 4.72s system 63% cpu 8.729 total
pvE 0.1 out9 0.54s user 2.93s system 41% cpu 8.435 total
LC_ALL=C gwc -lc 2.03s user 1.46s system 41% cpu 8.434 total
1 25710762176
out9: 23.9GiB 0:00:11 [2.17GiB/s] [2.17GiB/s] [ <=> ]
( python3 -c 'print("0"*25710762175)'; ) 1.24s user 6.84s system 72% cpu 11.076 total
pvE 0.1 out9 0.56s user 2.80s system 30% cpu 11.075 total
LC_ALL=C gwc -lc 2.03s user 1.46s system 31% cpu 11.074 total
1 25710762176
UPDATE 2 : (partially unrelated, but to illustrate my thought process)
Once you realize there's a predictable and deterministic pattern to how the pop-count string appears and repeats, you can create the full pop-count string for all 256-bytes sequentially using only 4-cycles of the do-while loop,
without using hardware instructions, or having to go through every byte one-by-one, either in decimal integer bit-string form, in order to create a pop-count string. The one thing that's crucial though, is setting a CONVFMT value large enough to avoid scientific notation truncation, since the essence of the function is performing multiple 16-digit integer adds (getting near 2^53 but never exceeding) ::
<<<'' mawk '
function initPopCountStr(_,__,
___,____,_____,______) {
# No.# of set:bits per:byte
#
__=_=(_<_)(______=___=_~_)
______+=++______;
do {_=\
(__=_)(__+(___=(___)___))
} while (--______)
return \
(((_=(_)(____=(__+___)(___+__+___))\
____)(_____=((__+(___+=______=___)))\
(___+=__+______)))____\
(_____)_____)(___)(______+___)
} BEGIN {
CONVFMT="%.20g"
} {
print initPopCountStr() }' \
\
| gsed -zE 's/(.{16})(.{16})/\1:\2\n/g' |nonEmpty|gcat -n|lgp3 4
1 0112122312232334:1223233423343445
2 1223233423343445:2334344534454556
3 1223233423343445:2334344534454556
4 2334344534454556:3445455645565667
5 1223233423343445:2334344534454556
6 2334344534454556:3445455645565667
7 2334344534454556:3445455645565667
8 3445455645565667:4556566756676778
Or if you prefer no loops at all, and have everything folded in into one fell swoop :
function initPopCountStr(_,__,
___,____,_____,______) {
# No.# of set:bits per:byte
#
return \
\
(((_=(_=(__=(_=(__=(_=(__=(_=(__=_=(_<_)(______=___=_~_))\
(__+(___=(___)___)) ))(__+(___=(___)___))))\
(__+(___=(___)___))))\
(__+(___=(___)___)))(____=(__+___)(___+__+___))____)\
(_____=((__+(___+=______=___)))(___+=__+______)))\
(____)(_____)_____)(___)(______+___)
}
even the ultra lazy method is only 0.791 secs :
out9: 95.4MiB 0:00:00 [ 129MiB/s] [ 129MiB/s] [<=> ]
( mawk2 -F= 'NF=+$_+(OFS=$_=$NF)' ORS= <<< '100000000=x'; )
0.59s user 0.16s system 95% cpu 0.791 total
And the fast method is 16 milli seconds
fg;fg; ( time ( <<< '100000000=x' mawk2 -F= '{
_ = sprintf("%*s",__=int((____=+$(_=\
+(___="."))) ^ +(++_/(++_*_))),"")
gsub(___,$NF,_)
do { gsub(___,_,_)
} while((__*=__)<____)
print length(_),substr(_,1,31) }' ) ) | gcat -n | lgp3
fg: no current job
fg: no current job
( mawk2 -F= <<< '100000000=x'; )
0.00s user 0.01s system 87% cpu 0.016 total
1 100000000 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
UPDATE : making 3,486,784,401 copies of "x" in 0.323 secs
fg;fg; ( time ( <<< '43046721=x' mawk2 -F= '
{
________=substr("",+srand()) +substr(srand(),7)
_=sprintf("%*s",__=int((___=+$(_=+\
(_____=".")))^(++_/(++_*_))),"")
gsub(_____,$NF,_)
do { gsub(_____, _, _) } while((__*=__)<___)
print sprintf(" %15.8f GiB | %20.13f",
(__=length(_))/8^10,
((substr("",+srand())+substr(srand(),7))-\
(___=________))/(1+4^-24))
print substr(_,1,33)
CONVFMT=OFMT="%.14g"
____=9
while(--____){
sub(".+","&&&&&&&&&",_)
print sprintf(" %15.8f GiB | %20.13f",
length(_)/8^10,((substr("",+srand()) +\
substr(srand(),7))-___)/(1+4^-24)); fflush() } }' ) )
fg: no current job
fg: no current job
0.04009038 GiB | 0.0077791213989
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
0.36081345 GiB | 0.0445649623871
3.24732103 GiB | 0.3233759403229
29.22588923 GiB | 3.5619049072266
263.03300306 GiB | 58.9802551269529
Grow it at a different rate and we get :
103 billion copies of "x" just shy of 17 seconds
=
0.04304672 x 10^9 | 0.0111348628998
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
0.3013270470 x 10^9 ( 301327047)| 0.0563540458679
2.1092893290 x 10^9 ( 2109289329)| 0.2259490489960
14.7650253030 x 10^9 ( 14765025303)| 1.4351949691773
103.3551771210 x 10^9 ( 103355177121)| 16.9313509464274
And relatively trivial to adjust it for arbitrary sizes :
{m,g}awk '{ # the timer bit is for mawk-2
CONVFMT = "%.25g"
OFMT = "%.13f"
________ = substr("", +srand()) + substr(srand(), 7)
_ = sprintf("%*s", __=int((___= +$(_=\
+(_____="."))^ (++_/(++_*_))),"")
gsub(_____, $NF, _)
do {
print ___, gsub(_____, _, _),
((substr("",+srand()) +
substr(srand(), 7)) -________)/(1-4^-25)
} while ((__*=__)<___ && __<4^8)
print ".", length(_)
print \
sprintf("\n----------------\n%42.f .%20.13f",
(__=length(_))==___ ?__: (__=length(_= sprintf("%.*s",
___-__,_)(_))),((substr("", +srand()\
)+substr(srand(),7))-________) / (1-4^-25))
print substr(_, 1, 19)
}'
|
( mawk2 -F= <<< '3987654321=x'; )
0.20s user 0.55s system 94% cpu 0.799 total
3987654321 251 0.0000970363617
3987654321 63001 0.3465299606323
. 3969126001
----------------
3987654321 . 0.7496919631958

Related

How do I get gawk to transpose my data into a csv file

I have a bunch of input text files that look like this:
measured 5.7 0.0000 0.0000 0.0125 0.0161 0.0203 0.0230 0.0233 0.0236 0.0241
0.0243 0.0239 0.0235 0.0226 0.0207 0.0184 0.0147 0.0000 0.0000
measured 7.4 0.0000 0.0000 0.0160 0.0207 0.0260 0.0295 0.0298 0.0302 0.0308
0.0311 0.0306 0.0300 0.0289 0.0264 0.0235 0.0187 0.0000 0.0000
Each file has a couple of lines like that.
I want to take all of these files, cut out the 'measured' and first number (eg. 5.7 and 7.4) and put them in a CSV file so they will be sorted into columns like this:
My gawk command is
BEGIN { OFS = "\n" }
/measured/ { c=2; $1=$2=""; $0=$0 }
c && c-- { $1=$1; print }
Which I run as part of a for loop in windows for %f in (*) do (gawk -f column.txt %f ) >> finaloutput\burnup.csv
And that just produces a long column of numbers like this:
How do I get gawk to transpose the data into separate columns instead of one big long column?
$1=$2="";
this was the problem. you threw away exactly what you wasted (the 5.7) by having "measured" and "5.7" both go blank. The rest of the work became a futile exercise of re-cranking fields without the data you need.
c=2 only set the times you want to loop it it didn't save any of the variables before the =""
Also, any particular reason why you need a loop for those files? just cat (or pv, my preferred method) it over to gawk (i think mawk2-beta might run this at least 60-70% faster than gawk 5.1.0. AWK is the insane tool that has a much easier time than python or R when it comes go shoving it with gigs at a time.
Perhaps also consider using gnu grep (not the gd BSD one like on my mac - the syntax and escaping required just to match ABC \n XYZ on that thing is surreal) ?
I think this might be one of those great use cases for the --text flag (so it won't go screaming on irregularity bytes here and there), the -h flag (when you're sourcing from multiple files), and the -o flag to capture only the portion you need. But i'm not 100% too sure if a concurrent grep like that might exhibit any inconsistencies or not ?
i'm super novice in python or perl so can't help much there
Try this new solution. Tested on mawk2-beta should work elsewhere. This should handle the transpose automatically without having to use multi-dimensional arrays because the regex inside RS already will flatten that whole thing into 1 column.
Setting TOTCOL should only occur once. which is handled by setting FS to newline, so whenever it's at end of the line, FS is at least 2, if not more.
Furthermore, it doesn't need to assume it's 9 or 11 columns at all. It'll be auto-computed. So unless the input file has more than 1 billion columns, this trick be an issue.
I backed up the NR by 2 initially so the % wouldn't loop back to 0 one spot too early.
mawk2 'BEGIN { TOTCOL = 1E9; NR -= 2; FS = ORS; OFS = "";
RS = "([\n]*[ \t]+measured[ \t]+[^ \t]+)?[ \t]+";
}
(NR < TOTCOL && NF > 1) {
TOTCOL = 2 * length(outS); }
(NR == TOTCOL) {
OFS = "\t" ; }
{
outS[NR % TOTCOL] = outS[NR % TOTCOL] OFS $1 ; }
END {
for (trspd in outS) {
print outS[trspd]; } }'
Finally fixed all the issues. This will work across gawk / mawk-1.3 / mawk2. Reason being mawk-1 must initialize the array first else it'll whack out.
gawk/mawk/mawk2 'BEGIN { TOTCOL = 1E8; FS = "[\n]+"; NR -= 2; OFS = "";
RS = "((^|[\n]+)[ \t]+measured[ \t]+[^ \t]+)?[ \t]+";
outS[""]++;
delete outS[""];
} ( NR < TOTCOL && NF > 1 ) { TOTCOL = 2 * length(outS);
} ( NR == TOTCOL ) { delete outS[-1];
OFS = "\t" ;
} { outS[NR%TOTCOL] = \
outS[NR%TOTCOL] OFS $1 ;
} END { trspd = 0;
nx = length(outS);
do {
print outS[trspd];
} while (++trspd < nx) }'
Here's one:
$ awk -v RS="" '{ # read empty line separated blocks
for(i=3;i<=NF;i++) # loop from 3rd field to the end
a[i]=a[i] sprintf("%10s",$i+0) # append to NF in indexed array elements
}
END { # in the end
for(i=3;i in a;i++) # output
print a[i]
}' file
Output:
0 0
0 0
0.0125 0.016
0.0161 0.0207
0.0203 0.026
0.023 0.0295
0.0233 0.0298
0.0236 0.0302
0.0241 0.0308
0.0243 0.0311
0.0239 0.0306
0.0235 0.03
0.0226 0.0289
0.0207 0.0264
0.0184 0.0235
0.0147 0.0187
0 0
0 0
It's dumb (since I'm lazy) in the sense that sprintf uses static 10 spaces for the right-justified output.
Perhaps this will do what you want?
awk 'NR%2{printf "%s ",$0;next;}1' input_text_file.csv | awk '{for (i=1; i<=NF; i++) {a[NR,i] = $i} } NF>p { p = NF } END {for(j=1; j<=p; j++) {str=a[1,j]; for(i=2; i<=NR; i++){str=str" "a[i,j];} print str}}'

Faster algorithm/language for string replacement [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
EDIT:
Method 3 provided below is way faster by testing, reduce the estimated runtime from 2-3 days to < 1 day.
I had a sample file with a long string >50M like this.
CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtgtgGCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatataGCCCTAGGtcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCtggtgtgtggggttagggAtagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGgtgtCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
For everything substring with length k = 50 (which means there are
length(file)-k+1 substring)
if the A||T||a||t (Upper & Lower case) is >40%,
replace every character in that substring with N or n (preserving
case).
Sample output:
CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnggttaNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGNNnnnnnnnnnnnnnnnnnnNnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGnnnnNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnngNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
I was using AWK in command line for ease, but it just runs extremely slow with string replacement... and consume only <5% CPU somehow
Code: https://repl.it/#hey0wing/DutifulHandsomeMisrac-2
# Method 1
cat chr22.fa | head -n1 > chr22.masked.fa
cat chr22.fa | tail -n+2 | awk -v k=100 -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
while (i <= length($0)-k+1) {
x = substr($0, i, k)
if (i == 1) {
rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
} else {
prevx = substr($0,i-1,1)
if (prevx == "A" || prevx == "a" || prevx == "T" || prevx == "t")
rate -= 1
nextx = substr(x,k,1)
if (nextx == "A" || nextx == "a" || nextx == "T" || nextx == "t")
rate += 1
}
if (rate>r*k/100) {
h++
highGC[i] = i
}
printf("index-r:%f%% high-AT:%d \r",i/(length($0)-k+1)*100,h)
i += 1
}
printf("index-r:%f%% high-AT:%d\n\n",i/(length($0)-k+1)*100,h)
for (j in highGC) {
y = highGC[j]
SUB++
printf("sub-r:%f%% \r",SUB/h*100)
x = substr($0, y, k)
gsub (/[AGCT]/,"N",x)
gsub (/[agct]/,"n",x)
$0 = substr($0,1,y-1) x substr($0,y+k)
}
printf("sub-r:%f%%\nsubstituted:%d\n\n",SUB/h*100,SUB)
printf("%s",$0) >> "chr22.masked.fa"
}'
# Method 2
cat chr22.fa | head -n1 > chr22.masked2.fa
cat chr22.fa | tail -n+2 | awk -v k="100" -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
h = 0;
while (i<=length($0)-k+1) {
x = substr($0, i, k)
rate = gsub(/[ATX]/,"X",x) + gsub(/[atx]/,"x",x)
if (rate>r/k*100) {
h++
gsub (/[GC]/,"N",x)
gsub (/[gc]/,"n",x)
$0 = substr($0,1,i-1) x substr($0,i+k)
}
printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
i += 1
}
gsub (/X/,"N",$0)
gsub (/x/,"n",$0)
printf("index-r:%f%% sub-r:%f%% \n",i/(length($0)-k+1)*100,h/544*100)
printf("%s",$0) >> "chr22.masked2.fa"
}'
# Method 3
cat chr22.fa | head -n1 > chr22.masked3.fa
cat chr22.fa | tail -n+2 | awk -v k="100" -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
h = 0;
while (i <= length($0)-k+1) {
x = substr($0, i, k)
rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
if (rate>r/k*100) {
h++
gsub(/[ACGT]/,"N",x)
gsub(/[acgt]/,"n",x)
if (i == 1) {
s = x
} else {
s = substr(s,1,length(s)-k+1) x
}
} else {
if (i == 1) {
s = x
} else {
s = s substr(x,k,1)
}
}
printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
i += 1
}
printf("index-r:%f%% sub-r:%f%% \n\n",i/(length($0)-k+1)*100,h/544*100)
printf("%s",s) >> "chr22.masked3.fa"
}'
The estimated runtime is around 2-3 days ...
Are there any faster algorithm for this problem? If no, are there any language can perform string replacement faster?
More info:
the AWK command consume ~30% CPU at WSL & GitBash, but only ~5% on windows cmd with an OpenSSH client, where the progress rate is similar
Okay, there's an O(n) solution that involves a sliding window on to your data set. The following algorithm should suffice:
set window to ""
while true:
if window is "":
read k characters into window, exit while if less available
set atCount to number of characters in window matching "AaTt".
if atCount > 40% of k:
for each char in window:
if char uppercase:
output "N"
else:
output "n"
window = ""
else:
if first character of window matches "AaTt":
decrease atCount
remove first character of window
read next character into end of window, exit while if none available
if last character of window matches "AaTt":
increase atCount
What this does is to run a sliding window through your data, at each point testing if the proportion of AaTt characters in that window is more than 40%.
If so, it outputs the desired Nn characters and reloads the next k-sized window.
If it's not over 40%, it removes the first character in the windows and adds the next one to the end, adjusting the count of AaTt characters correctly.
If, at any point, there aren't enough characters left to satisfy a check (k when loading a full window, or 1 when sliding), it exits the loop.
Try some perl:
perl -slpe '
my $len = length;
for (my $i = 0; $i < $len; $i += $k) {
my $substring = substr($_, $i, $k);
my $count = $substring =~ tr/aAtT/aAtT/;
if ($count >= $k * $threshold) {
$substring =~ s/[[:lower:]]/n/g;
$substring =~ s/[[:upper:]]/N/g;
substr($_, $i, $k) = $substring;
}
}
' -- -k=50 -threshold=0.4 file

Can I do a time based progress in awk?

I am currently using awk scripting to censor the console output and I print one dot for each censored line.
I want to update this code to make it avoid printing more than one dot per minute (or something similar). Obviously that if I do not get any progress (streamed new lines), no update is supposed to happen.
Current version of the code is at https://gist.github.com/ssbarnea/f7b72491af524fa364d9fc328cd43f2a
Note: I know that I could print a newline with "mod 10" or similar in order to limit the output but that approach is not good because the lines are not received with a consistent speed, sometimes I get lots of them, sometimes i get only one or two. Due to this I need to use a timer based approach which would do something like "print newline if the last one was printed more than x seconds ago"
With GNU awk for time functions you can print dots no more frequently than once per minute by simply comparing the time in seconds since the epoch when the current input line is being processed with the time when the previous dot was printed:
awk '
function prtDot() {
currTime = systime()
if ( (currTime - prevTime) > 60 ) {
printf "." | "cat>&2"
prevTime = currTime
}
}
{ print $0; prtDot() }
END { print "" | "cat>&2" }
'
e.g. printing a . every 10 seconds within a stream of numbers:
$ cat tst.awk
function prtDot() {
currTime = systime()
if ( (currTime - prevTime) > 10 ) {
printf "." | "cat>&2"
prevTime = currTime
}
}
{ printf "%s",$0%10 | "cat>&2"; prtDot() }
END { print "" | "cat>&2" }
$ i=0; while (( i < 50 )); do echo $((++i)); sleep 1; done | awk -f tst.awk
1.2345678901.23456789012.3456789012.34567890123.4567890
$ i=0; while (( i < 50 )); do echo $((++i)); sleep 3; done | awk -f tst.awk
1.2345.6789.0123.4567.8901.2345.6789.0123.4567.8901.2345.6789.0
the slight difference between the actual digits printed and expected is due to how long other parts of the while loop add to the overall interval between echos and other small imprecisions affecting when the shell loop is printing numbers and consequently when systime() is getting called in awk.

How can sed replace but skip the replaced characters in the pipeline?

I'm looking to escape some character ( and ), their respective escape codes are (40) and (41).
echo 'Hello (world)' | sed 's/(/(40)/g;s/)/(41)/g'
This code fails with Hello (40(41)world(41) because it will also process the output from the first replacement. Is there any way I can skip the replacement characters or do conditional branches here. I don't want to use a temporary (as the input sequence could contain anything).
All you need is:
$ echo 'Hello (world)' | sed 's/(/(40\n/g; s/)/(41)/g; s/\n/)/g'
Hello (40)world(41)
The above is safe because \n can't be present in the input since sed reads one line at a time. With some seds you might need to use a backslash followed by a literal newline or $'\n' instead of just \n.
Given the answer you posted, though, this may be what you really want (uses GNU awk for ord(), multi-char RS, and RT):
$ cat tst.awk
#load "ordchr"
BEGIN { RS = "[][(){}]"; ORS="" }
{ print $0 ( RT=="" ? "" : "(" ord(RT) ")" ) }
$ echo 'Hello (world) foo [bar] other {stuff} etc.' | awk -f tst.awk
Hello (40)world(41) foo (91)bar(93) other (123)stuff(125) etc.
If you have an older gawk that doesn't support #load than get a new one but if that's impossible for some reason then just create an array of the values, e.g.:
$ cat tst.awk
BEGIN {
RS = "[][(){}]"
ORS = ""
for (i=0;i<=255;i++) {
char = sprintf("%c",i)
map[char] = "(" i ")"
}
}
{ print $0 ( RT=="" ? "" : map[RT] ) }
$ echo 'Hello (world) foo [bar] other {stuff} etc.' | awk -f tst.awk
Hello (40)world(41) foo (91)bar(93) other (123)stuff(125) etc.
EDIT: timing data
Given a file that has these 10 lines:
$ head -10 file1m
When (chapman) billies leave [the] street, And drouthy {neibors}, neibors, meet;
As market days are wearing late, And folk begin to [tak] the gate,
While (we) sit bousing {at} the nappy, An' getting [fou] and unco happy,
We think na on the [lang] Scots (miles), The mosses, {waters}, slaps and stiles,
That lie between us and our hame, Where sits our sulky, sullen dame,
Gathering her [brows] like gathering storm, (Nursing) her wrath to keep it warm.
This truth fand honest Tam o' Shanter,
As he frae Ayr ae night did canter:
(Auld Ayr, wham ne'er a town surpasses,
For honest men and bonie lasses).
repeating to a total of 1 million lines, 10.5 million characters, 60.4 million bytes:
$ wc file1m
1000000 10500000 60400000 file1m
the 3rd-run timing stats for the sed script and both awk scripts above are:
$ time sed 's/(/(40\n/g; s/)/(41)/g; s/\n/)/g; s/\[/(91)/g; s/\]/(93)/g; s/{/(123)/g; s/}/(125)/g;' file1m > sed.out
real 0m7.488s
user 0m7.378s
sys 0m0.093s
$ cat function.awk
#load "ordchr"
BEGIN { RS = "[][(){}]"; ORS="" }
{ print $0 ( RT=="" ? "" : "(" ord(RT) ")" ) }
$ time awk -f function.awk file1m > awk_function.out
real 0m7.426s
user 0m7.269s
sys 0m0.155s
$ cat array.awk
BEGIN {
RS = "[][(){}]"
ORS = ""
for (i=0;i<=255;i++) {
char = sprintf("%c",i)
map[char] = "(" i ")"
}
}
{ print $0 ( RT=="" ? "" : map[RT] ) }
$ time awk -f array.awk file1m > awk_array.out
real 0m4.758s
user 0m4.648s
sys 0m0.092s
I verified that all 3 scripts produce the same, successfully modified output:
$ head -10 sed.out
When (40)chapman(41) billies leave (91)the(93) street, And drouthy (123)neibors(125), neibors, meet;
As market days are wearing late, And folk begin to (91)tak(93) the gate,
While (40)we(41) sit bousing (123)at(125) the nappy, An' getting (91)fou(93) and unco happy,
We think na on the (91)lang(93) Scots (40)miles(41), The mosses, (123)waters(125), slaps and stiles,
That lie between us and our hame, Where sits our sulky, sullen dame,
Gathering her (91)brows(93) like gathering storm, (40)Nursing(41) her wrath to keep it warm.
This truth fand honest Tam o' Shanter,
As he frae Ayr ae night did canter:
(40)Auld Ayr, wham ne'er a town surpasses,
For honest men and bonie lasses(41).
$ wc sed.out
1000000 10500000 68800000 sed.out
$ diff sed.out awk_function.out
$ diff sed.out awk_array.out
$
The problem is solved by creating an ord function in awk. It doesn't appear sed has this functionality.
#! /bin/sh
awk '
BEGIN { _ord_init() }
function _ord_init(low, high, i, t) {
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") {
low = 0;
high = 127;
} else if (sprintf("%c", 128 + 7) == "\a") {
low = 128;
high = 255;
} else {
low = 0;
high = 255;
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i);
_ord_[t] = i;
}
}
function ord(str, c) {
c = substr(str, 1, 1)
return _ord_[c]
}
// {
split($0, array, "\\[|\\]|\\(|\\)|\\{|\\}", separators);
len = length(array);
seplen = length(separators);
for (i = 1; i < len; ++i) {
printf "%s(%s)", array[i], ord(separators[i]);
}
printf "%s", array[len];
}
'
You can do this in perl, which supports one-liners and look-behind in regular expressions. Simply require the close-paren not be part of an existing escape:
$ echo 'Hello (world)' | perl -pe 's/\(/(40)/g; s/(?<!\(40)\)/(41)/g'
Hello (40)world(41)
It's tricky in sed but easy in any language with associative arrays.
perl -pe 'BEGIN { %h = ("(" => "(40)", ")" => "(41)" );
$r = join("|", map { quotemeta } keys %h); }
s/($r)/$h{$1}/g'

array over non-existing indices in awk

Sorry for the verbose question, it boils down to a very simple problem.
Assume there are n text files each containing one column of strings (denominating groups) and one of integers (denominating the values of instances within these groups):
# filename xxyz.log
a 5
a 6
b 10
b 15
c 101
c 100
#filename xyzz.log
a 3
a 5
c 116
c 128
Note that while the length of both columns within any given file is always identical it differs between files. Furthermore, not all files contain the same range of groups (the first one contains groups a, b, c, while the second one only contains groups a and c). In awk one could calculate the average of column 2 for each string in column 1 within each file separately and output the results with the following code:
NAMES=$(ls|grep .log|awk -F'.' '{print $1}');
for q in $NAMES;
do
gawk -F' ' -v y=$q 'BEGIN {print "param", y}
{sum1[$1] += $2; N[$1]++}
END {for (key in sum1) {
avg1 = sum1[key] / N[key];
printf "%s %f\n", key, avg1;
} }' $q.log | sort > $q.mean;
done;
Howerver, for the abovementioned reasons, the length of the resulting .mean files differs between files. For each .log file I'd like to output a .mean file listing the entire range of groups (a-d) in the first column and the corresponding mean value or empty spaces in the second column depending on whether this category is present in the .log file. I've tried the following code (given without $NAMES for brevity):
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
{sum[$1] += $2; N[$1]++}
END {for (i in arr) {
if (i in sum) {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;}
else {
printf "%s %s\n" i, "";}
}}' xxyz.log > xxyz.mean;
but it returns the following error:
awk: (FILENAME=myfile FNR=7) fatal: not enough arguments to satisfy format string
`%s %s
'
^ ran out for this one
Any suggestions would be highly appreciated.
Will you ever have explicit zeroes or negative numbers in the log files? I'm going to assume not.
The first line of your second script doesn't do what you wanted:
awk 'BEGIN{arr[a]="a"; arr[b]="b"; arr[c]="c"; arr[d]="d"}
This assigns "a" to arr[0] (because a is a variable not previously used), then "b" to the same element (because b is a variable not previously used), then "c", then "d". Clearly, not what you had in mind. This (untested) code should do the job you need as long as you know that there are just the four groups. If you don't know the groups a priori, you need a more complex program (it can be done, but it is harder).
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0) N[i] = 1 # Divide by zero protection
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}' xxyz.log > xxyz.mean;
This will print a zero average for the missing groups. If you prefer, you can do:
awk 'BEGIN { sum["a"] = 0; sum["b"] = 0; sum["c"] = 0; sum["d"] = 0 }
{ sum[$1] += $2; N[$1]++ }
END { for (i in sum) {
if (N[i] == 0)
printf("%s\n", i;
else {
avg = sum[i] / N[i];
printf "%s %f\n" i, avg;
}
}
}' xxyz.log > xxyz.mean;
For each .log file I'd like to output a .mean file listing the entire
range of groups (a-d) in the first column and the corresponding mean
value or empty spaces in the second column depending on whether this
category is present in the .log file.
Not purely an awk solution, but you can get all the groups with this.
awk '{print $1}' *.log | sort -u > groups
After you calculate the means, you can then join the groups file. Let's say the means for your second input file look like this temporary, intermediate file. (I called it xyzz.tmp.)
a 4
c 122
Join the groups, preserving all the values from the groups file.
$ join -a1 groups xyzz.tmp > xyzz.mean
$ cat xyzz.mean
a 4
b
c 122
Here's my take on the problem. Run like:
./script.sh
Contents of script.sh:
array=($(awk '!a[$1]++ { print $1 }' *.log))
readarray -t sorted < <(for i in "${array[#]}"; do echo "$i"; done | sort)
for i in *.log; do
for j in "${sorted[#]}"; do
awk -v var=$j '
{
sum[$1]+=$2
cnt[$1]++
}
END {
print var, (var in cnt ? sum[var]/cnt[var] : "")
}
' "$i" >> "${i/.log/.main}"
done
done
Results of grep . *.main:
xxyz.main:a 5.5
xxyz.main:b 12.5
xxyz.main:c 100.5
xyzz.main:a 4
xyzz.main:b
xyzz.main:c 122
Here is a pure awk answer:
find . -maxdepth 1 -name "*.log" -print0 |
xargs -0 awk '{SUBSEP=" ";sum[FILENAME,$1]+=$2;cnt[FILENAME,$1]+=1;next}
END{for(i in sum)print i, sum[i], cnt[i], sum[i]/cnt[i]}'
Easy enough to push this into a file --