Busybox awk: How to treat each character in String as integer to perform bitwise operations? - awk

I wanna change SSID wifi network name dynamically in OpenWRT via script which grab information from internet.
Because the information grabbed from internet may contains multiple-bytes characters, so it's can be easily truncated to invalid UTF-8 bytes sequence, so I want to use awk (busybox) to fix it. However, when I try to use bitwise function and on a String and integer, the result always return 0.
awk 'BEGIN{v="a"; print and(v,0xC0)}'
How to treat character in String as integer in awk like we can do in C/C++? char p[]="abc"; printf ("%d",*(p+1) & 0xC0);

You can make your own ord function like this - heavily borrowed from GNU Awk User's Guide - here
#!/bin/bash
awk '
BEGIN { _ord_init()
printf("ord(a) = %d\n", ord("a"))
}
function _ord_init( low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str,c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}'
Output
ord(a) = 97

I don;t know if it's what you mean since you didn't provide sample input and expected output but take a look at this with GNU awk and maybe it'll help:
$ gawk -lordchr 'BEGIN{v="a"; print v " -> " ord(v) " -> " chr(ord(v))}'
a -> 97 -> a

Related

How to generate string with a char repeated for n times?

I can use the following code to generate a string.
$ awk -e 'BEGIN { for(i=1;i<=10;++i) s = s "x"; print s }'
xxxxxxxxxx
But its complexity is super-linear wrt the string length.
$ time awk -e 'BEGIN { for(i=1;i<=10000000;++i) s = s "x" }'
real 0m0.868s
user 0m0.857s
sys 0m0.008s
$ time awk -e 'BEGIN { for(i=1;i<=100000000;++i) s = s "x" }'
real 0m9.886s
user 0m9.801s
sys 0m0.065s
$ time awk -e 'BEGIN { for(i=1;i<=1000000000;++i) s = s "x" }'
real 1m46.074s
user 1m45.171s
sys 0m0.760s
Is there a better way to repeat a char n times and assign the result to a variable?
Use sprintf to create a string of spaces, then use gsub to replace each space with an x:
$ time awk 'BEGIN {s = sprintf("%*s", 100000000, ""); gsub(".", "x", s)}'
real 0m1.744s
user 0m1.645s
sys 0m0.098s
This can be wrapped in an awk function:
function mkStr(c, n, s) {
s = sprintf("%*s", n, "");
gsub(".", c, s);
return s;
}
(s is a parameter simply to scope the variable to the function; it needs no argument, and indeed, any argument passed will be ignored.)
Update: there appears to be a significant difference in performance depending on which version of awk you are using. The above test used 20070501, the BSD(?) awk that ships with macOS. gawk-5.1.0 takes significantly longer.
I don't know what accounts for the difference; perhaps there is a solution that is fast in both versions.
Update 2: Ed Morton (in the comments) has verified that gsub is responsible for the slow running time in gawk, could not find a workaround, and has filed a bug report with the maintainers.
function loop(n){
for(i=1;i<=n;i++) s = s "x";
return s;
}
function repl(n){
s = sprintf("%*s", n, "");
gsub(/ /, "x", s);
return s;
}
function recStack(n, h){
switch( n ){
case 0:
return "";
default:
if( n % 2 == 1 ){
h = recStack( int((n-1)/2) )
return h h "x";
} else {
h = recStack( int(n/2) )
return h h;
}
}
}
function recStackIf(n, h){
if( n == 0 ) return "";
if( n % 2 == 1 ){
h = recStackIf( int((n-1)/2) ); # create first half
return h h "x"; # duplicate + one "x"
} else {
h = recStackIf( int(n/2) ); # create first half
return h h; # duplicate
}
}
function recArray(n, h, n2){
if( n in a ) return a[n];
switch( n ){
case 0:
return a[0] = "";
default:
if( n % 2 == 1 ){
n2 = int((n-1)/2);
h = recArray( n2 );
return a[n] = h h "x";
} else {
n2 = int(n/2);
h = recArray( n2 );
return a[n] = h h;
}
}
}
function recArrayIf(n, h, n2){
if( n in a ) return a[n];
if( n == 0 ) return a[0] = "";
if( n % 2 == 1 ){
n2 = int((n-1)/2);
h = recArrayIf( n2 );
return a[n] = h h "x";
} else {
n2 = int(n/2);
h = recArrayIf( n2 );
return a[n] = h h;
}
}
function concat(n){
exponent = log(n)/log(2)
m = int(exponent) # floor
m += (m < exponent ? 1 : 0) # ceiling
s = "x"
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
return s
}
BEGIN {
switch( F ){
case "recStack":
xxx = recStack( 100000000 );
break;
case "recStackIf":
xxx = recStackIf( 100000000 );
break;
case "recArray":
xxx = recArray( 100000000 );
break;
case "recArrayIf":
xxx = recArrayIf( 100000000 );
break;
case "loop":
xxx = loop( 100000000 );
break;
case "repl":
xxx = repl( 100000000 );
break;
case "concat":
xxx = concat( 100000000 );
break;
}
print length(xxx);
## xloop = loop(100000000 );
## if( xxx == xloop ) print "Match";
}
Times are:
# loop : real 0m5,405s, user 0m5,243s, sys 0m0,160s
# repl : real 0m7,670s, user 0m7,506s, sys 0m0,164s
# recArray: real 0m0,302s, user 0m0,141s, sys 0m0,161s
# recArrayIf: real 0m0,309s, user 0m0,168s, sys 0m0,141s
# recStack: real 0m0,316s, user 0m0,124s, sys 0m0,192s
# recStackIf: real 0m0,305s, user 0m0,152s, sys 0m0,152s
# concat: real 0m0,664s, user 0m0,300s, sys 0m0,364s
There's not much difference between the 5 versions of binary decomposition: a bunch of heap memory is used in all cases. Having the global array hang around until the end of all times isn't good and therefore I'd prefer either stack version.
wlaun#terra:/tmp$ gawk -V
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
wlaun#terra:/tmp$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Note that the above timing has been done with a statement printing the resulting string's length, which adds about 0.2 s to each version. Also, /usr/bin/time isn't reliable. Here are the relative "real" values from time without the print length(xxx):
# loop: 0m5,248s
# repl: 0m7,705s
# recStack: 0m0,103s
# recStackIf: 0m0,097s
# recArray: 0m0,103s
# recArrayIf: 0m0,099s
# concat: 0m0,455s
Added on request of Ed Morton:
Why is any of the recursive functions faster than any of the linear functions that iterate over O(N) elements? (The "O(N)" is the "big oh" symbol and is used to indicate a value that is N, possibly multiplied and/or incremented by some constant. A circle's circumference is O(r), it's area is O(r²).)
The answer is simple: By dividing N by 2, we get two strings of length O(N/2). This provides the possibility of re-using the result for the first half (no matter how we obtain it) for the second half! Thus, we'll get the second half of the result for free (except for the string copy operation, which is basically a machine instruction on most popular architectures). There is no reason why this great idea should not be applied for creating the first half as well, which means that we get three quarters of the result for free (except - see above). A little overhead results from the single "x" we have to throw in to cater for odd subdivisions of N.
There are many other recursive algorithms along the idea of halving and dealing with both sections individually, the most famous of them are Binary Search and Quicksort.
Here's a fast solution using any POSIX awk (tested on an average 8G RAM laptop running cygwin with GNU awk 5.1.0):
time awk -v n=100000000 'BEGIN {
exponent = log(n)/log(2)
m = int(exponent) # floor
m += (m < exponent ? 1 : 0) # ceiling
s = "x"
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
}'
real 0m0.665s
user 0m0.328s
sys 0m0.343s
The above just appends a copy of the string to itself until it's at least as big as the target string length and finally truncates it to exactly the desired length. The only somewhat tricky part is calculating how many times to append s to itself and that's just a case of solving 2^m = n for m given you already know n (100000000 in this case), see https://www.calculatorsoup.com/calculators/algebra/exponentsolve.php.
Obviously you could make the loop while ( length(s) < n ) instead of calculating m and that'd make the script briefer but would slow it down a bit (but it'd still be pretty fast):
$ time awk -v n=100000000 'BEGIN{s="x"; while (length(s) < n) s=s s; s=substr(s,1,n)}'
real 0m1.072s
user 0m0.562s
sys 0m0.483s
#JamesBrown had a better idea than calling length() on each iteration that also avoids having to calculate m while being about as fast:
$ time awk -v n=100000000 'BEGIN{s="x"; for (i=1; i<n; i*=2) s=s s; s=substr(s,1,n)}'
real 0m0.710s
user 0m0.281s
sys 0m0.390s
Originally I had the following thinking doubling strings of "x"s would be a faster approach than doubling individual "x"s on each iteration of the loop but it was a bit slower:
$ time awk -v n=100000000 '
BEGIN {
d = 1000000
m = 7
s = sprintf("%*s",d,"")
gsub(/ /,"x",s)
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
}
'
real 0m1.030s
user 0m0.625s
sys 0m0.375s
The idea in that 2nd script was to generate a string long enough to be meaningful but short enough that gsub() can convert all blanks to "x"s quickly (which I've found to be 1000000 chars by trial and error), then just repeat the above process with fewer iterations.
I opened a bug report with the gawk providers about gsub() being slow for this case, see https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00030.html if you're interested, but it looks like the outcome of that will just be that gsub() is a heavy tool, it's not surprising that it takes a long time for 100 million regexp matches, but you can speed it up considerable by setting LC_ALL=C before it runs so gawk doesn't have to worry about multibyte locales.
If you are okay with perl:
# assigning to a variable
$ perl -e '$s = "x" x 10'
# printing the string
$ perl -e 'print "x" x 10'
xxxxxxxxxx
Note that there's no newline for the print in the above example, use perl -le if you want one.
Here's a timing comparison:
$ time perl -e '$s = "x" x 100000000'
real 0m0.071s
# script.awk is the first code from Ed Morton's answer
$ time mawk -v n=100000000 -f script.awk
real 0m0.136s
$ time gawk -v n=100000000 -f script.awk
real 0m0.429s
Is there a better way to repeat a char n times and assign the result to a variable?
I propose following solution limited to case where n is equal to 2^x and x is integer equal or greater 1. For example to get x repeated 32 i.e. 2^5 times
awk 'BEGIN{s="x";for(i=1;i<=5;i+=1)s=s s;print s}' emptyfile.txt
output
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
(tested in gawk 4.2.1)
extreme test case benchmark - comparing python3's built-in repeater running on compiled C-code, versus user-level scripting of a different algorithm, generating 152.58 bn zeros.
Starting point :: length = 4 < "0000" > | time = 0.000000 secs
For-Loop # 1 | Lgth-Gain Ratio 6.0x | #-0's : 24. | 0.000036 secs (cum.)
For-Loop # 2 | Lgth-Gain Ratio 26.0x | #-0's : 624. | 0.000050 secs (cum.)
For-Loop # 3 | Lgth-Gain Ratio 626.0x | #-0's : 390,624. | 0.000175 secs (cum.)
For-Loop # 4 | Lgth-Gain Ratio 390626.0x | #-0's : 152,587,890,624. | 21.485092 secs (cum.)
( mawk2 ; )
4.38s user 13.69s system 81% cpu 22.226 total
( python3 -c 'print(len("0"*(5**4**2-1)))'; )
4.34s user 17.32s system 71% cpu 30.291 total
152587890624
% ( time ( mawk2 '
function comma(_,__,___,____,_____) {
if(length(sprintf("%+.f",_))<=(\
__*=__+=__=__~__)) {
return \
index(_,".") ?_:_"."
}
____=CONVFMT;CONVFMT="%.1000g";___=_
_____="[0-9]"
gsub("^|$",_____,_____)
sub("^[^.]+","",___)
sub("("(_____)")+([.]|$)",",&",_)
gsub(_____,"&,",_)
sub("([,]*[.].*)?$",___,_)
sub("[,]+$",".",_)
sub("^[ \t]*[+-]?[,]+",\
substr("-",__==__,_~"^[ \t]*[-]"),_)
CONVFMT=____;
return _ }
function printProgress(_____) {
print sprintf(" For-Loop "\
"# %d | Lgth-Gain Ratio %8.1fx | "\
"#-0\47s : %16s | %9s secs (cum.)",
++_-_____+!—_,
(___^-!!___)*(___=length(__)),
comma(___),timer())
return fflush() }
function timer(_,__,___) {
return sprintf("%.*s%.*s%*.*f",
!(_+=_+=(_=(_=__=___="")~_)+_),
srand(),
!(__+=__=--_),___=int((__^=++_)*\
+substr(srand(),++_)),
__%(++_-+-++_),_+=_=(++_+--_)%—_,
!_TIME0_?(_TIME0_=___)*!__\
:(___-_TIME0_)/__)
} BEGIN {
____=__+=__*=___=__^=(__+=++__)^__;
gsub("",__,__)
gsub(".",!__,__)
print ORS,"Starting point :: length = ",
___=length(__=substr(__,__~__,_=4)),
" < \42"(__)"\42 > | time = " ,
timer(), "secs", ORS;
for(_____=_;_____;_____--) {
gsub("",__,__)
printProgress(_____) } }' )) ;`
sleep 2;
( time (python3 -c 'print(len("0"*(5**4**2-1)))' ))|lgp3;
UPDATE 1 : generic version of it
depending on which variant of awk, it automatically switches to BAU binary doubler, and for the squarer, the speed gains are so immense it's actually faster to square, then substring than trying to be pin-drop accurate at each level, which inherently slows it down. _mN is the binary double, genZeros is the exp-sqr one
mawk2 '
function _mN(__,____,_,___,_____,______){
return \
(_=_~_)<+__?(_____=_mN((__-(\
______=__%(_+=_)))/_,____,___))___\
(_____)(______?(___)____:""):__<_?"":____
}
function genZeros(_,__,___,____) {
if (+_<=+((__=___=____="")<__)) {
return \
(-_<+_) ? +__:___;
} else if (FLG1*(_~"...")) {
return _mN(_,_<_)
}
____=__+=__*=___=__^=(__+=++__)^__;
gsub("",__,__)-gsub(".",!__,__)
if (_<__){ return \
substr((__)__,_~_,_)
}; ___=(___=__)___;__="";
__+=++__+__;
_-=__*(__=int((_/__)^(--__)^-!!_))*__;
___=(__<length(___))\
? substr(___,__~__,__) \
: sprintf("%"(+"")"*.f",__,!__)
gsub(".",(___)(___)___,___)
return \
!_?___:FLG2?\
sprintf("%"(_<_)"*.f",+_,_)___:_mN(+_,_<_)
} BEGIN { CONVFMT=\
OFMT="%.20g"
FLG1=(+sprintf("%o",-1)<"0x1")
FLG2=!(sprintf("%d",3E10)%2)
} {
print length(genZeros(3^18)),3^18 }'
387420489 387420489
=================================
Doubling each loop is too slow. To get even better performance, you need exponential squaring. how about 88 milli-seconds :
echo; ( time (gawk -v n=100000000 -Sbe 'BEGIN {
CONVFMT=OFMT="%.20g"
__="x";____="."
_=10^((___=3)*___)
gsub(____,__,_)
while(___--){
gsub(____,_,_)
}
print length(_) }' ) ) | lgp3
( gawk -v n=100000000 -Sbe ; )
0.07s user 0.01s system 95% cpu 0.088 total
100000000
And that's only for gawk. For mawk-2, it's an unearthly 32 msecs
echo; ( time ( mawk2 -v n=100000000 'BEGIN { CONVFMT=OFMT="%.20g"
__="x";____=".";
_=10^((___=3)*___)
gsub(____,__,_)
while(___--) {
gsub(____,_,_) }
print length(_) }' ) ) | lgp3
( mawk2 -v n=100000000 ; )
0.01s user 0.02s system 84% cpu 0.032 total
100000000
404 msecs is all it takes to go from just 1 single copy of "x", to nearly 4.3 billion copies of it :
( time ( mawk2 '
BEGIN {
___=(_+=_+=++_)^(_*_)
__=length(_=(_="x")(_)_);++___
while(__*__<___) {
gsub("",_,_)
__*=++__
print --__
}
print ORS,length(_),ORS }' ) )
15
255
65535
4294967295
4294967295
0.09s user 0.25s system 84% cpu 0.404 total
UPDATE :
benchmarking generating 25.7 billion zeros — at this size level, python3's built-in repeater is being left in the dust
echo; (fg && fg && fg ) 2>/dev/null;
echo;
( time ( mawk2 '
BEGIN { _=_<_
__= 4
while(__—) {
gsub("",(_)(_)_,_)
}
print _ }'
) \
| pvE9 | LC_ALL=C gwc -lc ) | ggXy3 | ecp
sleep 1.5
echo
( time ( python3 -c 'print("0"*25710762175)'
) \
| pvE9 | LC_ALL=C gwc -lc ) | ggXy3 | ecp ; echo
out9: 23.9GiB 0:00:08 [2.85GiB/s] [2.85GiB/s] [ <=> ]
( mawk2 'BEGIN { _=_<_;__=4; while(__--){ gsub("",(_)(_)_,_) }; print _ }'; ) 0.79s user 4.72s system 63% cpu 8.729 total
pvE 0.1 out9 0.54s user 2.93s system 41% cpu 8.435 total
LC_ALL=C gwc -lc 2.03s user 1.46s system 41% cpu 8.434 total
1 25710762176
out9: 23.9GiB 0:00:11 [2.17GiB/s] [2.17GiB/s] [ <=> ]
( python3 -c 'print("0"*25710762175)'; ) 1.24s user 6.84s system 72% cpu 11.076 total
pvE 0.1 out9 0.56s user 2.80s system 30% cpu 11.075 total
LC_ALL=C gwc -lc 2.03s user 1.46s system 31% cpu 11.074 total
1 25710762176
UPDATE 2 : (partially unrelated, but to illustrate my thought process)
Once you realize there's a predictable and deterministic pattern to how the pop-count string appears and repeats, you can create the full pop-count string for all 256-bytes sequentially using only 4-cycles of the do-while loop,
without using hardware instructions, or having to go through every byte one-by-one, either in decimal integer bit-string form, in order to create a pop-count string. The one thing that's crucial though, is setting a CONVFMT value large enough to avoid scientific notation truncation, since the essence of the function is performing multiple 16-digit integer adds (getting near 2^53 but never exceeding) ::
<<<'' mawk '
function initPopCountStr(_,__,
___,____,_____,______) {
# No.# of set:bits per:byte
#
__=_=(_<_)(______=___=_~_)
______+=++______;
do {_=\
(__=_)(__+(___=(___)___))
} while (--______)
return \
(((_=(_)(____=(__+___)(___+__+___))\
____)(_____=((__+(___+=______=___)))\
(___+=__+______)))____\
(_____)_____)(___)(______+___)
} BEGIN {
CONVFMT="%.20g"
} {
print initPopCountStr() }' \
\
| gsed -zE 's/(.{16})(.{16})/\1:\2\n/g' |nonEmpty|gcat -n|lgp3 4
1 0112122312232334:1223233423343445
2 1223233423343445:2334344534454556
3 1223233423343445:2334344534454556
4 2334344534454556:3445455645565667
5 1223233423343445:2334344534454556
6 2334344534454556:3445455645565667
7 2334344534454556:3445455645565667
8 3445455645565667:4556566756676778
Or if you prefer no loops at all, and have everything folded in into one fell swoop :
function initPopCountStr(_,__,
___,____,_____,______) {
# No.# of set:bits per:byte
#
return \
\
(((_=(_=(__=(_=(__=(_=(__=(_=(__=_=(_<_)(______=___=_~_))\
(__+(___=(___)___)) ))(__+(___=(___)___))))\
(__+(___=(___)___))))\
(__+(___=(___)___)))(____=(__+___)(___+__+___))____)\
(_____=((__+(___+=______=___)))(___+=__+______)))\
(____)(_____)_____)(___)(______+___)
}
even the ultra lazy method is only 0.791 secs :
out9: 95.4MiB 0:00:00 [ 129MiB/s] [ 129MiB/s] [<=> ]
( mawk2 -F= 'NF=+$_+(OFS=$_=$NF)' ORS= <<< '100000000=x'; )
0.59s user 0.16s system 95% cpu 0.791 total
And the fast method is 16 milli seconds
fg;fg; ( time ( <<< '100000000=x' mawk2 -F= '{
_ = sprintf("%*s",__=int((____=+$(_=\
+(___="."))) ^ +(++_/(++_*_))),"")
gsub(___,$NF,_)
do { gsub(___,_,_)
} while((__*=__)<____)
print length(_),substr(_,1,31) }' ) ) | gcat -n | lgp3
fg: no current job
fg: no current job
( mawk2 -F= <<< '100000000=x'; )
0.00s user 0.01s system 87% cpu 0.016 total
1 100000000 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
UPDATE : making 3,486,784,401 copies of "x" in 0.323 secs
fg;fg; ( time ( <<< '43046721=x' mawk2 -F= '
{
________=substr("",+srand()) +substr(srand(),7)
_=sprintf("%*s",__=int((___=+$(_=+\
(_____=".")))^(++_/(++_*_))),"")
gsub(_____,$NF,_)
do { gsub(_____, _, _) } while((__*=__)<___)
print sprintf(" %15.8f GiB | %20.13f",
(__=length(_))/8^10,
((substr("",+srand())+substr(srand(),7))-\
(___=________))/(1+4^-24))
print substr(_,1,33)
CONVFMT=OFMT="%.14g"
____=9
while(--____){
sub(".+","&&&&&&&&&",_)
print sprintf(" %15.8f GiB | %20.13f",
length(_)/8^10,((substr("",+srand()) +\
substr(srand(),7))-___)/(1+4^-24)); fflush() } }' ) )
fg: no current job
fg: no current job
0.04009038 GiB | 0.0077791213989
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
0.36081345 GiB | 0.0445649623871
3.24732103 GiB | 0.3233759403229
29.22588923 GiB | 3.5619049072266
263.03300306 GiB | 58.9802551269529
Grow it at a different rate and we get :
103 billion copies of "x" just shy of 17 seconds
=
0.04304672 x 10^9 | 0.0111348628998
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
0.3013270470 x 10^9 ( 301327047)| 0.0563540458679
2.1092893290 x 10^9 ( 2109289329)| 0.2259490489960
14.7650253030 x 10^9 ( 14765025303)| 1.4351949691773
103.3551771210 x 10^9 ( 103355177121)| 16.9313509464274
And relatively trivial to adjust it for arbitrary sizes :
{m,g}awk '{ # the timer bit is for mawk-2
CONVFMT = "%.25g"
OFMT = "%.13f"
________ = substr("", +srand()) + substr(srand(), 7)
_ = sprintf("%*s", __=int((___= +$(_=\
+(_____="."))^ (++_/(++_*_))),"")
gsub(_____, $NF, _)
do {
print ___, gsub(_____, _, _),
((substr("",+srand()) +
substr(srand(), 7)) -________)/(1-4^-25)
} while ((__*=__)<___ && __<4^8)
print ".", length(_)
print \
sprintf("\n----------------\n%42.f .%20.13f",
(__=length(_))==___ ?__: (__=length(_= sprintf("%.*s",
___-__,_)(_))),((substr("", +srand()\
)+substr(srand(),7))-________) / (1-4^-25))
print substr(_, 1, 19)
}'
|
( mawk2 -F= <<< '3987654321=x'; )
0.20s user 0.55s system 94% cpu 0.799 total
3987654321 251 0.0000970363617
3987654321 63001 0.3465299606323
. 3969126001
----------------
3987654321 . 0.7496919631958

Faster algorithm/language for string replacement [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
EDIT:
Method 3 provided below is way faster by testing, reduce the estimated runtime from 2-3 days to < 1 day.
I had a sample file with a long string >50M like this.
CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtgtgGCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatataGCCCTAGGtcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCtggtgtgtggggttagggAtagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGgtgtCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
For everything substring with length k = 50 (which means there are
length(file)-k+1 substring)
if the A||T||a||t (Upper & Lower case) is >40%,
replace every character in that substring with N or n (preserving
case).
Sample output:
CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnggttaNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGNNnnnnnnnnnnnnnnnnnnNnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGnnnnNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnngNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
I was using AWK in command line for ease, but it just runs extremely slow with string replacement... and consume only <5% CPU somehow
Code: https://repl.it/#hey0wing/DutifulHandsomeMisrac-2
# Method 1
cat chr22.fa | head -n1 > chr22.masked.fa
cat chr22.fa | tail -n+2 | awk -v k=100 -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
while (i <= length($0)-k+1) {
x = substr($0, i, k)
if (i == 1) {
rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
} else {
prevx = substr($0,i-1,1)
if (prevx == "A" || prevx == "a" || prevx == "T" || prevx == "t")
rate -= 1
nextx = substr(x,k,1)
if (nextx == "A" || nextx == "a" || nextx == "T" || nextx == "t")
rate += 1
}
if (rate>r*k/100) {
h++
highGC[i] = i
}
printf("index-r:%f%% high-AT:%d \r",i/(length($0)-k+1)*100,h)
i += 1
}
printf("index-r:%f%% high-AT:%d\n\n",i/(length($0)-k+1)*100,h)
for (j in highGC) {
y = highGC[j]
SUB++
printf("sub-r:%f%% \r",SUB/h*100)
x = substr($0, y, k)
gsub (/[AGCT]/,"N",x)
gsub (/[agct]/,"n",x)
$0 = substr($0,1,y-1) x substr($0,y+k)
}
printf("sub-r:%f%%\nsubstituted:%d\n\n",SUB/h*100,SUB)
printf("%s",$0) >> "chr22.masked.fa"
}'
# Method 2
cat chr22.fa | head -n1 > chr22.masked2.fa
cat chr22.fa | tail -n+2 | awk -v k="100" -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
h = 0;
while (i<=length($0)-k+1) {
x = substr($0, i, k)
rate = gsub(/[ATX]/,"X",x) + gsub(/[atx]/,"x",x)
if (rate>r/k*100) {
h++
gsub (/[GC]/,"N",x)
gsub (/[gc]/,"n",x)
$0 = substr($0,1,i-1) x substr($0,i+k)
}
printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
i += 1
}
gsub (/X/,"N",$0)
gsub (/x/,"n",$0)
printf("index-r:%f%% sub-r:%f%% \n",i/(length($0)-k+1)*100,h/544*100)
printf("%s",$0) >> "chr22.masked2.fa"
}'
# Method 3
cat chr22.fa | head -n1 > chr22.masked3.fa
cat chr22.fa | tail -n+2 | awk -v k="100" -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
h = 0;
while (i <= length($0)-k+1) {
x = substr($0, i, k)
rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
if (rate>r/k*100) {
h++
gsub(/[ACGT]/,"N",x)
gsub(/[acgt]/,"n",x)
if (i == 1) {
s = x
} else {
s = substr(s,1,length(s)-k+1) x
}
} else {
if (i == 1) {
s = x
} else {
s = s substr(x,k,1)
}
}
printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
i += 1
}
printf("index-r:%f%% sub-r:%f%% \n\n",i/(length($0)-k+1)*100,h/544*100)
printf("%s",s) >> "chr22.masked3.fa"
}'
The estimated runtime is around 2-3 days ...
Are there any faster algorithm for this problem? If no, are there any language can perform string replacement faster?
More info:
the AWK command consume ~30% CPU at WSL & GitBash, but only ~5% on windows cmd with an OpenSSH client, where the progress rate is similar
Okay, there's an O(n) solution that involves a sliding window on to your data set. The following algorithm should suffice:
set window to ""
while true:
if window is "":
read k characters into window, exit while if less available
set atCount to number of characters in window matching "AaTt".
if atCount > 40% of k:
for each char in window:
if char uppercase:
output "N"
else:
output "n"
window = ""
else:
if first character of window matches "AaTt":
decrease atCount
remove first character of window
read next character into end of window, exit while if none available
if last character of window matches "AaTt":
increase atCount
What this does is to run a sliding window through your data, at each point testing if the proportion of AaTt characters in that window is more than 40%.
If so, it outputs the desired Nn characters and reloads the next k-sized window.
If it's not over 40%, it removes the first character in the windows and adds the next one to the end, adjusting the count of AaTt characters correctly.
If, at any point, there aren't enough characters left to satisfy a check (k when loading a full window, or 1 when sliding), it exits the loop.
Try some perl:
perl -slpe '
my $len = length;
for (my $i = 0; $i < $len; $i += $k) {
my $substring = substr($_, $i, $k);
my $count = $substring =~ tr/aAtT/aAtT/;
if ($count >= $k * $threshold) {
$substring =~ s/[[:lower:]]/n/g;
$substring =~ s/[[:upper:]]/N/g;
substr($_, $i, $k) = $substring;
}
}
' -- -k=50 -threshold=0.4 file

Match specific pattern and print just the matched string in the previous line

I update the question with additional information
I have a .fastq file formatted in the following way
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)
For each sequence the format is the same (repetition of 4 lines)
What I am trying to do is searching for a specific regex pattern ([A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,})in a window of n=35 characters of the 2nd line, cut it if found and report it at the end of the previous line.
So far I've written a bunch of code that does almost what I want.I thought using the match function together wit the substr of my window of interest but i didn't achieve my goal. I report below the script.awk :
match(substr($0,0,35),/regexp/,a) {
print p,a[0] #print the previous line respect to the matched one
print #print the current line
for(i=0;i<=1;i++) { # print the 2 lines following
getline
print
}
}#store previous line
{ p = $0 }
Starting from a file like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
I would like to obtain an output like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 TATTCACATATAGACATGAAA #is the string that matched the regexp WITHOUT initial AA that doesn' match my expression
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC #without initial AA
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF # without "GGGGGGGGDGGGFGGGGGGFGGG" that is the same number of characters removed in the 2nd line
$ cat tst.awk
BEGIN {
tgtStr = "pattern"
tgtLgth = length(tgtStr)
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
rec[1] = rec[1] " " tgtStr
rec[2] = substr(rec[2],idx+tgtLgth)
rec[4] = substr(rec[4],idx+tgtLgth)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
$ awk -f tst.awk file
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
wrt the code you posted:
substr($0,0,35) - strings, fields, line numbers, and arrays in awk start at 1 not 0 so that should be substr($0,1,35). Awk will compensate for your mistake and treat it as if you had written 1 instead of 0 in this case but get used to starting everything at 1 to avoid mistakes when it matters.
for(i=0;i<=1;i++) - should be for(i=1;i<=2;i++) for the same reason.
getline - not an appropriate use and syntactically fragile, see for(i=0;i<=1;i++)
Update - per your comment below that pattern is actually a regexp rather than a string:
$ cat tst.awk
BEGIN {
tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
I warn you, I wanted to have some fun and it is twisted.
awk -v pattern=pattern -v window=15 '
BEGIN{RS="#";FS=OFS="\n"}
{pos = match($2, pattern); n_del=pos+length(pattern)}
pos && (n_del<=window){$1 = $1 " " pattern; $2=substr($2, n_del); $4=substr($4, n_del)}
NR!=1{printf "%s%s", RS, $0}
' file
Input :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Output :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Second block is not updated because window is 15 and it cannot find the pattern within this window.
I used variable RS to deal with entire 4 lines block with $0, $1, $2, $3 and $4. Because input file starts with RS and does not end with RS, I prefered to not set ORS and use printf instead of print.

binary numbers in gawk

How can one specify a number as binary in gawk?
According to the manual, gawk interprets all numbers as decimal unless they are preceded by a 0 (octal) or by a 0x (hexadecimal). Unlike in certain other languages, 0b does not do the trick.
For instance, the following lines do not give the desired output (010000 or 10000) because the values are interpreted as octal/decimal or decimal/decimal, respectively:
gawk '{print and(010000,110000)}'
0
gawk '{print and(10000,110000)}'
9488
I suspect that gawk may not support base-2 and that a user-defined function will be required to generate binary representations.
You're right, there's no internal support for binary conversion in gawk. And incredibly, there isn't even any in printf(). So you're stuck with functions.
Remember that awk is weakly typed. Which is why functions have insane behaviours like recognizing that "0x" at the beginning of a string means it's a hexadecimal number. In a language like this, better to control your own types.
Here's a couple of functions I've had sitting around for years...
#!/usr/local/bin/gawk -f
function bin2dec(n) {
result = 0;
if (n~/[^01]/) {
return n;
}
for (i=length(n); i; i--) {
result += 2^(length(n)-i) * substr(n,i,1);
}
return result;
}
function dec2bin(n) {
result = "";
while (n) {
if (n%2) {
result = "1" result;
} else {
result = "0" result;
}
n = int(n/2);
}
return result;
}
{
print dec2bin( and(bin2dec($1),bin2dec($2)) );
}
And the result:
$ echo "1101 1011" | ./doit.awk
1001
$ echo "11110 10011" | ./doit.awk
10010
$

Using awk printf to urldecode text

I'm using awk to urldecode some text.
If I code the string into the printf statement like printf "%s", "\x3D" it correctly outputs =. The same if I have the whole escaped string as a variable.
However, if I only have the 3D, how can I append the \x so printf will print the = and not \x3D?
I'm using busybox awk 1.4.2 and the ash shell.
I don't know how you do this in awk, but it's trivial in perl:
echo "http://example.com/?q=foo%3Dbar" |
perl -pe 's/\+/ /g; s/%([0-9a-f]{2})/chr(hex($1))/eig'
Since you're using ash and Perl isn't available, I'm assuming that you may not have gawk.
For me, using gawk or busybox awk, your second example works the same as the first (I get "=" from both) unless I use the --posix option (in which case I get "x3D" for both).
If I use --non-decimal-data or --traditional with gawk I get "=".
What version of AWK are you using (awk, nawk, gawk, busybox - and version number)?
Edit:
You can coerce the variable's string value into a numeric one by adding zero:
~/busybox/awk 'BEGIN { string="3D"; pre="0x"; hex=pre string; printf "%c", hex+0}'
GNU awk
#!/usr/bin/awk -fn
#include "ord"
BEGIN {
RS = "%.."
}
{
printf RT ? $0 chr("0x" substr(RT, 2)) : $0
}
Or
#!/bin/sh
awk -niord '{printf RT?$0chr("0x"substr(RT,2)):$0}' RS=%..
Decoding URL encoding (percent encoding)
This relies on gnu awk's extension of the split function, but this works:
gawk '{ numElems = split($0, arr, /%../, seps);
outStr = ""
for (i = 1; i <= numElems - 1; i++) {
outStr = outStr arr[i]
outStr = outStr sprintf("%c", strtonum("0x" substr(seps[i],2)))
}
outStr = outStr arr[i]
print outStr
}'
To start with, I'm aware this is an old question, but none of the answers worked for me (restricted to busybox awk)
Two options. To parse stdin:
awk '{for (y=0;y<127;y++) if (y!=37) gsub(sprintf("%%%02x|%%%02X",y,y), y==38 ? "\\&" : sprintf("%c", y));gsub(/%25/, "%");print}'
To take a command line parameter:
awk 'BEGIN {for (y=0;y<127;y++) if (y!=37) gsub(sprintf("%%%02x|%%%02X",y,y), y==38 ? "\\&" : sprintf("%c", y), ARGV[1]);gsub(/%25/, "%", ARGV[1]);print ARGV[1]}' parameter
Have to do %25 last because otherwise strings like %253D get double-parsed, which shouldn't happen.
The inline check for y==38 is because gsub treats & as a special character unless you backslash it.
This one is the fastest of them all by a large margin and it doesn't need gawk:
#!/usr/bin/mawk -f
function decode_url(url, dec, tmp, pre, mid, rep) {
tmp = url
while (match(tmp, /\%[0-9a-zA-Z][0-9a-zA-Z]/)) {
pre = substr(tmp, 1, RSTART - 1)
mid = substr(tmp, RSTART + 1, RLENGTH - 1)
rep = sprintf("%c", ("0x" mid) + 0)
dec = dec pre rep
tmp = substr(tmp, RSTART + RLENGTH)
}
return dec tmp
}
{
print decode_url($0)
}
Save it as decode_url.awk and use it like you normally would. E.g:
$ ./decode_url.awk <<< 'Hello%2C%20world%20%21'
Hello, world !
But if you want an even faster version:
#!/usr/bin/mawk -f
function gen_url_decode_array( i, n, c) {
delete decodeArray
for (i = 32; i < 64; ++i) {
c = sprintf("%c", i)
n = sprintf("%%%02X", i)
decodeArray[n] = c
decodeArray[tolower(n)] = c
}
}
function decode_url(url, dec, tmp, pre, mid, rep) {
tmp = url
while (match(tmp, /\%[0-9a-zA-Z][0-9a-zA-Z]/)) {
pre = substr(tmp, 1, RSTART - 1)
mid = substr(tmp, RSTART, RLENGTH)
rep = decodeArray[mid]
dec = dec pre rep
tmp = substr(tmp, RSTART + RLENGTH)
}
return dec tmp
}
BEGIN {
gen_url_decode_array()
}
{
print decode_url($0)
}
Other interpreters than mawk should have no problem with them.