Faster algorithm/language for string replacement [closed] - awk

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 2 years ago.
Improve this question
EDIT:
Method 3 provided below is way faster by testing, reduce the estimated runtime from 2-3 days to < 1 day.
I had a sample file with a long string >50M like this.
CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtgtgGCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatataGCCCTAGGtcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCtggtgtgtggggttagggAtagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGgtgtCCCTAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNtgttgttttattttcttacaggtggtgtgtggggttagggttagggttaNNNNNNNNNNNCCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgtgtgtatatatcatgtAGCCCTAGGGatgtgtggtgtgtggggttagggttagggttaNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgttggggtNNNNNNGgtgtgtatatatcatagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgatgtgtggtgtgggtgtgtggggttagggAGGGCCCTAGGGCCCTAtgtgtgGCCCTAGGGCtgtgtgGCCCTAGGGCGGagtatatatcatgtgtgNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
For everything substring with length k = 50 (which means there are
length(file)-k+1 substring)
if the A||T||a||t (Upper & Lower case) is >40%,
replace every character in that substring with N or n (preserving
case).
Sample output:
CACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCNNNNNNNCACTGCTGTCACCCTCCATGCACCTGCCCACCCTCCAAGGATCaagctCCgaTNNNNNNNNNNNNGgnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnggttaNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNGCCCTAGGNNNNNNNGCCCTAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNAGGNNnnnnnnnnnnnnnnnnnnNnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnngggttagggNNNNNNNNNNNNNNNNNNNNNNNNNNNNAGaggcatattgatcCCCTCCAAGGATCaagctCCgaTNNNNNNNggttagggttNNNNNGnnnnNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNnnnnnNNnnNNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNttgtggtgtgtggtgNNNNNNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnnNNNNNNNNNNNNNNNNNnnnnnnNNNNNNNNNNnnnnnnNNNNNNNNNNNNnnnnnnnnnnnnnnnngNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
I was using AWK in command line for ease, but it just runs extremely slow with string replacement... and consume only <5% CPU somehow
Code: https://repl.it/#hey0wing/DutifulHandsomeMisrac-2
# Method 1
cat chr22.fa | head -n1 > chr22.masked.fa
cat chr22.fa | tail -n+2 | awk -v k=100 -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
while (i <= length($0)-k+1) {
x = substr($0, i, k)
if (i == 1) {
rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
} else {
prevx = substr($0,i-1,1)
if (prevx == "A" || prevx == "a" || prevx == "T" || prevx == "t")
rate -= 1
nextx = substr(x,k,1)
if (nextx == "A" || nextx == "a" || nextx == "T" || nextx == "t")
rate += 1
}
if (rate>r*k/100) {
h++
highGC[i] = i
}
printf("index-r:%f%% high-AT:%d \r",i/(length($0)-k+1)*100,h)
i += 1
}
printf("index-r:%f%% high-AT:%d\n\n",i/(length($0)-k+1)*100,h)
for (j in highGC) {
y = highGC[j]
SUB++
printf("sub-r:%f%% \r",SUB/h*100)
x = substr($0, y, k)
gsub (/[AGCT]/,"N",x)
gsub (/[agct]/,"n",x)
$0 = substr($0,1,y-1) x substr($0,y+k)
}
printf("sub-r:%f%%\nsubstituted:%d\n\n",SUB/h*100,SUB)
printf("%s",$0) >> "chr22.masked.fa"
}'
# Method 2
cat chr22.fa | head -n1 > chr22.masked2.fa
cat chr22.fa | tail -n+2 | awk -v k="100" -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
h = 0;
while (i<=length($0)-k+1) {
x = substr($0, i, k)
rate = gsub(/[ATX]/,"X",x) + gsub(/[atx]/,"x",x)
if (rate>r/k*100) {
h++
gsub (/[GC]/,"N",x)
gsub (/[gc]/,"n",x)
$0 = substr($0,1,i-1) x substr($0,i+k)
}
printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
i += 1
}
gsub (/X/,"N",$0)
gsub (/x/,"n",$0)
printf("index-r:%f%% sub-r:%f%% \n",i/(length($0)-k+1)*100,h/544*100)
printf("%s",$0) >> "chr22.masked2.fa"
}'
# Method 3
cat chr22.fa | head -n1 > chr22.masked3.fa
cat chr22.fa | tail -n+2 | awk -v k="100" -v r=40 '{
printf("chr22.fa: %d\n",length($0))
i = 1;
h = 0;
while (i <= length($0)-k+1) {
x = substr($0, i, k)
rate = gsub(/A/,"A",x) + gsub(/T/,"T",x) + gsub(/a/,"a",x) + gsub(/t/,"t",x)
if (rate>r/k*100) {
h++
gsub(/[ACGT]/,"N",x)
gsub(/[acgt]/,"n",x)
if (i == 1) {
s = x
} else {
s = substr(s,1,length(s)-k+1) x
}
} else {
if (i == 1) {
s = x
} else {
s = s substr(x,k,1)
}
}
printf("index-r:%f%% sub-r:%f%% \r",i/(length($0)-k+1)*100,h/544*100)
i += 1
}
printf("index-r:%f%% sub-r:%f%% \n\n",i/(length($0)-k+1)*100,h/544*100)
printf("%s",s) >> "chr22.masked3.fa"
}'
The estimated runtime is around 2-3 days ...
Are there any faster algorithm for this problem? If no, are there any language can perform string replacement faster?
More info:
the AWK command consume ~30% CPU at WSL & GitBash, but only ~5% on windows cmd with an OpenSSH client, where the progress rate is similar

Okay, there's an O(n) solution that involves a sliding window on to your data set. The following algorithm should suffice:
set window to ""
while true:
if window is "":
read k characters into window, exit while if less available
set atCount to number of characters in window matching "AaTt".
if atCount > 40% of k:
for each char in window:
if char uppercase:
output "N"
else:
output "n"
window = ""
else:
if first character of window matches "AaTt":
decrease atCount
remove first character of window
read next character into end of window, exit while if none available
if last character of window matches "AaTt":
increase atCount
What this does is to run a sliding window through your data, at each point testing if the proportion of AaTt characters in that window is more than 40%.
If so, it outputs the desired Nn characters and reloads the next k-sized window.
If it's not over 40%, it removes the first character in the windows and adds the next one to the end, adjusting the count of AaTt characters correctly.
If, at any point, there aren't enough characters left to satisfy a check (k when loading a full window, or 1 when sliding), it exits the loop.

Try some perl:
perl -slpe '
my $len = length;
for (my $i = 0; $i < $len; $i += $k) {
my $substring = substr($_, $i, $k);
my $count = $substring =~ tr/aAtT/aAtT/;
if ($count >= $k * $threshold) {
$substring =~ s/[[:lower:]]/n/g;
$substring =~ s/[[:upper:]]/N/g;
substr($_, $i, $k) = $substring;
}
}
' -- -k=50 -threshold=0.4 file

Related

How to generate string with a char repeated for n times?

I can use the following code to generate a string.
$ awk -e 'BEGIN { for(i=1;i<=10;++i) s = s "x"; print s }'
xxxxxxxxxx
But its complexity is super-linear wrt the string length.
$ time awk -e 'BEGIN { for(i=1;i<=10000000;++i) s = s "x" }'
real 0m0.868s
user 0m0.857s
sys 0m0.008s
$ time awk -e 'BEGIN { for(i=1;i<=100000000;++i) s = s "x" }'
real 0m9.886s
user 0m9.801s
sys 0m0.065s
$ time awk -e 'BEGIN { for(i=1;i<=1000000000;++i) s = s "x" }'
real 1m46.074s
user 1m45.171s
sys 0m0.760s
Is there a better way to repeat a char n times and assign the result to a variable?
Use sprintf to create a string of spaces, then use gsub to replace each space with an x:
$ time awk 'BEGIN {s = sprintf("%*s", 100000000, ""); gsub(".", "x", s)}'
real 0m1.744s
user 0m1.645s
sys 0m0.098s
This can be wrapped in an awk function:
function mkStr(c, n, s) {
s = sprintf("%*s", n, "");
gsub(".", c, s);
return s;
}
(s is a parameter simply to scope the variable to the function; it needs no argument, and indeed, any argument passed will be ignored.)
Update: there appears to be a significant difference in performance depending on which version of awk you are using. The above test used 20070501, the BSD(?) awk that ships with macOS. gawk-5.1.0 takes significantly longer.
I don't know what accounts for the difference; perhaps there is a solution that is fast in both versions.
Update 2: Ed Morton (in the comments) has verified that gsub is responsible for the slow running time in gawk, could not find a workaround, and has filed a bug report with the maintainers.
function loop(n){
for(i=1;i<=n;i++) s = s "x";
return s;
}
function repl(n){
s = sprintf("%*s", n, "");
gsub(/ /, "x", s);
return s;
}
function recStack(n, h){
switch( n ){
case 0:
return "";
default:
if( n % 2 == 1 ){
h = recStack( int((n-1)/2) )
return h h "x";
} else {
h = recStack( int(n/2) )
return h h;
}
}
}
function recStackIf(n, h){
if( n == 0 ) return "";
if( n % 2 == 1 ){
h = recStackIf( int((n-1)/2) ); # create first half
return h h "x"; # duplicate + one "x"
} else {
h = recStackIf( int(n/2) ); # create first half
return h h; # duplicate
}
}
function recArray(n, h, n2){
if( n in a ) return a[n];
switch( n ){
case 0:
return a[0] = "";
default:
if( n % 2 == 1 ){
n2 = int((n-1)/2);
h = recArray( n2 );
return a[n] = h h "x";
} else {
n2 = int(n/2);
h = recArray( n2 );
return a[n] = h h;
}
}
}
function recArrayIf(n, h, n2){
if( n in a ) return a[n];
if( n == 0 ) return a[0] = "";
if( n % 2 == 1 ){
n2 = int((n-1)/2);
h = recArrayIf( n2 );
return a[n] = h h "x";
} else {
n2 = int(n/2);
h = recArrayIf( n2 );
return a[n] = h h;
}
}
function concat(n){
exponent = log(n)/log(2)
m = int(exponent) # floor
m += (m < exponent ? 1 : 0) # ceiling
s = "x"
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
return s
}
BEGIN {
switch( F ){
case "recStack":
xxx = recStack( 100000000 );
break;
case "recStackIf":
xxx = recStackIf( 100000000 );
break;
case "recArray":
xxx = recArray( 100000000 );
break;
case "recArrayIf":
xxx = recArrayIf( 100000000 );
break;
case "loop":
xxx = loop( 100000000 );
break;
case "repl":
xxx = repl( 100000000 );
break;
case "concat":
xxx = concat( 100000000 );
break;
}
print length(xxx);
## xloop = loop(100000000 );
## if( xxx == xloop ) print "Match";
}
Times are:
# loop : real 0m5,405s, user 0m5,243s, sys 0m0,160s
# repl : real 0m7,670s, user 0m7,506s, sys 0m0,164s
# recArray: real 0m0,302s, user 0m0,141s, sys 0m0,161s
# recArrayIf: real 0m0,309s, user 0m0,168s, sys 0m0,141s
# recStack: real 0m0,316s, user 0m0,124s, sys 0m0,192s
# recStackIf: real 0m0,305s, user 0m0,152s, sys 0m0,152s
# concat: real 0m0,664s, user 0m0,300s, sys 0m0,364s
There's not much difference between the 5 versions of binary decomposition: a bunch of heap memory is used in all cases. Having the global array hang around until the end of all times isn't good and therefore I'd prefer either stack version.
wlaun#terra:/tmp$ gawk -V
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
wlaun#terra:/tmp$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Note that the above timing has been done with a statement printing the resulting string's length, which adds about 0.2 s to each version. Also, /usr/bin/time isn't reliable. Here are the relative "real" values from time without the print length(xxx):
# loop: 0m5,248s
# repl: 0m7,705s
# recStack: 0m0,103s
# recStackIf: 0m0,097s
# recArray: 0m0,103s
# recArrayIf: 0m0,099s
# concat: 0m0,455s
Added on request of Ed Morton:
Why is any of the recursive functions faster than any of the linear functions that iterate over O(N) elements? (The "O(N)" is the "big oh" symbol and is used to indicate a value that is N, possibly multiplied and/or incremented by some constant. A circle's circumference is O(r), it's area is O(r²).)
The answer is simple: By dividing N by 2, we get two strings of length O(N/2). This provides the possibility of re-using the result for the first half (no matter how we obtain it) for the second half! Thus, we'll get the second half of the result for free (except for the string copy operation, which is basically a machine instruction on most popular architectures). There is no reason why this great idea should not be applied for creating the first half as well, which means that we get three quarters of the result for free (except - see above). A little overhead results from the single "x" we have to throw in to cater for odd subdivisions of N.
There are many other recursive algorithms along the idea of halving and dealing with both sections individually, the most famous of them are Binary Search and Quicksort.
Here's a fast solution using any POSIX awk (tested on an average 8G RAM laptop running cygwin with GNU awk 5.1.0):
time awk -v n=100000000 'BEGIN {
exponent = log(n)/log(2)
m = int(exponent) # floor
m += (m < exponent ? 1 : 0) # ceiling
s = "x"
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
}'
real 0m0.665s
user 0m0.328s
sys 0m0.343s
The above just appends a copy of the string to itself until it's at least as big as the target string length and finally truncates it to exactly the desired length. The only somewhat tricky part is calculating how many times to append s to itself and that's just a case of solving 2^m = n for m given you already know n (100000000 in this case), see https://www.calculatorsoup.com/calculators/algebra/exponentsolve.php.
Obviously you could make the loop while ( length(s) < n ) instead of calculating m and that'd make the script briefer but would slow it down a bit (but it'd still be pretty fast):
$ time awk -v n=100000000 'BEGIN{s="x"; while (length(s) < n) s=s s; s=substr(s,1,n)}'
real 0m1.072s
user 0m0.562s
sys 0m0.483s
#JamesBrown had a better idea than calling length() on each iteration that also avoids having to calculate m while being about as fast:
$ time awk -v n=100000000 'BEGIN{s="x"; for (i=1; i<n; i*=2) s=s s; s=substr(s,1,n)}'
real 0m0.710s
user 0m0.281s
sys 0m0.390s
Originally I had the following thinking doubling strings of "x"s would be a faster approach than doubling individual "x"s on each iteration of the loop but it was a bit slower:
$ time awk -v n=100000000 '
BEGIN {
d = 1000000
m = 7
s = sprintf("%*s",d,"")
gsub(/ /,"x",s)
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
}
'
real 0m1.030s
user 0m0.625s
sys 0m0.375s
The idea in that 2nd script was to generate a string long enough to be meaningful but short enough that gsub() can convert all blanks to "x"s quickly (which I've found to be 1000000 chars by trial and error), then just repeat the above process with fewer iterations.
I opened a bug report with the gawk providers about gsub() being slow for this case, see https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00030.html if you're interested, but it looks like the outcome of that will just be that gsub() is a heavy tool, it's not surprising that it takes a long time for 100 million regexp matches, but you can speed it up considerable by setting LC_ALL=C before it runs so gawk doesn't have to worry about multibyte locales.
If you are okay with perl:
# assigning to a variable
$ perl -e '$s = "x" x 10'
# printing the string
$ perl -e 'print "x" x 10'
xxxxxxxxxx
Note that there's no newline for the print in the above example, use perl -le if you want one.
Here's a timing comparison:
$ time perl -e '$s = "x" x 100000000'
real 0m0.071s
# script.awk is the first code from Ed Morton's answer
$ time mawk -v n=100000000 -f script.awk
real 0m0.136s
$ time gawk -v n=100000000 -f script.awk
real 0m0.429s
Is there a better way to repeat a char n times and assign the result to a variable?
I propose following solution limited to case where n is equal to 2^x and x is integer equal or greater 1. For example to get x repeated 32 i.e. 2^5 times
awk 'BEGIN{s="x";for(i=1;i<=5;i+=1)s=s s;print s}' emptyfile.txt
output
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
(tested in gawk 4.2.1)
extreme test case benchmark - comparing python3's built-in repeater running on compiled C-code, versus user-level scripting of a different algorithm, generating 152.58 bn zeros.
Starting point :: length = 4 < "0000" > | time = 0.000000 secs
For-Loop # 1 | Lgth-Gain Ratio 6.0x | #-0's : 24. | 0.000036 secs (cum.)
For-Loop # 2 | Lgth-Gain Ratio 26.0x | #-0's : 624. | 0.000050 secs (cum.)
For-Loop # 3 | Lgth-Gain Ratio 626.0x | #-0's : 390,624. | 0.000175 secs (cum.)
For-Loop # 4 | Lgth-Gain Ratio 390626.0x | #-0's : 152,587,890,624. | 21.485092 secs (cum.)
( mawk2 ; )
4.38s user 13.69s system 81% cpu 22.226 total
( python3 -c 'print(len("0"*(5**4**2-1)))'; )
4.34s user 17.32s system 71% cpu 30.291 total
152587890624
% ( time ( mawk2 '
function comma(_,__,___,____,_____) {
if(length(sprintf("%+.f",_))<=(\
__*=__+=__=__~__)) {
return \
index(_,".") ?_:_"."
}
____=CONVFMT;CONVFMT="%.1000g";___=_
_____="[0-9]"
gsub("^|$",_____,_____)
sub("^[^.]+","",___)
sub("("(_____)")+([.]|$)",",&",_)
gsub(_____,"&,",_)
sub("([,]*[.].*)?$",___,_)
sub("[,]+$",".",_)
sub("^[ \t]*[+-]?[,]+",\
substr("-",__==__,_~"^[ \t]*[-]"),_)
CONVFMT=____;
return _ }
function printProgress(_____) {
print sprintf(" For-Loop "\
"# %d | Lgth-Gain Ratio %8.1fx | "\
"#-0\47s : %16s | %9s secs (cum.)",
++_-_____+!—_,
(___^-!!___)*(___=length(__)),
comma(___),timer())
return fflush() }
function timer(_,__,___) {
return sprintf("%.*s%.*s%*.*f",
!(_+=_+=(_=(_=__=___="")~_)+_),
srand(),
!(__+=__=--_),___=int((__^=++_)*\
+substr(srand(),++_)),
__%(++_-+-++_),_+=_=(++_+--_)%—_,
!_TIME0_?(_TIME0_=___)*!__\
:(___-_TIME0_)/__)
} BEGIN {
____=__+=__*=___=__^=(__+=++__)^__;
gsub("",__,__)
gsub(".",!__,__)
print ORS,"Starting point :: length = ",
___=length(__=substr(__,__~__,_=4)),
" < \42"(__)"\42 > | time = " ,
timer(), "secs", ORS;
for(_____=_;_____;_____--) {
gsub("",__,__)
printProgress(_____) } }' )) ;`
sleep 2;
( time (python3 -c 'print(len("0"*(5**4**2-1)))' ))|lgp3;
UPDATE 1 : generic version of it
depending on which variant of awk, it automatically switches to BAU binary doubler, and for the squarer, the speed gains are so immense it's actually faster to square, then substring than trying to be pin-drop accurate at each level, which inherently slows it down. _mN is the binary double, genZeros is the exp-sqr one
mawk2 '
function _mN(__,____,_,___,_____,______){
return \
(_=_~_)<+__?(_____=_mN((__-(\
______=__%(_+=_)))/_,____,___))___\
(_____)(______?(___)____:""):__<_?"":____
}
function genZeros(_,__,___,____) {
if (+_<=+((__=___=____="")<__)) {
return \
(-_<+_) ? +__:___;
} else if (FLG1*(_~"...")) {
return _mN(_,_<_)
}
____=__+=__*=___=__^=(__+=++__)^__;
gsub("",__,__)-gsub(".",!__,__)
if (_<__){ return \
substr((__)__,_~_,_)
}; ___=(___=__)___;__="";
__+=++__+__;
_-=__*(__=int((_/__)^(--__)^-!!_))*__;
___=(__<length(___))\
? substr(___,__~__,__) \
: sprintf("%"(+"")"*.f",__,!__)
gsub(".",(___)(___)___,___)
return \
!_?___:FLG2?\
sprintf("%"(_<_)"*.f",+_,_)___:_mN(+_,_<_)
} BEGIN { CONVFMT=\
OFMT="%.20g"
FLG1=(+sprintf("%o",-1)<"0x1")
FLG2=!(sprintf("%d",3E10)%2)
} {
print length(genZeros(3^18)),3^18 }'
387420489 387420489
=================================
Doubling each loop is too slow. To get even better performance, you need exponential squaring. how about 88 milli-seconds :
echo; ( time (gawk -v n=100000000 -Sbe 'BEGIN {
CONVFMT=OFMT="%.20g"
__="x";____="."
_=10^((___=3)*___)
gsub(____,__,_)
while(___--){
gsub(____,_,_)
}
print length(_) }' ) ) | lgp3
( gawk -v n=100000000 -Sbe ; )
0.07s user 0.01s system 95% cpu 0.088 total
100000000
And that's only for gawk. For mawk-2, it's an unearthly 32 msecs
echo; ( time ( mawk2 -v n=100000000 'BEGIN { CONVFMT=OFMT="%.20g"
__="x";____=".";
_=10^((___=3)*___)
gsub(____,__,_)
while(___--) {
gsub(____,_,_) }
print length(_) }' ) ) | lgp3
( mawk2 -v n=100000000 ; )
0.01s user 0.02s system 84% cpu 0.032 total
100000000
404 msecs is all it takes to go from just 1 single copy of "x", to nearly 4.3 billion copies of it :
( time ( mawk2 '
BEGIN {
___=(_+=_+=++_)^(_*_)
__=length(_=(_="x")(_)_);++___
while(__*__<___) {
gsub("",_,_)
__*=++__
print --__
}
print ORS,length(_),ORS }' ) )
15
255
65535
4294967295
4294967295
0.09s user 0.25s system 84% cpu 0.404 total
UPDATE :
benchmarking generating 25.7 billion zeros — at this size level, python3's built-in repeater is being left in the dust
echo; (fg && fg && fg ) 2>/dev/null;
echo;
( time ( mawk2 '
BEGIN { _=_<_
__= 4
while(__—) {
gsub("",(_)(_)_,_)
}
print _ }'
) \
| pvE9 | LC_ALL=C gwc -lc ) | ggXy3 | ecp
sleep 1.5
echo
( time ( python3 -c 'print("0"*25710762175)'
) \
| pvE9 | LC_ALL=C gwc -lc ) | ggXy3 | ecp ; echo
out9: 23.9GiB 0:00:08 [2.85GiB/s] [2.85GiB/s] [ <=> ]
( mawk2 'BEGIN { _=_<_;__=4; while(__--){ gsub("",(_)(_)_,_) }; print _ }'; ) 0.79s user 4.72s system 63% cpu 8.729 total
pvE 0.1 out9 0.54s user 2.93s system 41% cpu 8.435 total
LC_ALL=C gwc -lc 2.03s user 1.46s system 41% cpu 8.434 total
1 25710762176
out9: 23.9GiB 0:00:11 [2.17GiB/s] [2.17GiB/s] [ <=> ]
( python3 -c 'print("0"*25710762175)'; ) 1.24s user 6.84s system 72% cpu 11.076 total
pvE 0.1 out9 0.56s user 2.80s system 30% cpu 11.075 total
LC_ALL=C gwc -lc 2.03s user 1.46s system 31% cpu 11.074 total
1 25710762176
UPDATE 2 : (partially unrelated, but to illustrate my thought process)
Once you realize there's a predictable and deterministic pattern to how the pop-count string appears and repeats, you can create the full pop-count string for all 256-bytes sequentially using only 4-cycles of the do-while loop,
without using hardware instructions, or having to go through every byte one-by-one, either in decimal integer bit-string form, in order to create a pop-count string. The one thing that's crucial though, is setting a CONVFMT value large enough to avoid scientific notation truncation, since the essence of the function is performing multiple 16-digit integer adds (getting near 2^53 but never exceeding) ::
<<<'' mawk '
function initPopCountStr(_,__,
___,____,_____,______) {
# No.# of set:bits per:byte
#
__=_=(_<_)(______=___=_~_)
______+=++______;
do {_=\
(__=_)(__+(___=(___)___))
} while (--______)
return \
(((_=(_)(____=(__+___)(___+__+___))\
____)(_____=((__+(___+=______=___)))\
(___+=__+______)))____\
(_____)_____)(___)(______+___)
} BEGIN {
CONVFMT="%.20g"
} {
print initPopCountStr() }' \
\
| gsed -zE 's/(.{16})(.{16})/\1:\2\n/g' |nonEmpty|gcat -n|lgp3 4
1 0112122312232334:1223233423343445
2 1223233423343445:2334344534454556
3 1223233423343445:2334344534454556
4 2334344534454556:3445455645565667
5 1223233423343445:2334344534454556
6 2334344534454556:3445455645565667
7 2334344534454556:3445455645565667
8 3445455645565667:4556566756676778
Or if you prefer no loops at all, and have everything folded in into one fell swoop :
function initPopCountStr(_,__,
___,____,_____,______) {
# No.# of set:bits per:byte
#
return \
\
(((_=(_=(__=(_=(__=(_=(__=(_=(__=_=(_<_)(______=___=_~_))\
(__+(___=(___)___)) ))(__+(___=(___)___))))\
(__+(___=(___)___))))\
(__+(___=(___)___)))(____=(__+___)(___+__+___))____)\
(_____=((__+(___+=______=___)))(___+=__+______)))\
(____)(_____)_____)(___)(______+___)
}
even the ultra lazy method is only 0.791 secs :
out9: 95.4MiB 0:00:00 [ 129MiB/s] [ 129MiB/s] [<=> ]
( mawk2 -F= 'NF=+$_+(OFS=$_=$NF)' ORS= <<< '100000000=x'; )
0.59s user 0.16s system 95% cpu 0.791 total
And the fast method is 16 milli seconds
fg;fg; ( time ( <<< '100000000=x' mawk2 -F= '{
_ = sprintf("%*s",__=int((____=+$(_=\
+(___="."))) ^ +(++_/(++_*_))),"")
gsub(___,$NF,_)
do { gsub(___,_,_)
} while((__*=__)<____)
print length(_),substr(_,1,31) }' ) ) | gcat -n | lgp3
fg: no current job
fg: no current job
( mawk2 -F= <<< '100000000=x'; )
0.00s user 0.01s system 87% cpu 0.016 total
1 100000000 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
UPDATE : making 3,486,784,401 copies of "x" in 0.323 secs
fg;fg; ( time ( <<< '43046721=x' mawk2 -F= '
{
________=substr("",+srand()) +substr(srand(),7)
_=sprintf("%*s",__=int((___=+$(_=+\
(_____=".")))^(++_/(++_*_))),"")
gsub(_____,$NF,_)
do { gsub(_____, _, _) } while((__*=__)<___)
print sprintf(" %15.8f GiB | %20.13f",
(__=length(_))/8^10,
((substr("",+srand())+substr(srand(),7))-\
(___=________))/(1+4^-24))
print substr(_,1,33)
CONVFMT=OFMT="%.14g"
____=9
while(--____){
sub(".+","&&&&&&&&&",_)
print sprintf(" %15.8f GiB | %20.13f",
length(_)/8^10,((substr("",+srand()) +\
substr(srand(),7))-___)/(1+4^-24)); fflush() } }' ) )
fg: no current job
fg: no current job
0.04009038 GiB | 0.0077791213989
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
0.36081345 GiB | 0.0445649623871
3.24732103 GiB | 0.3233759403229
29.22588923 GiB | 3.5619049072266
263.03300306 GiB | 58.9802551269529
Grow it at a different rate and we get :
103 billion copies of "x" just shy of 17 seconds
=
0.04304672 x 10^9 | 0.0111348628998
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
0.3013270470 x 10^9 ( 301327047)| 0.0563540458679
2.1092893290 x 10^9 ( 2109289329)| 0.2259490489960
14.7650253030 x 10^9 ( 14765025303)| 1.4351949691773
103.3551771210 x 10^9 ( 103355177121)| 16.9313509464274
And relatively trivial to adjust it for arbitrary sizes :
{m,g}awk '{ # the timer bit is for mawk-2
CONVFMT = "%.25g"
OFMT = "%.13f"
________ = substr("", +srand()) + substr(srand(), 7)
_ = sprintf("%*s", __=int((___= +$(_=\
+(_____="."))^ (++_/(++_*_))),"")
gsub(_____, $NF, _)
do {
print ___, gsub(_____, _, _),
((substr("",+srand()) +
substr(srand(), 7)) -________)/(1-4^-25)
} while ((__*=__)<___ && __<4^8)
print ".", length(_)
print \
sprintf("\n----------------\n%42.f .%20.13f",
(__=length(_))==___ ?__: (__=length(_= sprintf("%.*s",
___-__,_)(_))),((substr("", +srand()\
)+substr(srand(),7))-________) / (1-4^-25))
print substr(_, 1, 19)
}'
|
( mawk2 -F= <<< '3987654321=x'; )
0.20s user 0.55s system 94% cpu 0.799 total
3987654321 251 0.0000970363617
3987654321 63001 0.3465299606323
. 3969126001
----------------
3987654321 . 0.7496919631958

Match specific pattern and print just the matched string in the previous line

I update the question with additional information
I have a .fastq file formatted in the following way
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)
For each sequence the format is the same (repetition of 4 lines)
What I am trying to do is searching for a specific regex pattern ([A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,})in a window of n=35 characters of the 2nd line, cut it if found and report it at the end of the previous line.
So far I've written a bunch of code that does almost what I want.I thought using the match function together wit the substr of my window of interest but i didn't achieve my goal. I report below the script.awk :
match(substr($0,0,35),/regexp/,a) {
print p,a[0] #print the previous line respect to the matched one
print #print the current line
for(i=0;i<=1;i++) { # print the 2 lines following
getline
print
}
}#store previous line
{ p = $0 }
Starting from a file like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
I would like to obtain an output like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 TATTCACATATAGACATGAAA #is the string that matched the regexp WITHOUT initial AA that doesn' match my expression
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC #without initial AA
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF # without "GGGGGGGGDGGGFGGGGGGFGGG" that is the same number of characters removed in the 2nd line
$ cat tst.awk
BEGIN {
tgtStr = "pattern"
tgtLgth = length(tgtStr)
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
rec[1] = rec[1] " " tgtStr
rec[2] = substr(rec[2],idx+tgtLgth)
rec[4] = substr(rec[4],idx+tgtLgth)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
$ awk -f tst.awk file
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
wrt the code you posted:
substr($0,0,35) - strings, fields, line numbers, and arrays in awk start at 1 not 0 so that should be substr($0,1,35). Awk will compensate for your mistake and treat it as if you had written 1 instead of 0 in this case but get used to starting everything at 1 to avoid mistakes when it matters.
for(i=0;i<=1;i++) - should be for(i=1;i<=2;i++) for the same reason.
getline - not an appropriate use and syntactically fragile, see for(i=0;i<=1;i++)
Update - per your comment below that pattern is actually a regexp rather than a string:
$ cat tst.awk
BEGIN {
tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
I warn you, I wanted to have some fun and it is twisted.
awk -v pattern=pattern -v window=15 '
BEGIN{RS="#";FS=OFS="\n"}
{pos = match($2, pattern); n_del=pos+length(pattern)}
pos && (n_del<=window){$1 = $1 " " pattern; $2=substr($2, n_del); $4=substr($4, n_del)}
NR!=1{printf "%s%s", RS, $0}
' file
Input :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Output :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Second block is not updated because window is 15 and it cannot find the pattern within this window.
I used variable RS to deal with entire 4 lines block with $0, $1, $2, $3 and $4. Because input file starts with RS and does not end with RS, I prefered to not set ORS and use printf instead of print.

Awk average of n data in each column

"Using awk to bin values in a list of numbers" provide a solution to average each set of 3 points in a column using awk.
How is it possible to extend it to an indefinite number of columns mantaining the format? For example:
2457135.564106 13.249116 13.140903 0.003615 0.003440
2457135.564604 13.250833 13.139971 0.003619 0.003438
2457135.565067 13.247932 13.135975 0.003614 0.003432
2457135.565576 13.256441 13.146996 0.003628 0.003449
2457135.566039 13.266003 13.159108 0.003644 0.003469
2457135.566514 13.271724 13.163555 0.003654 0.003476
2457135.567011 13.276248 13.166179 0.003661 0.003480
2457135.567474 13.274198 13.165396 0.003658 0.003479
2457135.567983 13.267855 13.156620 0.003647 0.003465
2457135.568446 13.263761 13.152515 0.003640 0.003458
averaging values every 5 lines, should output something like
2457135.564916 13.253240 13.143976 0.003622 0.003444
2457135.567324 13.270918 13.161303 0.003652 0.003472
where the first result is the average of the first 1-5 lines, and the second result is the average of the 6-10 lines.
The accepted answer to Using awk to bin values in a list of numbers is:
awk '{sum+=$1} NR%3==0 {print sum/3; sum=0}' inFile
The obvious extension to average all the columns is:
awk 'BEGIN { N = 3 }
{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile
The extra flexibility here is that if you want to group blocks of 5 rows, you simply change one occurrence of 3 into 5. This ignores blocks of up to N-1 rows at the end of the file. If you want to, you can add an END block that prints a suitable average if NR % N != 0.
For the sample input data, the output I got from the script above was:
2457135.564592 13.249294 13.138950 0.003616 0.003437
2457135.566043 13.264723 13.156553 0.003642 0.003465
2457135.567489 13.272767 13.162732 0.003655 0.003475
You can make the code much more complex if you want to analyze what the output formats should be. I've simply used %.6f to ensure 6 decimal places.
If you want N to be a command-line parameter, you can use the -v option to relay the variable setting to awk:
awk -v N="${variable:-3}" \
'{ for (i = 1; i <= NF; i++) sum[i] += $i }
NR % N == 0 { for (i = 1; i <= NF; i++)
{
printf("%.6f%s", sum[i]/N, (i == NF) ? "\n" : " ")
sum[i] = 0
}
}' inFile
When invoked with $variable set to 5, the output generated from the sample data is:
2457135.565078 13.254065 13.144591 0.003624 0.003446
2457135.567486 13.270757 13.160853 0.003652 0.003472

Busybox awk: How to treat each character in String as integer to perform bitwise operations?

I wanna change SSID wifi network name dynamically in OpenWRT via script which grab information from internet.
Because the information grabbed from internet may contains multiple-bytes characters, so it's can be easily truncated to invalid UTF-8 bytes sequence, so I want to use awk (busybox) to fix it. However, when I try to use bitwise function and on a String and integer, the result always return 0.
awk 'BEGIN{v="a"; print and(v,0xC0)}'
How to treat character in String as integer in awk like we can do in C/C++? char p[]="abc"; printf ("%d",*(p+1) & 0xC0);
You can make your own ord function like this - heavily borrowed from GNU Awk User's Guide - here
#!/bin/bash
awk '
BEGIN { _ord_init()
printf("ord(a) = %d\n", ord("a"))
}
function _ord_init( low, high, i, t)
{
low = sprintf("%c", 7) # BEL is ascii 7
if (low == "\a") { # regular ascii
low = 0
high = 127
} else if (sprintf("%c", 128 + 7) == "\a") {
# ascii, mark parity
low = 128
high = 255
} else { # ebcdic(!)
low = 0
high = 255
}
for (i = low; i <= high; i++) {
t = sprintf("%c", i)
_ord_[t] = i
}
}
function ord(str,c)
{
# only first character is of interest
c = substr(str, 1, 1)
return _ord_[c]
}'
Output
ord(a) = 97
I don;t know if it's what you mean since you didn't provide sample input and expected output but take a look at this with GNU awk and maybe it'll help:
$ gawk -lordchr 'BEGIN{v="a"; print v " -> " ord(v) " -> " chr(ord(v))}'
a -> 97 -> a

awk - Skip Header, Add Columns

I need to create a script that uses awk to sum the values of several columns that are output by a LSF command. I also need the script to skip the headers in the first line. This is what I have so far, will it work? I'm not sure that it will properly skip the first line and add the others. I would test it, but I do not have access to the LSF machines.
bhosts | awk '
BEGIN { running=suspended=reserved=0; }
NR < 2 { next }
(running = running + $6)
(a = a + $7)
(b = b + $8)
(suspended = a + b)
(reserved = reserved + $9)
END {
...
...
}'
exit
I can't test either. This would be better asked on http://codereview.stackexchange.com, but if you want to do some calculations on every line except the first one:
bhosts | awk '
NR >= 2 {
running += $6
a += $7
b += $8
suspended = a + b
reserved += $9
}
END {
...
}
'
Undeclared variables are automatically treated as zero in numeric context, so it's not strictly necessary to declare them.