I can use the following code to generate a string.
$ awk -e 'BEGIN { for(i=1;i<=10;++i) s = s "x"; print s }'
xxxxxxxxxx
But its complexity is super-linear wrt the string length.
$ time awk -e 'BEGIN { for(i=1;i<=10000000;++i) s = s "x" }'
real 0m0.868s
user 0m0.857s
sys 0m0.008s
$ time awk -e 'BEGIN { for(i=1;i<=100000000;++i) s = s "x" }'
real 0m9.886s
user 0m9.801s
sys 0m0.065s
$ time awk -e 'BEGIN { for(i=1;i<=1000000000;++i) s = s "x" }'
real 1m46.074s
user 1m45.171s
sys 0m0.760s
Is there a better way to repeat a char n times and assign the result to a variable?
Use sprintf to create a string of spaces, then use gsub to replace each space with an x:
$ time awk 'BEGIN {s = sprintf("%*s", 100000000, ""); gsub(".", "x", s)}'
real 0m1.744s
user 0m1.645s
sys 0m0.098s
This can be wrapped in an awk function:
function mkStr(c, n, s) {
s = sprintf("%*s", n, "");
gsub(".", c, s);
return s;
}
(s is a parameter simply to scope the variable to the function; it needs no argument, and indeed, any argument passed will be ignored.)
Update: there appears to be a significant difference in performance depending on which version of awk you are using. The above test used 20070501, the BSD(?) awk that ships with macOS. gawk-5.1.0 takes significantly longer.
I don't know what accounts for the difference; perhaps there is a solution that is fast in both versions.
Update 2: Ed Morton (in the comments) has verified that gsub is responsible for the slow running time in gawk, could not find a workaround, and has filed a bug report with the maintainers.
function loop(n){
for(i=1;i<=n;i++) s = s "x";
return s;
}
function repl(n){
s = sprintf("%*s", n, "");
gsub(/ /, "x", s);
return s;
}
function recStack(n, h){
switch( n ){
case 0:
return "";
default:
if( n % 2 == 1 ){
h = recStack( int((n-1)/2) )
return h h "x";
} else {
h = recStack( int(n/2) )
return h h;
}
}
}
function recStackIf(n, h){
if( n == 0 ) return "";
if( n % 2 == 1 ){
h = recStackIf( int((n-1)/2) ); # create first half
return h h "x"; # duplicate + one "x"
} else {
h = recStackIf( int(n/2) ); # create first half
return h h; # duplicate
}
}
function recArray(n, h, n2){
if( n in a ) return a[n];
switch( n ){
case 0:
return a[0] = "";
default:
if( n % 2 == 1 ){
n2 = int((n-1)/2);
h = recArray( n2 );
return a[n] = h h "x";
} else {
n2 = int(n/2);
h = recArray( n2 );
return a[n] = h h;
}
}
}
function recArrayIf(n, h, n2){
if( n in a ) return a[n];
if( n == 0 ) return a[0] = "";
if( n % 2 == 1 ){
n2 = int((n-1)/2);
h = recArrayIf( n2 );
return a[n] = h h "x";
} else {
n2 = int(n/2);
h = recArrayIf( n2 );
return a[n] = h h;
}
}
function concat(n){
exponent = log(n)/log(2)
m = int(exponent) # floor
m += (m < exponent ? 1 : 0) # ceiling
s = "x"
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
return s
}
BEGIN {
switch( F ){
case "recStack":
xxx = recStack( 100000000 );
break;
case "recStackIf":
xxx = recStackIf( 100000000 );
break;
case "recArray":
xxx = recArray( 100000000 );
break;
case "recArrayIf":
xxx = recArrayIf( 100000000 );
break;
case "loop":
xxx = loop( 100000000 );
break;
case "repl":
xxx = repl( 100000000 );
break;
case "concat":
xxx = concat( 100000000 );
break;
}
print length(xxx);
## xloop = loop(100000000 );
## if( xxx == xloop ) print "Match";
}
Times are:
# loop : real 0m5,405s, user 0m5,243s, sys 0m0,160s
# repl : real 0m7,670s, user 0m7,506s, sys 0m0,164s
# recArray: real 0m0,302s, user 0m0,141s, sys 0m0,161s
# recArrayIf: real 0m0,309s, user 0m0,168s, sys 0m0,141s
# recStack: real 0m0,316s, user 0m0,124s, sys 0m0,192s
# recStackIf: real 0m0,305s, user 0m0,152s, sys 0m0,152s
# concat: real 0m0,664s, user 0m0,300s, sys 0m0,364s
There's not much difference between the 5 versions of binary decomposition: a bunch of heap memory is used in all cases. Having the global array hang around until the end of all times isn't good and therefore I'd prefer either stack version.
wlaun#terra:/tmp$ gawk -V
GNU Awk 5.0.1, API: 2.0 (GNU MPFR 4.0.2, GNU MP 6.2.0)
wlaun#terra:/tmp$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Note that the above timing has been done with a statement printing the resulting string's length, which adds about 0.2 s to each version. Also, /usr/bin/time isn't reliable. Here are the relative "real" values from time without the print length(xxx):
# loop: 0m5,248s
# repl: 0m7,705s
# recStack: 0m0,103s
# recStackIf: 0m0,097s
# recArray: 0m0,103s
# recArrayIf: 0m0,099s
# concat: 0m0,455s
Added on request of Ed Morton:
Why is any of the recursive functions faster than any of the linear functions that iterate over O(N) elements? (The "O(N)" is the "big oh" symbol and is used to indicate a value that is N, possibly multiplied and/or incremented by some constant. A circle's circumference is O(r), it's area is O(r²).)
The answer is simple: By dividing N by 2, we get two strings of length O(N/2). This provides the possibility of re-using the result for the first half (no matter how we obtain it) for the second half! Thus, we'll get the second half of the result for free (except for the string copy operation, which is basically a machine instruction on most popular architectures). There is no reason why this great idea should not be applied for creating the first half as well, which means that we get three quarters of the result for free (except - see above). A little overhead results from the single "x" we have to throw in to cater for odd subdivisions of N.
There are many other recursive algorithms along the idea of halving and dealing with both sections individually, the most famous of them are Binary Search and Quicksort.
Here's a fast solution using any POSIX awk (tested on an average 8G RAM laptop running cygwin with GNU awk 5.1.0):
time awk -v n=100000000 'BEGIN {
exponent = log(n)/log(2)
m = int(exponent) # floor
m += (m < exponent ? 1 : 0) # ceiling
s = "x"
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
}'
real 0m0.665s
user 0m0.328s
sys 0m0.343s
The above just appends a copy of the string to itself until it's at least as big as the target string length and finally truncates it to exactly the desired length. The only somewhat tricky part is calculating how many times to append s to itself and that's just a case of solving 2^m = n for m given you already know n (100000000 in this case), see https://www.calculatorsoup.com/calculators/algebra/exponentsolve.php.
Obviously you could make the loop while ( length(s) < n ) instead of calculating m and that'd make the script briefer but would slow it down a bit (but it'd still be pretty fast):
$ time awk -v n=100000000 'BEGIN{s="x"; while (length(s) < n) s=s s; s=substr(s,1,n)}'
real 0m1.072s
user 0m0.562s
sys 0m0.483s
#JamesBrown had a better idea than calling length() on each iteration that also avoids having to calculate m while being about as fast:
$ time awk -v n=100000000 'BEGIN{s="x"; for (i=1; i<n; i*=2) s=s s; s=substr(s,1,n)}'
real 0m0.710s
user 0m0.281s
sys 0m0.390s
Originally I had the following thinking doubling strings of "x"s would be a faster approach than doubling individual "x"s on each iteration of the loop but it was a bit slower:
$ time awk -v n=100000000 '
BEGIN {
d = 1000000
m = 7
s = sprintf("%*s",d,"")
gsub(/ /,"x",s)
for (i=1; i<=m; i++) {
s = s s
}
s = substr(s,1,n)
}
'
real 0m1.030s
user 0m0.625s
sys 0m0.375s
The idea in that 2nd script was to generate a string long enough to be meaningful but short enough that gsub() can convert all blanks to "x"s quickly (which I've found to be 1000000 chars by trial and error), then just repeat the above process with fewer iterations.
I opened a bug report with the gawk providers about gsub() being slow for this case, see https://lists.gnu.org/archive/html/bug-gawk/2021-07/msg00030.html if you're interested, but it looks like the outcome of that will just be that gsub() is a heavy tool, it's not surprising that it takes a long time for 100 million regexp matches, but you can speed it up considerable by setting LC_ALL=C before it runs so gawk doesn't have to worry about multibyte locales.
If you are okay with perl:
# assigning to a variable
$ perl -e '$s = "x" x 10'
# printing the string
$ perl -e 'print "x" x 10'
xxxxxxxxxx
Note that there's no newline for the print in the above example, use perl -le if you want one.
Here's a timing comparison:
$ time perl -e '$s = "x" x 100000000'
real 0m0.071s
# script.awk is the first code from Ed Morton's answer
$ time mawk -v n=100000000 -f script.awk
real 0m0.136s
$ time gawk -v n=100000000 -f script.awk
real 0m0.429s
Is there a better way to repeat a char n times and assign the result to a variable?
I propose following solution limited to case where n is equal to 2^x and x is integer equal or greater 1. For example to get x repeated 32 i.e. 2^5 times
awk 'BEGIN{s="x";for(i=1;i<=5;i+=1)s=s s;print s}' emptyfile.txt
output
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
(tested in gawk 4.2.1)
extreme test case benchmark - comparing python3's built-in repeater running on compiled C-code, versus user-level scripting of a different algorithm, generating 152.58 bn zeros.
Starting point :: length = 4 < "0000" > | time = 0.000000 secs
For-Loop # 1 | Lgth-Gain Ratio 6.0x | #-0's : 24. | 0.000036 secs (cum.)
For-Loop # 2 | Lgth-Gain Ratio 26.0x | #-0's : 624. | 0.000050 secs (cum.)
For-Loop # 3 | Lgth-Gain Ratio 626.0x | #-0's : 390,624. | 0.000175 secs (cum.)
For-Loop # 4 | Lgth-Gain Ratio 390626.0x | #-0's : 152,587,890,624. | 21.485092 secs (cum.)
( mawk2 ; )
4.38s user 13.69s system 81% cpu 22.226 total
( python3 -c 'print(len("0"*(5**4**2-1)))'; )
4.34s user 17.32s system 71% cpu 30.291 total
152587890624
% ( time ( mawk2 '
function comma(_,__,___,____,_____) {
if(length(sprintf("%+.f",_))<=(\
__*=__+=__=__~__)) {
return \
index(_,".") ?_:_"."
}
____=CONVFMT;CONVFMT="%.1000g";___=_
_____="[0-9]"
gsub("^|$",_____,_____)
sub("^[^.]+","",___)
sub("("(_____)")+([.]|$)",",&",_)
gsub(_____,"&,",_)
sub("([,]*[.].*)?$",___,_)
sub("[,]+$",".",_)
sub("^[ \t]*[+-]?[,]+",\
substr("-",__==__,_~"^[ \t]*[-]"),_)
CONVFMT=____;
return _ }
function printProgress(_____) {
print sprintf(" For-Loop "\
"# %d | Lgth-Gain Ratio %8.1fx | "\
"#-0\47s : %16s | %9s secs (cum.)",
++_-_____+!—_,
(___^-!!___)*(___=length(__)),
comma(___),timer())
return fflush() }
function timer(_,__,___) {
return sprintf("%.*s%.*s%*.*f",
!(_+=_+=(_=(_=__=___="")~_)+_),
srand(),
!(__+=__=--_),___=int((__^=++_)*\
+substr(srand(),++_)),
__%(++_-+-++_),_+=_=(++_+--_)%—_,
!_TIME0_?(_TIME0_=___)*!__\
:(___-_TIME0_)/__)
} BEGIN {
____=__+=__*=___=__^=(__+=++__)^__;
gsub("",__,__)
gsub(".",!__,__)
print ORS,"Starting point :: length = ",
___=length(__=substr(__,__~__,_=4)),
" < \42"(__)"\42 > | time = " ,
timer(), "secs", ORS;
for(_____=_;_____;_____--) {
gsub("",__,__)
printProgress(_____) } }' )) ;`
sleep 2;
( time (python3 -c 'print(len("0"*(5**4**2-1)))' ))|lgp3;
UPDATE 1 : generic version of it
depending on which variant of awk, it automatically switches to BAU binary doubler, and for the squarer, the speed gains are so immense it's actually faster to square, then substring than trying to be pin-drop accurate at each level, which inherently slows it down. _mN is the binary double, genZeros is the exp-sqr one
mawk2 '
function _mN(__,____,_,___,_____,______){
return \
(_=_~_)<+__?(_____=_mN((__-(\
______=__%(_+=_)))/_,____,___))___\
(_____)(______?(___)____:""):__<_?"":____
}
function genZeros(_,__,___,____) {
if (+_<=+((__=___=____="")<__)) {
return \
(-_<+_) ? +__:___;
} else if (FLG1*(_~"...")) {
return _mN(_,_<_)
}
____=__+=__*=___=__^=(__+=++__)^__;
gsub("",__,__)-gsub(".",!__,__)
if (_<__){ return \
substr((__)__,_~_,_)
}; ___=(___=__)___;__="";
__+=++__+__;
_-=__*(__=int((_/__)^(--__)^-!!_))*__;
___=(__<length(___))\
? substr(___,__~__,__) \
: sprintf("%"(+"")"*.f",__,!__)
gsub(".",(___)(___)___,___)
return \
!_?___:FLG2?\
sprintf("%"(_<_)"*.f",+_,_)___:_mN(+_,_<_)
} BEGIN { CONVFMT=\
OFMT="%.20g"
FLG1=(+sprintf("%o",-1)<"0x1")
FLG2=!(sprintf("%d",3E10)%2)
} {
print length(genZeros(3^18)),3^18 }'
387420489 387420489
=================================
Doubling each loop is too slow. To get even better performance, you need exponential squaring. how about 88 milli-seconds :
echo; ( time (gawk -v n=100000000 -Sbe 'BEGIN {
CONVFMT=OFMT="%.20g"
__="x";____="."
_=10^((___=3)*___)
gsub(____,__,_)
while(___--){
gsub(____,_,_)
}
print length(_) }' ) ) | lgp3
( gawk -v n=100000000 -Sbe ; )
0.07s user 0.01s system 95% cpu 0.088 total
100000000
And that's only for gawk. For mawk-2, it's an unearthly 32 msecs
echo; ( time ( mawk2 -v n=100000000 'BEGIN { CONVFMT=OFMT="%.20g"
__="x";____=".";
_=10^((___=3)*___)
gsub(____,__,_)
while(___--) {
gsub(____,_,_) }
print length(_) }' ) ) | lgp3
( mawk2 -v n=100000000 ; )
0.01s user 0.02s system 84% cpu 0.032 total
100000000
404 msecs is all it takes to go from just 1 single copy of "x", to nearly 4.3 billion copies of it :
( time ( mawk2 '
BEGIN {
___=(_+=_+=++_)^(_*_)
__=length(_=(_="x")(_)_);++___
while(__*__<___) {
gsub("",_,_)
__*=++__
print --__
}
print ORS,length(_),ORS }' ) )
15
255
65535
4294967295
4294967295
0.09s user 0.25s system 84% cpu 0.404 total
UPDATE :
benchmarking generating 25.7 billion zeros — at this size level, python3's built-in repeater is being left in the dust
echo; (fg && fg && fg ) 2>/dev/null;
echo;
( time ( mawk2 '
BEGIN { _=_<_
__= 4
while(__—) {
gsub("",(_)(_)_,_)
}
print _ }'
) \
| pvE9 | LC_ALL=C gwc -lc ) | ggXy3 | ecp
sleep 1.5
echo
( time ( python3 -c 'print("0"*25710762175)'
) \
| pvE9 | LC_ALL=C gwc -lc ) | ggXy3 | ecp ; echo
out9: 23.9GiB 0:00:08 [2.85GiB/s] [2.85GiB/s] [ <=> ]
( mawk2 'BEGIN { _=_<_;__=4; while(__--){ gsub("",(_)(_)_,_) }; print _ }'; ) 0.79s user 4.72s system 63% cpu 8.729 total
pvE 0.1 out9 0.54s user 2.93s system 41% cpu 8.435 total
LC_ALL=C gwc -lc 2.03s user 1.46s system 41% cpu 8.434 total
1 25710762176
out9: 23.9GiB 0:00:11 [2.17GiB/s] [2.17GiB/s] [ <=> ]
( python3 -c 'print("0"*25710762175)'; ) 1.24s user 6.84s system 72% cpu 11.076 total
pvE 0.1 out9 0.56s user 2.80s system 30% cpu 11.075 total
LC_ALL=C gwc -lc 2.03s user 1.46s system 31% cpu 11.074 total
1 25710762176
UPDATE 2 : (partially unrelated, but to illustrate my thought process)
Once you realize there's a predictable and deterministic pattern to how the pop-count string appears and repeats, you can create the full pop-count string for all 256-bytes sequentially using only 4-cycles of the do-while loop,
without using hardware instructions, or having to go through every byte one-by-one, either in decimal integer bit-string form, in order to create a pop-count string. The one thing that's crucial though, is setting a CONVFMT value large enough to avoid scientific notation truncation, since the essence of the function is performing multiple 16-digit integer adds (getting near 2^53 but never exceeding) ::
<<<'' mawk '
function initPopCountStr(_,__,
___,____,_____,______) {
# No.# of set:bits per:byte
#
__=_=(_<_)(______=___=_~_)
______+=++______;
do {_=\
(__=_)(__+(___=(___)___))
} while (--______)
return \
(((_=(_)(____=(__+___)(___+__+___))\
____)(_____=((__+(___+=______=___)))\
(___+=__+______)))____\
(_____)_____)(___)(______+___)
} BEGIN {
CONVFMT="%.20g"
} {
print initPopCountStr() }' \
\
| gsed -zE 's/(.{16})(.{16})/\1:\2\n/g' |nonEmpty|gcat -n|lgp3 4
1 0112122312232334:1223233423343445
2 1223233423343445:2334344534454556
3 1223233423343445:2334344534454556
4 2334344534454556:3445455645565667
5 1223233423343445:2334344534454556
6 2334344534454556:3445455645565667
7 2334344534454556:3445455645565667
8 3445455645565667:4556566756676778
Or if you prefer no loops at all, and have everything folded in into one fell swoop :
function initPopCountStr(_,__,
___,____,_____,______) {
# No.# of set:bits per:byte
#
return \
\
(((_=(_=(__=(_=(__=(_=(__=(_=(__=_=(_<_)(______=___=_~_))\
(__+(___=(___)___)) ))(__+(___=(___)___))))\
(__+(___=(___)___))))\
(__+(___=(___)___)))(____=(__+___)(___+__+___))____)\
(_____=((__+(___+=______=___)))(___+=__+______)))\
(____)(_____)_____)(___)(______+___)
}
even the ultra lazy method is only 0.791 secs :
out9: 95.4MiB 0:00:00 [ 129MiB/s] [ 129MiB/s] [<=> ]
( mawk2 -F= 'NF=+$_+(OFS=$_=$NF)' ORS= <<< '100000000=x'; )
0.59s user 0.16s system 95% cpu 0.791 total
And the fast method is 16 milli seconds
fg;fg; ( time ( <<< '100000000=x' mawk2 -F= '{
_ = sprintf("%*s",__=int((____=+$(_=\
+(___="."))) ^ +(++_/(++_*_))),"")
gsub(___,$NF,_)
do { gsub(___,_,_)
} while((__*=__)<____)
print length(_),substr(_,1,31) }' ) ) | gcat -n | lgp3
fg: no current job
fg: no current job
( mawk2 -F= <<< '100000000=x'; )
0.00s user 0.01s system 87% cpu 0.016 total
1 100000000 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
UPDATE : making 3,486,784,401 copies of "x" in 0.323 secs
fg;fg; ( time ( <<< '43046721=x' mawk2 -F= '
{
________=substr("",+srand()) +substr(srand(),7)
_=sprintf("%*s",__=int((___=+$(_=+\
(_____=".")))^(++_/(++_*_))),"")
gsub(_____,$NF,_)
do { gsub(_____, _, _) } while((__*=__)<___)
print sprintf(" %15.8f GiB | %20.13f",
(__=length(_))/8^10,
((substr("",+srand())+substr(srand(),7))-\
(___=________))/(1+4^-24))
print substr(_,1,33)
CONVFMT=OFMT="%.14g"
____=9
while(--____){
sub(".+","&&&&&&&&&",_)
print sprintf(" %15.8f GiB | %20.13f",
length(_)/8^10,((substr("",+srand()) +\
substr(srand(),7))-___)/(1+4^-24)); fflush() } }' ) )
fg: no current job
fg: no current job
0.04009038 GiB | 0.0077791213989
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
0.36081345 GiB | 0.0445649623871
3.24732103 GiB | 0.3233759403229
29.22588923 GiB | 3.5619049072266
263.03300306 GiB | 58.9802551269529
Grow it at a different rate and we get :
103 billion copies of "x" just shy of 17 seconds
=
0.04304672 x 10^9 | 0.0111348628998
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
0.3013270470 x 10^9 ( 301327047)| 0.0563540458679
2.1092893290 x 10^9 ( 2109289329)| 0.2259490489960
14.7650253030 x 10^9 ( 14765025303)| 1.4351949691773
103.3551771210 x 10^9 ( 103355177121)| 16.9313509464274
And relatively trivial to adjust it for arbitrary sizes :
{m,g}awk '{ # the timer bit is for mawk-2
CONVFMT = "%.25g"
OFMT = "%.13f"
________ = substr("", +srand()) + substr(srand(), 7)
_ = sprintf("%*s", __=int((___= +$(_=\
+(_____="."))^ (++_/(++_*_))),"")
gsub(_____, $NF, _)
do {
print ___, gsub(_____, _, _),
((substr("",+srand()) +
substr(srand(), 7)) -________)/(1-4^-25)
} while ((__*=__)<___ && __<4^8)
print ".", length(_)
print \
sprintf("\n----------------\n%42.f .%20.13f",
(__=length(_))==___ ?__: (__=length(_= sprintf("%.*s",
___-__,_)(_))),((substr("", +srand()\
)+substr(srand(),7))-________) / (1-4^-25))
print substr(_, 1, 19)
}'
|
( mawk2 -F= <<< '3987654321=x'; )
0.20s user 0.55s system 94% cpu 0.799 total
3987654321 251 0.0000970363617
3987654321 63001 0.3465299606323
. 3969126001
----------------
3987654321 . 0.7496919631958
I update the question with additional information
I have a .fastq file formatted in the following way
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 (sequence name)
CATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC.. (sequence)
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFF.. (sequence quality)
For each sequence the format is the same (repetition of 4 lines)
What I am trying to do is searching for a specific regex pattern ([A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,})in a window of n=35 characters of the 2nd line, cut it if found and report it at the end of the previous line.
So far I've written a bunch of code that does almost what I want.I thought using the match function together wit the substr of my window of interest but i didn't achieve my goal. I report below the script.awk :
match(substr($0,0,35),/regexp/,a) {
print p,a[0] #print the previous line respect to the matched one
print #print the current line
for(i=0;i<=1;i++) { # print the 2 lines following
getline
print
}
}#store previous line
{ p = $0 }
Starting from a file like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
AACATCTACATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
I would like to obtain an output like this:
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 TATTCACATATAGACATGAAA #is the string that matched the regexp WITHOUT initial AA that doesn' match my expression
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC #without initial AA
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF # without "GGGGGGGGDGGGFGGGGGGFGGG" that is the same number of characters removed in the 2nd line
$ cat tst.awk
BEGIN {
tgtStr = "pattern"
tgtLgth = length(tgtStr)
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( idx = index(substr(rec[2],1,winLgth),tgtStr) ) {
rec[1] = rec[1] " " tgtStr
rec[2] = substr(rec[2],idx+tgtLgth)
rec[4] = substr(rec[4],idx+tgtLgth)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
$ awk -f tst.awk file
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
wrt the code you posted:
substr($0,0,35) - strings, fields, line numbers, and arrays in awk start at 1 not 0 so that should be substr($0,1,35). Awk will compensate for your mistake and treat it as if you had written 1 instead of 0 in this case but get used to starting everything at 1 to avoid mistakes when it matters.
for(i=0;i<=1;i++) - should be for(i=1;i<=2;i++) for the same reason.
getline - not an appropriate use and syntactically fragile, see for(i=0;i<=1;i++)
Update - per your comment below that pattern is actually a regexp rather than a string:
$ cat tst.awk
BEGIN {
tgtRegexp = "[A-Z]{5,}ACA[A-Z]{5,}ACA[A-Z]{5,}"
winLgth = 35
numLines = 4
}
{
lineNr = ( (NR-1) % numLines ) + 1
rec[lineNr] = $0
}
lineNr == numLines {
if ( match(substr(rec[2],1,winLgth),tgtRegexp) ) {
rec[1] = rec[1] " " substr(rec[2],RSTART,RLENGTH)
rec[2] = substr(rec[2],RSTART+RLENGTH)
rec[4] = substr(rec[4],RSTART+RLENGTH)
}
for ( lineNr=1; lineNr<=numLines; lineNr++ ) {
print rec[lineNr]
}
}
I warn you, I wanted to have some fun and it is twisted.
awk -v pattern=pattern -v window=15 '
BEGIN{RS="#";FS=OFS="\n"}
{pos = match($2, pattern); n_del=pos+length(pattern)}
pos && (n_del<=window){$1 = $1 " " pattern; $2=substr($2, n_del); $4=substr($4, n_del)}
NR!=1{printf "%s%s", RS, $0}
' file
Input :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Output :
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8 pattern
ATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
GGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
#M01790:39:000000000-C3C6P:1:1101:14141:1618 1:N:0:8
CATCTACGCpatternATATTCACATATAGACATGAAACACCTGTGGTTCTTCCTC..
+
ACCCGGGGDGGGGGGDGGGFGGGGGGFGGGGGGGGGGGFGGGGFGFGFFGGGGFGF..
Second block is not updated because window is 15 and it cannot find the pattern within this window.
I used variable RS to deal with entire 4 lines block with $0, $1, $2, $3 and $4. Because input file starts with RS and does not end with RS, I prefered to not set ORS and use printf instead of print.