Naming of variables in gawk when using gensub - awk

I am doing string substitution in gawk. The code below is a simplified version (the real replacement argument to gensub involves lots of "\\1\\3\\2", which is why I can't use sub/gsub). My question is one of robustness: since I'm modifying the 1st field ($1) with gensub, can I store the output of gensub in the variable $1, or does this have potential to cause problems (in other contexts; it works fine in my code)?
# test data
$ printf "Header_1\tHeader_2\nHiC_scaffold_1_1234\t1234\nHiC_scaffold_2_7890\t7890\n" > input.txt
# code I'm using (works as expected)
$ gawk 'BEGIN {FS = "\t"} FNR == 1 {next} \
> {one = gensub(/HiC_scaffold_([0-9]+)_([0-9]+) ?/, "HIC_SCAFFOLD_\\2_\\1", "g", $1)} \
> {print $2 "\t" one}' \
> input.txt > output.txt1
# code I'm asking about (works as expected with these test data)
$ gawk 'BEGIN {FS = "\t"} FNR == 1 {next} \
> {$1 = gensub(/HiC_scaffold_([0-9]+)_[0-9]+ ?/, "HIC_SCAFFOLD_\\2_\\1", "g", $1)} \
> {print $2 "\t" $1}' \
> input.txt > output.txt2
$ head *txt*
==> input.txt <==
Header_1 Header_2
HiC_scaffold_1_1234 1234
HiC_scaffold_2_7890 7890
==> output.txt1 <==
1234 HIC_SCAFFOLD_1
7890 HIC_SCAFFOLD_2
==> output.txt2 <==
1234 HIC_SCAFFOLD_1
7890 HIC_SCAFFOLD_2

If I got you correctly, you are asking for a bit of a review on the second code.
Can you assign a field? Yes, so $1 = gensub(something) is ok (ref).
Potential issues? Yes: if $n doesn't exist, then you are creating it, and thus modifying $0 as well. You are doing it on $1, as far as I know, if a record exists ($0) then it must have at least one field ($1) - might be empty.
Another caveat would be if you were assigning to $0, but feels a little bit out of scope. Do not attempt to $1 = $1 after your gensub().
Finally, let's have a look at gensub(). If you provide no target to it, then it falls back to use $0. You are not doing so.
In the end, I cannot see a trivial situation where this can go wrong. Your code seems fine to me.

Related

Issue with using awk to extract words after/before a specific word

I have a file which has several sections with a header like this
$ head -n 5 test.txt
[44610] gmx#127.0.0.1
f1(cu_atomdata, NBParamGpu, Nbnxm::gpu_plist, bool), Block Size 64, Grid Size 3599, Device 0, 99 invocations
Section: Command line profiler metrics
Metric Name Metric Unit Minimum Maximum Average
-------------------------------------------------------------------------------------------- ----------- ------------ ------------ ------------
I would like to use the following awk command to get the number after Grid Size and the number before invocations. However, the following command returns nothing.
$ awk '{for (I=1;I<NF;I++) if ($I == "Grid Size") print $(I+1)}' test.txt
$
$ awk '{for (I=1;I<NF;I++) if ($I == "invocations") print $(I-1)}' test.txt
$
Any idea to fix that?
You may use this awk that loops through each field and extract your numbers based on field values:
awk '{
for (i=3;i<NF;i++)
if ($(i-2) == "Grid" && $(i-1) == "Size")
print "gridSize:", $i+0
else if ($(i+1) == "invocations")
print "invocations:", $i+0
}' file
gridSize: 3599
invocations: 99
Alternatively, you may try this gnu grep with PCRE regex:
grep -oP 'Grid Size\h+\K\d+|\d+(?=\h+invocations)' file
3599
99
\K - match reset
(?=...) - Lookahead assertion
With GNU awk latest versions try putting array within match itself:
awk '
match($0,/Grid Size [0-9]+/, arr1){
print arr1[3]
match($0,/[0-9]+ invocations/, arr2)
print arr2[1]
}
' Input_file
With your shown samples could you please try following(when I tried above it didn't work with 4.1 awk version so adding this one as an alternative here).
awk '
match($0,/Grid Size [0-9]+/){
num=split(substr($0,RSTART,RLENGTH),arr1," ")
print arr1[num]
match($0,/[0-9]+ invocations/)
split(substr($0,RSTART,RLENGTH),arr2," ")
print arr2[1]
}
' Input_file
make it even simpler :
{mawk/mawk2/gawk} 'BEGIN {
FS = "(^.+Grid Size[ ]+|" \ # before
"[,][^,]+[,][ ]+|" \ # in-between
"[ ]+invocations.*$)"; # after
} NF == 4 { print "grid size \043 : " $2 ", invocations \043 : " $3 }'
This regex gobbles everything before, in between, and after it. Because the regex touches the 2 walls at the ends, fields $1 and $4 will also be created, but as empty ones, hence the NF==4 check.
The octal code \043 is the hash symbol # - just my own personal preference of not having comment delimiter inside my strings is its original form.
a gawk approach with gensub:
$ gawk '/Grid Size/{
s=gensub(/.*Grid\sSize\s([[:digit:]]+).*,\s([[:digit:]]+) invocations/, "gridSize: \\1\ninvocations: \\2","G"); print s
}' myFile
gridSize: 3599
invocations: 99

Run command inside awk and store result inplace

I have a script that I need to run on every value. It basically return a number by taking an argument, like below
>>./myscript 4832
>>1100
my.csv contains the following:
123,4832
456,4833
789,4834
My command
cat my.csv | awk -F',' '{$3=system("../myscript $2");print $1,$2,$3'}
myscript is unable to understand that I'm passing the second input field $2 as argument. I need the output from the script to be added to the output as my 3rd column.
The expected output is
123,4832,1100
456,4833,17
789,4834,42
where the third field is the output from myscript with the second field as the argument.
If you are attempting to add a third field with the output from myscript $2 where $2 is the value of the second field, try
awk -F , '{ printf ("%s,%s,", $1, $2); system("../myscript " $2) }' my.csv
where we exploit the convenient fact that the output from myscript will complete the output without a newline with the calculated value and a newline.
This isn't really a good use of Awk; you might as well do
while IFS=, read -r first second; do
printf "%s,%s," "$first" "$second"
../mycript "$second"
done <my.csv
I'm assuming you require comma-separated output; changing this to space-separated is obviously a trivial modification.
The syntax you want is:
awk 'BEGIN{FS=OFS=","}
{
cmd = "./myscript \047" $2 "\047"
val = ( (cmd | getline line) > 0 ? line : "NaN" )
close(cmd)
print $0, val
}
' file
Tweak the getline part to do different error handling if you like and make sure you read and fully understand http://awk.freeshell.org/AllAboutGetline before using getline.
We can use in gnu-awk Two-Way Communications with Another Process
awk -F',' '{"../myscript "$2 |& getline v; print $1,$2,v}' my.csv
you get,
123 4832 1100
456 4833 17
789 4834 42
awk -F',' 'BEGIN { OFS=FS }{"../myscript "$2 |& getline v; print $1,$2,v}' my.csv
you get,
123,4832,1100
456,4833,17
789,4834,42
from GNU awk online documentation:
system: Execute the operating system command command and then return to the awk program. Return command’s exit status (see further on).
you need to use getline getline piped documentation
You need to specify the $2 separately in the string concatenation, that is
awk -F',' '{ system("echo \"echo " $1 "$(../myexecutable " $2 ") " $3 "\" | bash"); }' my.csv

Removing content of a column based on number of occurences

I have a file (; seperated) with data like this
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3;000-200.2
211111121;000-000.1;000-000.2
I would like to remove any $3 that has less than 3 occurences, so the outcome would be like
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
That is, only $3 got deleted, as it had only a single occurence
Sadly I am not really sure if (thus how) this could be done relatively easily (as doing the =COUNT.IF matching, and manuel delete in Excel feels quite embarrassing)
$ awk -F';' 'NR==FNR{cnt[$3]++;next} cnt[$3]<3{sub(/;[^;]+$/,"")} 1' file file
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
or if you prefer:
$ awk -F';' 'NR==FNR{cnt[$3]++;next} {print (cnt[$3]<3 ? $1 FS $2 : $0)}' file file
this awk one-liner can help, it processes the file twice:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]<3{NF--}7' file file
Though that awk solutions are the best in terms of performance, your goal could be also achieved with something like this:
while IFS=" " read a b;do
if [[ "$a" -lt "3" ]];then
sed -i "s/$b//" b.txt
fi
done <<<"$(cut -d";" -f3 b.txt |sort |uniq -c)"
Operation is based on the output of cut which counts occurrences.
$cut -d";" -f3 b.txt |sort |uniq -c
7 000-000.2
1 000-200.2
3 020-000.8
Above works for editing source file in place, so keep a back up for testing.
You can feed the file twice to awk. On the first run you gather a statistic that you use in the second run:
script.awk
FNR == NR { stats[ $3 ]++
next
}
{ if( stats[$3] < 3) print $1 $2
else print
}
Run it like this: awk -F\; -f script.awk yourfile yourfile .
The condition FNR == NR is true during processing of the first filename given to awk. The next statement skips the second block.
Thus the second block is only used for processing the second filename given to awk (which is here the same as the first filename).

AWK - How can I get the values that it equal among them? if ... $1== $1?

I am working with a list of DNA sequences. I would like to get all the sequences with the same name ($1). I was thinking to use if ($1 == "$1"). But this does not work.
result_file:
name1 number1 number2 sequenceofname1
name1 number3 number4 sequenceofname1
script:
awk '{if ($1 == "$1") printf("%s_%s_%s \t%s\n", $1,$2,$3,$4);}' <result_file >file.txt
How do I pass $1 to my awk command?
you can use -v option
awk -v name="name1" '{
if ($1 == name) printf("%s_%s_%s \t%s\n", $1,$2,$3,$4);
}' result_file > file.txt
or, if this statement in a script
awk -v name="$1" '{
if ($1 == name) printf("%s_%s_%s \t%s\n", $1,$2,$3,$4);
}' result_file > file.txt
-v var=val, Assign the value val to the variable var, before execution of the program begins. Such variable values are available to the BEGIN block of an AWK program.
If I've understood correctly, you want to use $1 from your shell script as an argument to the awk command within it.
In which case, you want to not quote the $1 that you want expanding, but do quote the rest of the awk command. One possibility is to double-quote the command:
awk "{if (\$1 == \"$1\") printf(\"%s_%s_%s \\t%s\\n\", \$1,\$2,\$3,\$4);}"
It can get hard to manage all the backslashes, so you probably prefer to single-quote most of the command, but double-quote the part to be expanded:
awk '{if ($1 == "'"$1"'") printf("%s_%s_%s \t%s\n", $1,$2,$3,$4);}'
That's slightly tricky to read - the critical bit divides as '...($1 == "' "$1" '")...'. So there's a double-quote that's part of the Awk command, and one that's for the shell, to keep $1 in one piece.
Oh, and no need to invoke cat - just provide the file as input:
awk ... <result_file >file.txt

awk optimised for multi delimeter on single record

I have a file with more than 2 bilion record in it.
It contains record with multi seperator as , OBO- OBI- CSP- and TICK
My lines in file B2.txt are:
917354000000,SUD=CAT-10&DBSG-1&BS3G-1&TS11-1&TS21-1&TS22-1&RSA-1&CSP-69&NAM-0&SCHAR-4&PWD-0000&OFA-0&OICK-134&HOLD-1&MPTY-1&CLIP-1&CFU-1&CFB-1&CFNRY-1&CFNRC-1&CAW-1&SOCFU-1&SOCFB-1&SOCFRY-1&SOCFRC-1&SODCF-0&SOSDCF-4&SOCB-0&SOCLIP-0&SOCLIR-0&SOCOLP-0;
917354000004,SUD=CAT-10&DBSG-1&OBO-2&OBR-2&BS3G-1&TS11-1&TS21-1&TS22-1&RSA-4&PRBT-1&NAM-0&SCHAR-8&PWD-0000&OFA-6&HOLD-1&CLIP-1&CFU-1&CFB-1&CFNRY-1&CFNRC-1&CAW-1&SOCFU-0&SOCFB-0&SOCFRY-0&SOCFRC-0&SODCF-0&SOSDCF-4&SOCB-0&SOCLIP-0&SOCLIR-0&SOCOLP-0;
My code is taking almost more than 2 days to run...
Below is the code.
#!/usr/bin/sh
echo "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value" > tt3.txt
while read i
do
MSISDN=`echo $i | awk -F"," '{ print $1}'`;
Complete_Info=`echo $i | awk -F"," '{ print $2}'`;
OBO_Value=`echo $Complete_Info | awk -F"OBO-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
OBI_Value=`echo $Complete_Info | awk -F"OBI-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
CSP_Value=`echo $Complete_Info | awk -F"CSP-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
TICK_Value=`echo $Complete_Info | awk -F"TICK-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
echo $MSISDN,$OBO_Value,$OBI_Value,$TICK_Value,$CSP_Value >> tt3.txt;
done < B2.txt
Is it possible to optimise this code with awk so the output file contains as follows
917354000000,,69,
Another awk:
awk '
BEGIN {
FS="[,&;]"
OFS=","
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
NF {
split(x,A) # a way of clearing array A, gawk can use delete A
for(i=2; i<=NF; i++) {
split ($i,F,/-/)
A[F[1]]=F[2]
}
print $1,A["OBO"],A["OBI"],A["TICK"],A["CSP"]
}
' B2.txt > tt3.txt
-------
After all the test results I decided to test with 1M records of which 25% was an empty line. I tested on OSX 10.9 with the following awk versions:
awk (BSD awk) v. 20070501)
mawk v. 1.3.4 20100625
gawk 4.0.2
I left out gawk 3 results, since I know from experience that they are usually disappointing and it usually finishes dead last..
Martin's solution was faster with BSD awk and GNU awk, but the opposite was true with mawk, where my solution was 25% faster. Ed's improvement by not using a split() function further improved speed by some 30% and with mawk it took only 7.225s, half of the time that was needed for Martin's solution.
But the real kicker turned out to be Jidders first solution with the match() and without the use of a function. With mawk it did it in a whopping 1.868s
So YMMV, but that best solution speed wise appears to be Jidder's first solution in combination with mawk..
S.
RESULTS:
Martin Scrutinizer Ed Morton Jidder (non-function version)
BSDawk
real 0m25.008s 1m51.424s 0m38.566s 0m17.945s
user 0m24.545s 1m47.791s 0m37.662s 0m17.485s
sys 0m0.117s 0m0.824s 0m0.120s 0m0.117s
mawk
real 0m14.472s 0m11.618s 0m7.225s 0m1.868s
user 0m13.922s 0m11.091s 0m6.988s 0m1.759s
sys 0m0.117s 0m0.116s 0m0.093s 0m0.084s
gawk
real 0m33.486s 1m16.490s 0m30.642s 0m17.201s
user 0m32.816s 1m14.874s 0m30.041s 0m16.689s
sys 0m0.134s 0m0.219s 0m0.116s 0m0.131s
Another awk way
awk 'BEGIN{OFS=FS=",";print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
NF{match($2,/CSP-[1-9]+/);a=substr($2,RSTART+4,RLENGTH-4)
match($2,/TICK-[1-9]+/);b=substr($2,RSTART+5,RLENGTH-5)
match($2,/OBI-[1-9]+/);c=substr($2,RSTART+4,RLENGTH-4)
match($2,/OBO-[1-9]+/);d=substr($2,RSTART+4,RLENGTH-4)
print $1,d,c,b,a}
' file
Produces
MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,
I think its pretty self explanatory but if you need anything explaining just ask.
EDIT:
Here it is using a function
awk 'BEGIN{OFS=FS=",";print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
function f(g){match($2,g"-[1-9]*");return (substr($2,RSTART+4,RLENGTH-4))}
NF{a=f("OBO");b=f("OBI");c=f("TICK");d=f("CSP");print $1,a,b,c,d} ' file
Bit neater
awk 'BEGIN{
OFS=FS=","
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
function GetVal(Str){
match($2,Str"-[1-9]*")
return (substr($2,RSTART+4,RLENGTH-4))
}
NF{
a=GetVal("OBO")
b=GetVal("OBI")
c=GetVal("TICK")
d=GetVal("CSP")
print $1,a,b,c,d} ' file
Decided to check the speed of each of the scripts on here for 10 000 rows
Mine(functions) - real 0m0.773s user 0m0.755s sys 0m0.016s
Mine(non-funct) - real 0m0.306s user 0m0.295s sys 0m0.009s
Scrutinizer - real 0m0.400s user 0m0.392s sys 0m0.008s
Martin - real 0m0.298s user 0m0.291s sys 0m0.006s
The function one is significantly slower.
Here is an awk script that worked for your input:
BEGIN {
OFS=","
FS=",|&|;|-"
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
{
obi=""
tick=""
csp=""
obo=""
for (i=4; i<=NF; i+=2) {
if( $i == "OBO" ) {
obo=$(i+1)
} else if ($i == "OBI") {
obi=$(i+1)
} else if ($i == "CSP") {
csp=$(i+1)
} else if ($i == "TICK") {
tick=$(i+1)
}
}
print $1, obo, obi, tick, csp
}
gave
MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,
I took advantage of the fact, that your data and input seems to be alternating every two steps.
For completeness let me mention my benchmark for 10 million rows with all solutions updated:
mawk gawk
Ed Morton
real 1m19.259s real 3m36.107s
user 1m17.987s user 3m35.163s
sys 0m0.706s sys 0m0.936s
Martin
real 2m13.875s real 4m37.112s
user 2m12.680s user 4m36.032s
sys 0m0.848s sys 0m0.954s
Scrutinizer
real 1m48.755s real 6m40.202s
user 1m47.513s user 6m39.148s
sys 0m0.894s sys 0m1.041s
Jidder (non-function version)
real 0m16.403s real 3m18.342s
user 0m15.626s user 3m17.004s
sys 0m0.632s sys 0m0.968s
Clearly: use Jidders solution with mawk, this saves a lot of time.
I used the same vars as Scrutinizer so it's easy to see the differences with this similar approach that doesn't need an extra array and doesn't call split() on every field:
$ cat tst.awk
BEGIN{
FS="[-&,]"
OFS=","
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
{
delete A # or split("",A) in non-gawk
for (i=2; i<NF;i+=2)
A[$i] = $(i+1)
print $1,A["OBO"],A["OBI"],A["TICK"],A["CSP"]
}
$ awk -f tst.awk file
MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,
Since #Jidders solution with mawk seems to be fastest, I wondered how this would compare:
BEGIN{OFS=FS=","; print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
NF {
a=b=c=d=$2
sub(/.*CSP-/,"",a); sub(/[^0-9].*/,"",a)
sub(/.*TICK-/,"",b); sub(/[^0-9].*/,"",b)
sub(/.*OBI-/,"",c); sub(/[^0-9].*/,"",c)
sub(/.*OBO-/,"",d); sub(/[^0-9].*/,"",d)
print $1,d,c,b,a
}
Similar approach but just using 2 subs()s instead of match() + substr(). The result is that that's significantly slower than my original attempt and Jidders:
$ time awk -f ed.awk file10k > ed.out
real 0m0.468s
user 0m0.405s
sys 0m0.030s
$ time awk -f ed2.awk file10k > ed2.out
real 0m1.092s
user 0m1.045s
sys 0m0.046s
$ time awk -f jidder.awk file10k > jidder.out
real 0m0.218s
user 0m0.124s
sys 0m0.061s
I guess mawk must have some serious optimization for match()+substr()!
Ah, I just realized what the difference is - string manipulation is notoriously slow in awk (slower than I/O) and the above 2-sub()s solution modifies each string variable twice.
You can simplify it some like this:
OBO_Value=$(awk -F"OBO-" '/OBO-/ {split($2,a,"&");print a[1];exit}' <<< $Complete_Info)
This do all in one go, so should be some faster. Also the exit makes the awk stop if line with value is found.
PS use parentheses instead of old and outdated back-tics