awk optimised for multi delimeter on single record - awk

I have a file with more than 2 bilion record in it.
It contains record with multi seperator as , OBO- OBI- CSP- and TICK
My lines in file B2.txt are:
917354000000,SUD=CAT-10&DBSG-1&BS3G-1&TS11-1&TS21-1&TS22-1&RSA-1&CSP-69&NAM-0&SCHAR-4&PWD-0000&OFA-0&OICK-134&HOLD-1&MPTY-1&CLIP-1&CFU-1&CFB-1&CFNRY-1&CFNRC-1&CAW-1&SOCFU-1&SOCFB-1&SOCFRY-1&SOCFRC-1&SODCF-0&SOSDCF-4&SOCB-0&SOCLIP-0&SOCLIR-0&SOCOLP-0;
917354000004,SUD=CAT-10&DBSG-1&OBO-2&OBR-2&BS3G-1&TS11-1&TS21-1&TS22-1&RSA-4&PRBT-1&NAM-0&SCHAR-8&PWD-0000&OFA-6&HOLD-1&CLIP-1&CFU-1&CFB-1&CFNRY-1&CFNRC-1&CAW-1&SOCFU-0&SOCFB-0&SOCFRY-0&SOCFRC-0&SODCF-0&SOSDCF-4&SOCB-0&SOCLIP-0&SOCLIR-0&SOCOLP-0;
My code is taking almost more than 2 days to run...
Below is the code.
#!/usr/bin/sh
echo "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value" > tt3.txt
while read i
do
MSISDN=`echo $i | awk -F"," '{ print $1}'`;
Complete_Info=`echo $i | awk -F"," '{ print $2}'`;
OBO_Value=`echo $Complete_Info | awk -F"OBO-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
OBI_Value=`echo $Complete_Info | awk -F"OBI-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
CSP_Value=`echo $Complete_Info | awk -F"CSP-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
TICK_Value=`echo $Complete_Info | awk -F"TICK-" '{ print $2 }'| awk -F"&"
'{ print $1 }'`
echo $MSISDN,$OBO_Value,$OBI_Value,$TICK_Value,$CSP_Value >> tt3.txt;
done < B2.txt
Is it possible to optimise this code with awk so the output file contains as follows
917354000000,,69,

Another awk:
awk '
BEGIN {
FS="[,&;]"
OFS=","
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
NF {
split(x,A) # a way of clearing array A, gawk can use delete A
for(i=2; i<=NF; i++) {
split ($i,F,/-/)
A[F[1]]=F[2]
}
print $1,A["OBO"],A["OBI"],A["TICK"],A["CSP"]
}
' B2.txt > tt3.txt
-------
After all the test results I decided to test with 1M records of which 25% was an empty line. I tested on OSX 10.9 with the following awk versions:
awk (BSD awk) v. 20070501)
mawk v. 1.3.4 20100625
gawk 4.0.2
I left out gawk 3 results, since I know from experience that they are usually disappointing and it usually finishes dead last..
Martin's solution was faster with BSD awk and GNU awk, but the opposite was true with mawk, where my solution was 25% faster. Ed's improvement by not using a split() function further improved speed by some 30% and with mawk it took only 7.225s, half of the time that was needed for Martin's solution.
But the real kicker turned out to be Jidders first solution with the match() and without the use of a function. With mawk it did it in a whopping 1.868s
So YMMV, but that best solution speed wise appears to be Jidder's first solution in combination with mawk..
S.
RESULTS:
Martin Scrutinizer Ed Morton Jidder (non-function version)
BSDawk
real 0m25.008s 1m51.424s 0m38.566s 0m17.945s
user 0m24.545s 1m47.791s 0m37.662s 0m17.485s
sys 0m0.117s 0m0.824s 0m0.120s 0m0.117s
mawk
real 0m14.472s 0m11.618s 0m7.225s 0m1.868s
user 0m13.922s 0m11.091s 0m6.988s 0m1.759s
sys 0m0.117s 0m0.116s 0m0.093s 0m0.084s
gawk
real 0m33.486s 1m16.490s 0m30.642s 0m17.201s
user 0m32.816s 1m14.874s 0m30.041s 0m16.689s
sys 0m0.134s 0m0.219s 0m0.116s 0m0.131s

Another awk way
awk 'BEGIN{OFS=FS=",";print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
NF{match($2,/CSP-[1-9]+/);a=substr($2,RSTART+4,RLENGTH-4)
match($2,/TICK-[1-9]+/);b=substr($2,RSTART+5,RLENGTH-5)
match($2,/OBI-[1-9]+/);c=substr($2,RSTART+4,RLENGTH-4)
match($2,/OBO-[1-9]+/);d=substr($2,RSTART+4,RLENGTH-4)
print $1,d,c,b,a}
' file
Produces
MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,
I think its pretty self explanatory but if you need anything explaining just ask.
EDIT:
Here it is using a function
awk 'BEGIN{OFS=FS=",";print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
function f(g){match($2,g"-[1-9]*");return (substr($2,RSTART+4,RLENGTH-4))}
NF{a=f("OBO");b=f("OBI");c=f("TICK");d=f("CSP");print $1,a,b,c,d} ' file
Bit neater
awk 'BEGIN{
OFS=FS=","
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
function GetVal(Str){
match($2,Str"-[1-9]*")
return (substr($2,RSTART+4,RLENGTH-4))
}
NF{
a=GetVal("OBO")
b=GetVal("OBI")
c=GetVal("TICK")
d=GetVal("CSP")
print $1,a,b,c,d} ' file
Decided to check the speed of each of the scripts on here for 10 000 rows
Mine(functions) - real 0m0.773s user 0m0.755s sys 0m0.016s
Mine(non-funct) - real 0m0.306s user 0m0.295s sys 0m0.009s
Scrutinizer - real 0m0.400s user 0m0.392s sys 0m0.008s
Martin - real 0m0.298s user 0m0.291s sys 0m0.006s
The function one is significantly slower.

Here is an awk script that worked for your input:
BEGIN {
OFS=","
FS=",|&|;|-"
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
{
obi=""
tick=""
csp=""
obo=""
for (i=4; i<=NF; i+=2) {
if( $i == "OBO" ) {
obo=$(i+1)
} else if ($i == "OBI") {
obi=$(i+1)
} else if ($i == "CSP") {
csp=$(i+1)
} else if ($i == "TICK") {
tick=$(i+1)
}
}
print $1, obo, obi, tick, csp
}
gave
MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,
I took advantage of the fact, that your data and input seems to be alternating every two steps.
For completeness let me mention my benchmark for 10 million rows with all solutions updated:
mawk gawk
Ed Morton
real 1m19.259s real 3m36.107s
user 1m17.987s user 3m35.163s
sys 0m0.706s sys 0m0.936s
Martin
real 2m13.875s real 4m37.112s
user 2m12.680s user 4m36.032s
sys 0m0.848s sys 0m0.954s
Scrutinizer
real 1m48.755s real 6m40.202s
user 1m47.513s user 6m39.148s
sys 0m0.894s sys 0m1.041s
Jidder (non-function version)
real 0m16.403s real 3m18.342s
user 0m15.626s user 3m17.004s
sys 0m0.632s sys 0m0.968s
Clearly: use Jidders solution with mawk, this saves a lot of time.

I used the same vars as Scrutinizer so it's easy to see the differences with this similar approach that doesn't need an extra array and doesn't call split() on every field:
$ cat tst.awk
BEGIN{
FS="[-&,]"
OFS=","
print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"
}
{
delete A # or split("",A) in non-gawk
for (i=2; i<NF;i+=2)
A[$i] = $(i+1)
print $1,A["OBO"],A["OBI"],A["TICK"],A["CSP"]
}
$ awk -f tst.awk file
MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value
917354000000,,,,69
917354000004,2,,,
Since #Jidders solution with mawk seems to be fastest, I wondered how this would compare:
BEGIN{OFS=FS=","; print "MSISDN,OBO_Value,OBI_Value,TICK_Value,CSP_Value"}
NF {
a=b=c=d=$2
sub(/.*CSP-/,"",a); sub(/[^0-9].*/,"",a)
sub(/.*TICK-/,"",b); sub(/[^0-9].*/,"",b)
sub(/.*OBI-/,"",c); sub(/[^0-9].*/,"",c)
sub(/.*OBO-/,"",d); sub(/[^0-9].*/,"",d)
print $1,d,c,b,a
}
Similar approach but just using 2 subs()s instead of match() + substr(). The result is that that's significantly slower than my original attempt and Jidders:
$ time awk -f ed.awk file10k > ed.out
real 0m0.468s
user 0m0.405s
sys 0m0.030s
$ time awk -f ed2.awk file10k > ed2.out
real 0m1.092s
user 0m1.045s
sys 0m0.046s
$ time awk -f jidder.awk file10k > jidder.out
real 0m0.218s
user 0m0.124s
sys 0m0.061s
I guess mawk must have some serious optimization for match()+substr()!
Ah, I just realized what the difference is - string manipulation is notoriously slow in awk (slower than I/O) and the above 2-sub()s solution modifies each string variable twice.

You can simplify it some like this:
OBO_Value=$(awk -F"OBO-" '/OBO-/ {split($2,a,"&");print a[1];exit}' <<< $Complete_Info)
This do all in one go, so should be some faster. Also the exit makes the awk stop if line with value is found.
PS use parentheses instead of old and outdated back-tics

Related

Issue with using awk to extract words after/before a specific word

I have a file which has several sections with a header like this
$ head -n 5 test.txt
[44610] gmx#127.0.0.1
f1(cu_atomdata, NBParamGpu, Nbnxm::gpu_plist, bool), Block Size 64, Grid Size 3599, Device 0, 99 invocations
Section: Command line profiler metrics
Metric Name Metric Unit Minimum Maximum Average
-------------------------------------------------------------------------------------------- ----------- ------------ ------------ ------------
I would like to use the following awk command to get the number after Grid Size and the number before invocations. However, the following command returns nothing.
$ awk '{for (I=1;I<NF;I++) if ($I == "Grid Size") print $(I+1)}' test.txt
$
$ awk '{for (I=1;I<NF;I++) if ($I == "invocations") print $(I-1)}' test.txt
$
Any idea to fix that?
You may use this awk that loops through each field and extract your numbers based on field values:
awk '{
for (i=3;i<NF;i++)
if ($(i-2) == "Grid" && $(i-1) == "Size")
print "gridSize:", $i+0
else if ($(i+1) == "invocations")
print "invocations:", $i+0
}' file
gridSize: 3599
invocations: 99
Alternatively, you may try this gnu grep with PCRE regex:
grep -oP 'Grid Size\h+\K\d+|\d+(?=\h+invocations)' file
3599
99
\K - match reset
(?=...) - Lookahead assertion
With GNU awk latest versions try putting array within match itself:
awk '
match($0,/Grid Size [0-9]+/, arr1){
print arr1[3]
match($0,/[0-9]+ invocations/, arr2)
print arr2[1]
}
' Input_file
With your shown samples could you please try following(when I tried above it didn't work with 4.1 awk version so adding this one as an alternative here).
awk '
match($0,/Grid Size [0-9]+/){
num=split(substr($0,RSTART,RLENGTH),arr1," ")
print arr1[num]
match($0,/[0-9]+ invocations/)
split(substr($0,RSTART,RLENGTH),arr2," ")
print arr2[1]
}
' Input_file
make it even simpler :
{mawk/mawk2/gawk} 'BEGIN {
FS = "(^.+Grid Size[ ]+|" \ # before
"[,][^,]+[,][ ]+|" \ # in-between
"[ ]+invocations.*$)"; # after
} NF == 4 { print "grid size \043 : " $2 ", invocations \043 : " $3 }'
This regex gobbles everything before, in between, and after it. Because the regex touches the 2 walls at the ends, fields $1 and $4 will also be created, but as empty ones, hence the NF==4 check.
The octal code \043 is the hash symbol # - just my own personal preference of not having comment delimiter inside my strings is its original form.
a gawk approach with gensub:
$ gawk '/Grid Size/{
s=gensub(/.*Grid\sSize\s([[:digit:]]+).*,\s([[:digit:]]+) invocations/, "gridSize: \\1\ninvocations: \\2","G"); print s
}' myFile
gridSize: 3599
invocations: 99

AWK FPAT not working as expected for string parsing

I have to parse a very large length string (from stdin). It is basically a .sql file. I have to get data from it. I am working to parse the data so that I can convert it into csv. For this, I am using awk. For my case, A sample snippet (of two records) is as follows:
b="(abc#xyz.com,www.example.com,'field2,(2)'),(dfr#xyz.com,www.example.com,'field0'),"
echo $b|awk 'BEGIN {FPAT = "([^\\)]+)|('\''[^'\'']+'\'')"}{print $1}'
In my regex, I am saying that split on ")" bracket or if single quotes are found then ignore all text until last quote is found. But my output is as follows:
(abc#xyz.com,www.example.com,'field2,(2
I am expecting this output
(abc#xyz.com,www.example.com,'field2,(2)'
Where is the problem in my code. I am search a lot and check awk manual for this but not successful.
My first answer below was wrong, there is an ERE for what you're trying to do:
$ echo "$b" | awk -v FPAT="[(]([^)]|'[^']*')*)" '{for (i=1; i<=NF; i++) print $i}'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
Original answer, left as a different approach:
You need a 2-pass approach first to replace all )s within quoted fields with something that can't already exist in the input (e.g. RS) and then to identify the (...) fields and put the RSs back to )s before printing them:
$ echo "$b" |
awk -F"'" -v OFS= '
{
for (i=2; i<=NF; i+=2) {
gsub(/)/,RS,$i)
$i = FS $i FS
}
FPAT = "[(][^)]*)"
$0 = $0
for (i=1; i<=NF; i++) {
gsub(RS,")",$i)
print $i
}
FS = FS
}
'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
The above is gawk-only due to FPAT (or we could have used gawk patsplit()), with other awks you'd used a while-match()-substr() loop:
$ echo "$b" |
awk -F"'" -v OFS= '
{
for (i=2; i<=NF; i+=2) {
gsub(/)/,RS,$i)
$i = FS $i FS
}
while ( match($0,/[(][^)]*)/) ) {
field = substr($0,RSTART,RLENGTH)
gsub(RS,")",field)
print field
$0 = substr($0,RSTART+RLENGTH)
}
}
'
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
Written and tested with your shown samples in GNU awk. This could be done in simple field separator setting, try following once, where b is your shell variable which has your shown value in it.
echo "$b" | awk -F'\\),\\(' '{print $1}'
(abc#xyz.com,www.example.com,'field2,(2)'
Explanation: Simply setting field separator of awk program to \\),\\( for your input and printing first field of it.
Similar regex approach as Ed has suggested but I usually prefer using RS and RT over FPAT:
b="(abc#xyz.com,www.example.com,'field2,(2)'),(dfr#xyz.com,www.example.com,'field0'),"
awk -v RS="[(]('[^']*'|[^)])*[)]" 'RT {print RT}' <<< "$b"
(abc#xyz.com,www.example.com,'field2,(2)')
(dfr#xyz.com,www.example.com,'field0')
if you wanna do it close to one pass, maybe try this
{mawk/mawk2/gawk} 'BEGIN { OFS = FS = "\047"; ORS = RS = "\n";
XFS = "\376\004\377";
XRS = "\051" ORS;
} ! /[\051]/ { print; next; } { for (x=1; x <= NF; x += 2) {
gsub(/[\051][^\050]*/, XFS, $(x)); } } gsub(XFS, XRS) || 1'
I did it this way with 2 gsubs just in case it starts sending rows below with unintended consequences. \051 = ")", \050 is the open one.
further enhanced it by telling it to instantly print and move on if no close brackets are even found (so nothing to split at all)
It only loops over odd-numbered fields once i split it by the single quote \047 (cuz even numbered ones are precisely the ones within a pair of single quotes you want to avoid chopping at).
As for XFS, just pick any combination of your choice using bytes that are almost impossible to encounter. If you want to play it safe, you can test for whether XFS exists in that row, and use some alternative combo. It's basically to insert a delimiter into the middle of the row that wouldn't run afoul with actual input data. It's not fool proof per se, but the likelihood of running into a combination of UTF16 Byte order mark and ASCII control characters is reasonably low.
(and if you encounter XFS, it's likely you already have corrupted data to begin with, since a 300 series octal must be followed by 200 series ones to be valid UTF8)
This way, i wouldn't need FPAT at all.
*updated with " || 1" towards the end as a safety catch-all, but shouldn't really be needed.

Naming of variables in gawk when using gensub

I am doing string substitution in gawk. The code below is a simplified version (the real replacement argument to gensub involves lots of "\\1\\3\\2", which is why I can't use sub/gsub). My question is one of robustness: since I'm modifying the 1st field ($1) with gensub, can I store the output of gensub in the variable $1, or does this have potential to cause problems (in other contexts; it works fine in my code)?
# test data
$ printf "Header_1\tHeader_2\nHiC_scaffold_1_1234\t1234\nHiC_scaffold_2_7890\t7890\n" > input.txt
# code I'm using (works as expected)
$ gawk 'BEGIN {FS = "\t"} FNR == 1 {next} \
> {one = gensub(/HiC_scaffold_([0-9]+)_([0-9]+) ?/, "HIC_SCAFFOLD_\\2_\\1", "g", $1)} \
> {print $2 "\t" one}' \
> input.txt > output.txt1
# code I'm asking about (works as expected with these test data)
$ gawk 'BEGIN {FS = "\t"} FNR == 1 {next} \
> {$1 = gensub(/HiC_scaffold_([0-9]+)_[0-9]+ ?/, "HIC_SCAFFOLD_\\2_\\1", "g", $1)} \
> {print $2 "\t" $1}' \
> input.txt > output.txt2
$ head *txt*
==> input.txt <==
Header_1 Header_2
HiC_scaffold_1_1234 1234
HiC_scaffold_2_7890 7890
==> output.txt1 <==
1234 HIC_SCAFFOLD_1
7890 HIC_SCAFFOLD_2
==> output.txt2 <==
1234 HIC_SCAFFOLD_1
7890 HIC_SCAFFOLD_2
If I got you correctly, you are asking for a bit of a review on the second code.
Can you assign a field? Yes, so $1 = gensub(something) is ok (ref).
Potential issues? Yes: if $n doesn't exist, then you are creating it, and thus modifying $0 as well. You are doing it on $1, as far as I know, if a record exists ($0) then it must have at least one field ($1) - might be empty.
Another caveat would be if you were assigning to $0, but feels a little bit out of scope. Do not attempt to $1 = $1 after your gensub().
Finally, let's have a look at gensub(). If you provide no target to it, then it falls back to use $0. You are not doing so.
In the end, I cannot see a trivial situation where this can go wrong. Your code seems fine to me.

Removing content of a column based on number of occurences

I have a file (; seperated) with data like this
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3;000-200.2
211111121;000-000.1;000-000.2
I would like to remove any $3 that has less than 3 occurences, so the outcome would be like
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
That is, only $3 got deleted, as it had only a single occurence
Sadly I am not really sure if (thus how) this could be done relatively easily (as doing the =COUNT.IF matching, and manuel delete in Excel feels quite embarrassing)
$ awk -F';' 'NR==FNR{cnt[$3]++;next} cnt[$3]<3{sub(/;[^;]+$/,"")} 1' file file
111111121;000-000.1;000-000.2
111111211;000-000.1;000-000.2
111112111;000-000.1;000-000.2
111121111;000-000.1;000-000.2
111211111;000-000.1;000-000.2
112111111;000-000.1;000-000.2
121111112;000-000.2;020-000.8
121111121;000-000.2;020-000.8
121111211;000-000.2;020-000.8
121113111;000-000.3
211111121;000-000.1;000-000.2
or if you prefer:
$ awk -F';' 'NR==FNR{cnt[$3]++;next} {print (cnt[$3]<3 ? $1 FS $2 : $0)}' file file
this awk one-liner can help, it processes the file twice:
awk -F';' 'NR==FNR{a[$3]++;next}a[$3]<3{NF--}7' file file
Though that awk solutions are the best in terms of performance, your goal could be also achieved with something like this:
while IFS=" " read a b;do
if [[ "$a" -lt "3" ]];then
sed -i "s/$b//" b.txt
fi
done <<<"$(cut -d";" -f3 b.txt |sort |uniq -c)"
Operation is based on the output of cut which counts occurrences.
$cut -d";" -f3 b.txt |sort |uniq -c
7 000-000.2
1 000-200.2
3 020-000.8
Above works for editing source file in place, so keep a back up for testing.
You can feed the file twice to awk. On the first run you gather a statistic that you use in the second run:
script.awk
FNR == NR { stats[ $3 ]++
next
}
{ if( stats[$3] < 3) print $1 $2
else print
}
Run it like this: awk -F\; -f script.awk yourfile yourfile .
The condition FNR == NR is true during processing of the first filename given to awk. The next statement skips the second block.
Thus the second block is only used for processing the second filename given to awk (which is here the same as the first filename).

Modifying a number value in text

I have a text coming in as
A1:B2.C3.D4.E5
A2:B7.C10.D0.E9
A0:B1.C9.D4.E8
I wonder how to change it as
A1:B2.C1.D4.E5
A2:B7.C8.D0.E9
A0:B1.C7.D4.E8
using Awk. First problem is multiple delimiter. Second is, how to get the C-Value and Decrement by 2.
awk solution:
$ awk -F"." '{$2=substr($2,0,1)""substr($2,2)-2;}1' OFS="." file
A1:B2.C1.D4.E5
A2:B7.C8.D0.E9
A0:B1.C7.D4.E8
I was wondering wether awk regexp would do the job, but apparently, awk cannot capture pattern. This is why I suggest perl solution:
$ cat data.txt
A1:B2.C3.D4.E5
A2:B7.C10.D0.E9
A0:B1.C9.D4.E8
$ perl -pe 's/C([0-9]+)/"C" . ($1-2)/ge;' data.txt
A1:B2.C1.D4.E5
A2:B7.C8.D0.E9
A0:B1.C7.D4.E8
Admittedly, I probably would have done this using the substr() function like Guru has shown:
awk 'BEGIN { FS=OFS="." } { $2 = substr($2,0,1) substr($2,2) - 2 }1' file
I do also like Aif's answer using Perl probably just a little more. Shorter is sweeter, isn't it? However, GNU awk can capture pattens. Here's how:
awk 'BEGIN { FS=OFS="." } match($2, /(C)(.*)/, a) { $2 = a[1] a[2] - 2}1' file