awk: "default" action if no pattern was matched? - awk

I have an awk script which checks for a lot of possible patterns, doing something for each pattern. I want something to be done in case none of the patterns was matched. i.e. something like this:
/pattern 1/ {action 1}
/pattern 2/ {action 2}
...
/pattern n/ {action n}
DEFAULT {default action}
Where of course, the "DEFAULT" line is no awk syntax and I wish to know if there is such a syntax (like there usually is in swtich/case statements in many programming languages).
Of course, I can always add a "next" command after each action, but this is tedious in case I have many actions, and more importantly, it prevents me from matching the line to two or more patterns.

You could invert the match using the negation operator ! so something like:
!/pattern 1|pattern 2|pattern/{default action}
But that's pretty nasty for n>2. Alternatively you could use a flag:
{f=0}
/pattern 1/ {action 1;f=1}
/pattern 2/ {action 2;f=1}
...
/pattern n/ {action n;f=1}
f==0{default action}

GNU awk has switch statements:
$ cat tst1.awk
{
switch($0)
{
case /a/:
print "found a"
break
case /c/:
print "found c"
break
default:
print "hit the default"
break
}
}
$ cat file
a
b
c
d
$ gawk -f tst1.awk file
found a
hit the default
found c
hit the default
Alternatively with any awk:
$ cat tst2.awk
/a/ {
print "found a"
next
}
/c/ {
print "found c"
next
}
{
print "hit the default"
}
$ awk -f tst2.awk file
found a
hit the default
found c
hit the default
Use the "break" or "next" as/when you want to, just like in other programming languages.
Or, if you like using a flag:
$ cat tst3.awk
{ DEFAULT = 1 }
/a/ {
print "found a"
DEFAULT = 0
}
/c/ {
print "found c"
DEFAULT = 0
}
DEFAULT {
print "hit the default"
}
$ gawk -f tst3.awk file
found a
hit the default
found c
hit the default
It's not exaclty the same semantics as a true "default" though so it's usage like that could be misleading. I wouldn't normally advocate using all-upper-case variable names but lower case "default" would clash with the gawk keyword so the script wouldn't be portable to gawk in future.

As mentioned above by tue, my understanding of the standard approach in Awk is to put next at each alternative and then have a final action without a pattern.
/pattern1/ { action1; next }
/pattern2/ { action2; next }
{ default-action }
The next statement will guarantee that no more patterns are considered for the line in question. And the default-action will always happen if the previous ones don't happen (thanks to all the next statements).

There is no "maintanance free" solution for a DEFAULT-Branch in awk.
The first possibility i would suggest is to complete each branch of a pattern match with a 'next' statement. So it's like a break statement. Add a final action at the end that matches everything. So it's the DEAFULT branch.
The other possibility would be:
set a flag for each branch that has a pattern match (i.e. your non-default branches)
e.g. start your actions with NONDEFAULT=1;
Add a last action at the end (the default branch) and define a condition NONDEFAULT==0 instaed of a reg expression match.

A fairly clean, portable workaround is using an if statement:
Instead of:
pattern1 { action1 }
pattern2 { action2 }
...
one could use the following:
{
if ( pattern1 ) { action1 }
else if ( pattern2 ) { action2 }
else { here is your default action }
}
As mentioned above, GNU awk has switch statements, but other awk implementations don't, so using switch would not be portable.

Related

Using awk to process html-related Gift-format Moodle questions

This is basically a awk question but it is about processing data for the Moodle
Gift format, thus the tags.
I want to format html code in a question (Moodle "test" activity) but I need to replace < and > with the corresponding entities, as these will be interpreted as "real" html, and not printed.
However, I want to be able to type the question with regular code and post-process the file before importing it as gift into Moodle.
I thought awk would be the perfect tool to do this.
Say I have this (invalid as such) Moodle/gift question:
::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}
What I want is a script that translates this into a valid gift question:
::q1::[html]This is a question about HTML:
<pre>
<p>some text</p>
</pre>
and some tag:<code><img></code>
{T}
key point: replace < and > with < and > when:
inside a <pre>-</pre> bloc (assuming those tags are alone on a line)
between <code>and </code>, with arbitrary string in between.
For the first part, I'm fine. I have a shell script calling awk (gawk, actually).
awk -f process_src2gift.awk $1.src >$1.gift
with process_src2gift.awk:
BEGIN { print "// THIS IS A GENERATED FILE !" }
{
if( $1=="<pre>" ) # opening a "code" block
{
code=1;
print $0;
}
else
{
if( $1=="</pre>" ) # closing a "code" block
{
code=0;
print $0;
}
else
{ # if "code block", replace < > by html entities
if( code==1 )
{
gsub(">","\\>");
gsub("<","\\<");
}
print $0;
}
}
}
END { print "// END" }
However, I'm stuck with the second requirement..
Questions:
Is it possible to add to my awk script code to process the hmtl code inside the <code> tags? Any idea ? I thought about using sed but I didn't see how to do that.
Maybe awk isn't the right tool for that ? I'm open for any suggestion on other (standard Linux) tool.
Answering own question.
I found a solution by doing a two step awk process:
first step as described in question
second step by defining <code> or </code> as field delimiter, using a regex, and process the string replacement on second argument ($2).
The shell file becomes:
echo "Step 1"
awk -f process_src2gift.awk $1.src >$1.tmp
echo "Step 2"
awk -f process_src2gift_2.awk $1.tmp >$1.gift
rm $1.tmp
And the second awk file (process_src2gift_2.awk) will be:
BEGIN { FS="[<][/]?[c][o][d][e][>]"; }
{
gsub(">","\\>",$2);
gsub("<","\\<",$2);
if( NF >= 3 )
print $1 "<code>" $2 "</code>" $3
else
print $0
}
Of course, there are limitations:
no attributes in the <code> tag
only one pair <code></code> in the line
probably others...

In AWK, skip the rest of the current action?

Thanks for looking.
I have an AWK script with something like this;
/^test/{
if ($2 == "2") {
# What goes here?
}
# Do some more stuff with lines that match test, but $2 != "2".
}
NR>1 {
print $0
}
I'd like to skip the rest of the action, but process the rest of the patterns/actions on the same line.
I've tried return but this isn't a function.
I've tried next but that skips the rest of the patterns/actions for the current line.
For now I've wrapped the rest of the ^test action in the if statement's else, but I was wondering if there was a better approach.
Not sure this matters but I am using gawk on OSX, installed via brew (for better compatibility with my target OS).
Update (w/solution):
Edits: Expanded code sample based on #karakfa's answer.
BEGIN{
keepLastLine = 1;
}
/^test/ && !keepLastLine{
printLine = 1;
print $0;
next;
}
/^test/ && keepLastLine{
printLine = 0;
next;
}
/^foo/{
# This is where I have the rest of my logic (approx 100 lines),
# including updates to printLine and keepLastLine
}
NR>1 {
if (printLine) {
print $0
}
}
This will work for me, I even like it better that what I was thinking of.
However I do wonder what if my keepLastLine condition was only accessible in a for loop?
I gather from what #karakfa has said, there isn't a control structure for exiting only an action, and continuing with other patterns, so that would have to be implemented with a flag of some sort (not unlike #RavinderSingh13's answer).
If I got it correct could you please try following. I am creating a variable named flag here which will be chedked if condition inside test block for checking if 2nd field is 2 is TRUE then it will be SET. When it is SET so rest of statements in test BLOCK will NOT be executed. Also resetting flag's value before read starts for a line too.
awk '
{
found=""
}
/^test/{
if ($2 == "2") {
# What goes here?
found=1
}
if(!found){
# Do some more stuff with lines that match test, but $2 != "2".
}
}
NR>1 {
print $0
}' Input_file
Testing of code here:
Let's say following is the Input_file:
cat Input_file
file
test 2 file
test
abcd
After running code following we will get following output, where if any line is having test keyword and NOT having $2==2 then also it will execute statements outside of test condition.
awk '
{
found=""
}
/^test/{
if ($2 == "2") {
print "# What goes here?"
found=1
}
if(!found){
print "Do some more stuff with lines that match test, but $2 != 2"
}
}
NR>1 {
print $0
}' Input_file
# What goes here?
test 2 file
Do some more stuff with lines that match test, but $2 != 2
test
abcd
the magic keyword you're looking for is else
/^test/{ if($2==2) { } # do something
else { } # do something else
}
NR>1 # {print $0} is implied.
for some reason if you don't want to use else just move up condition one up (flatten the hierarchy)
/^test/ && $2==2 { } # do something
/^test/ && $2!=2 { } # do something else
# other action{statement}s

What does it mean when an awk script has code outside curly braces?

Going through an awk tutorial, I came across this line
substr($0,20,5) == "HELLO" {print}
which prints a line if there is a "HELLO" string starting at 20th char.
Now I thought curly braces were necessary at the start of an awk script and an 'if' for this to work, but it works without nevertheless.
Can some explain how it evaluates?
If you have:
{ action }
...then that action runs on every line. By contrast, if you have:
condition { action }
...then that action runs only against lines for which the condition is true.
Finally, if you have only a condition, then the default action is print:
NR % 2 == 0
...will thus print every other line.
You can similarly have multiple pairs in a single script:
condition1 { action1 }
condition2 { action2 }
{ unconditional_action }
...and can also have BEGIN and END blocks, which run at the start and end of execution.

Getting output for every line of input file. Only need output once

This should be pretty simple but I'm having an issue with the flow of an awk script. I run the following script and it prints the output over and over again (if I had to guess I would say that it's printing once for every line of the input file). As requested, here is some fake input:
[30000] (03/20 00:00:02.950):{0x2D90} Pattern1 5.0.3.57
[30000] (03/20 00:00:03.911):{0x2D90} Pattern2 5.0.3.57
[30000] (03/20 00:00:02.950):{0x2D90} Pattern3 5.0.3.16
[30000] (03/20 00:00:03.911):{0x2D90} Pattern4 5.0.3.16
Here is the script:
/Pattern1/ {
gsub(/\./,"");
agtver=$5;
}
/Pattern2/ {
gsub(/\./,"");
ctrver=$5;
}
{
if (agtver ~ 50357 && ctrver ~ 50357) {
print "Blamo!";
}
else print "No blamo. :("
}
And here is the output that I'm getting:
[chawkins#chawkins-DT Devel]$ ./fakeawk.awk < fake.txt
No blamo. :(
Blamo!
Blamo!
Blamo!
The output that I expect is a single Blamo! if the patterns match and a single No blamo. :( if it doens't match.
The problem seems to be that there are three separate { ... } sections, but I need these to be able to process two patterns... unless there is a way to condense this.
If you never see pattern1 and pattern2 after the first time, then agtver and ctrver remain set. You have to zero them out again.
edit added debug output, you should be able to see where the logic is failing.
Tested with your data, thanks for adding that!
/Pattern1/ { gsub(/\./,""); agtver=$5;}
/Pattern2/ { gsub(/\./,""); ctrver=$5;}
{
#dbg print "\n#dbg: $5=" $5 "xx\tagtver=" agtver "xx\tctrver=" ctrver "xxx\t$0=" $0
if (agtver ~ 50357 && ctrver ~ 50357) {
print "Blamo!";
agtver="" ; ctrver=""
}
else print "No blamo. :("
}
./fakeawk.awk < fake.txt
output
No blamo. :(
Blamo!
No blamo. :(
No blamo. :(
I hope this helps.
TXR:
#(gather :vars (agtver ctrver))
# (skip :greedy) #/Pattern1/ #{agtver /5\.0\.3\.57/}
# (skip :greedy) #/Pattern2/ #{ctrver /5\.0\.3\.57/}
#(end)
#(do (put-string "Blamo!\n"))
Output:
$ txr fake.txr fake.log
Blamo!
$ echo "junk" | txr fake.txr -
false
The #(gather) directive is perfect for this. It matches material that can appear in any order, and :vars (agtver ctrver) adds the constraint that bindings must be found for both of these variables, or else a failure occurs.
We can then express the two indepedent conditions we are looking for as a pair of independent whole-line pattern matches which bind two different variables.
The logic may be read as "please scan the input to gather bindings variables agtver and ctrver or else fail". And then the rules for gathering the variables are specified, one per line.
We don't really need the side effect of printing Blamo!: the successful or failed termination of the program tells us everything.

Break down JSON string in simple perl or simple unix?

ok so i have have this
{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}
and at the moment i'm using this shell command to decode it to get the string i need,
echo $x | grep -Po '"utterance":.*?[^\\]"' | sed -e s/://g -e s/utterance//g -e 's/"//g'
but this only works when you have a grep compiled with perl and plus the script i use to get that JSON string is written in perl, so is there any way i can do this same decoding in a simple perl script or a simpler unix command, or better yet, c or objective-c?
the script i'm using to get the json is here, http://pastebin.com/jBGzJbMk and if you want a file to use then download http://trevorrudolph.com/a.flac
How about:
perl -MJSON -nE 'say decode_json($_)->{hypotheses}[0]{utterance}'
in script form:
use JSON;
while (<>) {
print decode_json($_)->{hypotheses}[0]{utterance}, "\n"
}
Well, I'm not sure if I can deduce what you are after correctly, but this is a way to decode that JSON string in perl.
Of course, you'll need to know the data structure in order to get the data you need. The line that prints the "utterance" string is commented out in the code below.
use strict;
use warnings;
use Data::Dumper;
use JSON;
my $json = decode_json
q#{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}#;
#print $json->{'hypotheses'}[0]{'utterance'};
print Dumper $json;
Output:
$VAR1 = {
'status' => 0,
'hypotheses' => [
{
'utterance' => 'hello how are you',
'confidence' => '0.96311796'
}
],
'id' => '7aceb216d02ecdca7ceffadcadea8950-1'
};
Quick hack:
while (<>) {
say for /"utterance":"?(.*?)(?<!\\)"/;
}
Or as a one-liner:
perl -lnwe 'print for /"utterance":"(.+?)(?<!\\)"/g' inputfile.txt
The one-liner is troublesome if you happen to be using Windows, since " is interpreted by the shell.
Quick hack#2:
This will hopefully go through any hash structure and find keys.
my $json = decode_json $str;
say find_key($json, 'utterance');
sub find_key {
my ($ref, $find) = #_;
if (ref $ref) {
if (ref $ref eq 'HASH' and defined $ref->{$find}) {
return $ref->{$find};
} else {
for (values $ref) {
my $found = find_key($_, $find);
if (defined $found) {
return $found;
}
}
}
}
return;
}
Based on the naming, it's possible to have multiple hypotheses. The prints the utterance of each hypothesis:
echo '{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}' | \
perl -MJSON::XS -n000E'
say $_->{utterance}
for #{ JSON::XS->new->decode($_)->{hypotheses} }'
Or as a script:
use feature qw( say );
use JSON::XS;
my $json = '{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}';
say $_->{utterance}
for #{ JSON::XS->new->decode($json)->{hypotheses} };
If you don't want to use any modules from CPAN and try a regex instead there are multiple variants you can try:
# JSON is on a single line:
$json = '{"other":"stuff","hypo":[{"utterance":"hi, this is \"bob\"","moo":0}]}';
# RegEx with negative look behind:
# Match everything up to a double quote without a Backslash in front of it
print "$1\n" if ($json =~ m/"utterance":"(.*?)(?<!\\)"/)
This regex works if there is only one utterance. It doesn't matter what else is in the string around it, since it only searches for the double quoted string following the utterance key.
For a more robust version you could add whitespace where necessary/possible and make the . in the RegEx match newlines: m/"utterance"\s*:\s*"(.*?)(?<!\\)"/s
If you have multiple entries for the utterance confidence hash/object, changing case and weird formatting of the JSON string try this:
# weird JSON:
$json = <<'EOJSON';
{
"status":0,
"id":"an ID",
"hypotheses":[
{
"UtTeraNcE":"hello my name is \"Bob\".",
"confidence":0.0
},
{
'utterance' : 'how are you?',
"confidence":0.1
},
{
"utterance"
: "
thought
so!
",
"confidence" : 0.9
}
]
}
EOJSON
# RegEx with alternatives:
print "$1\n" while ( $json =~ m/["']utterance["']\s*:\s*["'](([^\\"']|\\.)*)["']/gis);
The main part of this RegEx is "(([^\\"]|\\.)*)". Description in detail as extended regex:
/
["'] # opening quotes
( # start capturing parentheses for $1
( # start of grouping alternatives
[^\\"'] # anything that's not a backslash or a quote
| # or
\\. # a backslash followed by anything
) # end of grouping
* # in any quantity
) # end capturing parentheses
["'] # closing quotes
/xgs
If you have many data sets and speed is a concern you can add the o modifier to the regex and use character classes instead of the i modifier. You can suppress the capturing of the alternatives to $2 with clustering parenthesis (?:pattern). Then you get this final result:
m/["'][uU][tT][tT][eE][rR][aA][nN][cC][eE]["']\s*:\s*["']((?:[^\\"']|\\.)*)["']/gos
Yes, sometimes perl looks like a big explosion in a bracket factory ;-)
Just stubmled upon another nice method of doing this, i finaly found how to acsess the Mac OS X JavaScript engine form commandline, heres the script,
alias jsc='/System/Library/Frameworks/JavaScriptCore.framework/Versions/A/Resources/jsc'
x='{"status":0,"id":"7aceb216d02ecdca7ceffadcadea8950-1","hypotheses":[{"utterance":"hello how are you","confidence":0.96311796}]}'
jsc -e "print(${x}['hypotheses'][0]['utterance'])"
Ugh, yes i came up with another answer, im strudying python and it reads arrays in both its python format and the same format as a json so, i jsut made this one liner when your variable is x
python -c "print ${x}['hypotheses'][0]['utterance']"
figured it out for unix but would love to see your perl and c, objective-c answers...
echo $X | sed -e 's/.*utterance//' -e 's/confidence.*//' -e s/://g -e 's/"//g' -e 's/,//g'
:D
shorter copy of the same sed:
echo $X | sed -e 's/.*utterance//;s/confidence.*//;s/://g;s/"//g;s/,//g'