I am very intrigued by the ability to add actions to ParseKit grammars. There is surprisingly little documentation on what is available in those actions. Say I have two rules like:
databaseName = Word;
createTableStmt ='CREATE' ('TEMP'| 'TEMPORARY')? 'TABLE' 'IF NOT EXISTS'? databaseName;
This obviously isn't a whole grammar but will serve as an example. When parsing i'd like to "return" a CreateTableStmt object that has certain properties. If I understand the tool correctly i'd add an action to the rule, do stuff then push it on the assembly which will carry it around for the next rule to deal with or use.
So for example it would look like:
createTableStmt ='CREATE' ('TEMP'| 'TEMPORARY')? 'TABLE' 'IF NOT EXISTS'? databaseName;
{
AnotherObj* dbName = Pop(); //gives me the top most object
CreateTableStmt* createTable = [[CreateTableStmt alloc] initWith:dbName];
//set if it was temporary
// set 'IF NOT EXISTS'
PUSH(createTable);//push back on stack for next rule to use
}
Then when everything is parsed I can just get that root object off the stack and it is a fully instantiated custom representation of the grammar. Somewhat like building an AST if i remember correctly. I can then do stuff with that representation much easier than with the passed in string.
My question is how can I see if it matched ('TEMP' | 'TEMPORARY') so I can set the value. Are those tokens on the stack? Is there a better way than to pop back to the 'CREATE' and see if we passed it. Should I be popping back to the bottom of the stack anyway on each match?
Also if my rule was instead
qualifiedTableName = (databaseName '.')? tableName (('INDEXED' 'BY' indexName) | ('NOT' 'INDEXED'))?;
Is it correct to assume that the action would not be called until the rule had been matched? So in this case when the action is called to the stack could look like:
possibly:
|'INDEXED'
|'NOT'
or:
|indexName (A custom object possibly)
|'BY'
|'INDEXED
|tableName (for sure will be here)
and possibly these
|'.' (if this is here I know the database name must be here) if not push last one on?
|databaseName
--------------(perhaps more things from other rules)
Are these correct assessments? Is there any other documentation on actions? I know it is heavily based on Antlr but its the subtle differences that can really get you in trouble.
Creator of ParseKit here. A few items:
ParseKit deprecation:
Just this week, I have forked ParseKit to a cleaner/smaller/faster library called PEGKit. ParseKit should be considered deprecated, and PEGKit should be used for all new development. Please move to PEGKit.
PEGKit is nearly identical to the grammar and code-gen features of ParseKit, and your ParseKit grammars are usable with PEGKit with a few small changes. In fact, all of the examples in your question here are usable with no changes in PEGKit.
See the Deprecation Notice in the ParseKit README.
And this tutorial on PEGKit.
Syntax errors in your grammar:
I spot 3 syntax errors in your grammar samples above (this applies equally to both ParseKit and PEGKit).
This line:
createTableStmt ='CREATE' ('TEMP'| 'TEMPORARY')? 'TABLE' 'IF NOT EXISTS'? databaseName;
Should be:
createTableStmt ='CREATE' ('TEMP'| 'TEMPORARY')? 'TABLE' ('IF' 'NOT' 'EXISTS')? databaseName;
Notice the break up of the invalid 'IF NOT EXISTS' construct into individual literal tokens. This is not only necessary, but also desireable so that variable whitespace between the words is allowed.
The POP() macro should be all upper case.
Your createTableStmt rule is missing a semicolon at the very end (after the action's closing }).
Before Answering:
Make sure you are using v0.3.1 PEGKit or later (HEAD of master). I fixed an important bug while finding the answer to your question, and my solutions below require this fix.
Answer to your first question:
My question is how can I see if it matched ('TEMP' | 'TEMPORARY') so I can set the value?
Good question! You basically have the right idea in your further comments above.
Specficially, I would probably break up the createTableStmt rule into 4 rules like this:
createTableStmt = 'CREATE'! tempOpt 'TABLE'! existsOpt databaseName ';'!;
databaseName = QuotedString;
tempOpt
= ('TEMP'! | 'TEMPORARY'!)
| Empty
;
existsOpt
= ('IF'! 'NOT'! 'EXISTS'!)
| Empty
;
Notice all of the vital ! discard directives for discarding unneeded literal tokens.
Also Notice that I've changed the last two rules to use | Empty rather than ?. This is so I can add Actions to the Empty alternatives (you'll see that in a sec).
Then you can either add Actions to your grammar, or use ObjC parser delegate callbacks if you prefer to work in pure code.
If you use Actions in your grammar, something like the following will work:
createTableStmt = 'CREATE'! tempOpt 'TABLE'! existsOpt databaseName ';'!
{
NSString *dbName = POP();
BOOL ifNotExists = POP_BOOL();
BOOL isTemp = POP_BOOL();
NSLog(#"create table: %#, %d, %d", dbName, ifNotExists, isTemp);
// go to town
// myCreateTable(dbName, ifNotExists, isTemp);
};
databaseName = QuotedString
{
// pop the string value of the `PKToken` on the top of the stack
NSString *dbName = POP_STR();
// trim quotes
dbName = [dbName substringWithRange:NSMakeRange(1, [dbName length]-2)];
// leave it on the stack for later
PUSH(dbName);
};
tempOpt
= ('TEMP'! | 'TEMPORARY'!) { PUSH(#YES); }
| Empty { PUSH(#NO); }
;
existsOpt
= ('IF'! 'NOT'! 'EXISTS'!) { PUSH(#YES); }
| Empty { PUSH(#NO); }
;
I've added this Grammar and a test case to the PEGKit project.
As for your second question, please break it out as a new SO question, and tag it ParseKit and PEGKit and I will get to it ASAP.
Related
I am new to bison, and have the misfortune of needing to write a parser for a language that may have what would otherwise be an operator within a variable name. For example, depending on context, the expression
FOO = BAR-BAZ
could be interpreted as either:
the variable "FOO" being assigned the value of the variable "BAR" minus the value of the variable "BAZ", OR
the variable "FOO" being assigned the value of the variable "BAR-BAZ"
Fortunately the language requires variable declarations ahead of time, so I can determine whether a given string is a valid variable via a function I've implemented:
bool isVariable(char* name);
that will return true if the given string is a valid variable name, and false otherwise.
How do I tell bison to attempt the second scenario above first, and only if (through use of isVariable()) that path fails, go back and try it as the first scenario above? I've read that you can have bison try multiple parsing paths and cull invalid ones when it encounters a YYERROR, so I've tried a set of rules similar to:
variable:
STRING { if(!isVariable($1)) YYERROR; }
;
expression:
expression '-' expression
| variable
;
but when given "BAR-BAZ" the parser tries it as a single variable and just stops completely when it hits the YYERROR instead of exploring the "BAR" - "BAZ" path as I expect. What am I doing wrong?
Edit:
I'm beginning to think that my flex rule for STRING might be the culprit:
((A-Z0-9][-A-Z0-9_///.]+)|([A-Z])) {yylval.sval = strdup(yytext); return STRING;}
In this case, if '-' appears in the middle of alphanumeric characters, the whole lot is treated as 1 STRING, without the possibility for subdivision by the parser (and therefore only one path explored). I suppose I could manually parse the STRING in the parser action, but it seems like there should be a better way. Perhaps flex could give back alternate token streams (one for the "BAR-BAZ" case and another for the "BAR"-"BAZ" case) that are diverted to different parser stacks for exploration? Is something like that possible?
It's not impossible to solve this problem within a bison-generated parser, but it's not easy, and the amount of hackery required might detract from the readability and verifiability of the grammar.
To be clear, GLR parsers are not fallback parsers. The GLR algorithm explores all possible parses in parallel, and rejects invalid ones as it goes. (The bison implementation requires that the parse converge to a single possible parse; the original GLR algorithm produces forest of parse trees.) Also, the GLR algorithm does not contemplate multiple lexical analyses.
If you want to solve this problem in the context of the parser, you'll probably need to introduce special handling for whitespace, or at least for - which are not surrounded by whitespace. Otherwise, you will not be able to distinguish between a - b (presumably always subtraction) and a-b (which might be the variable a-b if that variable were defined). Leaving aside that issue, you would be looking for something like this (but this won't work, as explained below):
expr : term
| expr '-' term
term : factor
| term '*' factor
factor: var
| '(' expr ')'
var : ident { if (!isVariable($1)) { /* reject this production */ } }
ident : WORD
| ident '-' WORD { $$ = concatenate($1, "-", $3); }
This won't work because the action associated with var : ident is not executed until after the parse has been disambiguated. So if the production is rejected, the parse fails, because the parser has already determined that the production is necessary. (Until the parser makes that determination, actions are deferred.)
Bison allows GLR grammars to use semantic predicates, which are executed immediately instead of being deferred. But that doesn't help, because semantic predicates cannot make use of computed semantic values (since the semantic value computations are still deferred when the semantic predicate is evaluated). You might think you could get around this by making the computation of the concatenated identifier (in the second ident production) a semantic predicate, but then you run into another limitation: semantic predicates do not themselves have semantic values.
Probably there is a hack which will get around this problem, but that might leave you with a different problem. Suppose that a, c, a-b and b-c are defined variables. Then, what is the meaning of a-b-c? Is it (a-b) - c or a - (b-c) or an error?
If you expect it to be an error, then there is no problem since the GLR parser will find both possible parses and bison-generated GLR parsers signal a syntax error if the parse is ambiguous. But then the question becomes: is a-b-c only an error if it is ambiguous? Or is it an error because you cannot use a subtraction operator without surround whitespace if its arguments are hyphenated variables? (So that a-b-c can only be resolved to (a - b) - c or to (a-b-c), regardless of whether a-b and b-c exist?) To enforce the latter requirement, you'll need yet more complication.
If, on the other hand, your language is expected to model a "fallback" approach, then the result should be (a-b) - c. But making that selection is not a simple merge procedure between two expr reductions, because of the possibility of a higher precedence * operator: d * a-b-c either resolves to (d * a-b) - c or (d * a) - b-c; in those two cases, the parse trees are radically different.
An alternative solution is to put the disambiguation of hyphenated variables into the scanner, instead of the parser. This leads to a much simpler and somewhat clearer definition, but it leads to a different problem: how do you tell the scanner when you don't want the semantic disambiguation to happen? For example, you don't want the scanner to insist on breaking up a variable name into segments when you the name occurs in a declaration.
Even though the semantic tie-in with the scanner is a bit ugly, I'd go with that approach in this case. A rough outline of a solution is as follows:
First, the grammar. Here I've added a simple declaration syntax, which may or may not have any resemblance to the one in your grammar. See notes below.
expr : term
| expr '-' term
term : factor
| term '*' factor
factor: VARIABLE
| '(' expr ')'
decl : { splitVariables(false); } "set" VARIABLE
{ splitVariables(true); } '=' expr ';'
{ addVariable($2); /* ... */ }
(See below for the semantics of splitVariables.)
Now, the lexer. Again, it's important to know what the intended result for a-b-c is; I'll outline two possible strategies. First, the fallback strategy, which can be implemented in flex:
int candidate_len = 0;
[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] { yymore();
candidate_len = yyleng;
BEGIN(HYPHENATED);
}
[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<HYPHENATED>"-"[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] {
yymore();
if (isVariable(yytext))
candidate_len = yyleng;
}
<HYPHENATED>"-"[[:alpha:]][[:alnum:]]* { if (!isVariable(yytext))
yyless(candidate_len);
yylval.id = strdup(yytext);
BEGIN(INITIAL);
return WORD;
}
That uses yymore and yyless to find the longest prefix sequence of hyphenated words which is a valid variable. (If there is no such prefix, it chooses the first word. An alternative would be to select the entire sequence if there is no such prefix.)
A similar alternative, which only allows the complete hyphenated sequence (in the case where that is a valid variable) or individual words. Again, we use yyless and yymore, but this time we don't bother checking intermediate prefixes and we use a second start condition for the case where we know we're not going to combine words:
int candidate_len = 0;
[[:alpha:]][[:alnum:]]*/"-"[[:alpha:]] { yymore();
candidate_len = yyleng;
BEGIN(HYPHENATED);
}
[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<HYPHENATED>("-"[[:alpha:]][[:alnum:]]*)*[[:alpha:]][[:alnum:]]* {
if (isVariable(yytext)) {
yylval.id = strdup(yytext);
BEGIN(INITIAL);
return WORD;
} else {
yyless(candidate_len);
yylval.id = strdup(yytext);
BEGIN(NO_COMBINE);
return WORD;
}
}
<NO_COMBINE>[[:alpha:]][[:alnum:]]* { yylval.id = strdup(yytext);
return WORD;
}
<NO_COMBINE>"-" { return '-'; }
<NO_COMBINE>.|\n { yyless(0); /* rescan */
BEGIN(INITIAL);
}
Both of the above solutions use isVariable to decide whether or not a hyphenated sequence is a valid variable. As mentioned earlier, there must be a way to turn off the check, for example in the case of a declaration. To accomplish this, we need to implement splitVariables(bool). The implementation is straightforward; it simply needs to set a flag visible to isVariable. If the flag is set to true, then isVariable always returns true without actually checking for the existence of the variable in the symbol table.
All of that assumes that the symbol table and the splitVariables flag are shared between the parser and the scanner. A naïve solution would make both of these variables globals; a cleaner solution would be to use a pure parser and lexer, and pass the symbol table structure (including the flag) from the main program into the parser, and from there (using %lex-param) into the lexer.
Say I want to parse substrings with ParseKit, like the prefix of a word. So for example I want to parse 'preview' and 'review'. So my grammar might be:
#start = prefix 'view';
prefix = 'pre' | 're';
Now without modifying ParseKit I can match 'pre view' and 're view' but not 'preview' or 'review'. From looking at the documentation I guess I need to customize PKTokeinzer's word state because it is looking for whitespace to terminate a 'Word' token. How do I get around that?
Developer of ParseKit here.
I'm not sure I fully understand the question, but I think it sounds somewhat misguided.
If you are looking for a way to match sub-tokens or characters, Regular Expressions might be a better fit for your needs than ParseKit.
A ParseKit grammar matches against tokens produced by the ParseKit tokenizer (PKTokenizer class). Not individual characters.
It's not that it is impossible for PKTokenizer to produce a pre and and a view token from input of preview. But it would require customization of the code that I would call unwise and unnecessarily complicated. I think that is a bad idea.
If you want to use ParseKit (rather than Regex) anyway, you can simply do the sub-parsing in your assembler callbacks (instead of in the grammar).
So in the Grammar:
#start = either;
either = 'preview' | 'review';
And in ObjC:
- (void)parser:(PKParser *)p didMatchEither:(PKAssembly *)a {
PKToken *tok = [a pop];
NSString *str = tok.stringValue;
if ([str hasPrefix:#"pre"]) {
... // handle 'preview'
} else {
... // handle 'review'
}
}
Also remember that ParseKit Grammars support matching tokens via RegEx. So if you want to match all words that end in view:
#start = anyView;
anyView = /\b\w*?view\b/;
Hope that helps.
From Parsekit: how to match individual quote characters?
If you define a parser:
#start = int;
int = /[+-]?[0-9]+/
Unfortunately it isn't going to be parsing any integers prefixed with a "+", unless you include:
#numberState = "+" // at the top.
In the number parse above, the "Symbol" default parser wasn't even mentioned, yet it is still active and overrides user defined parsers.
Okay so with numbers you can still fix it by adding the directive. What if you're trying to create a parser for "++"? I haven't found any directive that can make the following parser work.
#start = plusplus;
plusplus = "++";
The effects of default parsers on the user parser seems so arbitrary. Why can't I parse "++"?
Is it possible to just turn off default Parsers altogether? They seem to get in the way if I'm not doing something common.
Or maybe I've got it all wrong.
EDIT:
I've found a parser that would parse plus plus:
#start = plusplus;
plusplus = plus plus;
plus = "+";
I am guessing the answer is: the literal symbols defined in your parser cannot overlap between default parsers; It must be contained completely by at least once of them.
Developer of ParseKit here.
I have a few responses.
I think you'll find the ParseKit API highly elegant and sensible, the more you learn. Keep in mind that I'm not tooting my own horn by saying that. Although I built ParseKit, I did not design the ParseKit API. Rather, the design of ParseKit is based almost entirely on the designs found in Steven Metsker's Building Parsers In Java. I highly recommend you checkout the book if you want to deeply understand ParseKit. Plus it's a fantastic book about parsing in general.
You're confusing Tokenizer States with Parsers. They are two distinct things, but the details are more complex than I can answer here. Again, I recommend Metsker's book.
In the course of answering your question, I did find a small bug in ParseKit. Thanks! However, it was not affecting your outcome described above as you were not using the correct grammar to get the outcome it seems you were looking for. You'll need to update your source code from The Google Code Project now, or else my advice below will not work for you.
Now to answer your question.
I think you are looking for a grammar which both recognizes ++ as a single multi-char Symbol token and also recognizes numbers with leading + chars as explicitly-positive numbers rather than a + Symbol token followed by a Number token.
The correct grammar I believe you are looking for is something like this:
#symbols = '++'; // declare ++ as a multi-char symbol
#numberState = '+'; // allow explicitly-positive numbers
#start = (Number|Symbol)*;
Input like this:
++ +1 -2 + 3 ++
Will be tokenized like so:
[++, +1, -2, +, 3, ++]++/+1/-2/+/3/++^
Two reminders:
Again, you will need to update your source code now to see this work correctly. I had to fix a bug in this case.
This stuff is tricky, and I recommend reading Metsker's book to fully understand how ParseKit works.
extdefs:
{$<ttype>$ = NULL_TREE; } extdef
| extdefs {$<ttype>$ = NULL_TREE; } extdef
;
Why is it in the middle?
It could be everywhere. Sometimes it's useful to have something done in between the tokens, especially in this kind of or expressions.
In the standard description of the yacc utility it's said that:
Actions can occur anywhere in a rule
(not just at the end); an action can
access values returned by actions to
its left, and in turn the value it
returns can be accessed by actions to
its right. An action appearing in the
middle of a rule shall be equivalent
to replacing the action with a new
non-terminal symbol and adding an
empty rule with that non-terminal
symbol on the left-hand side. The
semantic action associated with the
new rule shall be equivalent to the
original action. The use of actions
within rules might introduce conflicts
that would not otherwise exist.
Hey guys, given a data set in plain text such as the following:
==Events==
* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].
* [[710]] – [[Saracen]] invasion of [[Sardinia]].
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].
*[[1275]] – Traditional founding of the city of [[Amsterdam]].
*[[1524]] – [[Italian Wars]]: The French troops lay siege to [[Pavia]].
*[[1553]] – Condemned as a [[Heresy|heretic]], [[Michael Servetus]] is [[burned at the stake]] just outside [[Geneva]].
*[[1644]] – [[Second Battle of Newbury]] in the [[English Civil War]].
*[[1682]] – [[Philadelphia]], [[Pennsylvania]] is founded.
I would like to end up with an NSDictionary or other form of collection so that I can have the year (The Number on the left) mapping to the excerpt (The text on the right). So this is what the 'template' is like:
*[[YEAR]] – THE_TEXT
Though I would like the excerpt to be plain text, that is, no wiki markup so no [[ sets. Actually, this could prove difficult with alias links such as [[Edmund I of England|Edmund I]].
I am not all that experienced with regular expressions so I have a few questions. Should I first try to 'beautify' the data? For example, removing the first line which will always be ==Events==, and removing the [[ and ]] occurrences?
Or perhaps a better solution: Should I do this in passes? So for example, the first pass I can separate each line into * [[710]] and [[Saracen]] invasion of [[Sardinia]]. and store them into different NSArrays.
Then go through the first NSArray of years and only get the text within the [[]] (I say text and not number because it can be 530 BC), so * [[710]] becomes 710.
And then for the excerpt NSArray, go through and if an [[some_article|alias]] is found, make it only be [[alias]] somehow, and then remove all of the [[ and ]] sets?
Is this possible? Should I use regular expressions? Are there any ideas you can come up with for regular expressions that might help?
Thanks! I really appreciate it.
EDIT: Sorry for the confusion, but I only want to parse the above data. Assume that that's the only type of markup that I will encounter. I'm not necessarily looking forward to parsing wiki markup in general, unless there is already a pre-existing library which does this. Thanks again!
This code assumes you are using RegexKitLite:
NSString *data = #"* [[312]] – [[Constantine the Great]] is said to have received his famous [[Battle of Milvian Bridge#Vision of Constantine|Vision of the Cross]].\n\
* [[710]] – [[Saracen]] invasion of [[Sardinia]].\n\
* [[939]] – [[Edmund I of England|Edmund I]] succeeds [[Athelstan of England|Athelstan]] as [[King of England]].\n\
*[[1275]] – Traditional founding of the city of [[Amsterdam]].";
NSString *captureRegex = #"(?i)(?:\\* *\\[\\[)([0-9]*)(?:\\]\\] \\– )(.*)";
NSRange captureRange;
NSRange stringRange;
stringRange.location = 0;
stringRange.length = data.length;
do
{
captureRange = [data rangeOfRegex:captureRegex inRange:stringRange];
if ( captureRange.location != NSNotFound )
{
NSString *year = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:1 error:NULL];
NSString *textStuff = [data stringByMatching:captureRegex options:RKLNoOptions inRange:stringRange capture:2 error:NULL];
stringRange.location = captureRange.location + captureRange.length;
stringRange.length = data.length - stringRange.location;
NSLog(#"Year:%#, Stuff:%#", year, textStuff);
}
}
while ( captureRange.location != NSNotFound );
Note that you really need to study up on RegEx's to build these well, but here's what the one I have is saying:
(?i)
Ignore case, I could have left that out since I'm not matching letters.
(?:\* *\[\[)
?: means don't capture this block, I escape * to match it, then there are zero or more spaces (" *") then I escape out two brackets (since brackets are also special characters in a regex).
([0-9]*)
Grab anything that is a number.
(?:\]\] \– )
Here's where we ignore stuff again, basically matching " – ". Note any "\" in the regex, I have to add another one to in the Objective-C string above since "\" is a special character in a string... and yes that means matching a regex escaped single "\" ends up as "\\" in an Obj-C string.
(.*)
Just grab anything else, by default the RegEX engine will stop matching at the end of a line which is why it doesn't just match everything else. You'll have to add code to strip out the [[LINK]] stuff from the text.
The NSRange variables are used to keep matching through the file without re-matching original matches. So to speak.
Don't forget after you add the RegExKitLite class files, you also need to add the special linker flag or you'll get lots of link errors (the RegexKitLite site has installation instructions).
I'm no good with regular expressions, but this sounds like a job for them. I imagine a regex would sort this out for you quite easily.
Have a look at the RegexKitLite library.
If you want to be able to parse Wikitext in general, you have a lot of work to do. Just one complicating factor is templates. How much effort do you want to go to cope with these?
If you're serious about this, you probably should be looking for an existing library which parses Wikitext. A brief look round finds this CPAN library, but I have not used it, so I can't cite it as a personal recommendation.
Alternatively, you might want to take a simpler approach and decide which particular parts of Wikitext you're going to cope with. This might be, for example, links and headings, but not lists. Then you have to focus on each of these and turn the Wikitext into whatever you want that to look like. Yes, regular expressions will help a lot with this bit, so read up on them, and if you have specific problems, come back and ask.
Good luck!