How to use Start States in ML-Lex? - tokenize

I am creating a tokeniser in ML-Lex a part of the definition of which is
datatype lexresult = STRING
| STRINGOP
| EOF
val error = fn x => TextIO.output(TextIO.stdOut,x ^ "\n")
val eof = fn () => EOF
%%
%structure myLang
digit=[0-9];
ws=[\ \t\n];
str=\"[.*]+\";
strop=\[[0-9...?\^]\];
%s alpha;
alpha=[a-zA-Z];
%%
<alpha> {alphanum}+ => (ID);
. => (error ("myLang: ignoring bad character " ^ yytext); lex());
I want that the type ID should be detected only when it starts with or is found after "alpha". I know that writing it as
{alpha}+ {alphanum}* => (ID);
is an option but I need to learn to use the use of start states as well for some other purposes. Can someone please help me on this?

The information you need is in the documentation which comes with SML available in various places. Many university courses have online notes which contain working examples.
The first thing to note from your example code is that you have overloaded the name alpha and used it to name a state and a pattern. This is probably not a good idea. The pattern alphanum is not not defined, and the result ID is not declared. Some basic errors which you should probably fix before thinking about using states - or posting a question here on SO. Asking for help for code with such obvious faults in it is not encouraging help from the experts. :-)
Having fixed up those errors, we can start using states. Here is my version of your code:
datatype lexresult = ID
| EOF
val error = fn x => TextIO.output(TextIO.stdOut,x ^ "\n")
val eof = fn () => EOF
%%
%structure myLang
digit=[0-9];
ws=[\ \t\n];
str=\"[.*]+\";
strop=\[[0-9...?\^]\];
%s ALPHA_STATE;
alpha=[a-zA-Z];
alphanum=[a-zA-Z0-9];
%%
<INITIAL>{alpha} => (YYBEGIN ALPHA_STATE; continue());
<ALPHA_STATE>{alphanum}+ => (YYBEGIN INITIAL; TextIO.output(TextIO.stdOut,"ID\n"); ID);
. => (error ("myLang: ignoring bad character " ^ yytext); lex());
You can see I've added ID to the lexresult, named the state ALPHA_STATE and added the alphanum pattern. Now lets look at how the state code works:
There are two states in this program, they are called INITIAL and ALPHA_STATE (all lex programs have an INITIAL default state). It always begins recognising in the INITIAL state. Having a rule <INITIAL>{alpha} => indicates that if you encounter a letter when in the initial state (i.e. NOT in the ALPHA_STATE) then it is a match and the action should be invoked. The action for this rule works as follows:
YYBEGIN ALPHA_STATE; (* Switch from INITIAL state to ALPHA_STATE *)
continue() (* and keep going *)
Now we are in ALPHA_STATE it enables those rules defined for this state, which enable the rule <ALPHA_STATE>{alphanum} =>. The action on this rule switch back to the INITIAL state and record the match.
For a longer example of using states (lex rather than ML-lex) you can see my answer here: Error while parsing comments in lex.
To test this ML-LEX program I referenced this helpful question: building a lexical analyser using ml-lex, and generated the following SML program:
use "states.lex.sml";
open myLang
val lexer =
let
fun input f =
case TextIO.inputLine f of
SOME s => s
| NONE => raise Fail "Implement proper error handling."
in
myLang.makeLexer (fn (n:int) => input TextIO.stdIn)
end
val nextToken = lexer();
and just for completeness, it generated the following output demonstrating the match:
c:\Users\Brian>"%SMLNJ_HOME%\bin\sml" main.sml
Standard ML of New Jersey v110.78 [built: Sun Dec 21 15:52:08 2014]
[opening main.sml]
[opening states.lex.sml]
[autoloading]
[library $SMLNJ-BASIS/basis.cm is stable]
[autoloading done]
structure myLang :
sig
structure UserDeclarations : <sig>
exception LexError
structure Internal : <sig>
val makeLexer : (int -> string) -> unit -> Internal.result
end
val it = () : unit
hello
ID

Related

Antlr: how to switch on token type in Visitor implementation

I'm playing around with Antlr, designing a toy language, which I think is where most people start! - I had a question on how best to think about switching on token type.
consider a 'function call' in the language, where a function can consume a string, number or variable - for example like the below (project() is the function call)
project("ABC") vs project(123) vs project($SOME_VARIABLE)
I have the alteration operator in my grammar, so the grammar parses the right thing, but in the visitor code, it would be nice to tell the difference between the three versions of the above.
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
try {
s1 = ctx.STRING_LITERAL().getText();
}catch(Exception e){}
try{
s2 = ctx.NUM().getText();
}catch(Exception e){}
System.out.println("Created Project via => " + ctx.getChild(1).toString());
}
The code above worked, depending on whether s1 or s2 are null, I can infer how I was called (with a literal or a number, I haven't shown the variable case above), but I'm interested if there is a better or more elegant way - for example switching on token type inside the visitor code to actually process the language.
The grammar I had for the above was
createproj: 'project('WS?(STRING_LITERAL|NUM)')';
and when I use the intellij antlr plugin, it seems to know the token type of the argument to the project() function - but I don't seem to be able to get to it from my code.
You could do something like this:
createproj
: 'project' '(' WS? param ')'
;
param
: STRING_LITERAL
| NUM
;
and in your visitor code:
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
switch(ctx.param().start.getType()) {
case YourLexerName.STRING_LITERAL:
...
case YourLexerName.NUM:
...
...
}
}
so by inlining the token in the grammar I had originally, I've lost the opportunity to inspect it in the visitor code?
No really, you could also do it like this:
createproj
: 'project' '(' WS? param_token=(STRING_LITERAL | NUM) ')'
;
and could then do this:
#Override
public ASTRoot visitCreateproj(projectmgmtParser.CreateprojContext ctx) {
switch(ctx.param_token.getType()) {
case YourLexerName.STRING_LITERAL:
...
case YourLexerName.NUM:
...
...
}
}
Just make sure you don't mix lexer rules (tokens) and parser rules in your set param_token=( ... ). When it's a parser rule, ctx.param_token.getType() will fail (it must then be ctx.param_token.start.getType()). That is why I recommended adding an extra parser rule, because this would then still work:
param
: STRING_LITERAL
| NUM
| some_parser_rule
;

Why doesn't Perl 6's colon pair and name interpolation work together?

I was playing around with Interpolating into names. I was mostly interested in this colon syntax feature to turn a variable into a pair where the identifier is the key.
my %Hamadryas = map { slip $_, 0 }, <
februa
honorina
velutina
>;
{
my $pair = :%Hamadryas;
say $pair; # Hamadryas => { ... }
}
put '-' x 50;
But, just for giggles, I wanted to try it with variable name interpolation too. I know this is stupid because if I know the name I don't need the colon syntax to get it. But, I also thought that it should work by accident:
{
my $name = 'Hamadryas';
# Since I already have the name, I could just:
# my $pair = $name => %::($name)
# But, couldn't I just line up the syntax?
my $pair = :%::($name); # does not work
say $pair;
}
Why doesn't that :%::($name) syntax work? That's more a question of when the parser decides that it's not parsing something it wants to understand. I figured it would see the : and start processing a colon pair, then see the % and know it had a hash, even though there's the :: after the %.
Is there a way to make it work with tricks and grammar mutations?

Yacc/bison: what's wrong with my syntax equations?

I'm writing a "compiler" of sorts: it reads a description of a game (with rooms, characters, things, etc.) Think of it as a visual version of an Adventure-style game, but with much simpler problems.
When I run my "compiler" I'm getting a syntax error on my input, and I can't figure out why. Here's the relevant section of my yacc input:
character
: char-head general-text character-insides { PopChoices(); }
;
character-insides
: LEFTBRACKET options RIGHTBRACKET
;
char-head
: char-namesWT opt-imgsWT char-desc opt-cond
;
char-desc
: general-text { SetText($1); }
;
char-namesWT
: DOTC ID WORD { AddCharacter($3, $2); expect(EXP_TEXT); }
;
opt-cond
: %empty
| condition
;
condition
: condition-reason condition-main general-text
{ AddCondition($1, $2, $3); }
;
condition-reason
: DOTU { $$ = 'u'; }
| DOTV { $$ = 'v'; }
;
condition-main
: money-conditionWT
| have-conditionWT
| moves-conditionWT
| flag-conditionWT
;
have-conditionWT
: PERCENT_SLASH opt-bang ID
{ $$ = MkCondID($1, $2, $3) ; expect(EXP_TEXT); }
;
opt-bang
: %empty { $$ = TRUE; }
| BANG { $$ = FALSE; }
;
ID: WORD
;
Things in all caps are terminal symbols, things in lower or mixed case are non-terminals. If a non-terminal ends in WT, then it "wants text". That is, it expects that what comes after it may be arbitrary text.
Background: I have written my own token recognizer in C++ because(*) I want the syntax to be able to change the way the lexer's behavior. Two types of tokens should be matched only when the syntax expects them: FILENAME (with slashes and other non-alphameric characters) and TEXT, which means "all the text from here to the end of the line" (but not starting with certain keywords).
The function "expect" tells the lexer when to look for these two symbols. The expectation is reset to EXP_NORMAL after each token is returned.
I have added code to yylex that prints out the tokens as it recognizes them, and it looks to me like the tokenizer is working properly -- returning the tokens I expect.
(*) Also because I want to be able to ask the tokenizer for the column where the error occurred, and get the contents of the line being scanned at the time so I can print out a more useful error message.
Here is the relevant part of the input:
.c Wendy wendy
OK, now you caught me, what do you want to do with me?
.u %/lasso You won't catch me like that.
[
Here is the last part of the debugging output from yylex:
token: 262: DOTC/
token: 289: WORD/Wendy
token: 289: WORD/wendy
token: 292: TEXT/OK, now you caught me, what do you want to do with me?
token: 286: DOTU/
token: 274: PERCENT_SLASH/%/
token: 289: WORD/lasso
token: 292: TEXT/You won't catch me like that.
token: 269: LEFTBRACKET/
here's my error message:
: line 124, columns 3-4: syntax error, unexpected LEFTBRACKET, expecting TEXT
[
To help you understand the equations above, here is the relevant part of the description of the input syntax that I wrote the yacc code from.
// Character:
// .c id charactername,[imagename,[animationname]]
// description-text
// .u condition on the character being usable [optional]
// .v condition on the character being visible [optional]
// [
// (options)
// ]
// Conditions:
// %$[-]n Must [not] have at least n dollars
// %/[-]name Must [not] have named thing
// %t-nnn At/before specified number of moves
// %t+nnn At/after specified number of moves
// %#[-]name named flag must [not] be set
// Condition-char: $, /, t, or #, as described above
//
// Condition:
// % condition-char (identifier/int) ['/' text-if-fail ]
// description-text: Can be either on-line text or multi-line text
// On-line text is the rest of the line
brackets mark optional non-terminals, but a bracket standing alone (represented by LEFTBRACKET and RIGHTBRACKET in the yacc) is an actual token, e.g.
// [
// (options)
// ]
above.
What am I doing wrong?
To debug parsing problems in your grammar, you need to understand the shift/reduce machine that yacc/bison produces (described in the .output file produced with the -v option), and you need to look at the trail of states that the parser goes through to reach the problem you see.
To enable debugging code in the parser (which can print the states and the shift and reduce actions as they occur), you need to compile with -DYYDEBUG or put #define YYDEBUG 1 in the top of your grammar file. The debugging code is controlled by the global variable yydebug -- set to non-zero to turn on the trace and zero to turn it off. I often use the following in main:
#ifdef YYDEBUG
extern int yydebug;
if (char *p = getenv("YYDEBUG"))
yydebug = atoi(p);
#endif
Then you can include -DYYDEBUG in your compiler flags for debug builds and turn on the debugging code by something like setenv YYDEBUG 1 to set the envvar prior to running your program.
I suppose your syntax error message was generated by bison. What is striking is that it claims to have found a LEFTBRACKET when it expects a [. Naively, you might expect it to be satisfied with the LEFTBRACKET it found, but of course bison knows nothing about LEFTBRACKET except its numeric value, which will be some integer larger than 256.
The only reason bison might expect [ is if your grammar includes the terminal '['. But since your scanner seems to return LEFTBRACKET when it sees a [, the parser will never see '['.

Capturing content which can start with Parser keywords in Xtext

The following is the simplified version of my actual grammar :-
grammar org.hello.World
import "http://www.eclipse.org/emf/2002/Ecore" as ecore
generate world "http://www.hello.org/World"
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER)*
;
Greeting:
'<hello>' name=ID '</hello>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
So using above grammar if my input is like :-
<hi><hello>world</hello>
Then I am getting an syntax error saying that mismatched character 'i' expecting 'e' at Column 2 .
My requirement is that AnyContent should match "<hi>" , can anyone guide me about how to achieve that?
If you want to make it with Xtext. I advice you to split your problem. You first problem is syntaxic, you need to parser your file. The second problem is semantic, you want to give a "sense" to your objets and tell who is the container. Define the container and the containment for XML can't be done inside your grammar.
Make a custom Ecore and make an easy grammar, with start and end tag. You don't really care about the name of your tag.
Example :
Model returns XmlFile: (StartTag|EndTag|Text)+;
Text returns Text: text=STRING;
StartTag returns StartTag: '<' name=ID '>';
EndTag returns EndTag: '</' name=ID '>';
Change the TokenSource. The token source will give the token to your Parser. You can override the nature of your token, merge or split them.
The idea here is to merge all token outside the between of ">" and "</".
This token represent a Text, so you can create a single token for all elements containing between this elements. Example :
class CustomTokenSource extends XtextTokenStream{
new(TokenSource tokenSource, ITokenDefProvider tokenDefProvider) {
super(tokenSource,tokenDefProvider)
}
override LT(int k) {
var Token token = super.LT(k)
if(token != null && token.text != null) token.tokenOverride(k);
token
}
In this example you need to add your custom code on the method "tokenOverride".
Add your custom token source on your parser :
class XDSLParser extends DSLParser{
override protected XtextTokenStream createTokenStream(TokenSource tokenSource) {
return new CustomTokenSource(tokenSource, getTokenDefProvider());
}
}
Compute the containement : the containment of your elements can be compute after the parsing. After it, you can get your model and change it as you will. To make it, you need to override the method "doParse" of your Parser "XDSLParser" as follow :
override protected IParseResult doParse(String ruleName, CharStream in, NodeModelBuilder nodeModelBuilder, int initialLookAhead) {
var IParseResult result = super.doParse( ruleName, in, nodeModelBuilder, initialLookAhead)
//Give you model
result.rootASTElement;
return result
}
Note : The model that you obtain after the parsing will be flat. The xmlFile Object will contain all the elements in the good order. You need to write an algorithm to build the containement on your AST model.
This will require a lot of tweaking in the grammar due to the nature of the antlr lexer that is used by Xtext. The lexer will not roll back for the keyword <hello>: As soon as it sees a < followed by an h it'll try consume the hello-token. Something along these lines could work though:
Model:
content=AnyContent greetings+=Greeting*;
AnyContent:
(ID | ANY_OTHER | '<' (ID | ANY_OTHER | '/' | '>') | '/' | '>' | 'hello')*
;
Greeting:
'<' 'hello '>' name=ID '<' '/' 'hello' '>';
terminal ID:
('a'..'z'|'A'..'Z')+
;
terminal ANY_OTHER:
.
;
The approach won't scale for real world grammars but maybe it helps to get on the some working track.

Signature mismatch with types in modules/functors

Excuse my potential misuse of terminology, I'm still not very comfortable with OCaml.
We have a functor with the following (abridged) signature:
module type FUNCTORA = sig
type input
type output
type key
type inter
val my_function : input list -> (key * output) list Deferred.t
end
Next, we implement it as such. MYAPP has the same types as above.
module MyFunctor (App : MyAPP) : FUNCTORA = struct
type input = App.input
type output = App.output
type key = App.key
type inter App.value
let my_function lst = ...
end
When trying to compile the implementation, we get this error:
Error: Signature mismatch:
...
Values do not match:
val my_function :
App.input list ->
(App.key * App.output) list Async_kernel.Deferred.t
is not included in
val my_function :
input list -> (key * output) list Async.Std.Deferred.t
It doesn't consider input to include App.input etc, even though we set them to be the same type. How can we get this to type check?
If I make the following changes:
MyAPP => FUNCTORA (* Since you say they are the same *)
type inter App.value => type inter = App.inter (* Syntax/name error *)
Deferred.t => option (* To limit dependence on other modules *)
Then your code compiles for me.
Possibly the problem is with Deferred.t. There are two distinct looking types in the error message.