How does lex match tokens - tokenize

I am learning lex. I made a simple lex file containing one rule:
%%
“Hello” puts(“response\n”);
%%
After running lex file.l, I’d like to inspect the outputted file file.yy.c. I presume that the lexer stores the tokens somehow, and matches it (probably with a switch statement) with the input. Looking at the file, I am able to see the output (puts(“response\n”);, but I cannot find the tokens themselves. I can see many tables (matrices?) in the outputted file, but I cannot figure out how they are translated into the tokens.
Any help explaining how tokens are matched by the lexer is much appreciated!

lex builds a state machine (a DFA) that consumes one character at a time until each reaches a state that can't match any longer token, and then runs the code corresponding to the longest token it found.
In your example, it will build a very simple DFA with about 7 states -- from an initial state that will match 'H' that goes to second state matching 'e' etc. If it gets to the 6th state (after matching the 'o') it will print the message, but any of the states on any other character will go to a state that does the default "single character echo" action.

Related

ANTLR LEXER RULE to have two rules, one will accept every characters including symbols and another will accept only characters

Is it possible in ANTLR LEXER RULE to have two rules, one will accept every characters including all symbols(like (,),_ etc) and another will accept only characters a to z?
Something like below:
String: ('a'..'z'|'A'..'Z')*;
EVERYTHING:(.)*;
Yes, it is possible.
This is how ANTLR lexer decides which rule to use:
- whichever rule can match the longest sub-sequence of the input (starting form current position in the input)
- in case more rules can match this sub-sequence (i.e. it's a tie), the first rule (as defined in the grammar file) wins
So in your case, for alpha-only input, both rules will match it, but since String is further up in the grammar, it will be used. In case of non-alpha input, the EVERYTHING rule will be able to match a longer sub-sequence and therefore will be used.
Note however that as it's written, your EVERYTHING rule matches even spaces and newlines, so in this specific case String rule will be used only if the whole input is just alpha characters and nothing else; whole input will be matched as a single token in either case. So in real grammar, the EVERTYHING rule will be probably slightly different.

ANTLR 4.5 - Mismatched Input 'x' expecting 'x'

I have been starting to use ANTLR and have noticed that it is pretty fickle with its lexer rules. An extremely frustrating example is the following:
grammar output;
test: FILEPATH NEWLINE TITLE ;
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
NEWLINE: '\r'? '\n' ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
This grammar will not match something like:
c:\test.txt
x
Oddly if I change TITLE to be TITLE: 'x' ; it still fails this time giving an error message saying "mismatched input 'x' expecting 'x'" which is highly confusing. Even more oddly if I replace the usage of TITLE in test with FILEPATH the whole thing works (although FILEPATH will match more than I am looking to match so in general it isn't a valid solution for me).
I am highly confused as to why ANTLR is giving such extremely strange errors and then suddenly working for no apparent reason when shuffling things around.
This seems to be a common misunderstanding of ANTLR:
Language Processing in ANTLR:
The Language Processing is done in two strictly separated phases:
Lexing, i.e. partitioning the text into tokens
Parsing, i.e. building a parse tree from the tokens
Since lexing must preceed parsing there is a consequence: The lexer is independent of the parser, the parser cannot influence lexing.
Lexing
Lexing in ANTLR works as following:
all rules with uppercase first character are lexer rules
the lexer starts at the beginning and tries to find a rule that matches best to the current input
a best match is a match that has maximum length, i.e. the token that results from appending the next input character to the maximum length match is not matched by any lexer rule
tokens are generated from matches:
if one rule matches the maximum length match the corresponding token is pushed into the token stream
if multiple rules match the maximum length match the first defined token in the grammar is pushed to the token stream
Example: What is wrong with your grammar
Your grammar has two rules that are critical:
FILEPATH: ('A'..'Z'|'a'..'z'|'0'..'9'|':'|'\\'|'/'|' '|'-'|'_'|'.')+ ;
TITLE: ('A'..'Z'|'a'..'z'|' ')+ ;
Each match, that is matched by TITLE will also be matched by FILEPATH. And FILEPATH is defined before TITLE: So each token that you expect to be a title would be a FILEPATH.
There are two hints for that:
keep your lexer rules disjunct (no token should match a superset of another).
if your tokens intentionally match the same strings, then put them into the right order (in your case this will be sufficient).
if you need a parser driven lexer you have to change to another parser generator: PEG-Parsers or GLR-Parsers will do that (but of course this can produce other problems).
This was not directly OP's problem, but for those who have the same error message, here is something you could check.
I had the same Mismatched Input 'x' expecting 'x' vague error message when I introduced a new keyword. The reason for me was that I had placed the new key word after my VARNAME lexer rule, which assigned it as a variable name instead of as the new keyword. I fixed it by putting the keywords before the VARNAME rule.

ANTLR zero to multiple occurrence of the same parser rule

I'm trying to parse javadoc style comments. How can I indicate that the same parser rule could potentially be triggered zero or more times?
doc_comment : '/**' (param_declaration)* '*/' ;
param_declaration : OUTERWS '#param' OUTERWS ID OUTERWS;
ID : ('a'..'z')+ ;
OUTERWS : ('\n' | '\r' | ' ' |'\t')*;
Enclosing the param_declaration rule in ()* doesn't seem to work since it's not a token.
I would expect that:
/**
#param one
#param two
*/
would work. But instead I get: extraneous input '#param' expecting {'/' which doesn't make sense to me if (param_declaration) matches zero or more instances. It seems like adding ()* to param_declaration does nothing. Either way:
/**
#param one
*/
Works fine; with or without ()*.
The answer to your question is, to match rule foo zero or more times, use (foo)* or simply foo*.
If this is not producing a usable result, then the problem lies somewhere in how you have structured your lexer and/or parser, and to solve it you would need to ask a more specific question and include your grammar along with specific inputs and outputs that are not what you hoped, plus a description of the desired output.
Edit: Your error with two parameters is occurring because the param_declaration rule begins and ends with a required OUTERWS token. This means two OUTERWS tokens must appear in a row for two parameters to be parsed. This is impossible, because any two sequences of white space characters in the input file would match one long OUTERWS token, and that longer token will always be used instead of two shorter tokens.
Also note that your OUTERWS token is written in such a way that it could match 0 characters. If your input sequence contained a digit, say 0, then the longest token appearing before 0 would be a zero-length OUTERWS token. Since the input would not advance as a result of matching 0 characters, this means an input containing a digit should produce an infinitely long stream of empty OUTERWS tokens. The related warning you see when generating code for this grammar is not to be ignored.
Edit 2: Your input can match zero parameters if the comment appears in the form /***/. However, if your comment appears in the form /** */, you will have an OUTERWS token between /** and */, which is not allowed by your parser rules when there is no param_declaration.

Do independent rules influence one another?

When I was debugging my grammar for C# I noticed something very unusual: some inputs that are not accepted by a full grammar are being accepted by the same grammar with some independent rules deleted. I could not find a logical explanation. For example:
CS - this grammar does not accept the input a<a<a>><EOF>
CS' - and this grammar which is basically the same as CS but with some independent rules deleted (rules are not reordered) does accept a<a<a>><EOF>
As you can see both grammars start with the rule start: namespaceOrTypeName EOF; and therefore they should call the same set of rules (CS will never call those rules that are deleted in CS'). I spent a day debugging this, deleting or adding new rules, but couldn't find a flaw in the logic. Any help would be of use, thank you.
Unicode
EDIT:
After changing the start rule in CS to start: Identifier EOF; the grammar starts rejecting the input method which is normally accepted when only Identifier rules are defined. So I guess, since there is a rule attributeTarget: ...| 'method' | ..., that after compiling the grammar some phrases get reserved such as 'method' in this case but I'm not still sure if thats the case.
The first grammar includes the overloadableBinaryOperator rule which implicitly defines the >> token. Since >> is a 2-character token, the lexer will never treat the input >> as two separate 1-character tokens >, >. If you open the grammar in ANTLRWorks 2, you'll see a warning indicator for each implicitly-defined token. You should remove all of these warnings by:
Creating explicit lexer rules for every token you intend to appear in the input.
Only using the syntax 'new' in a parser rule if a corresponding lexer rule exists for the literal 'new'.

How to discard the rest of line after syntax error

I'm implementing a small shell, and using lex&yacc to parse command. Lex reads command from stdin and yacc execute the command after yyparse.
The problem is, when there is a syntax error, yacc prompt an error and parse from the begining. In this case, cmd1 >>> cmd2 leads to running cmd2 becuase >>> is a syntax error.
My question is how to discard the rest of current command after encounting a syntax error?
If you want to write an interactive language with a prompt that lets users enter expressions, it's a bad idea to simply use yacc on the entire input stream. Yacc might get confused about something on one line and then misinterpret subsequent lines. For instance, the user might have an unbalanced parenthesis on the first line. or a string literal which is not closed, and then yacc will just keep consuming subsequent lines of the input, looking to close the construct.
It's better to gather the line of input from the user, and then parse that as one unit. The end of the line then simply the end of the input as far as Yacc is concerned.
If you're using lex, there are ways to redirect lex to read characters from a buffer in memory instead of from a FILE * stream. Look for documentation on the YY_INPUT macro, which you can define in a Lex file to basically specify the code that Lex uses for obtaining input characters.
Analogy time: Using a scanner developed with lex/yacc for directly handling interactive user input is a little bit like using scanf for handling user input. Whereas capturing a line into a buffer and then parsing it is more like using sscanf. Quote:
It's perfectly appropriate to parse strings with sscanf (as long as the return value is checked), because it's so easy to regain control, restart the scan, discard the input if it didn't match, etc. [comp.lang.c FAQ, 12.20].