Please have a look at my grammar: https://bitbucket.org/rstoll/tsphp-parser/raw/cdb41531e86ec66416403eb9c29edaf60053e5df/src/main/antlr/TSPHP.g
Somehow ANTLR produces an infinite loop finding infinite EOF tokens for the following input:
class a{public function void a(}
Although, only prog expects EOF classBody somehow accept it as well. Has someone an idea how I can fix that, what I have to change that classBody does not accept EOF tokens respectively?
Code from the generated class:
// D:\\TSPHP-parser\\src\\main\\antlr\\TSPHP.g:287:129: ( classBody )*
loop17:
do {
int alt17=2;
int LA17_0 = input.LA(1);
if ( (LA17_0==EOF||LA17_0==Abstract||LA17_0==Const||LA17_0==Final||LA17_0==Function||LA17_0==Private||(LA17_0 >= Protected && LA17_0 <= Public)||LA17_0==Static) ) {
alt17=1;
}
switch (alt17) {
case 1 :
// D:\\TSPHP-parser\\src\\main\\antlr\\TSPHP.g:287:129: classBody
{
pushFollow(FOLLOW_classBody_in_classDeclaration1603);
classBody38=classBody();
state._fsp--;
if (state.failed) return retval;
if ( state.backtracking==0 ) stream_classBody.add(classBody38.getTree());
}
break;
default :
break loop17;
}
} while (true);
The problem occurs, when token = EOF, the loop is never quit, since EOF is a valid token, even though I haven not specified like that.
EDIT The do not get the error if I comment line 342 and 347 out (the empty case in rule accessModifierWithoutPrivateOrPublic, accessModifierOrPublic respectively)
EDIT 2 I could solve my problem. I rewrote the methodModifier rule (integrated all the possible modifier in one rule). This way ANTLR does not believe that EOF is a valid token after /empty/ in
accessModifierOrPublic
: accessModifier
| /* empty */ -> Public["public"]
;
This type of bug can occur in error handling for ANTLR 3. In ANTLR 4, the IntStream.consume() method was updated to require the following exception be thrown to preempt this problem.
Throws:
IllegalStateException - if an attempt is made to consume the the end of the stream (i.e. if LA(1)==EOF before calling consume).
For ANTLR 3 grammars, you can at least prevent an infinite loop by using your own TokenStream implementation (probably easiest to extend CommonTokenStream) and throwing this exception if the condition listed above is violated. Note that you might need to allow this condition to be violated once (reasons are complicated), so keep a count and throw the IllegalStateException if the code tries to consume EOF more than 2 or 3 times. Remember this is just an effort to break the infinite loop so you can be a little "fuzzy" on the actual check.
Related
I am writing a simple calculator in which an expression can be reduced to a statement, or list of statements. If a bad expression triggers a syntax error, I try to catch it with a production rule and then ignore it by giving the rule no actions. However, I believe this still reduces to a stmt_list despite it not being a valid statement.
Is there a way to make it simply ignore tokens matched with an error and prevent them from being reduced and used later?
The code below matches error ';' and reduces it to a stmt_list. It will then try to reduce that stmt_list with a valid expression, but since the first production was never called this will trigger a memory exception. My objective is to have Bison literally do nothing if an error is matched, such that a later valid expression can be the first reduction to stmt_list.
stmt_list:
expr ';' {
// Allocate memory for statement array
stmts = (float*)malloc(24 * sizeof(float));
// Assign pointer for array
$$ = stmts;
// At pointer, assign statement value
*$$ = $1;
// Increment pointer (for next job)
$$++;
}
| stmt_list expr ';' {
$$ = $1;
*$$ = $2;
$$++;
}
| error ';' { } // Do nothing (ignore bad stmt)
| stmt_list error ';' { } // Do nothing (ignore bad stmt)
;
If you supply no action for a rule, bison/yacc provides the default action $$ = $1.
In fact, you are not providing no action. You are providing an explicit action which does nothing. As it happens, if you use the C template, the parser will still perform the default action. In other templates, an action which does not assign a value to $$ might provoke a warning during parser generation. But it certainly won't modify your data structures so as to nullify the action. It can't know what that means. If you know, you should write it as the action :-) .
It's not 100% clear to me why you are keeping the results of the evaluations in a fixed-size dynamically-allocated array. You make no attempt to detect when the array fills up, so it's entirely possible that you'll end up overflowing the allocation and overwriting random memory. Moreover, using a global like this isn't usually a good idea because it prevents you from building more than one list at the same time. (For example, if you wanted to implement function calls, since a function's arguments are also a list of expressions.)
On the whole, it's better to put the implementation of the expanding expression list in a simple API which is implemented elsewhere. Here, I'm going to assume that you've done that; for specificity, I'll assume the following API (although it's just one example):
/* The list header structure, which contain all the information
* necessary to use the list. The forward declaration makes it
* possible to use pointers to ExprList objects without having to
* expose its implementation details.
*/
typedef struct ExprList ExprList;
/* Creates a new empty expression-list and returns a pointer to its header. */
ExprList* expr_list_create(void);
/* Resizes the expression list to the supplied size. If the list
* currently has fewer elements, new elements with default values are
* added at the end. If it currently has more elements, the excess
* ones are discarded. Calling with size 0 empties the list (but
* doesn't delete it).
*/
int expr_list_resize(ExprList* list, int new_length);
/* Frees all storage associated with the expression list. The
* argument must have been created with expr_list_create, and its
* value must not be used again after this function returns.
*/
void expr_list_free(ExprList* list);
/* Adds one element to the end of the expression-list.
* I kept the float datatype for expression values, although I
* strongly believe that its not ideal. But an advantage of using an
* API like this is that it is easier to change.
*/
void expr_list_push(ExprList* list, float value);
/* Returns the number of elements in the expression-list. */
int expr_list_len(ExprList* list);
/* Returns the address of the element in the expression list
* with the given index. If the index is out of range, behaviour
* is undefined; a debugging implementation will report an error.
*/
float* expr_list_at(ExprList* list, int index);
With that API, we can rewrite the productions for valid expressions:
stmt_list: expr ';' { $$ = expr_list_create();
expr_list_push($$, $1);
}
| stmt_list expr ';' { $$ = $1;
expr_list_push($$, $1);
}
Now for the error cases. You have two error rules; one triggers when the error is at the beginning of a list, and the other when the error is encountered after one or more (possibly erroneous) expressions have been handled. Both of these are productions for stmt_list so they must have the same value type as stmt_list does (ExprList*). Thus, they must do whatever you think is appropriate when a syntax error is produced.
The first one, when the error is at the start of the list, only needs to create an empty list. It's hard to see what else it could do.
stmt_list: error ';' { $$ = expr_list_create(); }
It seems to me that there are at least two alternatives for the other error action, when an error is detected after the list has at least one successfully-computed value. One possibility is to ditch the erroneous item, leaving the rest of the list intact. This requires only the default action:
stmt_list: stmt_list error ';'
(Of course, you could add the action { $$ = $1; } explicitly, if you wanted to.)
The other possibility is to empty the entire list, so as to start from scratch with the next element:
stmt_list: stmt_list error ';' { $$ = $1;
expr_list_resize($$, 0);
}
There are undoubtedly other possibilities. As I said, bison cannot figure out what it is that you intended (and neither can I, really). You'll have to implement whatever behaviour you want.
I read the spin guide yet there is no answer for the following question:
I have a line in my code as following:
Ch?x
where Ch is a channel and x is channel type (to receive MSG)
What happens if Ch is empty? will it wait for MSG to arrive or not?
Do i need to check first if Ch is not empty?
basically all I want is that if Ch is empty then wait till MSG arrives and when it's arrive continue...
Bottom line: the semantics of Promela guarantee your desired behaviour, namely, that the receive-operation blocks until a message can be received.
From the receive man page
EXECUTABILITY
The first and the third form of the statement, written with a single
question mark, are executable if the first message in the channel
matches the pattern from the receive statement.
This tells you when a receive-operation is executable.
The semantics of Promela then tells you why executability matters:
As long as there are executable transitions (corresponding to the
basic statements of Promela), the semantics engine will select one of
them at random and execute it.
Granted, the quote doesn't make it very explicit, but it means that a statement that is currently not executable will block the executing process until it becomes executable.
Here is a small program that demonstrates the behaviour of the receive-operation.
chan ch = [1] of {byte};
/* Must be a buffered channel. A non-buffered, i.e., rendezvous channel,
* won't work, because it won't be possible to execute the atomic block
* around ch ! 0 atomically since sending over a rendezvous channel blocks
* as well.
*/
short n = -1;
proctype sender() {
atomic {
ch ! 0;
n = n + 1;
}
}
proctype receiver() {
atomic {
ch ? 0;
n = -n;
}
}
init {
atomic {
run sender();
run receiver();
}
_nr_pr == 1;
assert n == 0;
/* Only true if both processes are executed and if sending happened
* before receiving.
*/
}
Yes, the current proctype will block until a message arrives on Ch. This behavior is described in the Promela Manual under the receive statement. [Because you are providing a variable x (as in Ch?x) any message in Ch will cause the statement to be executable. That is, the pattern matching aspect of receive does not apply.]
how to do conditional compilation in yacc. Similar to done in C using ifdef.
I want to create a rule based on a condition. Is it possible in yacc. Example. based on condition rule A is defined as follows:
ruleA : A | B, /* For condition 1 */
ruleA : C /* If condition 1 is not satisfied */
btyacc has conditional compilation based on defined flags, similar to the C preprocessor. You can say:
%ifdef VERSION_A
ruleA: A | B ;
%endif
%ifdef VERSION_B
ruleA: C ;
%endif
and then use a -DVERSION_A or -DVERSION_B command line argument to get one version or the other. Its pretty primitive (you can only test a single flag per %ifdef, can't nest %ifdefs and there's no %else), but its adequate for simple things.
If you can't preprocess your Yacc grammar with an appropriate preprocessor, then you can use a solution based on actions:
ruleA : A { if (condition1) { process OK; } else YYERROR; }
| B { if (condition1) { process OK; } else YYERROR; }
| C { if (!condition1) { process OK; } else YYERROR; }
;
The YYERROR action triggers Yacc's normal error processing mechanism. This means your grammar must 'work' with both sets of rules in operation as far as Yacc is concerned. If this leads to complexity because of shift/reduce (or even reduce/reduce) conflicts, then preprocessing is the way to go. The result grammar will be smaller and more focussed.
My suggestion is that you expose the conditional flag as a terminal symbol, we can call it the mode terminal. (So, technically, it's a run-time not compile-time condition.)
You need to return the mode terminal at every point that it would make a difference in the parse. Your grammar can ignore it when it receives extra ones. It can return the mode terminal only in the "condition 1" case, only in the "not condition 1" case, or you can return a different one in both cases. So say you have two tokens C1 and C2, one for each mode.
This may not work out so well if you are parsing an existing grammar, but if you are designing the grammar and the parser together all problems are solvable.
Then you end up with:
ruleA : C1 A | C1 B | C2 C ;
Are there better ways to require that Ragel consume all of the input? Here is what I'm using now:
=begin
%%{
machine my_lexer;
# ...
# extract tokens and store into `tokens`
# ...
}%%
=end
class MyLexer
%% write data;
def self.run(string)
data = string.unpack("c*")
eof = data.length
tokens = []
%% write init;
%% write exec;
data.length == p ? tokens : nil
end
end
Most of the above is boilerplate, except for the data.length == p test. It works -- except that it doesn't verify that the lexer ended in a final state. So, I have test cases that give me tokens back even if the entire input was not successfully parsed.
Is there a better way?
(Testing for the final state directly might work better. I'm looking into how to do that. Ideas?)
You can handle errors using either global or local error actions.
For global error actions you can use this syntax:
$!action
For local error actions, which are local to your machine definition, you can use this syntax:
$^action
If you put a flag on your action, you can check the flag to detect an error.
I'm only starting out with ragel, but it's possible you want to look at EOF actions or Error actions, executed respectively when the input ends or when the next character satisfies no transition from the current state.
I am trying use ANTLR to analyse a large set of code using full Java grammar. Since ANTLR needs to open all the source files and scan them, I am wondering if it can also return lines of code.
I checked API for Lexer and Parser, it seems they do not return LoC. Is it easy to instrument the grammar rule a bit to get LoC? The full Java rule is complicated, I don't really want to mess a large part of it.
If you have an existing ANTLR grammar, and want to count certain things during parsing, you could do something like this:
grammar ExistingGrammar;
// ...
#parser::members {
public int loc = 0;
}
// ...
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
// ...
So, whenever your oparser encounters a someParserRule, you increase the loc by one by placing {loc++;} after (or before) the rule.
So, whatever your definition of a line of code is, simply place {loc++;} in the rule to increase the counter. Be careful not to increase it twice:
statement
: someParserRule {loc++;}
| // ...
;
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
EDIT
I just noticed that in the title of your question you asked if this can be done during lexing. That won't be possible. Let's say a LoC would always end with a ';'. During lexing, you wouldn't be able to make a distinction between a ';' after, say, an assignment (which is a single LoC), and the 2 ';'s inside a for(int i = 0; i < n; i++) { ... } statement (which wouldn't be 2 LoC).
In the C target the data structure ANTLR3_INPUT_STREAM has a getLine() function which returns the current line from the input stream. It seems the Java version of this is CharStream.getLine(). You should be able to call this at any time and get the current line in the input stream.
Use a visitor to visit the CompilationUnit context, then context.stop.getLine() will give you the last line number of the compilation unit context.
#Override public Integer visitCompilationUnit(#NotNull JAVAParser.CompilationUnitContext ctx) {
return ctx.stop.getLine();
}