Checking that Ragel matched the entire input - ragel

Are there better ways to require that Ragel consume all of the input? Here is what I'm using now:
=begin
%%{
machine my_lexer;
# ...
# extract tokens and store into `tokens`
# ...
}%%
=end
class MyLexer
%% write data;
def self.run(string)
data = string.unpack("c*")
eof = data.length
tokens = []
%% write init;
%% write exec;
data.length == p ? tokens : nil
end
end
Most of the above is boilerplate, except for the data.length == p test. It works -- except that it doesn't verify that the lexer ended in a final state. So, I have test cases that give me tokens back even if the entire input was not successfully parsed.
Is there a better way?
(Testing for the final state directly might work better. I'm looking into how to do that. Ideas?)

You can handle errors using either global or local error actions.
For global error actions you can use this syntax:
$!action
For local error actions, which are local to your machine definition, you can use this syntax:
$^action
If you put a flag on your action, you can check the flag to detect an error.

I'm only starting out with ragel, but it's possible you want to look at EOF actions or Error actions, executed respectively when the input ends or when the next character satisfies no transition from the current state.

Related

Turbo C++ : while(fin) vs while(!fin.eof())

I was told that I should be using while(fin) instead of while(!fin.eof()) when reading a file.
What exactly is the difference?
Edit: I do know that while(fin) actually checks the stream object and that when it becomes NULL, the loop breaks and it covers eof and fail flags.
But my course teacher says that fin.eof() is better so I need to understand the fundamental operation that's going on here.
Which one is the right practice?
Note: This is not a duplicate, I need assistance in Turbo C++ and with binary files.
I'm basically trying to read a file using a class object.
First of all I am assuming fin is your fstream object. In which case your teacher would not have told you to use while(fin.eof()) for reading from file. She would have told to use while(!fin.eof()).
Let me explain. eof() is a member of the fstream class which returns a true or false value depending on whether the End Of File (eof) of the file you are reading has been reached. Thus while eof() function returns 0 it means the end of file has not been reached and loop continues to execute, but when eof() returns 1 the end of the file has been reached and the loop exits.
while(fin) loop is entered because fin actually returns the value of an error flag variable inside the class object fin whose value is set to 0 when any function like read or write or open fails. Thus the loop works as long as the read function inside the loop works.
Personally I would not suggest either of them.
I would suggest
//assume a class abc.
abc ob;
While(fin.read((char*)&ob, sizeof(ob)))
{}
Or
While(fin.getline(parameters))
{}
This loop reads the file record inside the loop condition and if nothing was read due to the end of file being reached, the loop is exited.
The problem with while(!fin.eof()) is that it returns 1 if the end of file has been reached. End of file is actually a character that is put at the end of the file. So when the read function inside the loop reads this character and sets a variable eof to 1. All the function actually does is return this value.
Thus works fine when you are reading lines in words but when you are reading successive records of a class from a file, this method will fail.
Consider
clas abc
{}a;
Fstream fin("file");
While(!fin.eof())
{
fin.read((char*)&a,sizeof(a));
a.display(); // display is a member function which displays the info }
Thus displays the last record twice. This is because the end of file character is the character after the last byte of the last record. When the last is read the file pointer is at the eof byte but hasn't read it yet. So it will enter the loop again but this time the eof char is read but the read function fails. The values already in the variable a, that is the previous records will be displayed again.
One good method is to do something like this:
while ( instream.read(...) && !instream.eof() ) { //Reading a binary file
Statement1;
Statement2;
}
or in case of a text file:
while ( (ch = instream.get()) && !instream.eof() ) { //To read a single character
Statement1;
Statement2;
}
Here, the object is being read within the while loop's condition statement and then the value of eof flag is being tested.
This wouldn't result in undesired outputs.
Here we are checking the status of the actual I/O operation and the eof together. You may also check for the fail flag.
I would like to point out that according to #RetiredNinja, we may only check for the I/O operation.
That is:
while ( instream.read(...) ) { //Reading a binary file
Statement1;
Statement2;
}
A quick and easy workaround that worked for me to avoid any problems when using eof is to check for it after the first reading and not as a condition of the while loop itself. Something like this:
while (true) // no conditions
{
filein >> string; // an example reading, could be any kind of file reading instruction
if (filein.eof()) break; // break the while loop if eof was reached
// the rest of the code
}

Ragel: avoid redundant call of "when" clause function

I'm writing Ragel machine for rather simple binary protocol, and what I present here is even more simplified version, without any error recovery whatsoever, just to demonstrate the problem I'm trying to solve.
So, the message to be parsed here looks like this:
<1 byte: length> <$length bytes: user data> <1 byte: checksum>
Machine looks as follows:
%%{
machine my_machine;
write data;
alphtype unsigned char;
}%%
%%{
action message_reset {
/* TODO */
data_received = 0;
}
action got_len {
len = fc;
}
action got_data_byte {
/* TODO */
}
action message_received {
/* TODO */
}
action is_waiting_for_data {
(data_received++ < len);
}
action is_checksum_correct {
1/*TODO*/
}
len = (any);
fmt_separate_len = (0x80 any);
data = (any);
checksum = (any);
message =
(
# first byte: length of the data
(len #got_len)
# user data
(data when is_waiting_for_data #got_data_byte )*
# place higher priority on the previous machine (i.e. data)
<:
# last byte: checksum
(checksum when is_checksum_correct #message_received)
) >to(message_reset)
;
main := (msg_start: message)*;
# Initialize and execute.
write init;
write exec;
}%%
As you see, first we receive 1 byte that represents length; then we receive data bytes until we receive needed amount of bytes (the check is done by is_waiting_for_data), and when we receive next (extra) byte, we check whether it is a correct checksum (by is_checksum_correct). If it is, machine is going to wait for next message; otherwise, this particular machine stalls (I haven't included any error recovery here on purpose, in order to simplify diagram).
The diagram of it looks like this:
$ ragel -Vp ./msg.rl | dot -Tpng -o msg.png
Click to see image
As you see, in state 1, while we receiving user data, conditions are as follows:
0..255(is_waiting_for_data, !is_checksum_correct),
0..255(is_waiting_for_data, is_checksum_correct)
So on every data byte it redundantly calls is_checksum_correct, although the result doesn't matter at all.
The condition should be as simple: 0..255(is_waiting_for_data)
How to achieve that?
How is is_checksum_correct supposed to work? The when condition happens before the checksum is read, according to what you posted. My suggestion would be to check the checksum inside message_received and handle any error there. That way, you can get rid of the second when and the problem would no longer exist.
It looks like semantic conditions are a relatively new feature in Ragel, and while they look really useful, maybe they're not quite mature enough yet if you want optimal code.

In SPIN/Promela, how to receive a MSG from a channel in the correct way?

I read the spin guide yet there is no answer for the following question:
I have a line in my code as following:
Ch?x
where Ch is a channel and x is channel type (to receive MSG)
What happens if Ch is empty? will it wait for MSG to arrive or not?
Do i need to check first if Ch is not empty?
basically all I want is that if Ch is empty then wait till MSG arrives and when it's arrive continue...
Bottom line: the semantics of Promela guarantee your desired behaviour, namely, that the receive-operation blocks until a message can be received.
From the receive man page
EXECUTABILITY
The first and the third form of the statement, written with a single
question mark, are executable if the first message in the channel
matches the pattern from the receive statement.
This tells you when a receive-operation is executable.
The semantics of Promela then tells you why executability matters:
As long as there are executable transitions (corresponding to the
basic statements of Promela), the semantics engine will select one of
them at random and execute it.
Granted, the quote doesn't make it very explicit, but it means that a statement that is currently not executable will block the executing process until it becomes executable.
Here is a small program that demonstrates the behaviour of the receive-operation.
chan ch = [1] of {byte};
/* Must be a buffered channel. A non-buffered, i.e., rendezvous channel,
* won't work, because it won't be possible to execute the atomic block
* around ch ! 0 atomically since sending over a rendezvous channel blocks
* as well.
*/
short n = -1;
proctype sender() {
atomic {
ch ! 0;
n = n + 1;
}
}
proctype receiver() {
atomic {
ch ? 0;
n = -n;
}
}
init {
atomic {
run sender();
run receiver();
}
_nr_pr == 1;
assert n == 0;
/* Only true if both processes are executed and if sending happened
* before receiving.
*/
}
Yes, the current proctype will block until a message arrives on Ch. This behavior is described in the Promela Manual under the receive statement. [Because you are providing a variable x (as in Ch?x) any message in Ch will cause the statement to be executable. That is, the pattern matching aspect of receive does not apply.]

antlr infinit eof loop

Please have a look at my grammar: https://bitbucket.org/rstoll/tsphp-parser/raw/cdb41531e86ec66416403eb9c29edaf60053e5df/src/main/antlr/TSPHP.g
Somehow ANTLR produces an infinite loop finding infinite EOF tokens for the following input:
class a{public function void a(}
Although, only prog expects EOF classBody somehow accept it as well. Has someone an idea how I can fix that, what I have to change that classBody does not accept EOF tokens respectively?
Code from the generated class:
// D:\\TSPHP-parser\\src\\main\\antlr\\TSPHP.g:287:129: ( classBody )*
loop17:
do {
int alt17=2;
int LA17_0 = input.LA(1);
if ( (LA17_0==EOF||LA17_0==Abstract||LA17_0==Const||LA17_0==Final||LA17_0==Function||LA17_0==Private||(LA17_0 >= Protected && LA17_0 <= Public)||LA17_0==Static) ) {
alt17=1;
}
switch (alt17) {
case 1 :
// D:\\TSPHP-parser\\src\\main\\antlr\\TSPHP.g:287:129: classBody
{
pushFollow(FOLLOW_classBody_in_classDeclaration1603);
classBody38=classBody();
state._fsp--;
if (state.failed) return retval;
if ( state.backtracking==0 ) stream_classBody.add(classBody38.getTree());
}
break;
default :
break loop17;
}
} while (true);
The problem occurs, when token = EOF, the loop is never quit, since EOF is a valid token, even though I haven not specified like that.
EDIT The do not get the error if I comment line 342 and 347 out (the empty case in rule accessModifierWithoutPrivateOrPublic, accessModifierOrPublic respectively)
EDIT 2 I could solve my problem. I rewrote the methodModifier rule (integrated all the possible modifier in one rule). This way ANTLR does not believe that EOF is a valid token after /empty/ in
accessModifierOrPublic
: accessModifier
| /* empty */ -> Public["public"]
;
This type of bug can occur in error handling for ANTLR 3. In ANTLR 4, the IntStream.consume() method was updated to require the following exception be thrown to preempt this problem.
Throws:
IllegalStateException - if an attempt is made to consume the the end of the stream (i.e. if LA(1)==EOF before calling consume).
For ANTLR 3 grammars, you can at least prevent an infinite loop by using your own TokenStream implementation (probably easiest to extend CommonTokenStream) and throwing this exception if the condition listed above is violated. Note that you might need to allow this condition to be violated once (reasons are complicated), so keep a count and throw the IllegalStateException if the code tries to consume EOF more than 2 or 3 times. Remember this is just an effort to break the infinite loop so you can be a little "fuzzy" on the actual check.

Can ANTLR return Lines of Code when lexing?

I am trying use ANTLR to analyse a large set of code using full Java grammar. Since ANTLR needs to open all the source files and scan them, I am wondering if it can also return lines of code.
I checked API for Lexer and Parser, it seems they do not return LoC. Is it easy to instrument the grammar rule a bit to get LoC? The full Java rule is complicated, I don't really want to mess a large part of it.
If you have an existing ANTLR grammar, and want to count certain things during parsing, you could do something like this:
grammar ExistingGrammar;
// ...
#parser::members {
public int loc = 0;
}
// ...
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
// ...
So, whenever your oparser encounters a someParserRule, you increase the loc by one by placing {loc++;} after (or before) the rule.
So, whatever your definition of a line of code is, simply place {loc++;} in the rule to increase the counter. Be careful not to increase it twice:
statement
: someParserRule {loc++;}
| // ...
;
someParserRule
: SomeLexerRule someOtherParserRule {loc++;}
;
EDIT
I just noticed that in the title of your question you asked if this can be done during lexing. That won't be possible. Let's say a LoC would always end with a ';'. During lexing, you wouldn't be able to make a distinction between a ';' after, say, an assignment (which is a single LoC), and the 2 ';'s inside a for(int i = 0; i < n; i++) { ... } statement (which wouldn't be 2 LoC).
In the C target the data structure ANTLR3_INPUT_STREAM has a getLine() function which returns the current line from the input stream. It seems the Java version of this is CharStream.getLine(). You should be able to call this at any time and get the current line in the input stream.
Use a visitor to visit the CompilationUnit context, then context.stop.getLine() will give you the last line number of the compilation unit context.
#Override public Integer visitCompilationUnit(#NotNull JAVAParser.CompilationUnitContext ctx) {
return ctx.stop.getLine();
}