Terminal Symbol vs Token in Lex or Flex - yacc

I am studying YACC and the concept of a terminal symbol vs a token keeps coming up. Could someone explain to me what the difference is or point me to an article or tutorial that might help?

They are really two names for the same thing, but usually "terminal" is used to describe what the parser is working with, while "token" is used to describe the corresponding sequence of symbols in the source.
In a parser generator like yacc, the grammar of the language is defined in terms of an "alphabet" of "terminals". The word "alphabet" is a little confusing because they are strings, not letters. But from the parser's perspective, every terminal is an indivisible unit indistinguishable from any other use of the same kind of terminal. So the source code:
total = 17 + subtotal;
will be presented to the parser as something like:
ID EQUALS NUMBER PLUS ID SEMICOLON
There is a correspondence between the stream of terminals which the parser sees and substrings of the input language. So we say that the "token" total is an instance of the "terminal" ID. There may be an unlimited number of potential tokens corresponding to a given terminal (or they may be just one, as with the terminal EQUALS) but what the parser actually works with is a smallish finite set of terminals.

Related

Preferentially match shorter token in ANTLR4

I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.

Which are the variable naming rules in Processing?

(Question by John Williams, from a Coursera forum, which I decided to share with the community, since I haven't been able to find this answered anywhere.)
The following code runs without error:
int _j = 1;
//int 2var = 2;
int var2 = 2;
int Kvar = 3; // first letter can be uppercase
int spec$var = 4;
int com_pound_var = 5; // compounding without camel case
int com$pound$two = 6;
int $var = 199;
println(_j);
println(var2);
println(Kvar);
println(spec$var);
println(com_pound_var);
println(com$pound$two);
println($var); //first character can be special
Since the compiler accepts _j, Kvar, and $var as valid variable names, it is clear that variable names do not need to start with a lowercase letter.
I was unable to locate the variable naming rules anywhere in the reference.
What are the variable naming rules for the Processing language?
Quick answer: can start with any letter, underscore and dollar signs, continue with letters, numbers, underscore and dollar signs. Details below.
I could also not find anything in the reference or the documentation at all. However, inspecting the source code, I found that Processing is not a language of its own, but rather a framework in which you run some commands. The difference is that you're actually writing a different language, and Processing just gives you some basic scaffolding where you build on top of.
For some technical details: Processing compiles a Java Build with some flags, spins up a virtual machine (Java VM, not same thing as a full fledged virtual machine) and connects to it to get input and output streams (this is why you can interact with the mouse or get the console output of your own program in a separate window). (Source.)
This language, which you may have guessed already, is Java.
With that said, the actual docs that answer this question is the Java Language Specification, which, to simplify things, is as close as you can get to an answer. (But if you really want to know, it's a mess.)
Specifically, the section on Identifiers, which I'll sum up below:
Can start with any letter (A-Z, a-z), underscore (_), dollar sign ($), or any unicode "letter" (accented, chinese, etc. Details.)
Can continue with any of the above, and can also continue with digits (0-9). Can also contain other unicode "letters" (Details.)
Can have unlimited length
Cannot be any Java keyword (list here)
Cannot be false, true, null
They can look the same and still be different if their codes are different (some Unicode letters look just like letters but are different ones)
I hope this helps! Investigating was fun.

software\tool for checking syntax format

Im looking for a tool that can validate if a given text\paragraph subject to a specific format .
for example :
I can be able to check if the text is as following :
xxx{
sss:aaa;
}
yyy();
preferably open source tool, with easy rule sets like xml or something .
by text i mean a string that i get from i.e fgets(), or any function that reads from a file .
For something like this I'd suggest a parser (see, for instance, What is Parse/parsing?). You can build one from a definition of the language that you want to parse using a parser generator like Yacc or its free GNU equivalent Bison, or any number of other parser generators, many of which are also freely available.
Most parsers are used to transform a text that complies with a grammar into some other form (e.g. an intermediate language or a machine code) but that isn't neccesary - in your case the parser could simply say (at a minimum) "Yes" if the text conforms to a given grammar.
Parsers for simple grammars can be built by hand but, if you have the tools available, using a parser generator is easier and more robust in my experience.
Further, the text that you've shown is similar to a portion of code written in the C language (something close to a struct declaration followed by a function call), so you would be able to re-use parts of the grammar that you need from an existing Yacc grammar for C like this one.

Bison input analyzer - basic question on optional grammar and input interpretation

I am very new to Flex/Bison, So it is very navie question.
Pardon me if so. May look like homework question - but I need to implement project based on below concept.
My question is related to two parts,
Question 1
In Bison parser, How do I provide rules for optional input.
Like, I need to parse the statment
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana'
Here the ratio token can be optional. Similarly, If I have many tokens optional, then How do I provide the grammar in the parser for the same?
My code looks like,
%start program
program : TK_COUNTRY TK_IDENTIFIER TK_STATE TK_IDENTIFIER TK_POPULATION TK_IDENTIFIER ...
where all the tokens are defined in the lexer. Since there are many tokens which are optional, If I use "|" then there will be many different ways of input combination possible.
Question 2
There are good chance that the comment might have quotes as part of the input, so I have added a token -tag which user can provide to interpret the same,
Example :
-country='USA' -state='INDIANA' -population='100' -ratio='0.5' -comment='Census study for Indiana$'s population' -tag=$
Now, I need to reinterpret Indiana$'s as Indiana's since -tag=$.
Please provide any input or related material for to understand these topic.
Q1: I am assuming we have 4 possible tokens: NAME , '-', '=' and VALUE
Then the grammar could look like this:
attrs:
attr attrs
| attr
;
attr:
'-' NAME '=' VALUE
;
Note that, unlike you make specific attribute names distinguished tokens, there is no way to say "We must have country, state and population, but ratio is optional."
This would be the task of that part of the program that analyses the data produced by the parser.
Q2: I understand this so, that you think of changing the way lexical analysis works while the parser is running. This is not a good idea, at least not for a beginner. Have you even started to think about lexical analysis, as opposed to parsing?

How does a lexer return a semantic value that the parser uses?

Is it always necessary to do so? What does it look like?
Lexers don't deal with semantics, they only deal with turning a stream of characters into tokens (sequences of characters that have meaning to the compiler). Semantics are determined during syntactic analysis. See this answer to a previous question for further details on the stages of compilation.
In yacc, your lexer gets a global variable named yylval which is a C union. Back in yacc, this becomes the value for $1, $2, etc.
Lexer don't care about semantic the only mission in life for lexers is to convert the source code (stream of characters) into tokens each has this form <Token_type, Information_related_to_token> the information maybe the value of the token (string), the name of the operator (=) ...
Tokens then are sent to a parser that deals with syntactic analysis. as a side job a lexer can create a symbols table.