Parse same pattern differently depending on context with lex/yacc - yacc

My problem is that I have an identical pattern of characters that I wish to parse differently given their context. In one part of the file I need to parse a version number of the form "#.#" Obviously, this is identical to a floating point number. The lexer always opts to return a floating point number. I think I can switch the sequence of rules to give the version # precedence(?), but that doesn't do me any good when I need to later parse floating point numbers.
I suppose I could forget about asking the parser to return each piece of the version separately and split the floating point number into pieces, but I had hoped to have it done for me.
There is actually a bit more to the context of the version #. Its complete form is "ABC #.# XYZ" where "ABC" and "XYZ" never change. I have tried a couple of things to leverage the context of the version #, but haven't made it work.
Is there a way to provide some context to the lexer to parse the components of the version? Am I stuck with receiving a floating point number and parsing it myself?

You have a few possibilities.
The simplest one is to do the string to number conversion in the parser instead of the scanner. That requires making a copy of the number as a string, but the overhead should not be too high: malloc of short strings is well-optimized on almost all platforms. And the benefit is that the code is extremely straightforward and robust:
Parser
%union {
char* string;
double number;
// other types, including version
}
%token <string> DOTTED
%token <number> NUMBER
%type <number> number
%type <version> version
%%
number : NUMBER
| DOTTED { $$ = atod($1); free($1); }
version: DOTTED { $$ = make_version($1); free($1); }
Scanner
[[:digit:]]+\.[[:digit:]]+ { yylval.string = strdup(yytext); return DOTTED; }
[[:digit:]]+\.?|\.[[:digit:]]+ { yylval.number = atod(yytext); }
The above assumes that version numbers are always single-dotted, as in the OP. In applications where version numbers could have multiple dots or non-numeric characters, you would end up with three possible token types: unambiguous numbers, unambiguous version strings, and single-dotted numeric strings which could be either. Aside from adding the VERSION token type and the pattern for unambiguous version strings to the scanner, the only change is to add | VERSION to the version production in the parser.
Another possibility, if you can easily figure out in the scanner whether a number or a version is required, is to use a start condition. You can also change the condition from the parser but it's more complicated and you need to understand the parsing algorithm in order to ensure that you are communicating the state changes correctly.
Finally, you could do both conversions in the scanner, and pick the correct one when you reduce the token in the parser. If the version is a small and simple data structure, this might well turn out to be optimal.

Related

Nearley Tokenizers vs Rules

I'm pretty new to nearly.js, and I would like to know what tokenizers/lexers do compared to rules, according to the website:
By default, nearley splits the input into a stream of characters. This is called scannerless parsing.
A tokenizer splits the input into a stream of larger units called tokens. This happens in a separate stage before parsing. For example, a tokenizer might convert 512 + 10 into ["512", "+", "10"]: notice how it removed the whitespace, and combined multi-digit numbers into a single number.
Wouldn't that be the same as:
Math -> Number _ "+" _ Number
Number -> [0-9]:+
I don't see what the purpose of lexers are. I see that rules are always useable in this case and there is no need for lexers.
After fiddling around with them, I found out the use of tokenizers, say we had the following:
Keyword -> "if"|"else"
Identifier -> [a-zA-Z_]+
This won't work, if we try compiling this, we get ambiguous grammar, "if" will be matched as both a keyword and an Identifier, a tokenizer however:
{
"keyword": /if|else/
"identifier": /[a-zA-Z_]+/
}
Trying to compile this will not result in ambiguous grammar, because tokenizers are smart (at least the one shown in this example, which is Moo).

Generating Random String of Numbers and Letters Using Go's "testing/quick" Package

I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!
Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me
The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.

Preferentially match shorter token in ANTLR4

I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
Here's a very small subset of the defined tokens. I could make a cut-down version of the grammar as an example, but it seems like it shouldn't be necessary to resolve this problem (or to point out that I'm going about this entirely the wrong way).
MILLI_OR_METRE: 'm' ;
OSMOLE: 'osm' ;
MONTH: 'mo' ;
SECOND: 's' ;
One of the standard testcases is mosm, from which the lexer should generate the token stream MILLI_OR_METRE OSMOLE. Unfortunately, because ANTLR preferentially matches longer tokens, it generates the token stream MONTH SECOND MILLI_OR_METRE, which then causes the parser to raise an error.
Is it possible to make an ANTLR4 lexer try to match using shorter tokens first? Adding lookahead-type rules to MONTH isn't a great solution, as there are all sorts of potential lexing conflicts that I'd need to take account of (for example mol being lexed as MONTH LITRE instead of MOLE and so on).
EDIT:
StefanA below is of course correct; this is a job for a parser capable of backtracking (eg. recursive descent, packrat, PEG and probably various others... Coco/R is one reasonable package to do this). In an attempt to avoid adding a dependency on another parser generator (or moving other bits of the project from ANTLR to this new generator) I've hacked my way around the problem like this:
MONTH: 'mo' { _input.La(1) != 's' && _input.La(1) != 'l' && _input.La(1) != '_' }? ;
// (note: this is a C# project; java would use _input.LA instead)
but this isn't really a very extensible or maintainable solution, and like as not will have introduced other subtle issues I've not come across yet.
Your problem does not require smaller tokens to be preferred (In this case MONTH would never be matched). You need a backtracking behaviour dependent on the text being matched or not. Right?
ANTLR separates tokenization and parsing strictly. Consequently every solution to your problem will seem like a hack.
However other parser generators are specialized on problems like yours. Packrat Parsers (PEG) are backtracking and allow tokenization on the fly. Try out parboiled for this purpose.
Appears that the question is not being framed correctly.
I'm currently attempting to write a UCUM parser using ANTLR4. My current approach has involved defining every valid unit and prefix as a token.
But, according to the UCUM:
The expression syntax of The Unified Code for Units of Measure generates an infinite number of codes with the consequence that it is impossible to compile a table of all valid units.
The most to expect from the lexer is an unambiguous identification of the measurement string without regard to its semantic value. Similarly, a parser alone will be unable to validly select between unit sequences like MONTH LITRE and MOLE - both could reasonably apply to a leak rate - unless the problem space is statically constrained in the parser definition.
A heuristic, structural (explicitly identifying the problem space) or contextual (considering the relative nature of other units in the problem space), is most likely required to select the correct unit interpretation.
The best tool to use is the one that puts you in the best position to implement the heuristics necessary to disambiguate the unit strings. Antlr could do it using parse-tree walkers. Whether that is the appropriate approach requires further analysis.

Which are the variable naming rules in Processing?

(Question by John Williams, from a Coursera forum, which I decided to share with the community, since I haven't been able to find this answered anywhere.)
The following code runs without error:
int _j = 1;
//int 2var = 2;
int var2 = 2;
int Kvar = 3; // first letter can be uppercase
int spec$var = 4;
int com_pound_var = 5; // compounding without camel case
int com$pound$two = 6;
int $var = 199;
println(_j);
println(var2);
println(Kvar);
println(spec$var);
println(com_pound_var);
println(com$pound$two);
println($var); //first character can be special
Since the compiler accepts _j, Kvar, and $var as valid variable names, it is clear that variable names do not need to start with a lowercase letter.
I was unable to locate the variable naming rules anywhere in the reference.
What are the variable naming rules for the Processing language?
Quick answer: can start with any letter, underscore and dollar signs, continue with letters, numbers, underscore and dollar signs. Details below.
I could also not find anything in the reference or the documentation at all. However, inspecting the source code, I found that Processing is not a language of its own, but rather a framework in which you run some commands. The difference is that you're actually writing a different language, and Processing just gives you some basic scaffolding where you build on top of.
For some technical details: Processing compiles a Java Build with some flags, spins up a virtual machine (Java VM, not same thing as a full fledged virtual machine) and connects to it to get input and output streams (this is why you can interact with the mouse or get the console output of your own program in a separate window). (Source.)
This language, which you may have guessed already, is Java.
With that said, the actual docs that answer this question is the Java Language Specification, which, to simplify things, is as close as you can get to an answer. (But if you really want to know, it's a mess.)
Specifically, the section on Identifiers, which I'll sum up below:
Can start with any letter (A-Z, a-z), underscore (_), dollar sign ($), or any unicode "letter" (accented, chinese, etc. Details.)
Can continue with any of the above, and can also continue with digits (0-9). Can also contain other unicode "letters" (Details.)
Can have unlimited length
Cannot be any Java keyword (list here)
Cannot be false, true, null
They can look the same and still be different if their codes are different (some Unicode letters look just like letters but are different ones)
I hope this helps! Investigating was fun.

ParseKit: What built-in Productions should I use in my Grammars?

I just started using ParseKit to explore language creation and perhaps build a small toy DSL. However, the current SVN trunk from Google is throwing a -[PKToken intValue]: unrecognized selector sent to instance ... when parsing this grammar:
#start = identifier ;
identifier = (Letter | '_') | (letterOrDigit | '_') ;
letterOrDigit = Letter | Digit ;
Against this input:
foo
Clearly, I am missing something or have incorrectly configured my project. What can I do to fix this issue?
Developer of ParseKit here.
First, see the ParseKit Tokenization docs.
Basically, ParseKit can work in one of two modes: Let's call them Tokens Mode and Chars Mode. (There are no formal names for these two modes, but perhaps there should be.)
Tokens Mode is more popular by far. Virtually every example you will find of using ParseKit will show how to use Tokens Mode. I believe all of the documentation on http://parsekit.com is using Tokens Mode. ParseKit's grammar feature (that you are using in your example only works in Tokens Mode).
Chars Mode is a very little-known feature of ParseKit. I've never had anyone ask about it before.
So the differences in the modes are:
In Tokens Mode, the ParseKit Tokenizer emits multi-char tokens (like Words, Symbols, Numbers, QuotedStrings etc) which are then parsed by the ParseKit parsers you create (programmatically or via grammars).
In Chars Mode, the ParseKit Tokenizer always emits single-char tokens which are then parsed by the ParseKit parsers you create programmatically. (grammars don't currently work with this mode as this mode is not popular).
You could use Chars Mode to implement Regular Expresions which parse on a char-by-char basis.
For your example, you should be ignoring Chars Mode and just use Tokens Mode. The following Built-in Productions are for Chars Mode only. Do not use them in your grammars:
(PK)Letter
(PK)Digit
(PK)Char
(PK)SpecificChar
Notice how all of those Productions sound like they match individual chars. That's because they do.
Your example above should probably look like:
#start = identifier;
identifier = Word; // by default Words start with a-zA-Z_ and contain -0-9a-zAZ_'
Keep in mind the Productions in your grammars (parsers like identifier) will be working on Tokens already emitted from ParseKit's Tokenizer. Not individual chars.
IOW: by the time your grammar goes to work parsing input, the input has already been tokenized into Tokens of type Word, Number, Symbol, QuotedString, etc.
Here are all of the Built-in Productions available for use in your Grammar:
Word
Number
Symbol
QuotedString
Comment
Any
S // Whitespace. only available when #preservesWhitespaceTokens=YES. NO by default.
Also:
DelimitedString('start', 'end', 'allowedCharset')
/xxx/i // RegEx match
There are also operators for composite parsers:
// Sequence
| // Alternation
? // Optional
+ // Multiple
* // Repetition
~ // Negation
& // Intersection
- // Difference