How to tell if an identifier is being assigned or referenced? (FLEX/BISON) - grammar

So, I'm writing a language using flex/bison and I'm having difficulty with implementing identifiers, specifically when it comes to knowing when you're looking at an assignment or a reference,
for example:
1) A = 1+2
2) B + C (where B and C have already been assigned values)
Example one I can work out by returning an ID token from flex to bison, and just following a grammar that recognizes that 1+2 is an integer expression, putting A into the symbol table, and setting its value.
examples two and three are more difficult for me because: after going through my lexer, what's being returned in ex.2 to bison is "ID PLUS ID" -> I have a grammar that recognizes arithmetic expressions for numerical values, like INT PLUS INT (which would produce an INT), or DOUBLE MINUS INT (which would produce a DOUBLE). if I have "ID PLUS ID", how do I know what type the return value is?
Here's the best idea that I've come up with so far: When tokenizing, every time an ID comes up, I search for its value and type in the symbol table and switch out the ID token with its respective information; for example: while tokenizing, I come across B, which has a regex that matches it as being an ID. I look in my symbol table and see that it has a value of 51.2 and is a DOUBLE. So instead of returning ID, with a value of B to bison, I'm returning DOUBLE with a value of 51.2
I have two different solutions that contradict each other. Here's why: if I want to assign a value to an ID, I would say to my compiler A = 5. In this situation, if I'm using my previously described solution, What I'm going to get after everything is tokenized might be, INT ASGN INT, or STRING ASGN INT, etc... So, in this case, I would use the former solution, as opposed to the latter.
My question would be: what kind of logical device do I use to help my compiler know which solution to use?
NOTE: I didn't think it necessary to post source code to describe my conundrum, but I will if anyone could use it effectively as a reference to help me understand their input on this topic.
Thank you.

The usual way is to have a yacc/bison rule like:
expr: ID { $$ = lookupId($1); }
where the the lookupId function looks up a symbol in the symbol table and returns its type and value (or type and storage location if you're writing a compiler rather than a strict interpreter). Then, your other expr rules don't need to care whether their operands come from constants or symbols or other expressions:
expr: expr '+' expr { $$ = DoAddition($1, $3); }
The function DoAddition takes the types and values (or locations) for its two operands and either adds them, producing a result, or produces code to do the addition at run time.

If possible redesign your language so that the situation is unambiguous. This is why even Javascript has var.
Otherwise you're going to need to disambiguate via semantic rules, for example that the first use of an identifier is its declaration. I don't see what the problem is with your case (2): just generate the appropriate code. If B and C haven't been used yet, a value-reading use like this should be illegal, but that involves you in control flow analysis if taken to the Nth degree of accuracy, so you might prefer to assume initial values of zero.
In any case you can see that it's fundamentally a language design problem rather than a coding problem.

Related

Difficulty writing PEG recursive expression grammar with Arpeggio

My input text might have a simple statement like this:
aircraft
In my language I call this a name which represents a set of instances with various properties.
It yields an instance_set of all aircraft instances in this example.
I can apply a filter in parenthesis to any instance_set:
aircraft(Altitude < ceiling)
It yields another, possibly reduced instance_set.
And since it is an instance set, I can filter it yet again:
aircraft(Altitude < ceiling)(Speed > min_speed)
I foolishly thought I could do something like this in my grammar:
instance_set = expr
expr = source / instance_set
source = name filter?
It parses my first two cases correctly, but chokes on the last one:
aircraft(Altitude < ceiling)(Speed > min_speed)
The error reported being just before the second open paren.
Why doesn't Arpeggio see that there is just a filtered instance_set which is itself a filtered instance set?
I humbly submit my appeal to the peg parsing gods and await insight...
Your first two cases both match source. Once source is matched, it's matched; that's the PEG contract. So the parser isn't going to explore the alternative.
But suppose it did. How could that help? The rule says that if an expr is not a source, then it's an instance_set. But an instance_set is just an expr. In other words, an expr is either a source or it's an expr. Clearly the alternative doesn't get us anywhere.
I'm pretty sure Arpeggio has repetitions, which is what you really want here:
source = name filter*

What does the operator := mean? [duplicate]

I've seen := used in several code samples, but never with an accompanying explanation. It's not exactly possible to google its use without knowing the proper name for it.
What does it do?
http://en.wikipedia.org/wiki/Equals_sign#In_computer_programming
In computer programming languages, the equals sign typically denotes either a boolean operator to test equality of values (e.g. as in Pascal or Eiffel), which is consistent with the symbol's usage in mathematics, or an assignment operator (e.g. as in C-like languages). Languages making the former choice often use a colon-equals (:=) or ≔ to denote their assignment operator. Languages making the latter choice often use a double equals sign (==) to denote their boolean equality operator.
Note: I found this by searching for colon equals operator
It's the assignment operator in Pascal and is often used in proofs and pseudo-code. It's the same thing as = in C-dialect languages.
Historically, computer science papers used = for equality comparisons and ← for assignments. Pascal used := to stand in for the hard-to-type left arrow. C went a different direction and instead decided on the = and == operators.
In the statically typed language Go := is initialization and assignment in one step. It is done to allow for interpreted-like creation of variables in a compiled language.
// Creates and assigns
answer := 42
// Creates and assigns
var answer = 42
Another interpretation from outside the world of programming languages comes from Wolfram Mathworld, et al:
If A and B are equal by definition (i.e., A is defined as B), then this is written symbolically as A=B, A:=B, or sometimes A≜B.
■ http://mathworld.wolfram.com/Defined.html
■ https://math.stackexchange.com/questions/182101/appropriate-notation-equiv-versus
Some language uses := to act as the assignment operator.
In a lot of CS books, it's used as the assignment operator, to differentiate from the equality operator =. In a lot of high level languages, though, assignment is = and equality is ==.
This is old (pascal) syntax for the assignment operator. It would be used like so:
a := 45;
It may be in other languages as well, probably in a similar use.
A number of programming languages, most notably Pascal and Ada, use a colon immediately followed by an equals sign (:=) as the assignment operator, to distinguish it from a single equals which is an equality test (C instead used a single equals as assignment, and a double equals as the equality test).
Reference: Colon (punctuation).
In Python:
Named Expressions (NAME := expr) was introduced in Python 3.8. It allows for the assignment of variables within an expression that is currently being evaluated. The colon equals operator := is sometimes called the walrus operator because, well, it looks like a walrus emoticon.
For example:
if any((comment := line).startswith('#') for line in lines):
print(f"First comment: {comment}")
else:
print("There are no comments")
This would be invalid if you swapped the := for =. Note the additional parentheses surrounding the named expression. Another example:
# Compute partial sums in a list comprehension
total = 0
values = [1, 2, 3, 4, 5]
partial_sums = [total := total + v for v in values]
# [1, 3, 6, 10, 15]
print(f"Total: {total}") # Total: 15
Note that the variable total is not local to the comprehension (so too is comment from the first example). The NAME in a named expression cannot be a local variable within an expression, so, for example, [i := 0 for i, j in stuff] would be invalid, because i is local to the list comprehension.
I've taken examples from the PEP 572 document - it's a good read! I for one am looking forward to using Named Expressions, once my company upgrades from Python 3.6. Hope this was helpful!
Sources: Towards Data Science Article and PEP 572.
It's like an arrow without using a less-than symbol <= so like everybody already said "assignment" operator. Bringing clarity to what is being set to where as opposed to the logical operator of equivalence.
In Mathematics it is like equals but A := B means A is defined as B, a triple bar equals can be used to say it's similar and equal by definition but not always the same thing.
Anyway I point to these other references that were probably in the minds of those that invented it, but it's really just that plane equals and less that equals were taken (or potentially easily confused with =<) and something new to define assignment was needed and that made the most sense.
Historical References: I first saw this in SmallTalk the original Object Language, of which SJ of Apple only copied the Windows part of and BG of Microsoft watered down from them further (single threaded). Eventually SJ in NeXT took the second more important lesson from Xerox PARC in, which became Objective C.
Well anyway they just took colon-equals assiment operator from ALGOL 1958 which was later popularized by Pascal
https://en.wikipedia.org/wiki/PARC_(company)
https://en.wikipedia.org/wiki/Assignment_(computer_science)
Assignments typically allow a variable to hold different values at
different times during its life-span and scope. However, some
languages (primarily strictly functional) do not allow that kind of
"destructive" reassignment, as it might imply changes of non-local
state.
The purpose is to enforce referential transparency, i.e. functions
that do not depend on the state of some variable(s), but produce the
same results for a given set of parametric inputs at any point in
time.
https://en.wikipedia.org/wiki/Referential_transparency
For VB.net,
a constructor (for this case, Me = this in Java):
Public ABC(int A, int B, int C){
Me.A = A;
Me.B = B;
Me.C = C;
}
when you create that object:
new ABC(C:=1, A:=2, B:=3)
Then, regardless of the order of the parameters, that ABC object has A=2, B=3, C=1
So, ya, very good practice for others to read your code effectively
Colon-equals was used in Algol and its descendants such as Pascal and Ada because it is as close as ASCII gets to a left-arrow symbol.
The strange convention of using equals for assignment and double-equals for comparison was started with the C language.
In Prolog, there is no distinction between assignment and the equality test.

ANTLR recognize single character

I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.

Generating Random String of Numbers and Letters Using Go's "testing/quick" Package

I've been breaking my head over this for a few days now and can't seem to be able to figure it out. Perhaps it's glaringly obvious, but I don't seem to be able to spot it. I've read up on all the basics of unicode, UTF-8, UTF-16, normalisation, etc, but to no avail. Hopefully somebody's able to help me out here...
I'm using Go's Value function from the testing/quick package to generate random values for the fields in my data structs, in order to implement the Generator interface for the structs in question. Specifically, given a Metadata struct, I've defined the implementation as follows:
func (m *Metadata) Generate(r *rand.Rand, size int) (value reflect.Value) {
value = reflect.ValueOf(m).Elem()
for i := 0; i < value.NumField(); i++ {
if t, ok := quick.Value(value.Field(i).Type(), r); ok {
value.Field(i).Set(t)
}
}
return
}
Now, in doing so, I'll end up with both the receiver and the return value being set with random generated values of the appropriate type (strings, ints, etc. in the receiver and reflect.Value in the returned reflect.Value).
Now, the implementation for the Value function states that it will return something of type []rune converted to type string. As far as I know, this should allow me to then use the functions in the runes, unicode and norm packages to define a filter which filters out everything which is not part of 'Latin', 'Letter' or 'Number'. I defined the following filter which uses a transform to filter out letters which are not in those character rangetables (as defined in the unicode package):
func runefilter(in reflect.Value) (out reflect.Value) {
out = in // Make sure you return something
if in.Kind() == reflect.String {
instr := in.String()
t := transform.Chain(norm.NFD, runes.Remove(runes.NotIn(rangetable.Merge(unicode.Letter, unicode.Latin, unicode.Number))), norm.NFC)
outstr, _, _ := transform.String(t, instr)
out = reflect.ValueOf(outstr)
}
return
}
Now, I think I've tried just about anything, but I keep ending up with a series of strings which are far from the Latin range, e.g.:
𥗉똿穊
𢷽嚶
秓䝏小𪖹䮋
𪿝ท솲
𡉪䂾
ʋ𥅮ᦸ
堮𡹯憨𥗼𧵕ꥆ
𢝌𐑮𧍛併怃𥊇
鯮
𣏲𝐒
⓿ꐠ槹𬠂黟
𢼭踁퓺𪇖
俇𣄃𔘧
𢝶
𝖸쩈𤫐𢬿詢𬄙
𫱘𨆟𑊙
欓
So, can anybody explain what I'm overlooking here and how I could instead define a transformer which removes/replaces non-letter/number/latin characters so that I can use the Value function as intended (but with a smaller subset of 'random' characters)?
Thanks!
Confusingly the Generate interface needs a function using the type not a the pointer to the type. You want your type signature to look like
func (m Metadata) Generate(r *rand.Rand, size int) (value reflect.Value)
You can play with this here. Note: the most important thing to do in that playground is to switch the type of the generate function from m Metadata to m *Metadata and see that Hi Mom! never prints.
In addition, I think you would be better served using your own type and writing a generate method for that type using a list of all of the characters you want to use. For example:
type LatinString string
const latin = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01233456789"
and then use the generator
func (l LatinString) Generate(rand *rand.Rand, size int) reflect.Value {
var buffer bytes.Buffer
for i := 0; i < size; i++ {
buffer.WriteString(string(latin[rand.Intn(len(latin))]))
}
s := LatinString(buffer.String())
return reflect.ValueOf(s)
}
playground
Edit: also this library is pretty cool, thanks for showing it to me
The answer to my own question is, it seems, a combination of the answers provided in the comments by #nj_ and #jimb and the answer provided by #benjaminkadish.
In short, the answer boils down to:
"Not such a great idea as you thought it was", or "Bit of an ill-posed question"
"You were using the union of 'Letter', 'Latin' and 'Number' (Letter || Number || Latin), instead of the intersection of 'Latin' with the union of 'Letter' and 'Number' ((Letter || Number) && Latin))
Now for the longer version...
The idea behind me using the testing/quick package is that I wanted random data for (fuzzy) testing of my code. In the past, I've always written the code for doing things like that myself, again and again. This meant a lot of the same code across different projects. Now, I could of course written my own package for it, but it turns out that, even better than that, there's actually a standard package which does just about exactly what I want.
Now, it turns out the package does exactly what I want very well. The codepoints in the strings which it generates are actually random and not just restricted to what we're accustomed to using in everyday life. Now, this is of course exactly the thing which you want in doing fuzzy testing in order to test the code with values outside the usual assumptions.
In practice, that means I'm running into two problems:
There's some limits on what I would consider reasonable input for a string. Meaning that, in testing the processing of a Name field or a URL field, I can reasonably assume there's not going to be a value like 'James Mc⌢' (let alone 'James Mc🙁') or 'www.🕸site.com', but just 'James McFrown' and 'www.website.com'. Hence, I can't expect a reasonable system to be able to support it. Of course, things shouldn't completely break down, but it also can't be expected to handle the former examples without any problems.
When I filter the generated string on values which one might consider reasonable, the chance of ending up with a valid string is very small. The set of possible characters in the set used by the testing/quick is just so large (0x10FFFF) and the set of reasonable characters so small, you end up with empty strings most of the time.
So, what do we need to take away from this?
So, whilst I hoped to use the standard testing/quick package to replace my often repeated code to generate random data for fuzzy testing, it does this so well that it provides data outside the range of what I would consider reasonable for the code to be able to handle. It seems that the choice, in the end, is to:
Either be able to actually handle all fuzzy options, meaning that if somebody's name is 'Arnold 💰💰' ('Arnold Moneybags'), it shouldn't go arse over end. Or...
Use custom/derived types with their own Generator. This means you're going to have to use the derived type instead of the basic type throughout the code. (Comparable to defining a string as wchar_t instead of char in C++ and working with those by default.). Or...
Don't use testing/quick for fuzzy testing, because as soon as you run into a generated string value, you can (and should) get a very random string.
As always, further comments are of course welcome, as it's quite possible I overlooked something.

How would you structure a spreadsheet app in elm?

I've been looking at elm and I really enjoy learning the language. I've been thinking about doing a spreadsheet application, but i can't wrap my head how it would be structured.
Let's say we have three cells; A, B and C.
If I enter 4 in cell A and =A in cell B how would i get cell B to always equal cell A? If i then enter =A+B in cell C, can that be evaluated to 8, and also be updated when A or B changes?
Not sure how to lever Signals for such dynamic behavior..
Regards Oskar
First you need to decide how to represent your spreadsheet grid. If you come from a C background, you may want to use a 2D array, but I've found that a dictionary actually works better in Elm. So you can define type alias Grid a = Dict (Int, Int) a.
As for the a, what each cell holds... this is an opportunity to define a domain-specific language. So something like
type Expr = Lit Float | Ref (Int, Int) | Op2 (Float -> Float -> Float) Expr Expr
This means an expression is either a literal float, a reference to another cell location, or an operator. An operator can be any function on two floats, and two other expressions which get recursively evaluated. Depending on what you're going for, you can instead define specific tags for each operation, like Plus Expr Expr | Times Expr Expr, or you can add extra opN tags for operations of different arity (like negate).
So then you might define type alias Spreadsheet = Grid Expr, and if you want to alias (Int, Int) to something, that might help too. I'm also assuming you only want floats in your spreadsheet.
Now you need functions to convert strings to expressions and back. The traditional names for these functions are parse and eval.
parse : String -> Maybe Expr -- Result can also work
eval : Spreadsheet -> Grid Float
evalOne : Expr -> Spreadsheet -> Maybe Float
Parse will be a little tricky; the String module is your friend. Eval will involve chasing references through the spreadsheet and filling in the results, recursively. At first you'll want to ignore the possibility of catching infinite loops. Also, this is just a sketch, if you find that different type signatures work better, use them.
As for the view, I'd start with read-only, so you can verify hard-coded spreadsheets are evaluated properly. Then you can worry about editing, with the idea being that you just rerun the parser and evaluator and get a new spreadsheet to render. It should work because a spreadsheet has no state other than the contents of each cell. (Minimizing the recomputed work is one of many different ways you can extend this.) If you're using elm-html, table elements ought to be fine.
Hope this sets you off in the right direction. This is an ambitious project and I'd love to see it when you're done (post it to the mailing list). Good luck!