Barewords can be used at the left hand side of Pair declarations (this is not documented yet, I'm addressing this issue right now, but I want to get everything right). However, I have not found what is and what's not going to be considered a bareword key anywhere.
This seems to work
say (foo'bar-baz => 3); # OUTPUT: «foo'bar-baz => 3»
This does not
say (foo-3 => 3); # OUTPUT: «(exit code 1) ===SORRY!=== Error while compiling /tmp/jorTNuKH9VUndeclared routine: foo used at line 1»
So it apparently follows the same syntax as the ordinary identifiers. Is that correct? Am I missing something here?
There are no barewords in Perl 6 in the sense that they exist in Perl 5, and the term isn't used in Perl 6 at all.
There are two cases that we might call a "bare identifier":
An identifier immediately followed by zero or more horizontal whitespace characters (\h*), followed by the characters =>. This takes the identifier on the left as a pair key, and the term parsed after the => as a pair value. This is an entirely syntactic decision; the existence of, for example, a sub or type with that identifier will not have any influence.
An identifier followed by whitespace (or some other statement separator or terminator). If there is already a type of that name, then it is compiled into a reference to the type object. Otherwise, it will always be taken as a sub call. If no sub declaration of that name exists yet, it will be considered a call to a post-declared sub, and an error produced at CHECK-time if a sub with that name isn't later declared.
These two cases are only related in the sense that they are both cases of terms in the Perl 6 grammar, and that they both look to parse an identifier, which follow the standard rules linked in the question. Which wins is determined by Longest Token Matching semantics; the restriction that there may only be horizontal whitespace between the identifier and => exists to make sure that the identifier, whitespace, and => will together be counted as the declarative prefix, and so case 1 will always win over case 2.
Related
I can't find even couple of words about containing numbers in names of variables or on methods. Does anyone have any authoritative information about such cases:
string2map
its4me
etc...
Exactly using number as a word but not number as a number.
Is it acceptable? Not acceptable, stupidly, professional or not. Please argue your opinion.
I haven't found any information either but below are my own thoughts.
Using a digit in an identifier which happens 2 be pronounced in the same way as a word is just silly word play. It also makes the meaning of the identifier ambiguous - does char2old mean that a character is too old, is it an old version of char2 or is it a conversion? It's fun however to come up with names like a10sorFlow, the2lbox, my4mula but they are best avoided.
When it comes to using numbers 1 to N at the end of equally named identifiers, it is probably better to use an array instead if N > 2. Also, when N = 2 there are often clearer names that can be used, like leftCircle and rightCircle instead of circle1 and circle2, or currentChar and nextChar instead of char1 and char2.
Here is a good general guide for naming variables:
Identifier kind
Word class
Example
Boolean variable or pure function
Last word is an adjective
doorClosed, TablePrepared
Non-boolean variable or pure function
Last word is a noun
closedDoor, PreparedTable
Non-pure function (has side-effects)
First word is a verb
CloseDoor, PrepareTable
I'm pretty sure this isn't possible, but I want to ask just in case.
I have the common ID token definition:
ID: LETTER (LETTER | DIG)*;
The problem is that in the grammar I need to parse, there are some instructions in which you have a single character as operand, like:
a + 4
but
ab + 4
is not possible.
So I can't write a rule like:
sum: (INT | LETTER) ('+' (INT | LETTER))*
Because the lexer will consider 'a' as an ID, due to the higher priority of ID. (And I can't change that priority because it wouldn't recognize single character IDs then)
So I can only use ID instead of LETTER in that rule. It's ugly because there shouldn't be an ID, just a single letter, and I will have to do a second syntactic analysis to check that.
I know that there's nothing to do about it, since the lexer doesn't understand about context. What I'm thinking that maybe there's already built-in ANTLR4 is some kind of way to check the token's length inside the rule. Something like:
sum: (INT | ID{length=1})...
I would also like to know if there are some kind of "token alias" so I can do:
SINGLE_CHAR is alias of => ID
In order to avoid writing "ID" in the rule, since that can be confusing.
PD: I'm not parsing a simple language like this one, this is just a little example. In reality, an ID could also be a string, there are other tokens which can only be a subset of letters, etc... So I think I will have to do that second analysis anyways after parsing the entry to check that syntactically is legal. I'm just curious if something like this exists.
Checking the size of an identifier is a semantic problem and should hence be handled in the semantic phase, which usually follows the parsing step. Parse your input with the usual ID rule and check in the constructed parse tree the size of the recognized ids (and act accordingly). Don't try to force this kind of decision into your grammar.
So, I'm writing a language using flex/bison and I'm having difficulty with implementing identifiers, specifically when it comes to knowing when you're looking at an assignment or a reference,
for example:
1) A = 1+2
2) B + C (where B and C have already been assigned values)
Example one I can work out by returning an ID token from flex to bison, and just following a grammar that recognizes that 1+2 is an integer expression, putting A into the symbol table, and setting its value.
examples two and three are more difficult for me because: after going through my lexer, what's being returned in ex.2 to bison is "ID PLUS ID" -> I have a grammar that recognizes arithmetic expressions for numerical values, like INT PLUS INT (which would produce an INT), or DOUBLE MINUS INT (which would produce a DOUBLE). if I have "ID PLUS ID", how do I know what type the return value is?
Here's the best idea that I've come up with so far: When tokenizing, every time an ID comes up, I search for its value and type in the symbol table and switch out the ID token with its respective information; for example: while tokenizing, I come across B, which has a regex that matches it as being an ID. I look in my symbol table and see that it has a value of 51.2 and is a DOUBLE. So instead of returning ID, with a value of B to bison, I'm returning DOUBLE with a value of 51.2
I have two different solutions that contradict each other. Here's why: if I want to assign a value to an ID, I would say to my compiler A = 5. In this situation, if I'm using my previously described solution, What I'm going to get after everything is tokenized might be, INT ASGN INT, or STRING ASGN INT, etc... So, in this case, I would use the former solution, as opposed to the latter.
My question would be: what kind of logical device do I use to help my compiler know which solution to use?
NOTE: I didn't think it necessary to post source code to describe my conundrum, but I will if anyone could use it effectively as a reference to help me understand their input on this topic.
Thank you.
The usual way is to have a yacc/bison rule like:
expr: ID { $$ = lookupId($1); }
where the the lookupId function looks up a symbol in the symbol table and returns its type and value (or type and storage location if you're writing a compiler rather than a strict interpreter). Then, your other expr rules don't need to care whether their operands come from constants or symbols or other expressions:
expr: expr '+' expr { $$ = DoAddition($1, $3); }
The function DoAddition takes the types and values (or locations) for its two operands and either adds them, producing a result, or produces code to do the addition at run time.
If possible redesign your language so that the situation is unambiguous. This is why even Javascript has var.
Otherwise you're going to need to disambiguate via semantic rules, for example that the first use of an identifier is its declaration. I don't see what the problem is with your case (2): just generate the appropriate code. If B and C haven't been used yet, a value-reading use like this should be illegal, but that involves you in control flow analysis if taken to the Nth degree of accuracy, so you might prefer to assume initial values of zero.
In any case you can see that it's fundamentally a language design problem rather than a coding problem.
This question already has answers here:
Is there any language that allows spaces in its variable names [closed]
(2 answers)
Closed 9 years ago.
Related: Why can't variable names start with numbers?
Is there a technical reason why spaces aren't allowed in variable names or is it down to convention?
For example, what's stopping us from doing something like this?:
average score = sum of scores / number of scores
The only issue that comes to mind is keywords, but one could simply restrict the use of them in a variable name, and the lexer would be able to distinguish between part of a variable and a keyword.
There’s no fundamental reason, apart from the decisions of language designers and a history of single-token identifiers. Some languages in fact do allow multi-token identifiers: MultiMedia Fusion’s expression language, some Mac spreadsheet/notebook software whose name escapes me, and I’m sure of others. There are several considerations that make the problem nontrivial, though.
Presuming the language is free-form, you need a canonical representation, so that an identifier like account name is treated the same regardless of whitespace. A compiler would probably need to use some mangling convention to please a linker. Then you have to consider the effect of that on foreign exports—why C++ has the extern "C" linkage specifier to disable mangling.
Keywords are an issue, as you have seen. Most C-family languages have a lexical class of keywords distinct from identifiers, which are not context-sensitive. You cannot name a variable class in C++. This can be solved by disallowing keywords in multi-token identifiers:
if account age < 13 then child account = true;
Here, if and then cannot be part of an identifier, so there is no ambiguity with account age and child account. Alternatively, you can require punctuation everywhere:
if (account age < 13) {
child account = true;
}
The last option is to make keywords pervasively context-sensitive, leading to such monstrosities as:
IF IF = THEN THEN ELSE = THEN ELSE THEN = ELSE
The biggest issue is that juxtaposition is an extremely powerful syntactic construct, and you don’t want to occupy it lightly. Allowing multi-token identifiers prevents using juxtaposition for another purpose, such as function application or composition. Far better, I think, just to allow most nonwhitespace characters and thereby permit such identifiers as canonical-venomous-frobnicator. Still plenty readable but with fewer opportunities for ambiguity.
I think it is bacause the designers of the language have followed this convention.
I have searched on Google and found that while naming a variable this is a rule which is followed while naming a variable.
Some links are given below:-
SPSS notes
The following rules apply to variable names:
Variable names cannot contain spaces.
C Programming/Variables
Variable names by IBM
Java Variable Naming convention
Variable names are case-sensitive. A variable's name can be any legal
identifier — an unlimited-length sequence of Unicode letters and
digits, beginning with a letter, the dollar sign "$", or the
underscore character "". The convention, however, is to always begin
your variable names with a letter, not "$" or "". Additionally, the
dollar sign character, by convention, is never used at all. You may
find some situations where auto-generated names will contain the
dollar sign, but your variable names should always avoid using it. A
similar convention exists for the underscore character; while it's
technically legal to begin your variable's name with "_", this
practice is discouraged. White space is not permitted.
Wiki for Naming Convention
In all of the above links you will find that the designers have followed this naming convention for naming the variable.
Also check Is there any language that allows spaces in its variable names
This is forced from language designing.
Compiler needs to find out the meaning of words.
Compiler works on a "State Machine" method, and it needs to distinguish key words.
Maybe placing variable names in "[" and "]" give us some solution(like SQL).
But it will be harder to use it in coding...
I recently read some of : https://www.rfc-editor.org/rfc/rfc6570#section-1
And I found the following URL template examples :
GIVEN :
var="value";
x=1024;
path=/foo/bar;
{/var,x}/here /value/1024/here
{#path,x}/here #/foo/bar,1024/here
These seem contradictory.
In the first one, it appears that the / replaces ,
In the 2nd one, it appears that the , is kept .
Thus, I'm wondering wether there are inconsistencies in this particular RFC. I'm new to these RFC's so maybe I don't fully understand the culture behind how these develop.
There's no contradiction in those two examples. They illustrate the point that the rules for expanding an expression whose first character is / are different from the rules for expanding an expression whose first character is #. These alternative expansion rules are pretty much the entire point of having a variety of different magic leading characters -- which are called operators in the RFC.
The expression with the leading / is expanded according to a rule that says "each variable in the expression is replaced by its value, preceded by a / character". (I'm paraphrasing the real rule, which is described in section 3.2.6 of that RFC.) The expression with the leading # is expanded according to a rule that says "each variable in the expression is replaced by its value, with the first variable preceded by a # and subsequent variables preceded by a ,. (Again paraphrased, see section 3.2.4 for the real rule.)