What's the purpose of #[datatype] constructor in Rebol - rebol

I just came across this syntax in Rebol to construct some values:
>> #[email! "me#host.com"]
== me#host.com
This seems to be equivalent to
>> to email! "me#host.com"
== me#host.com
and this
>> #[string! "hello"]
== "hello"
While these error out:
>> #[integer! 1]
** Syntax Error: Invalid construct -- #[
** Near: (line 1) #[integer! 1]
>> #[decimal! 1]
** Syntax Error: Invalid construct -- #[
** Near: (line 1) #[decimal! 1]
>> #[string! 1]
== [string! 1]
I wonder what's this for? what benefits does it bring?

This syntactic structure is called "construction syntax" (also see CureCode issue #1955). Its raison d'être is to allow literal forms for values which could not otherwise be directly represented.
There are two main classes of situations that require construction syntax.
Values which (except for construction syntax) have no separate lexical form, but are only constructed through evaluation.
Values with a regular literal form, but where a particular value is "special" in as far as the value no longer can be written directly.
(1) Construction Syntax for Direct Value Representation
One prominent example for this class is object!. Rebol objects are typically created with code such as make object! [a: 42]. This is not a direct literal representation of the resulting object value, but rather code which, when evaluated, (in the DO dialect) creates the expected object value. Construction syntax allows for a direct representation of the value: #[object! [a: 42]].
Other typical examples are #[none!] and the somewhat irregular (construction-syntax-wise) #[true] and #[false] (note the missing !). Wrapping one's head around the difference between e.g. #[none!] and none will lead to a deeper understanding of Rebol semantics (and is thus left as an exercise for the reader).
(2) Construction Syntax as "Escape Mechanism"
A typical example for this case is a reversed URL, such as the value you get from reverse http://stackoverflow.com. If the resulting value would be serialised as a plain URL!, the serialised form will no longer be syntactically valid:
>> /moc.wolfrevokcats//:ptth
** Script error: // does not allow unset! for its value2 argument
Construction syntax provides an escape mechanism for this situation:
>> #[url! "/moc.wolfrevokcats//:ptth"] = reverse http://stackoverflow.com/
== true
This particular example also points to a problematic situation:
>> reverse http://stackoverflow.com/
== /moc.wolfrevokcats//:ptth
Arguably, this output (produced by the mold native) is wrong: the result as displayed is not a valid lexical representation of the corresponding value.
There's a few bugs tracking particular instances of this class of problems (such as CureCode issue #2010 for URL!). One proposal on the table towards a more general solution is to have a round-trip check built into some variant of MOLD: whenever LOAD-ing the result of a non-construction-syntax MOLD results in value different from the original value, this MOLD variant would fall back to serialising to construction syntax.

Related

Smalltalk: what's the difference between "&" and "and:"

I just started learning Smalltalk at uni.
I would like to know the difference between the messages & and and: (just like | and or:)
Welcome to Smalltalk! You are asking a good question that points out some subtle differences in expressions that are outside or inside blocks.
The '&' message is a "binary" message so takes only one argument. In this case the argument is expected to be a Boolean or an expression that evaluates to a Boolean. Most importantly for our purposes in this comparison, the expression will always be evaluated.
The 'and:' message is a "keyword" message and also takes only one argument. But in this case the argument is expected to be a Block that if later sent the 'value' message would return a Boolean. So, for our purposes in this comparison, the Block will not always be evaluated. (The difference between the binary and keyword messages does not matter for this question.)
The option to not evaluate the argument allows Short-circuit evaluation in which knowing the left-hand side often allows us to avoid a (possibly expensive) calculation of the right-hand side. So, if the left ("receiver") is false, there is nothing to be gained by evaluating the right-hand side.
So, consider the following:
self cheapFalse & self expensiveTrue. "an expensive way to get false"
self cheapFalse and: [self expensiveTrue]. "a cheap way to get false"
One obvious difference is that and: is a keyword message thus you can write something like
a > b and: [b < c]
while with & you would need parentheses in this expression:
a > b & (b < c)
because all the binary messages (+, -, >, <, &, |) have the same precedence.
More importantly, and: and or: expect blocks while & and | expect values. This means that you can write
this shouldRedraw and: [ obj stateChanged ]
where obj stateChanged will be executed only if this shouldRedraw is true. This is useful if you want to optimize your code and avoid computing something which is not needed at the moment.

Raku: return Type

I want to write a function returning an array whose all subarrays must have a length of two.
For example return will be [[1, 2], [3, 4]].
I define:
(1) subset TArray of Array where { .all ~~ subset :: where [Int, Int] };
and
sub fcn(Int $n) of TArray is export(:fcn) {
[[1, 2], [3, 4]];
}
I find (1) over-complicated. Is there something simpler?
Stepping back first
subset TArray of Array where { .all ~~ subset :: where [Int, Int] };
Is there something simpler?
Before we go there, let's step back. Even ignoring your code's "overly-complicated" nature based on just looking at it, it's also potentially problematic and complicated for various reasons that may not be so obvious. I'll highlight three:
This subset will accept an Array containing Arrays, with each of those arrays containing two Ints. But it doesn't mandate an Array[Array[Int]]. The outer Array's type may be just a generic Array, rather than being an Array[Array] let lone an Array[Array[Int]]. Indeed it will be unless you deliberately introduce strongly typed values. I will cover strong typing in the last section of this answer.
What about an empty Array? Your subset will accept that. Is that your intent? If not, what about requiring at least one pair of Ints?
The outer where clause uses a common Raku idiom of the form .all ~~ ..., with a junction on the left hand side of the ~~ smart match operator. Astonishingly, per an issue I just filed, this may be a problem. What alternatives are there?
Starting simple
Raku does a decent job of keeping simple things simple. If we put aside any artificial desire for strong typing, and focus on simple tools for tightening code up, a simple subset I would have suggested in the past would be:
subset TArray where .all == 2; # BAD despite being idiomatic???
This has all of the problems your original code has, plus in addition it accepts data that has non-integers where integers belong.
But it does have the redeeming qualities that it does a useful check (that the inner arrays each have two elements) and it's significantly simpler than your code.
Now I've reminded myself that I need to view .all on the left hand side of ~~ as possibly a problem, I'll instead write it as:
subset TArray where 2 == .all; # Potentially the new idiomatic.
This version reads more poorly, but, while readability is important, basic correctness is more important.
Still fairly simple, and less problems
Here are two variants I came up with:
subset TArray where all .map: * ~~ (Int,Int);
subset TArray where .elems == .grep: (Int,Int);
These both avoid the junction/smartmatch problem. (The first where expression does have a junction to the left of a smart match, but it's not an example of the problem.)
The second version isn't so obviously correct (think of it as checking that the count of subarrays is the same as the count of subarrays that match (Int,Int)) but it nicely lends itself to fixing the problem of matching if there are zero subarrays, if that were to need fixing:
subset TArray where 0 < .elems == .grep: (Int,Int);
Strong typing solutions
The solutions thus far don't deal with strong typing. Perhaps that's desirable. Perhaps not.
To understand what I mean by this, let's first look at literals:
say WHAT 1; # (Int)
say WHAT [1,2]; # (Array)
say WHAT [[1,2],[3,4]]; # (Array)
These values have types determined by their literal constructors.
The last two are just Arrays, generic over their elements.
(The second is not an Array[Int], which might be expected. Similarly the last one is not an Array[Array[Int]].)
Current built in Raku literal forms for composite types (arrays and hashes) all construct generic Arrays which do not constrain the types of their elements.
See the PR Introduce [1,2,3]:Int syntax #4406 for a proposal/PR regarding element typed composite literals and a related issue I just posted in response to your Q here about an alternative and/or complementary approach to that PR. (There have been discussions over the years about this aspect of the type system but it seems like it's time for Rakoons to look at addressing it.)
What if you wanted to build a strongly typed data structure as the value to return from your routine, and to have the return type check that?
Here's one way one might build such a strongly typed value:
my Array[Array[Int]] $result .= new: Array[Int].new(1,2), Array[Int].new(3,4);
Super verbose! But now you could write the following for your sub's return type check and it'll work:
subset TArray of Array[Array[Int]] where 0 < .elems == .grep: (Int,Int);
sub fcn(Int $n) of TArray is export(:fcn) {
my Array[Array[Int]] $result .= new: Array[Int].new(1,2), Array[Int].new(3,4);
}
Another way to build a strongly typed value is to specify not only the strong typing in a variable's type constraint, but also coercion typing to bridge from a loosely typed value to a strongly typed target.
We keep the exact same subset (that establishes the strongly typed target data structure and adds "refinement typing" checks):
subset TArray of Array[Array[Int]] where 0 < .elems == .grep: (Int,Int);
But instead of using a verbose correct-by-construction initialization value, using full type names and news, we introduce additional coercion typing and then just use ordinary literal syntax:
constant TArrayInitialization = TArray(Array[Array[Int]()]());
sub fcn(Int $n) of TArray is export(:fcn) {
my TArrayInitialization $result = [[1,2],[3,4]];
}
(I could have written the TArrayInitialization declaration as another subset, but it would be a slight overkill to have done so. A constant does the job with less fuss.)
I gather that the aim is to restrict the type of the inner Array to [Int,Int] ... the closest I can get to this is to declare two subsets, one based on the other...
subset IArray where * ~~ [Int, Int];
subset TArray where .all ~~ IArray;
Otherwise, the anonymous subset form you use seems to be the briefest, although as #raiph points out you can drop the 'of Array' piece.
If you wanted to impose this sort of constraint on a function's parameter (rather than its return type) you could do so with something like:
sub fcn(#a where {all .map: * ~~ [Int, Int]}) {...}
As the other answers have mentioned, there currently isn't great syntax for similarly constraining the return type, but there's a proposal to add support for similar syntax for return types. In fact, as mentioned in that issue, someone has volunteered to work on an implementation but hasn't yet made any progress as far as I know. (And I guess I should know, since I was that volunteer… oops)
So, for now, a subset is the best option – but hopefully the future will have even better ways to write that.

Accessing the last element in Perl6

Could someone explain why this accesses the last element in Perl 6
#array[*-1]
and why we need the asterisk *?
Isn't it more logical to do something like this:
#array[-1]
The user documentation explains that *-1 is just a code object, which could also be written as
-> $n { $n - 1 }
When passed to [ ], it will be invoked with the array size as argument to compute the index.
So instead of just being able to start counting backwards from the end of the array, you could use it to eg count forwards from its center via
#array[* div 2] #=> middlemost element
#array[* div 2 + 1] #=> next element after the middlemost one
According to the design documents, the reason for outlawing negative indices (which could have been accepted even with the above generalization in place) is this:
The Perl 6 semantics avoids indexing discontinuities (a source of subtle runtime errors), and provides ordinal access in both directions at both ends of the array.
If you don't like the whatever-star, you can also do:
my $last-elem = #array.tail;
or even
my ($second-last, $last) = #array.tail(2);
Edit: Of course, there's also a head method:
my ($first, $second) = #array.head(2);
The other two answers are excellent. My only reason for answering was to add a little more explanation about the Whatever Star * array indexing syntax.
The equivalent of Perl 6's #array[*-1] syntax in Perl 5 would be $array[ scalar #array - 1]. In Perl 5, in scalar context an array returns the number of items it contains, so scalar #array gives you the length of the array. Subtracting one from this gives you the last index of the array.
Since in Perl 6 indices can be restricted to never be negative, if they are negative then they are definitely out of range. But in Perl 5, a negative index may or may not be "out of range". If it is out of range, then it only gives you an undefined value which isn't easy to distinguish from simply having an undefined value in an element.
For example, the Perl 5 code:
use v5.10;
use strict;
use warnings;
my #array = ('a', undef, 'c');
say $array[-1]; # 'c'
say $array[-2]; # undefined
say $array[-3]; # 'a'
say $array[-4]; # out of range
say "======= FINISHED =======";
results in two nearly identical warnings, but still finishes running:
c
Use of uninitialized value $array[-2] in say at array.pl line 7.
a
Use of uninitialized value in say at array.pl line 9.
======= FINISHED =======
But the Perl 6 code
use v6;
my #array = 'a', Any, 'c';
put #array[*-1]; # evaluated as #array[2] or 'c'
put #array[*-2]; # evaluated as #array[1] or Any (i.e. undefined)
put #array[*-3]; # evaluated as #array[0] or 'a'
put #array[*-4]; # evaluated as #array[-1], which is a syntax error
put "======= FINISHED =======";
will likewise warn about the undefined value being used, but it fails upon the use of an index that comes out less than 0:
c
Use of uninitialized value #array of type Any in string context.
Methods .^name, .perl, .gist, or .say can be used to stringify it to something meaningful.
in block <unit> at array.p6 line 5
a
Effective index out of range. Is: -1, should be in 0..Inf
in block <unit> at array.p6 line 7
Actually thrown at:
in block <unit> at array.p6 line 7
Thus your Perl 6 code can more robust by not allowing negative indices, but you can still index from the end using the Whatever Star syntax.
last word of advice
If you just need the last few elements of an array, I'd recommend using the tail method mentioned in mscha's answer. #array.tail(3) is much more self-explanatory than #array[*-3 .. *-1].

How to tell if an identifier is being assigned or referenced? (FLEX/BISON)

So, I'm writing a language using flex/bison and I'm having difficulty with implementing identifiers, specifically when it comes to knowing when you're looking at an assignment or a reference,
for example:
1) A = 1+2
2) B + C (where B and C have already been assigned values)
Example one I can work out by returning an ID token from flex to bison, and just following a grammar that recognizes that 1+2 is an integer expression, putting A into the symbol table, and setting its value.
examples two and three are more difficult for me because: after going through my lexer, what's being returned in ex.2 to bison is "ID PLUS ID" -> I have a grammar that recognizes arithmetic expressions for numerical values, like INT PLUS INT (which would produce an INT), or DOUBLE MINUS INT (which would produce a DOUBLE). if I have "ID PLUS ID", how do I know what type the return value is?
Here's the best idea that I've come up with so far: When tokenizing, every time an ID comes up, I search for its value and type in the symbol table and switch out the ID token with its respective information; for example: while tokenizing, I come across B, which has a regex that matches it as being an ID. I look in my symbol table and see that it has a value of 51.2 and is a DOUBLE. So instead of returning ID, with a value of B to bison, I'm returning DOUBLE with a value of 51.2
I have two different solutions that contradict each other. Here's why: if I want to assign a value to an ID, I would say to my compiler A = 5. In this situation, if I'm using my previously described solution, What I'm going to get after everything is tokenized might be, INT ASGN INT, or STRING ASGN INT, etc... So, in this case, I would use the former solution, as opposed to the latter.
My question would be: what kind of logical device do I use to help my compiler know which solution to use?
NOTE: I didn't think it necessary to post source code to describe my conundrum, but I will if anyone could use it effectively as a reference to help me understand their input on this topic.
Thank you.
The usual way is to have a yacc/bison rule like:
expr: ID { $$ = lookupId($1); }
where the the lookupId function looks up a symbol in the symbol table and returns its type and value (or type and storage location if you're writing a compiler rather than a strict interpreter). Then, your other expr rules don't need to care whether their operands come from constants or symbols or other expressions:
expr: expr '+' expr { $$ = DoAddition($1, $3); }
The function DoAddition takes the types and values (or locations) for its two operands and either adds them, producing a result, or produces code to do the addition at run time.
If possible redesign your language so that the situation is unambiguous. This is why even Javascript has var.
Otherwise you're going to need to disambiguate via semantic rules, for example that the first use of an identifier is its declaration. I don't see what the problem is with your case (2): just generate the appropriate code. If B and C haven't been used yet, a value-reading use like this should be illegal, but that involves you in control flow analysis if taken to the Nth degree of accuracy, so you might prefer to assume initial values of zero.
In any case you can see that it's fundamentally a language design problem rather than a coding problem.

What is this syntax #[] in REBOL?

In another question, I saw the following syntax:
#[unset!]
What is that? If I say type? #[unset!] in R3, it tells me unset!, but it doesn't solve the mystery of what #[] is.
So curious.
#[] is the serialized form for values. Play with MOLD versus MOLD/ALL in the console to get a feel for it.
the #[] "syntax" is actually not a syntax as such (not legal syntax, if you try it), only special cases of such constructs are "legal", like the #[unset!] syntax, #[true] syntax, #[false] syntax, #[none!] syntax, or #[datatype! unset!] syntax.
What is even more interesting, is, what the #[unset!] value actually is. It happens to be the value every "uninitialized" variable in REBOL has (not in functions, though, function local variables are initialized to #[none!]), as well as the result of such expressions like print 1, do [], (), etc.
Regarding the "function local variables ... initialized to #[none!]" I should add, that only the variables following an "unused refinement" (i.e. the one not used in the actual call) are initialized to #[none!] together with the refinement variable.
To explain the issue further, the syntactic (Data exchange dialect) difference between true and #[true] is, that the former is a word, while the latter is a value of the logic! type. Seen from the semantic (Do dialect) point of view, the difference is smaller, since the (global) word is interpreted as a variable, which happens to refer to the #[true] value.
Looks like it's the value-construction syntax for an unset instance, as opposed to the word unset!:
>> comparison: [unset! #[unset!]]
== [unset! unset!]
>> type? first comparison
== word!
>> type? second comparison
== unset!
>> second comparison
>> first comparison
== unset!
If you're in a programmatic context you could do this with to-unset, but having a literal notation lets you dodge the reduce:
>> comparison: reduce ['unset! to-unset none]
== [unset! unset!]
>> second comparison
>> first comparison
== unset!
Looks like they've reserved the #[...] syntax for more of these kinds of constructors.
Look at the example below:
trace on
>> e: none
Trace: e: (set-word)
Trace: none (word) ;<<<<<
>> e: #[none]
Trace: e: (set-word)
Trace: none (none) ;<<<<<
none is a word in the first one, but in the second one, #[none] is not a word, it is the Rebol value of the datatype none!.
Similarly for other values like true, false. The #[unset!] case is special since every undefined variable actually has this value.