Argument count limit of methods in Smalltalk-80 implementation - oop

The maximum number of arguments of a method is limited to 2^5-1(i.e. 31) because there is only 5 bits to encode the number of arguments in a compiled method as illustrated in Figure 27.4 of the blue book. But the double extended send bytecode has 8 bits to encode the number of arguments (see the definition of doubleExtendedSendBytecode here), which means I can send as many as 2^8-1 (i.e. 127) arguments to a message (using perform: or the statement will not be compiled). Is this a contradiction? I think the bytecode uses too many bits to encode the number of arguments.

Yes, this is a contradiction but it did not yet matter enough.
Also, the number of arguments in a methods is bounded to the maximum number of temporary variables in a method, too, which in most Smalltalks happen to be 2^8-1.
There is another part to that:
In Squeak, the number of arguments is actually restricted to 15 (2^4-1), and also has only a nibble (4 bit) of space in the method header.
As the comment of Squeak's CompiledMethod states:
(index 18) 6 bits: number of temporary variables (#numTemps)
(index 24) 4 bits: number of arguments to the method (#numArgs)
with the #numTemps including the number of arguments, too.
Long story short, yes, the doubleExtendedSendBytecode could encode more arguments than actually expressible in a CompiledMethod.
This is one of the reasons it was replaced in Squeak by the doubleExtendedDoAnything bytecode that can do more than just sends, but limits the number of arguments to 2^5-1 (which is still more than the CompiledMethod can encode, but it is not unlikely that CompiledMethod will change in the foreseeable future to encode more than 15 arguments).

The actual number of arguments used is mostly small. The number of arguments of CompiledMethods in a Moose 4.6 Image I have here:
|bag|
bag := IdentityBag new.
CompiledMethod allInstances do:[ :element | bag add: element numArgs ].
bag sortedCounts
52006 -> 0
25202 -> 1
6309 -> 2
2133 -> 3
840 -> 4
391 -> 5
191 -> 6
104 -> 7
61 -> 8
12 -> 9
11 -> 10
5 -> 11
4 -> 12
3 -> 13
2 -> 15
1 -> 14

Related

Size of a serialized complete binary tree

I'm tryin to work out the size of a serialized binary tree having N nodes (also mentioned in Leetcode). This is how I calculate the size:
If we assume the storage required to store values be V bits for each node, then the storage needed to store N nodes will be N.V. We also need to store NULL for the leaves; since there are exactly Ceiling(N/2) leaves in a complete tree, and assuming only one bit is enough to represent NULL, then an additional of 2 x Ceiling(N/2) bits will be required. 2 x Ceiling(N/2) translates to N+1 as in a complete tree N is always an odd number.
So, N.V + (N+1) bit is required in total.
However, I can see that in Leetcode and some other places (e.g. this), it's calculated as N.V + 2N.
What am I missing?
What am I missing?
The two references you provided (LeetCode and blog article) deal with arbitrary binary trees, not necessarily complete. So let me first deal with arbitrary binary trees:
Although a NULL reference could be represented with one bit (e.g. with value 0), you also need to store the fact that a reference is not a NULL (value 1). You cannot just omit the bit, as then the next bit (belonging to a node value) could be misinterpreted as indicating a NULL reference. So you should not only count that bit for each NULL reference, but count it in for all branches.
The serialised format would for each node represent:
The node's value (𝑉 bits)
The fact whether or not its left child is a NULL (1 bit)
The fact whether or not its right child is a NULL (1 bit)
Example:
Let 𝑉 be 4
Tree to serialise:
10
/ \
7 13
\
14
Serialisation process (level order):
node value
has left
has right
serialised
without spacing
10
yes
yes
1010 1 1
101011
7
no
no
0111 0 0
011100
13
no
yes
1101 0 1
110101
14
no
no
1110 0 0
111000
Complete:
101011011100110101111000
If we were only to store the 0 when there is a NULL, then we would get this:
101001110011010111000
^
But now the bit at the indicated position is ambiguous, because that bit could be interpreted as representing a NULL reference, but actually it is the first of 𝑉 bits 0111 representing the value 7.
It is however possible to reduce the serialised string with 2 bits: the very last 2 bits will always be 0 in a tree traversal that is guaranteed to end with a leaf. This is for example the case for level-order and pre-order traversals. So then you could just omit those 2 bits.
The case for complete binary trees
First of all about the definition of a complete binary tree. You write:
in a complete tree N is always an odd number.
I suppose then your definition of a complete tree is what at Wikipedia is called a perfect tree. We can however also look at (nearly) complete binary trees (and then 𝑁 is not necessarily odd).
For complete binary trees the case is simpler, as a level order traversal of a complete binary tree will never include NULLs, i.e. there are no "gaps" in such a traversal.
So you can just serialise the node's values in that order, giving each 𝑉 bits. This is actually the array representation that is used for binary heaps:
The parent / child relationship is defined implicitly by the elements' indices in the array.
If serialisation happens in a string data type that implicitly has a length attribute, then that's it. If there is no such meta data, then you need to prefix the value of 𝑁 in the serialisation, reserving a predefined number of bits for it. Alternatively, if there is a special value of 𝑉 bits that will never occur as actual node value, you could append it as a terminator (much like \0 in C-strings).

Compact-u16 - what is the purpose of this?

I was doing some research over the weekend on some blockchain dev in the Solana blockchain and came across a construct called Compact-u16. The definition of this in the documentation says the following: "A compact-u16 is a multi-byte encoding of 16 bits. The first byte contains the lower 7 bits of the value in its lower 7 bits. If the value is above 0x7f, the high bit is set and the next 7 bits of the value are placed into the lower 7 bits of a second byte. If the value is above 0x3fff, the high bit is set and the remaining 2 bits of the value are placed into the lower 2 bits of a third byte.".
I have been coding for 30+ years. Maybe I'm just old school on this, but why is there a construct to store 16 bits of data in 3 bytes? This is just vastly inefficient from my standpoint. Is there a reason for this? On further research, I found a doc related to assembly instruction pointers, which referenced 7 instruction pointers that are useful for caching values when context switching in and out of the processor stack. But this construct is used for a web app platform. Like, literally, there is no reason that I have been able to find that justifies using 3 bytes to store 16 bits of data. If the developers wanted to use an elegant bit mapping solution to compress space, why not just use a semaphore? Why create a brand new construct that increases the storage requirements for the data by 33%.
What am I missing?
I had some similar confusion when reading the compact-u16 description. Based on the code for parsing them in the solana python module I believe they're doing something conceptually similar to UTF-8, and storing the number in 1-3 bytes depending on its size.
Basically instead of each byte having 8 bits of a number, it has 7 bits of the number and a flag (the most significant bit) that indicates whether the number continues in the next byte. For the largest numbers they need an extra byte, but for numbers less than 128 they need only one byte. Since Solana seems to use these for storing the length of arrays, if it's common that the length of the arrays is less than 128 then they will end up with fewer total bytes to transfer across all transactions.
Some examples I worked out for myself:
hex | compact-u16
--------+------------
0x0000 | [0x00]
0x0001 | [0x01]
0x007f | [0x7f]
0x0080 | [0x80 0x01]
0x3fff | [0xff 0x7f]
0x4000 | [0x80 0x80 0x01]
0xc000 | [0x80 0x80 0x03]
0xffff | [0xff 0xff 0x03])

Are Smalltalk bytecode optimizations worth the effort?

Consider the following method in the Juicer class:
Juicer >> juiceOf: aString
| fruit juice |
fruit := self gather: aString.
juice := self extractJuiceFrom: fruit.
^juice withoutSeeds
It generates the following bytecodes
25 self ; 1
26 pushTemp: 0 ; 2
27 send: gather:
28 popIntoTemp: 1 ; 3
29 self ; 4
30 pushTemp: 1 ; 5
31 send: extractJuiceFrom:
32 popIntoTemp: 2 ; 6 <-
33 pushTemp: 2 ; 7 <-
34 send: withoutSeeds
35 returnTop
Now note that 32 and 33 cancel out:
25 self ; 1
26 pushTemp: 0 ; 2
27 send: gather:
28 popIntoTemp: 1 ; 3 *
29 self ; 4 *
30 pushTemp: 1 ; 5 *
31 send: extractJuiceFrom:
32 storeIntoTemp: 2 ; 6 <-
33 send: withoutSeeds
34 returnTop
Next consider 28, 29 and 30. They insert self below the result of gather. The same stack configuration could have been achieved by pushing self before sending the first message:
25 self ; 1 <-
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 popIntoTemp: 1 ; 4 <-
30 pushTemp: 1 ; 5 <-
31 send: extractJuiceFrom:
32 storeIntoTemp: 2 ; 6
33 send: withoutSeeds
34 returnTop
Now cancel out 29 and 30
25 self ; 1
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 storeIntoTemp: 1 ; 4 <-
30 send: extractJuiceFrom:
31 storeIntoTemp: 2 ; 5
32 send: withoutSeeds
33 returnTop
Temporaries 1 and 2 are written but not read. So, except when debugging, they could be skipped leading to:
25 self ; 1
26 self ; 2
27 pushTemp: 0 ; 3
28 send: gather:
29 send: extractJuiceFrom:
30 send: withoutSeeds
31 returnTop
This last version, which saves 4 out 7 stack operations, corresponds to the less expressive and clear source:
Juicer >> juiceOf: aString
^(self extractJuiceFrom: (self gather: aString)) withoutSeeds
Note also that there are other possible optimizations that Pharo (I haven't checked Squeak) does not implement (e.g., jump chaining.) These optimizations would encourage the Smalltalk programmer to better express their intentions without having to pay the cost of additional computations.
My question is whether these improvements are an illusion or not. Concretely, are bytecode optimizations absent from Pharo/Squeak because they are known to have little relevance, or are they regarded as beneficial but haven't been addressed yet?
EDIT
An interesting advantage of using a register+stack architecture [cf. A Smalltalk Virtual Machine Architectural Model by Allen Wirfs-Brock and Pat Caudill] is that the additional space provided by registers makes it easier the manipulation of bytecodes for the sake of optimization. Of course, even though these kinds of optimizations are not as relevant as method inlining or polymorphic inline caches, as pointed out in the answer below, they shouldn't be disregarded, especially when combined with others implemented by the JIT compiler. Another interesting topic to analyze is whether destructive optimization (i.e., the one that requires de-optimization for supporting the debugger) is actually necessary or enough performance gains can be attained by non-destructive techniques.
The main annoyance when you start playing with such optimizations is debugger interface.
Historically and still currently in Squeak, the debugger is simulating the bytecode level and needs to map the bytecodes to corresponding Smalltalk instruction.
So I think the gain was too low for justifying complexification, or even worse degradation of debugging facility.
Pharo wants to change the debugger to operate at a higher level (Abstract Syntax Tree), but I don't know how they will end up at bytecode which is all the VM knows of.
IMO, this kind of optimization might better be implemented in the JIT compiler which transforms bytecode to machine native code.
EDIT
The greatest gains are in eliminating the sends themselves (by inlining) because they are much more expensive (x10) than the stack operations - there are 10 times more bytecodes executed per second than sends when you test 1 tinyBenchmarks (COG VM).
Interestingly, such optimizations could take place in the Smalltalk image, but only on hotspot detected by VM, as in the SISTA effort. See for example https://clementbera.wordpress.com/2014/01/22/the-sista-chronicles-iii-an-intermediate-representation-for-optimizations/
So, in the light of SISTA, the answer is rather: interesting, not yet addressed, but actively studied (and work in progress)!
All the machinery for de-optimizing when the method has to be debugged still is one of the difficult points as I understand it.
I think that a broader question is worth answering: are bytecodes worth the effort? Bytecodes were thought as a compact and portable representation of code that is close the target machine. As such, they are easy to interpret, but slow to execute.
Bytecodes do not excel in any of these games, and that usually makes them not the best choice if you want to either write an interpreter or a fast VM. On one hand, AST nodes far easier to interpret (only a few node types vs lots of different bytecodes). On the other hand, with the advent of JIT compilers, it became clear that running native code instead is not only possible but also much faster.
If you look at the most efficient VM implementations of JavaScript (which can be considered the most modern compilers of today) and also Java (HotSpot, Graal), you'll see they all use a tiered compilation scheme. Methods are initially interpreted from the AST, and only jitted when they become a hot spot.
At the hardest tiers of compilation there are no bytecodes. The key component in a compiler is its intermediate representation, and bytecodes do not fulfill the required properties. The most optimizable IRs are much more fine grained: they are in SSA form, and allow specific representation of registers and memory. This allows for much better code analysis and optimization.
Then again, if you are interested in portable code, there isn't anything more portable than the AST. Besides, it's easier and more practical to implement AST-based debuggers and profilers than bytecode-based ones. The only remaining problem is compactness, but in any case you can implement something like ast-codes (coded asts, similar to bytecodes but representing the tree)
On the other hand, if you want full speed, then you'll go for a JIT with a good IR and no bytecodes. I think that bytecodes don't fill many gaps in today VMs, but still remain mostly for backwards compatibility (also there are many examples of hardware archiqutectures that directly execute Java bytecodes).
There are also some cool experiments with the Cog VM related with bytecodes. But from what I understand they transform the bytecode into another IR for optimizing, then they convert back to bytecodes. I'm not sure if there's a technical gain in the last conversion besides reusing the original JIT architecture, or if there actually is any optimization at the bytecode level.

How does multiple assignment work?

From Section 4.1 of Programming in Lua.
In a multiple assignment, Lua first evaluates all values and only then
executes the assignments. Therefore, we can use a multiple assignment
to swap two values, as in
x, y = y, x -- swap x' fory'
How does the assignment work actually?
How multiple assignment gets implemented depends on what implementation of Lua you are using. The implementation is free to do things anyway it likes as long as it preserves the semantics. That is, no matter how things get implemented, you should get the same result as if you had saved all the values in the RHS before assigning them to the LHS, as the Lua book explains.
If you are still curious about the actual implementation, one thing you can do is see what is the bytecode that gets produced for a certain program. For example, taking the following program
local x,y = 10, 11
x,y = y,x
and passing it to the bytecode compiler (luac -l) for Lua 5.2 gives
main <lop.lua:0,0> (6 instructions at 0x9b36b50)
0+ params, 3 slots, 1 upvalue, 2 locals, 2 constants, 0 functions
1 [1] LOADK 0 -1 ; 10
2 [1] LOADK 1 -2 ; 11
3 [2] MOVE 2 1
4 [2] MOVE 1 0
5 [2] MOVE 0 2
6 [2] RETURN 0 1
The MOVE opcode assigns the value in the right register to the left register (see lopcodes.h in the Lua source for more details). Apparently, what is going on is that registers 0 and 1 are being used for x and y and slot 2 is being used as a temporary extra slot. x and y get initialized with constants in the first two opcodes and in the next three 3 opcodes a swap is performed using the "temporary" second slot, kind of like you would do by hand:
tmp = y -- MOVE 2 1
y = x -- MOVE 1 0
x = tmp -- MOVE 0 2
Given how Lua used a different approach when doing a swapping assignment and a static initialization, I wouldn't be surprised if you got different results for different kinds of multiple assignments (setting table fields is probably going to look very different, specially since then the order should matter due to metamethods...). We would need to find the part in the source where the bytecode gets emitted to be 100% sure though. And as I mentioned before, all of this might vary between Lua versions and implementations, specially if you look at LuaJIT vs PUC Lua.

How do I process enormous numbers? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Most efficient implementation of a large number class
Suppose I needed to calculate 2^150000. Obviously that number is going to exceed the size of an int, float, or double. How can I make a data type that allows normal math functions but exceeds the basic number types?
If this is a "depends which language you use" kind of deal. I will say C#.
See
Most efficient implementation of a large number class
for some leads.
If C# is not cast in stone, and you want something that just works out of the box, then there are several options. The one I know best is Python, but I think that languages like Scheme and Ruby support large numbers, too.
Python: 2**150000. Prints the result after about 1 second.
If you want free mathematics software, look at Maxima or Sage.
You might also consider using Frink, which is a language with the native capability of dealing with measurement units.
It computes 2^150000 without difficulty, deals with fractions (e.g. 1/3+2/5 --> 11/15), computes 3 meters + 2 inch --> 3.0508 m and is a full programming language.
Frink - Copyright 2000-2008 Alan Eliasen, eliasen#mindspring.com
http://futureboy.us/frinkdocs/
Several languages have built in support for arbitrary large numbers. You could use Mathematica, for example. I tried your example in Mathematica, and the result has 45,155 digits. I tried the same example with bc on a Unix machine. bc supports extended precision, but not that extended; it bombed on the example.
Lisp is your friend. Default biginteger numbers.
I find it very frustrating to use a language without arbitrarily large numbers: it seems nonsensical to be able to use ordinary operators like addition on most numbers, but to have to switch to method calls on a BigInt instance simply because of its size.
A whole bunch of languages have more complete numeric towers, and seamlessly coerce when needed; e.g., Allegro Common Lisp evaluates and prints all 45,155 digits of (expt 2 150000) in 1ms.
cl-user(2): (time (expt 2 150000))
; cpu time (non-gc) 0 msec user, 0 msec system
; cpu time (gc) 0 msec user, 0 msec system
; cpu time (total) 0 msec user, 0 msec system
; real time 1 msec
; space allocation:
; 2 cons cells, 18,784 other bytes, 0 static bytes
There is a product in C called calc which is an arbitrary precision calculator. I used it once when working as a researcher and found it fairly straightforward to use...
http://sourceforge.net/projects/calc/
It can be programmed for difficult or long calculations and can accept arguments from the command line. In interactive mode, it accepts one command at a time, and displays the answer.
Ordinarily the commands are simply expressions such as:
3 * (4 + 1)
and calc will print:
15
Calc does the arithmetic operators +, -, /, * as well as ^ (exponentiation), % (modulus) and // (integer divide).
For example:
3 * 19 ^ 43 - 1
will produce:
29075426613099201338473141505176993450849249622191102976
Calc values can be VERY large. For example:
2 ^ 23209 - 1
will print:
402874115778988778181873329071 ... loads of digits ... 3779264511
Hope this helps...
I don't know C# but I do know the Ruby programming language has the BigDemical class that seems to allow numbers of unlimited size.
Python has a bignum library. If you need to implement a bignum library in another language you can at least use the Python one as reference for validating your work. Note that bignums have a few implementation gotchas that aren't immediately obvious if you don't know what you're looking for.