Cog VM and indirect variable access - smalltalk

Does anyone know whether the Cog VM for Pharo and Squeak is able to optimize away simple indirect variable accesses with accessors like this:
SomeClass>>someProperty
^ someProperty
SomeClass>>someSecondProperty
^ someSecondProperty
that just return an instance variable, so that methods like this:
SomeClass>>someMethod
^ self someProperty doWith: self someSecondProperty
will be no slower than methods like this:
SomeClass>>someMethod
^ someProperty doWith: someSecondProperty
I did some benchmarks, and they do seem roughly equivalent in speed, but I'm curious if anyone familiar with Cog knows for certain, because if there is a difference (no matter how slight), then there might be situations however rare where one is inappropriate.

There's a little cost right now but it's so little that you should not bother. If you want performance, you are willing to change other parts of your code, not instance variable access.
A quick bench:
bench
^ { [ iv yourself ] bench . [ self iv yourself ] bench }
=> #('52,400,000 per second.' '49,800,000 per second.')
The difference does not look so big.
Once jitted and executed once, the difference is that "self iv" executes an inline cache check, a cpu call and a cpu return in addition of fetching the instance variable value. The call and return instructions are most probably going to be anticipated by the cpu and not really executed. So it's about the inline cache check which is a very cheap operation.
What the inlining compiler in development will add is that the cpu call and return are really going to be removed with inlining, which will cover the cases where the cpu has not anticipated them. In addition, the inline cache check may or may not be removed depending on circumstances.
There are details such as the getter method needs to be compiled to native code which takes room in the machine code zone which could increase the number of machine code zone garbage collection, but that's even more anecdotic than the inline cache check overhead.
So in short, there is a very very very little overhead right now but that overhead will decrease in the future.
Clement

This is a tough question... And I don't know the exact answer. But I can help you learning how to check by yourself with a few clues.
You'll need to load the VMMaker package in an image. In Pharo, there is a procedure to build such image by just downloading everything from the net and github. See https://github.com/pharo-project/pharo-vm
Then the main hint is that methods that just return an instance variable are compiled as if executing primitive 264 + inst var offset... (for example, you'll see this by inspecting Interval>>#first or any other simple inst var getter)
In classical interpreter VM, this is handled in Interpreter>>internalExecuteNewMethod.
It seems like you pay the cost of a method lookup (some caches make this cheaper), but not of a real method activation.
I suppose that it explains that debuggers can't enter into such simple methods... This however is not a real inlining.
In COG, the same happens in StackInterpreter>>internalQuickPrimitiveResponse if ever interpreter is used.
As for the JIT, this is handled by Cogit>>compilePrimitive, see also implementors of genQuickReturnInstVar. This is not proper inlining either, but you can see that there are very few instructions generated. Again, I bet you generally don't pay the price of a lookup thank to so called Polymorphic Inline Cache (PIC).
For real inlining, I didn't find a clue after this quick browsing of source code...
My understanding is that it will happen at image side thru callback from Sista VM, but this is work in progress and only my vague recollection. Clement Bera is writing a blog about this (the sista chronicles at http://clementbera.wordpress.com)
If you're afraid of digging in VMMaker source code, I invite you to ask on vm-dev.lists.squeakfoundation.org I'm pretty sure Eliot Miranda or Clement will be happy to give you a far more accurate answer.
EDIT
I forgot to tell you about the conclusion of above perigrinations: I think that there will be a very small difference if you directly use the inst. var. rather than a getter, but this shouldn't be really noticeable, and in all cases, your programming style should NOT be guided by such neglectable optimizations.

Related

Should I avoid unwrap in production application?

It's easy to crash at runtime with unwrap:
fn main() {
c().unwrap();
}
fn c() -> Option<i64> {
None
}
Result:
Compiling playground v0.0.1 (file:///playground)
Running `target/debug/playground`
thread 'main' panicked at 'called `Option::unwrap()` on a `None` value', ../src/libcore/option.rs:325
note: Run with `RUST_BACKTRACE=1` for a backtrace.
error: Process didn't exit successfully: `target/debug/playground` (exit code: 101)
Is unwrap only designed for quick tests and proofs-of-concept?
I can not affirm "My program will not crash here, so I can use unwrap" if I really want to avoid panic! at runtime, and I think avoiding panic! is what we want in a production application.
In other words, can I say my program is reliable if I use unwrap? Or must I avoid unwrap even if the case seems simple?
I read this answer:
It is best used when you are positively sure that you don't have an error.
But I don't think I can be "positively sure".
I don't think this is an opinion question, but a question about Rust core and programming.
While the whole “error handling”-topic is very complicated and often opinion based, this question can actually be answered here, because Rust has rather narrow philosophy. That is:
panic! for programming errors (“bugs”)
proper error propagation and handling with Result<T, E> and Option<T> for expected and recoverable errors
One can think of unwrap() as converting between those two kinds of errors (it is converting a recoverable error into a panic!()). When you write unwrap() in your program, you are saying:
At this point, a None/Err(_) value is a programming error and the program is unable to recover from it.
For example, say you are working with a HashMap and want to insert a value which you may want to mutate later:
age_map.insert("peter", 21);
// ...
if /* some condition */ {
*age_map.get_mut("peter").unwrap() += 1;
}
Here we use the unwrap(), because we can be sure that the key holds a value. It would be a programming error if it didn't and even more important: it's not really recoverable. What would you do when at that point there is no value with the key "peter"? Try inserting it again ... ?
But as you may know, there is a beautiful entry API for the maps in Rust's standard library. With that API you can avoid all those unwrap()s. And this applies to pretty much all situations: you can very often restructure your code to avoid the unwrap()! Only in a very few situation there is no way around it. But then it's OK to use it, if you want to signal: at this point, it would be a programming bug.
There has been a recent, fairly popular blog post on the topic of “error handling” whose conclusion is similar to Rust's philosophy. It's rather long but worth reading: “The Error Model”. Here is my try on summarizing the article in relation to this question:
deliberately distinguish between programming bugs and recoverable errors
use a “fail fast” approach for programming bugs
In summary: use unwrap() when you are sure that the recoverable error that you get is in fact unrecoverable at that point. Bonus points for explaining “why?” in a comment above the affected line ;-)
In other words, can I say my program is reliable if I use unwrap? Or must I avoid unwrap even if the case seems simple?
I think using unwrap judiciously is something you have to learn to handle, it can't just be avoided.
My rhetorical question barrage would be:
Can I say my program is reliable if I use indexing on vectors, arrays or slices?
Can I say my program is reliable if I use integer division?
Can I say my program is reliable if I add numbers?
(1) is like unwrap, indexing panics if you make a contract violation and try to index out of bounds. This would be a bug in the program, but it doesn't catch as much attention as a call to unwrap.
(2) is like unwrap, integer division panics if the divisor is zero.
(3) is unlike unwrap, addition does not check for overflow in release builds, so it may silently result in wraparound and logical errors.
Of course, there are strategies for handling all of these without leaving panicky cases in the code, but many programs simply use for example bounds checking as it is.
There are two questions folded into one here:
is the use of panic! acceptable in production
is the use of unwrap acceptable in production
panic! is a tool that is used, in Rust, to signal irrecoverable situations/violated assumptions. It can be used to either crash a program that cannot possibly continue in the face of this failure (for example, OOM situation) or to work around the compiler knowing it cannot be executed (at the moment).
unwrap is a convenience, that is best avoided in production. The problem about unwrap is that it does not state which assumption was violated, it is better instead to use expect("") which is functionally equivalent but will also give a clue as to what went wrong (without opening the source code).
unwrap() is not necessarily dangerous. Just like with unreachable!() there are cases where you can be sure some condition will not be triggered.
Functions returning Option or Result are sometimes just suited to a wider range of conditions, but due to how your program is structured those cases might never occur.
For example: when you create an iterator from a Vector you buid yourself, you know its exact length and can be sure how long invoking next() on it returns a Some<T> (and you can safely unwrap() it).
unwrap is great for prototyping, but not safe for production. Once you are done with your initial design you go back and replace unwrap() with Result<Value, ErrorType>.

When is invokedynamic actually useful (besides lazy constants)?

TL;DR
Please provide a piece of code written in some well known dynamic language (e.g. JavaScript) and how that code would look like in Java bytecode using invokedynamic and explain why the usage of invokedynamic is a step forward here.
Background
I have googled and read quite a lot about the not-that-new-anymore invokedynamic instruction which everyone on the internet agrees on that it will help speed dynamic languages on the JVM. Thanks to stackoverflow I managed to get my own bytecode instructions with Sable/Jasmin to run.
I have understood that invokedynamic is useful for lazy constants and I also think that I understood how the OpenJDK takes advantage of invokedynamic for lambdas.
Oracle has a small example, but as far as I can tell the usage of invokedynamic in this case defeats the purpose as the example for "adder" could much simpler, faster and with roughly the same effect expressed with the following bytecode:
aload whereeverAIs
checkcast java/lang/Integer
aload whereeverBIs
checkcast java/lang/Integer
invokestatic IntegerOps/adder(Ljava/lang/Integer;Ljava/lang/Integer;)Ljava/lang/Integer;
because for some reason Oracle's bootstrap method knows that both arguments are integers anyway. They even "admit" that:
[..]it assumes that the arguments [..] will be Integer objects. A bootstrap method requires additional code to properly link invokedynamic [..] if the parameters of the bootstrap method (in this example, callerClass, dynMethodName, and dynMethodType) vary.
Well yes, and without that interesing "additional code" there is no point in using invokedynamic here, is there?
So after that and a couple of further Javadoc and Blog entries I think that I have a pretty good grasp on how to use invokedynamic as a poor replacement when invokestatic/invokevirtual/invokevirtual or getfield would work just as well.
Now I am curious how to actually apply the invokedynamic instruction to a real world usecase so that it actually is some improvements over what we could with "traditional" invocations (except lazy constants, I got those...).
Actually, lazy operations are the main advantage of invokedynamic if you take the term “lazy creation” broadly. E.g., the lambda creation feature of Java 8 is a kind of lazy creation that includes the possibility that the actual class containing the code that will be finally invoked by the invokedynamic instruction doesn’t even exist prior to the execution of that instruction.
This can be projected to all kind of scripting languages delivering code in a different form than Java bytecode (might be even in source code). Here, the code may be compiled right before the first invocation of a method and remains linked afterwards. But it may even become unlinked if the scripting language supports redefinition of methods. This uses the second important feature of invokedynamic, to allow mutable CallSites which may be changed afterwards while supporting maximal performance when being invoked frequently without redefinition.
This possibility to change an invokedynamic target afterwards allows another option, linking to an interpreted execution on the first invocation, counting the number of executions and compiling the code only after exceeding a threshold (and relinking to the compiled code then).
Regarding dynamic method dispatch based on a runtime instance, it’s clear that invokedynamic can’t elide the dispatch algorithm. But if you detect at runtime that a particular call-site will always call the method of the same concrete type you may relink the CallSite to an optimized code which will do a short check if the target is that expected type and performs the optimized action then but branches to the generic code performing the full dynamic dispatch only if that test fails. The implementation may even de-optimize such a call-site if it detects that the fast path check failed a certain number of times.
This is close to how invokevirtual and invokeinterface are optimized internally in the JVM as for these it’s also the case that most of these instructions are called on the same concrete type. So with invokedynamic you can use the same technique for arbitrary lookup algorithms.
But if you want an entirely different use case, you can use invokedynamic to implement friend semantics which are not supported by the standard access modifier rules. Suppose you have a class A and B which are meant to have such a friend relationship in that A is allowed to invoke private methods of B. Then all these invocations may be encoded as invokedynamic instructions with the desired name and signature and pointing to a public bootstrap method in B which may look like this:
public static CallSite bootStrap(Lookup l, String name, MethodType type)
throws NoSuchMethodException, IllegalAccessException {
if(l.lookupClass()!=A.class || (l.lookupModes()&0xf)!=0xf)
throw new SecurityException("unprivileged caller");
l=MethodHandles.lookup();
return new ConstantCallSite(l.findStatic(B.class, name, type));
}
It first verifies that the provided Lookup object has full access to A as only A is capable of constructing such an object. So sneaky attempts of wrong callers are sorted out at this place. Then it uses a Lookup object having full access to B to complete the linkage. So, each of these invokedynamic instructions is permanently linked to the matching private method of B after the first invocation, running at the same speed as ordinary invocations afterwards.

Will the Hotspot VM inline functions as necessary?

I am converting some C++ code to Java and I was wondering what I can do about the inlined functions. Can I assume that functions will be inlined by the VM (as an when necessary) and just not worry about this? How do I profile to observe this behaviour? Suppose there is a main outer function, and I throw a for loop around it and cause a million invocations. Should I expect to see an improvements as the VM inlines more and more?
Yes Java does inline method calls. The inlining is performed by the JIT compiler, so you won't see it by examining the bytecode files.
Whether inlining actually occurs for a given method call will depend on the size of the method body, and whether the call is inlineable. (If a method call involves dispatching ... after the JVM has a bunch of global optimization designed to remove unnecessary dispatching ... then it cannot be inlined.)
The same applies to your example with your outer main function. It depends on how big the method body is. On the other hand, if the method takes a significant time to execute, then the relative importance of the optimization decreases correspondingly.
My advice is to not worry about things like this at this stage. Just write the code clearly and simply, and let the JIT compiler deal with the problem of optimizing. When your application is working, you can profile it and see if there are any "hot spots" in the code that are worthwhile optimizing by hand.
But I should be able to see this in something like Visual VM right? I mean initially no inlining, then more and more stuff is inlined so the average time for the outer method is slightly reduced.
It may be observable and it may not, depending on the amount time spent in making the calls relative to executing the method bodies. (Profiling often relies on sampling the program counter. The reported times may be inaccurate if the number of samples for a given region of code is too small ... and for other reasons.)
It also depends on the JVM you are using. Not all JVMs will re-optimize code that they have previously optimized.
Finally, there is a way to get the JVM to dump the native code output by the JIT compiler. That will give you a definitive answer as to what has been inlined ... if you are prepared to read the machine instructions.

STM32 programming tips and questions

I could not find any good document on internet about STM32 programming. STM's own documents do not explain anything more than register functions. I will greatly appreciate if anyone can explain my following questions?
I noticed that in all example programs that STM provides, local variables for main() are always defined outside of the main() function (with occasional use of static keyword). Is there any reason for that? Should I follow a similar practice? Should I avoid using local variables inside the main?
I have a gloabal variable which is updated within the clock interrupt handle. I am using the same variable inside another function as a loop condition. Don't I need to access this variable using some form of atomic read operation? How can I know that a clock interrupt does not change its value in the middle of the function execution? Should I need to cancel clock interrupt everytime I need to use this variable inside a function? (However, this seems extremely ineffective to me as I use it as loop condition. I believe there should be better ways of doing it).
Keil automatically inserts a startup code which is written in assembly (i.e. startup_stm32f4xx.s). This startup code has the following import statements:
IMPORT SystemInit
IMPORT __main
.In "C", it makes sense. However, in C++ both main and system_init have different names (e.g. _int_main__void). How can this startup code can still work in C++ even without using "extern "C" " (I tried and it worked). How can the c++ linker (armcc --cpp) can associate these statements with the correct functions?
you can use local or global variables, using local in embedded systems has a risk of your stack colliding with your data. with globals you dont have that problem. but this is true no matter where you are, embedded microcontroller, desktop, etc.
I would make a copy of the global in the foreground task that uses it.
unsigned int myglobal;
void fun ( void )
{
unsigned int myg;
myg=myglobal;
and then only use myg for the rest of the function. Basically you are taking a snapshot and using the snapshot. You would want to do the same thing if you are reading a register, if you want to do multiple things based on a sample of something take one sample of it and make decisions on that one sample, otherwise the item can change between samples. If you are using one global to communicate back and forth to the interrupt handler, well I would use two variables one foreground to interrupt, the other interrupt to foreground. yes, there are times where you need to carefully manage a shared resource like that, normally it has to do with times where you need to do more than one thing, for example if you had several items that all need to change as a group before the handler can see them change then you need to disable the interrupt handler until all the items have changed. here again there is nothing special about embedded microcontrollers this is all basic stuff you would see on a desktop system with a full blown operating system.
Keil knows what they are doing if they support C++ then from a system level they have this worked out. I dont use Keil I use gcc and llvm for microcontrollers like this one.
Edit:
Here is an example of what I am talking about
https://github.com/dwelch67/stm32vld/tree/master/stm32f4d/blinker05
stm32 using timer based interrupts, the interrupt handler modifies a variable shared with the foreground task. The foreground task takes a single snapshot of the shared variable (per loop) and if need be uses the snapshot more than once in the loop rather than the shared variable which can change. This is C not C++ I understand that, and I am using gcc and llvm not Keil. (note llvm has known problems optimizing tight while loops, very old bug, dont know why they have no interest in fixing it, llvm works for this example).
Question 1: Local variables
The sample code provided by ST is not particularly efficient or elegant. It gets the job done, but sometimes there are no good reasons for the things they do.
In general, you use always want your variables to have the smallest scope possible. If you only use a variable in one function, define it inside that function. Add the "static" keyword to local variables if and only if you need them to retain their value after the function is done.
In some embedded environments, like the PIC18 architecture with the C18 compiler, local variables are much more expensive (more program space, slower execution time) than global. On the Cortex M3, that is not true, so you should feel free to use local variables. Check the assembly listing and see for yourself.
Question 2: Sharing variables between interrupts and the main loop
People have written entire chapters explaining the answers to this group of questions. Whenever you share a variable between the main loop and an interrupt, you should definitely use the volatile keywords on it. Variables of 32 or fewer bits can be accessed atomically (unless they are misaligned).
If you need to access a larger variable, or two variables at the same time from the main loop, then you will have to disable the clock interrupt while you are accessing the variables. If your interrupt does not require precise timing, this will not be a problem. When you re-enable the interrupt, it will automatically fire if it needs to.
Question 3: main function in C++
I'm not sure. You can use arm-none-eabi-nm (or whatever nm is called in your toolchain) on your object file to see what symbol name the C++ compiler assigns to main(). I would bet that C++ compilers refrain from mangling the main function for this exact reason, but I'm not sure.
STM's sample code is not an exemplar of good coding practice, it is merely intended to exemplify use of their standard peripheral library (assuming those are the examples you are talking about). In some cases it may be that variables are declared external to main() because they are accessed from an interrupt context (shared memory). There is also perhaps a possibility that it was done that way merely to allow the variables to be watched in the debugger from any context; but that is not a reason to copy the technique. My opinion of STM's example code is that it is generally pretty poor even as example code, let alone from a software engineering point of view.
In this case your clock interrupt variable is atomic so long as it is 32bit or less so long as you are not using read-modify-write semantics with multiple writers. You can safely have one writer, and multiple readers regardless. This is true for this particular platform, but not necessarily universally; the answer may be different for 8 or 16 bit systems, or for multi-core systems for example. The variable should be declared volatile in any case.
I am using C++ on STM32 with Keil, and there is no problem. I am not sure why you think that the C++ entry points are different, they are not here (Keil ARM-MDK v4.22a). The start-up code calls SystemInit() which initialises the PLL and memory timing for example, then calls __main() which performs global static initialisation then calls C++ constructors for global static objects before calling main(). If in doubt, step through the code in the debugger. It is important to note that __main() is not the main() function you write for your application, it is a wrapper with different behaviour for C and C++, but which ultimately calls your main() function.

Does primitive wrapper instantiation cause memory allocation in JDK 1.6

From PMD:
IntegerInstantiation: In JDK 1.5, calling new Integer() causes memory allocation. Integer.valueOf() is more memory friendly.
ByteInstantiation: In JDK 1.5, calling new Byte() causes memory allocation. Byte.valueOf() is more memory friendly.
ShortInstantiation: In JDK 1.5, calling new Short() causes memory allocation. Short.valueOf() is more memory friendly.
LongInstantiation: In JDK 1.5, calling new Long() causes memory allocation. Long.valueOf() is more memory friendly.
Does the same apply for JDK 1.6? I am just wondering if the compiler or jvm optimize this to their respective valueof methods.
In theory the compiler could optimize a small subset of cases where (for example) new Integer(n) was used instead of the recommended Integer.valueOf(n).
Firstly, we should note that the optimization can only be applied if the compiler can guarantee that the wrapper object will never be compared against other objects using == or !=. (If this occurred, then the optimization changes the semantics of wrapper objects, such that "==" and "!=" would behave in a way that is contrary to the JLS.)
In the light of this, it is unlikely that such an optimization would be worth implementing:
The optimization would only help poorly written applications that ignore the javadoc, etc recommendations. For a well written application, testing to see if the optimization could be applied only slows down the optimizer; e.g. the JIT compiler.
Even for a poorly written application, the limitation on where the optimization is allowed means that few of the actual calls to new Integer(n) qualify for optimization. In most situations, it is too expensive to trace all of places where the wrapper created by a new expression might be used. If you include reflection in the picture, the tracing is virtually impossible for anything but local variables. Since most uses of the primitive wrappers entail putting them into collections, it is easy to see that the optimization would hardly ever be found (by a practical optimizer) to be allowed.
Even in the cases where the optimization was actually applied, it would only help for values of n within in a limited range. For example, calling Integer.valueOf(n) for large n will always create a new object.
This holds for Java 6 as well. Try the following with Java 6 to prove it:
System.out.println(new Integer(3) == new Integer(3));
System.out.println(Integer.valueOf(3) == Integer.valueOf(3));
System.out.println(new Long(3) == new Long(3));
System.out.println(Long.valueOf(3) == Long.valueOf(3));
System.out.println(new Byte((byte)3) == new Byte((byte)3));
System.out.println(Byte.valueOf((byte)3) == Byte.valueOf((byte)3));
However, with big numbers the optimization is off, as expected.
The same applies for Java SE 6. In general it is difficult to optimise away the creation of new objects. It is guaranteed that, say, new Integer(42) != new Integer(42). There is potential in some circumstances to remove the need to get an object altogether, but I believe all that is disabled in HotSpot production builds at the time of writing.
Most of the time it doesn't matter. Unless you have identified you are running a critical piece of code which is called lots of times (e.g. 10K or more) it is likely it won't make much difference.
If in doubt, assume the compiler does no optimisations. It does in fact very little. The JVM however, can do lots of optimisations, but removing the need to create an object is not one of them. The general assumption is that object allocation is fast enough most of the time.
Note: code which is run only a few times (< 10K time by default) will not even be fully compiled to native code and this is likely to slow down your code more than the object allocation.
On a recent jvm with escape analysis and scalar replacement, new Integer() might actually be faster than Integer.valueOf() when the scope of the variable is restricted to one method or block. See Autoboxing versus manual boxing in Java for some more details.