Coroutine what is the difference with C++ inline function? - coroutine

As I understood it, coroutine is a sequence of function calls, with the difference that function calls do not allocate stack each call (they are resumed for the point they were stoped), or we can say that these function are executed in shared stack.
So the main advantage of coroutine is execution speed. Isn't it just inline functions from C++? (when instead of call, function body is inserted during compilation).

A "coroutine", in the commonly used sense, is basically a function that -- once started -- can be envisioned as running alongside the caller. That is, when the coroutine "yields" (a semi-special kind of return), it isn't necessarily done -- and "calling" it again will have the coroutine pick up right where it left off, with all its state intact, rather than starting from the beginning. The calls can thus be seen as kinda passing messages between the two functions.
Few languages fully and natively do this. (Stack-based languages tend to have a hard time with it, absent some functionality like Windows's "fibers".) Ruby does, apparently, and Python uses a limited version of it. I believe they call it a "generator", and it's basically used kinda like an iterable collection (whose iterator generates its next "element" on the fly). C# can semi do that as well (they call it an "iterator"), but the compiler actually turns the function into a class that implements a state machine of sorts.

Related

Are Kotlin's 'coroutines' actually coroutines?

The Kotlin documentation's overview of coroutines says that "Kotlin provides coroutine support at the language level, and delegates most of the functionality to libraries." But does Kotlin actually support coroutines, or are they misusing the word?
In practice what the quote from the docs means is:
The language provides the suspend modifier, and the compiler transforms suspend functions into CPS (continuation-passing style).
The kotlinx.coroutines library provides machinery for executing the resulting chains of continuations.
I haven't found a good answer for which bit of the above constitutes 'a coroutine' in Kotlin's terminology. My current best understanding is that Kotlin refers to the instance created by launch as a 'coroutine'. That fits with the slogan we see everywhere that "coroutines are light-weight threads" (for example here).
suspend fun myFunction(): Unit = delay(100)
someScope.launch { myFunction() } // this creates a 'coroutine'.
However it doesn't fit with my (admittedly thin) knowledge/experience of coroutines in other languages. Outside of Kotlin, I would describe a coroutine as a function that is like a subroutine, but can suspend and yield control to another arbitrary coroutine at any point. This definition is partly based on my reading of the Wikipedia article on Coroutines.
This is different from the way Kotlin uses the term in two major respects:
In Kotlin, the coroutine is the rather abstract 'lightweight thread' that executes the function, whereas by my definition, the coroutine would be the function itself.
Kotlin's suspend functions can only yield control back to the caller, whereas a defining characteristic of a coroutine is that it can pass control to another coroutine.
I think that Kotlin's suspend functions are actually generators or semicoroutines, and that the kotlinx.coroutines library is essentially providing a trampoline.
Is my understanding correct, based on the way the term 'coroutine' is used outside of Kotlin? If so, could the existing language support actually be used to implement something more akin to true coroutines, or is it fundamentally limited to semicoroutines and generators?
For me a coroutine is more like a flow of code execution, an execution context, not a function. But I agree, naming is a little inconsistent and can be confusing because traditionally we said a function is a subroutine and now we say coroutines are like threads. Even the Wikipedia article linked by you does this: first it says a coroutine is a generalization of a subroutine, but then it says coroutines are very similar to threads.
In the end of the day, I think it doesn't really matter that much. If we say we "launch a coroutine" that could mean: "launch a function which is a coroutine" or "launch a suspendable execution flow", but this is almost the same. I think this is why people use these terms for both meanings.
The main difference between subroutines and coroutines is that for subroutines the execution flow can only exit by returning from the subroutine - then the subroutine call is finished. Coroutines can temporarily suspend their execution, jump to another coroutine and after some time jump back to the place where they suspended. Coroutines in Kotlin can do that, so yes, my opinion is that they are coroutines. Although, Kotlin has to "emulate" this behavior, because most of its runtimes (JVM, JS) do not support coroutines natively.
Kotlin's suspend functions can only yield control back to the caller
What makes you think so? You can launch 5 coroutines and whenever one of them suspends, the execution jumps to the point where another coroutine suspended earlier.
Well... technically speaking yes, suspending is implemented by returning (if this is what you mean), but this is just implementation details. From the end-user perspective the execution is passed from one coroutine to another, running side-by-side.

When is invokedynamic actually useful (besides lazy constants)?

TL;DR
Please provide a piece of code written in some well known dynamic language (e.g. JavaScript) and how that code would look like in Java bytecode using invokedynamic and explain why the usage of invokedynamic is a step forward here.
Background
I have googled and read quite a lot about the not-that-new-anymore invokedynamic instruction which everyone on the internet agrees on that it will help speed dynamic languages on the JVM. Thanks to stackoverflow I managed to get my own bytecode instructions with Sable/Jasmin to run.
I have understood that invokedynamic is useful for lazy constants and I also think that I understood how the OpenJDK takes advantage of invokedynamic for lambdas.
Oracle has a small example, but as far as I can tell the usage of invokedynamic in this case defeats the purpose as the example for "adder" could much simpler, faster and with roughly the same effect expressed with the following bytecode:
aload whereeverAIs
checkcast java/lang/Integer
aload whereeverBIs
checkcast java/lang/Integer
invokestatic IntegerOps/adder(Ljava/lang/Integer;Ljava/lang/Integer;)Ljava/lang/Integer;
because for some reason Oracle's bootstrap method knows that both arguments are integers anyway. They even "admit" that:
[..]it assumes that the arguments [..] will be Integer objects. A bootstrap method requires additional code to properly link invokedynamic [..] if the parameters of the bootstrap method (in this example, callerClass, dynMethodName, and dynMethodType) vary.
Well yes, and without that interesing "additional code" there is no point in using invokedynamic here, is there?
So after that and a couple of further Javadoc and Blog entries I think that I have a pretty good grasp on how to use invokedynamic as a poor replacement when invokestatic/invokevirtual/invokevirtual or getfield would work just as well.
Now I am curious how to actually apply the invokedynamic instruction to a real world usecase so that it actually is some improvements over what we could with "traditional" invocations (except lazy constants, I got those...).
Actually, lazy operations are the main advantage of invokedynamic if you take the term “lazy creation” broadly. E.g., the lambda creation feature of Java 8 is a kind of lazy creation that includes the possibility that the actual class containing the code that will be finally invoked by the invokedynamic instruction doesn’t even exist prior to the execution of that instruction.
This can be projected to all kind of scripting languages delivering code in a different form than Java bytecode (might be even in source code). Here, the code may be compiled right before the first invocation of a method and remains linked afterwards. But it may even become unlinked if the scripting language supports redefinition of methods. This uses the second important feature of invokedynamic, to allow mutable CallSites which may be changed afterwards while supporting maximal performance when being invoked frequently without redefinition.
This possibility to change an invokedynamic target afterwards allows another option, linking to an interpreted execution on the first invocation, counting the number of executions and compiling the code only after exceeding a threshold (and relinking to the compiled code then).
Regarding dynamic method dispatch based on a runtime instance, it’s clear that invokedynamic can’t elide the dispatch algorithm. But if you detect at runtime that a particular call-site will always call the method of the same concrete type you may relink the CallSite to an optimized code which will do a short check if the target is that expected type and performs the optimized action then but branches to the generic code performing the full dynamic dispatch only if that test fails. The implementation may even de-optimize such a call-site if it detects that the fast path check failed a certain number of times.
This is close to how invokevirtual and invokeinterface are optimized internally in the JVM as for these it’s also the case that most of these instructions are called on the same concrete type. So with invokedynamic you can use the same technique for arbitrary lookup algorithms.
But if you want an entirely different use case, you can use invokedynamic to implement friend semantics which are not supported by the standard access modifier rules. Suppose you have a class A and B which are meant to have such a friend relationship in that A is allowed to invoke private methods of B. Then all these invocations may be encoded as invokedynamic instructions with the desired name and signature and pointing to a public bootstrap method in B which may look like this:
public static CallSite bootStrap(Lookup l, String name, MethodType type)
throws NoSuchMethodException, IllegalAccessException {
if(l.lookupClass()!=A.class || (l.lookupModes()&0xf)!=0xf)
throw new SecurityException("unprivileged caller");
l=MethodHandles.lookup();
return new ConstantCallSite(l.findStatic(B.class, name, type));
}
It first verifies that the provided Lookup object has full access to A as only A is capable of constructing such an object. So sneaky attempts of wrong callers are sorted out at this place. Then it uses a Lookup object having full access to B to complete the linkage. So, each of these invokedynamic instructions is permanently linked to the matching private method of B after the first invocation, running at the same speed as ordinary invocations afterwards.

Serializing Running Programs in a Functional Interpreter

I am writing an interpreter implemented functionally using a variations of the Cont Monad. Inspired by Smalltalk's use of images to capture a running program, I am investigating how to serialize the executing hosted program and need help determining how to accomplish this at a high level.
Problem Statement
Using the Cont monad, I can capture the current continuation of a running program in my interpreter. Storing the current continuation allows resuming interpreter execution by calling the continuation. I would like to serialize this continuation so that the state of a running program can be saved to disk or loaded by another interpreter instance. However, my language (I am both targeting and working in Javascript) does not support serializing functions this way.
I would be interested in an approach that can be used to build up the continuation at a given point of execution given some metadata without running the entire program again until it reaches that point. Preferably, minimal changes to the implementation of the interpreter itself would be made.
Considered Approach
One approach that may work is to push all control flow logic into the program state. For example, I currently express a C style for loop using the host language's recursion for the looping behavior:
var forLoop = function(init, test, update, body) {
var iter = function() {
// When test is true, evaluate the loop body and start iter again
// otherwise, evaluate an empty statement and finish
return branch(test,
next(
next(body, update),
iter()),
empty);
};
return next(
init,
iter());
};
This is a nice solution but if I pause the program midway though a for loop, I don't know how I can serialize the continuation that has been built up.
I know I can serialize a transformed program using jumps however, and the for loop can be constructed from jump operations. A first pass of my interpreter would generate blocks of code and save these in the program state. These blocks would capture some action in the hosted language and potentially execute other blocks. The preprocessed program would look something like this:
Label Actions (Block of code, there is no sequencing)
-----------------------------------
start: init, GOTO loop
loop: IF test GOTO loop_body ELSE GOTO end
loop_body: body, GOTO update
update: update, GOTO loop
end: ...
This makes each block of code independent, only relying on values stored in the program state.
To serialize, I would save off the current label name and the state when it was entered. Deserialization would preprocess the input code to build the labels again and then resume at the given label with the given state. But now I have to think in terms of these blocks when implementing my interpreter. Even using composition to hide some of this seems kind of ugly.
Question
Are there any good existing approaches for addressing this problem? Am I thinking about serializing a program the entirely wrong way? Is this even possible for structures like this?
After more research, I have some thoughts on how I would do this. However, I'm not sure that adding serialization is something I want to do at this point as it would effect the rest of the implementation so much.
I'm not satisfied with this approach and would greatly like to hear any alternatives.
Problem
As I noted, transforming the program into a list of statements makes serialization easier. The entire program can be transformed into something like assembly language, but I wanted to avoid this.
Keeping a concept of expressions, what I didn't originally consider is that function calls can occur inside of deeply nested expressions. Take this program for example:
function id(x) { return x; }
10 + id(id(5)) * id(3);
The serializer should be able to serialize the program at any statement, but the statement could potentially be evaluated inside of an expression.
Host Functions In the State
The reason the program state cannot be easily serialized is that it contains host functions for continuations. These continuations must be transformed into data structures that can be serialized and independently reconstructed into the action the original continuation represented. Defunctionalization is most often used to express a higher order language in a first order language, but I believe it would also enable serialization.
Not all continuations can be easily defunctionalized without drastically rewriting the interpreter. As we are only interested in serialization at specific points, serialization at these points requires the entire continuation stack be defunctionalized. So all statements and expressions must be defunctionalized, but internal logic can remain unchanged in most cases because we don't want to allow serialization partway though an internal operation.
However, to my knowledge, defunctionalization does not work the Cont Monad because of bind statements. The lack of a good abstraction makes it difficult to work with.
Thoughts on a Solution
The goal is to create an object made up of only simple data structures that can be used to reconstruct the entire program state.
First, to minimize the amount of work required, I would rewrite the statements level interpreter to use something more like a state machine that can be more easily serialized. Then, I would defunctionalize expressions. Function calls would push the defunctionlized continuation for the remaining expression onto an internal stack for the state machine.
Making Program State a Serializable Object
Looking at how statements work, I'm not convinced that the Cont Monad is the best approach for chaining statements together (I do think it works pretty well at the expression level and for internal operations however). A state machine seems more natural approach, and this would also be easier to serialize.
A machine that steps between statements would be written. All other structures in the state would also be made serializable. Builtin functions would have to use serializable handles to identify them so that no functions are in the state.
Handling Expressions
Expressions would be rewritten to pass defunctionalized continuations instead of host function continuations. When a function call is encountered in an expression, it captures the defunctionalized current continuation and pushes it onto the statement machine's internal stack (This would only happen for hosted functions, not builtin ones), creating a restore point where computation can resume.
When the function returns, the defunctionalized continuation is passed the result.
Concerns
Javascript also allows hosted functions to be evaluated inside almost any expression (getters, setters, type conversion, higher order builtins) and this may complicate things if we allow serialization inside of those functions.
Defunctionalization seems to require working directly with continuations and would make the entire interpreter less flexible.

Will the Hotspot VM inline functions as necessary?

I am converting some C++ code to Java and I was wondering what I can do about the inlined functions. Can I assume that functions will be inlined by the VM (as an when necessary) and just not worry about this? How do I profile to observe this behaviour? Suppose there is a main outer function, and I throw a for loop around it and cause a million invocations. Should I expect to see an improvements as the VM inlines more and more?
Yes Java does inline method calls. The inlining is performed by the JIT compiler, so you won't see it by examining the bytecode files.
Whether inlining actually occurs for a given method call will depend on the size of the method body, and whether the call is inlineable. (If a method call involves dispatching ... after the JVM has a bunch of global optimization designed to remove unnecessary dispatching ... then it cannot be inlined.)
The same applies to your example with your outer main function. It depends on how big the method body is. On the other hand, if the method takes a significant time to execute, then the relative importance of the optimization decreases correspondingly.
My advice is to not worry about things like this at this stage. Just write the code clearly and simply, and let the JIT compiler deal with the problem of optimizing. When your application is working, you can profile it and see if there are any "hot spots" in the code that are worthwhile optimizing by hand.
But I should be able to see this in something like Visual VM right? I mean initially no inlining, then more and more stuff is inlined so the average time for the outer method is slightly reduced.
It may be observable and it may not, depending on the amount time spent in making the calls relative to executing the method bodies. (Profiling often relies on sampling the program counter. The reported times may be inaccurate if the number of samples for a given region of code is too small ... and for other reasons.)
It also depends on the JVM you are using. Not all JVMs will re-optimize code that they have previously optimized.
Finally, there is a way to get the JVM to dump the native code output by the JIT compiler. That will give you a definitive answer as to what has been inlined ... if you are prepared to read the machine instructions.

STM32 programming tips and questions

I could not find any good document on internet about STM32 programming. STM's own documents do not explain anything more than register functions. I will greatly appreciate if anyone can explain my following questions?
I noticed that in all example programs that STM provides, local variables for main() are always defined outside of the main() function (with occasional use of static keyword). Is there any reason for that? Should I follow a similar practice? Should I avoid using local variables inside the main?
I have a gloabal variable which is updated within the clock interrupt handle. I am using the same variable inside another function as a loop condition. Don't I need to access this variable using some form of atomic read operation? How can I know that a clock interrupt does not change its value in the middle of the function execution? Should I need to cancel clock interrupt everytime I need to use this variable inside a function? (However, this seems extremely ineffective to me as I use it as loop condition. I believe there should be better ways of doing it).
Keil automatically inserts a startup code which is written in assembly (i.e. startup_stm32f4xx.s). This startup code has the following import statements:
IMPORT SystemInit
IMPORT __main
.In "C", it makes sense. However, in C++ both main and system_init have different names (e.g. _int_main__void). How can this startup code can still work in C++ even without using "extern "C" " (I tried and it worked). How can the c++ linker (armcc --cpp) can associate these statements with the correct functions?
you can use local or global variables, using local in embedded systems has a risk of your stack colliding with your data. with globals you dont have that problem. but this is true no matter where you are, embedded microcontroller, desktop, etc.
I would make a copy of the global in the foreground task that uses it.
unsigned int myglobal;
void fun ( void )
{
unsigned int myg;
myg=myglobal;
and then only use myg for the rest of the function. Basically you are taking a snapshot and using the snapshot. You would want to do the same thing if you are reading a register, if you want to do multiple things based on a sample of something take one sample of it and make decisions on that one sample, otherwise the item can change between samples. If you are using one global to communicate back and forth to the interrupt handler, well I would use two variables one foreground to interrupt, the other interrupt to foreground. yes, there are times where you need to carefully manage a shared resource like that, normally it has to do with times where you need to do more than one thing, for example if you had several items that all need to change as a group before the handler can see them change then you need to disable the interrupt handler until all the items have changed. here again there is nothing special about embedded microcontrollers this is all basic stuff you would see on a desktop system with a full blown operating system.
Keil knows what they are doing if they support C++ then from a system level they have this worked out. I dont use Keil I use gcc and llvm for microcontrollers like this one.
Edit:
Here is an example of what I am talking about
https://github.com/dwelch67/stm32vld/tree/master/stm32f4d/blinker05
stm32 using timer based interrupts, the interrupt handler modifies a variable shared with the foreground task. The foreground task takes a single snapshot of the shared variable (per loop) and if need be uses the snapshot more than once in the loop rather than the shared variable which can change. This is C not C++ I understand that, and I am using gcc and llvm not Keil. (note llvm has known problems optimizing tight while loops, very old bug, dont know why they have no interest in fixing it, llvm works for this example).
Question 1: Local variables
The sample code provided by ST is not particularly efficient or elegant. It gets the job done, but sometimes there are no good reasons for the things they do.
In general, you use always want your variables to have the smallest scope possible. If you only use a variable in one function, define it inside that function. Add the "static" keyword to local variables if and only if you need them to retain their value after the function is done.
In some embedded environments, like the PIC18 architecture with the C18 compiler, local variables are much more expensive (more program space, slower execution time) than global. On the Cortex M3, that is not true, so you should feel free to use local variables. Check the assembly listing and see for yourself.
Question 2: Sharing variables between interrupts and the main loop
People have written entire chapters explaining the answers to this group of questions. Whenever you share a variable between the main loop and an interrupt, you should definitely use the volatile keywords on it. Variables of 32 or fewer bits can be accessed atomically (unless they are misaligned).
If you need to access a larger variable, or two variables at the same time from the main loop, then you will have to disable the clock interrupt while you are accessing the variables. If your interrupt does not require precise timing, this will not be a problem. When you re-enable the interrupt, it will automatically fire if it needs to.
Question 3: main function in C++
I'm not sure. You can use arm-none-eabi-nm (or whatever nm is called in your toolchain) on your object file to see what symbol name the C++ compiler assigns to main(). I would bet that C++ compilers refrain from mangling the main function for this exact reason, but I'm not sure.
STM's sample code is not an exemplar of good coding practice, it is merely intended to exemplify use of their standard peripheral library (assuming those are the examples you are talking about). In some cases it may be that variables are declared external to main() because they are accessed from an interrupt context (shared memory). There is also perhaps a possibility that it was done that way merely to allow the variables to be watched in the debugger from any context; but that is not a reason to copy the technique. My opinion of STM's example code is that it is generally pretty poor even as example code, let alone from a software engineering point of view.
In this case your clock interrupt variable is atomic so long as it is 32bit or less so long as you are not using read-modify-write semantics with multiple writers. You can safely have one writer, and multiple readers regardless. This is true for this particular platform, but not necessarily universally; the answer may be different for 8 or 16 bit systems, or for multi-core systems for example. The variable should be declared volatile in any case.
I am using C++ on STM32 with Keil, and there is no problem. I am not sure why you think that the C++ entry points are different, they are not here (Keil ARM-MDK v4.22a). The start-up code calls SystemInit() which initialises the PLL and memory timing for example, then calls __main() which performs global static initialisation then calls C++ constructors for global static objects before calling main(). If in doubt, step through the code in the debugger. It is important to note that __main() is not the main() function you write for your application, it is a wrapper with different behaviour for C and C++, but which ultimately calls your main() function.