Confusion regarding reentrant functions - interrupt

My understanding of "reentrant function" is that it's a function that can be interrupted (e.g by an ISR or a recursive call) and later resumed such that the overall output of the function isn't affected in any way by the interruption.
Following is an example of a reentrant function from Wikipedia https://en.wikipedia.org/wiki/Reentrancy_(computing)
int t;
void swap(int *x, int *y)
{
int s;
s = t; // save global variable
t = *x;
*x = *y;
// hardware interrupt might invoke isr() here!
*y = t;
t = s; // restore global variable
}
void isr()
{
int x = 1, y = 2;
swap(&x, &y);
}
I was thinking, what if we modify the ISR like this:
void isr()
{
t=0;
}
And let's say, then, that the main function calls the swap function, but then suddenly an interrupt occurs, then the output would surely get distorted as the swap wouldn't be proper, which in my mind makes this function non-reentrant.
Is my thinking right or wrong? Is there some mistake in my understanding of reentrancy?

The answer to your question:
that the main function calls the swap function, but then suddenly an interrupt occurs, then the output would surely get distorted as the swap wouldn't be proper, which in my mind makes this function non-reentrant.
Is no, it does not, because re-entrancy is (by definition) defined with respect to self. If isr calls swap, the other swap would be safe. However, swap is thread-unsafe, though.
The correct way of thinking depends on the precise definition of re-entrancy and thread-safety (See, say Threadsafe vs re-entrant)
Wikipedia, the source of the code in question, selected the definition of reentrant function to be "if it can be interrupted in the middle of its execution and then safely called again ("re-entered") before its previous invocations complete execution".

I have never heard the term re-entrancy used in the context of interrupt service routines. It is generally the responsibility of the ISR (and/or the operating system) to maintain consistency - application code should not need to know anything about what an interrupt might do.
That a function is re-entrant usually means that it can be called from multiple threads simultaneously - or by itself recursively (either directly or through a more elaborate call chain) - and still maintain internal consistency.
For functions to be re-entrant they must generally avoid using static variables and of course avoid calls to other functions that are not themselves re-entrant.

Related

How to disable all optimization when using COSMIC compiler?

I am using the COSMIC compiler in the STVD ide and even though optimization is turned of with -no (documentation says "-no: do not use optimizer") some lines of code get removed and cannot have a breakpoint placed upon them, nor are they to be found in the disassembly.
I tried to set -oc (leave removed instructions as comments) which resulted in not even showing the removed lines as comment.
bool foo(void)
{
uint8_t val;
if (globalvar > 5)
val = 0;
for (val = 0; val < 8; val++)
{
some code...
}
return true;
}
I do know it seems idiotic to set val to 0 prior to the for loop but lets just assume it is for some reason necessary. When I set no optimization I expect it to be not optimized but insted the val = 0; gets removed without any traces.
I am not looking for a workaround like declaring val volatile whitch solves the problem. I am rather looking for a way to prevent the optimization or at least understand/know what changes are made to my code when compiling.
It is not clear from the manual, but it seems that the -no option prevents assembly level optimisation. It seems possible that the code generator stage that runs before assembly optimisation may perform higher level optimisation such as redundant code removal.
From the manual:
-cp
disable the constant propagation optimization. By default,
when a variable is assigned with a constant, any subsequent access to that variable is replaced by the constant
itself until the variable is modified or a flow break is
encountered (function call, loop, label ...).
It seems that it is this constant propagation feature that you must explicitly disable.
It is unusual perhaps, but it appears that this compiler optimises by default, and distinguishes between compiler optimisations and assembler optimisations (performed as the compilation stage), and them makes you switch off each individual optimisation separately.
To avoid this in the code, rather than switching it off globally, you could initialise val to a non-zero value in this case:
int val = -1 ;
Then the later assignment to zero will require explicit code. This has the advantage over volatile perhaps in that it will not block optimisations when you do enable them.
I believe that this behaviour is allowed by the C language specification.
You are effectively writing the same value either once or twice to the same variable on successive lines of code. The compiler could assign this value to either a processor register or a memory location as it sees fit and knows that the value following the initial assignment in the for loop is the same as the value assigned when the if clause is actioned. As a result the language spec allows the compiler to throw the redundant code away.
The way to force the compiler to perform all read and write accesses to the variable is to use the volatile keyword. That is what it is for.

C++/CLI method calls native method to modify int - need pin_ptr?

I have a C++/CLI method, ManagedMethod, with one output argument that will be modified by a native method as such:
// file: test.cpp
#pragma unmanaged
void NativeMethod(int& n)
{
n = 123;
}
#pragma managed
void ManagedMethod([System::Runtime::InteropServices::Out] int% n)
{
pin_ptr<int> pinned = &n;
NativeMethod(*pinned);
}
void main()
{
int n = 0;
ManagedMethod(n);
// n is now modified
}
Once ManagedMethod returns, the value of n has been modified as I would expect. So far, the only way I've been able to get this to compile is to use a pin_ptr inside ManagedMethod, so is pinning in fact the correct/only way to do this? Or is there a more elegant way of passing n to NativeMethod?
Yes, this is the correct way to do it. Very highly optimized inside the CLR, the variable gets the [pinned] attribute so the CLR knows that it stores an interior pointer to an object that should not be moved. Distinct from GCHandle::Alloc(), pin_ptr<> can do it without creating another handle. It is reported in the table that the jitter generates when it compiles the method, the GC uses that table to know where to look for object roots.
Which only ever matters when a garbage collection occurs at the exact same time that NativeMethod() is running. Doesn't happen very often in practice, you'd have to use threads in the program. YMMV.
There is another way to do it, doesn't require pinning but requires a wee bit more machine code:
void ManagedMethod(int% n)
{
int copy = n;
NativeMethod(copy);
n = copy;
}
Which works because local variables have stack storage and thus won't be moved by the garbage collector. Does not win any elegance points for style but what I normally use myself, estimating the side-effects of pinning is not that easy. But, really, don't fear pin_ptr<>.

Will code written in this style be optimized out by RVO in C++11?

I grew up in the days when passing around structures was bad mojo because they are often large, so pointers were always the way to go. Now that C++11 has quite good RVO (right value optimization), I'm wondering if code like the following will be efficient.
As you can see, my class has a bunch of vector structures (not pointers to them). The constructor accepts value structures and stores them away.
My -hope- is that the compiler will use move semantics so that there really is no copying of data going on; the constructor will (when possible) just assume ownership of the values passed in.
Does anyone know if this is true, and happens automagically, or do I need a move constructor with the && syntax and so on?
// ParticleVertex
//
// Class that represents the particle vertices
class ParticleVertex : public Vertex
{
public:
D3DXVECTOR4 _vertexPosition;
D3DXVECTOR2 _vertexTextureCoordinate;
D3DXVECTOR3 _vertexDirection;
D3DXVECTOR3 _vertexColorMultipler;
ParticleVertex(D3DXVECTOR4 vertexPosition,
D3DXVECTOR2 vertexTextureCoordinate,
D3DXVECTOR3 vertexDirection,
D3DXVECTOR3 vertexColorMultipler)
{
_vertexPosition = vertexPosition;
_vertexTextureCoordinate = vertexTextureCoordinate;
_vertexDirection = vertexDirection;
_vertexColorMultipler = vertexColorMultipler;
}
virtual const D3DVERTEXELEMENT9 * GetVertexDeclaration() const
{
return particleVertexDeclarations;
}
};
Yes, indeed you should trust the compiler to optimally "move" the structures:
Want Speed? Pass By Value
Guideline: Don’t copy your function arguments. Instead, pass them by value and let the compiler do the copying
In this case, you'd move the arguments into the constructor call:
ParticleVertex myPV(std::move(pos),
std::move(textureCoordinate),
std::move(direction),
std::move(colorMultipler));
In many contexts, the std::move will be implicit, e.g.
D3DXVECTOR4 getFooPosition() {
D3DXVECTOR4 result;
// bla
return result; // NRVO, std::move only required with MSVC
}
ParticleVertex myPV(getFooPosition(), // implicit rvalue-reference moved
RVO means Return Value Optimization not Right value optimization.
RVO is a optimization performed by the compiler when the return of a function is by value, and its clear that the code returns a temporary object created in the body, so the copy can be avoided. The function returns the created object directly.
What C++11 introduces is Move Semantics. Move semantics allows us to "move" the resource from a certain temporary to a target object.
But, move implies that the object wich the resource comes from, is in a unusable state after the move. This is not the case (I think) you want in your class, because the vertex data is used by the class, even if the user calls to this function or not.
So, use the common return by const reference to avoid copies.
On the other hand,, DirectX provides handles to the resources (Pointers), not the real resource. Pointers are basic types,its copying is cheap, so don't worry about performance. In your case, you are using 2d/3d vectors. Its copying is cheap too.
Personally, I think that returning a pointer to an internal resource is a very bad idea, always. I think that in this case the best aproach is to return by const reference.

Overcoming the race condition in lock-free reference-counted dereferences

Imagine a structure like this:
struct my_struct {
uint32_t refs
...
}
for which a pointer is acquired through a lookup table:
struct my_struct** table;
my_struct* my_struct_lookup(const char* name)
{
my_struct* s = table[hash(name)];
/* EDIT: Race condition here. */
atomic_inc(&s->refs);
return s;
}
A race exists between the dereference and the atomic increment in a multi-threaded model. Given that this is very performance critical code, I was wondering how this race inbetween the dereference and atomic increment is typically resolved or worked around?
EDIT: When acquiring a pointer to a my_struct structure via the lookup table, it is necessary to first dereference the structure in order to increment its reference count. This creates a problem in multi-threaded code when other threads could be altering the reference count and potentially deallocating the object itself while another thread would then dereference a pointer to non-existent memory. Combined with preemption and some bad luck, this could be a recipe for disaster.
As someone said above, you can make linked list of memory to free at some later time, so your pointers are never invalid. This is a handy method in some cases.
Or....you can make a 64 bit struct with your 32 bit pointer and have 32 bits for a ref count and other flags. You can use 64 bit atomic ops on the struct if you wrap it in a union:
union my_struct_ref {
struct {
unsigned int cUse : 16,
fDeleted : 1; // etc
struct my_struct *s;
} Data;
unsigned long n64;
}
You can human readably work with the Data part of the struct, and you can use CAS on the n64 bit part.
my_struct* my_struct_lookup(const char* name)
{
struct my_struct_ref Old, New;
int iHash = hash(name);
// concurrency loop
while (1) {
Old.n64 = table[iHash].n64;
if (Old.Data.fDeleted)
return NULL;
New.n64 = Old.n64;
New.Data.cRef++;
if (CAS(&table[iHash].n64, Old.n64, New.n64)) // CAS = atomic compare and swap
return New.Data.s; // success
// we get here if some other thread changed the count or deleted our pointer
// in between when we got a copy of it int old. Just loop to try again.
}
}
If you are using 64 bit pointers you will need to do 128 bit CAS.
One solution is to use a freelist, rather than malloc() and free(). This has obvious drawbacks.
Another is to implement lock-free garbage collection (also known as Safe Memory Reclaimation).
There are MANY patents in this field, but it appears that epoch-based LFGC is unencumbered.
The upshot of using this method is that elements are only deallocated when no threads are pointing at them.
The former solution is very easy to implement. You need a lock-free freelist, of course, or your overall system is no longer lock-free.
The latter is really not complex, but requires learning the algorithm in question, which takes some time and research.
Beside the race you identified, you have a general problem of memory consistency.
Even if you could make the table modifications atomic in a lock-free fashion, the block of memory my_struct* points to could still be "stale" when seen from a different thread compared to the thread that last modified it. This does not apply to my_struct.refs (provided you always access it using atomics), but does apply to all other fields. This is the consequence of write buffers and caches that are "private" to each CPU core.
The only way to guarantee you are seeing the correct memory content is to use a memory barrier. Yet, a typical lock is also a memory barrier, so why not just use the lock in the first place?
Lock-free programming is much trickier than may initially seem, OTOH locks can be very fast, especially when contentions are rare. Have you actually benchmarked lock-based implementation and confirmed that locking is indeed your bottleneck?

How can be bytecode used for optimizing the execution time of dynamic languages?

I am interested in some optimization methods or general bytecode designs, which might help speed up execution using VM in comparison to interpretation of an AST.
The main win in AST interpretation vs. bytecode is operation dispatch cost, for highly optimised interpreters this starts to become a real problem. "Dispatch" is the term used to describe the overhead required to start executing an operation (such as arithmetic, property access, etc).
A fairly normal AST based interpreter would look something like this:
class ASTNode {
virtual double execute() = 0;
}
class NumberNode {
virtual double execute() { return m_value; }
double m_value;
}
class AddNode {
virtual double execute() { return left->execute() + right->execute(); }
}
So executing the code for something as simple as 1+1 requires 3 virtual calls. Virtual calls a very expensive (in the grand scheme of things) due to the multiple indirections to make the call, and the general cost of making a call in the first place.
In a bytecode interpreter you have you a different dispatch model -- rather than virtual calls you have an execution loop, akin to:
while (1) {
switch (op.type) {
case op_add:
// Efficient interpreters use "registers" rather than
// a stack these days, but the example code would be more
// complicated
push(pop() + pop());
continue;
case op_end:
return pop();
}
}
This still has a reasonably expensive dispatch cost vs native code, but is much faster than virtual dispatch. You can further improve perf using a gcc extension called "computed goto" which allows you to remove the switch dispatch, reducing total dispatch cost to basically a single indirect branch.
In addition to improving dispatch costs bytecode based interpreters have a number of additional advantages over AST interpreters, mostly due to the ability of the bytecode to "directly" jump to other locations as a real machine would, for example imagine a snippet of code like this:
while (1) {
...statements...
if (a)
break;
else
continue;
}
To implement this correctly everytime a statement is executed you would need to indicate whether execution is meant to stay in the loop or stop, so the execution loop becomes something like:
while (condition->execute() == true) {
for (i = 0; i < statements->length(); i++) {
result = statements[i]->execute();
if (result.type == BREAK)
break;
if (result.type == CONTINUE)
i = 0;
}
}
As you add more forms of flow control this signalling becomes more and more expensive. Once you add exceptions (eg. flow control that can happen everywhere) you start needing to check for these things in the middle of even basic arithmetic, leading to ever increasing overhead. If you want to see this in the real world I encourage you to look at the ECMAScript spec, where they describe the execution model in terms of an AST interpreter.
In a bytecode interpreter these problems basically go away, as the bytecode is able to directly express control flow rather than indirectly through signalling, eg. continue is simply converted into a jump instruction, and you only get that cost if it's actually hit.
Finally an AST interpreter by definition is recursive, and so has to be prevented from overflowing the system stack, which puts very heavy restrictions on how much you can recurse in your code, something like:
1+(1+(1+(1+(1+(1+(1+(1+1)))))))
Has 8 levels of recursion (at least) in the interpreter -- this can be a very significant cost; older versions of Safari (pre-SquirrelFish) used an AST interpreter, and for this reason JS was allowed only a couple of hundred levels of recursion vs 1000's allowed in modern browsers.
Perhaps you could look at the various methods which the llvm "opt" tool provides. Those are bytecode-to-bytecode optimisations, and the tool itself will provide analysis on the benefits of applying a particular optimisation.