Nativecall Buf lifetime and Garbage Collector - raku

I've got a chunk of memory in a Buf I want to pass in to a C library, but the library will be using the memory beyond the lifetime of a single call.
I understand that can be problematic since the Garbage Collector can move memory around.
For passing in a Str, the Nativecall docs
say "If the C function requires the lifetime of a string to exceed the function call, the argument must be manually encoded and passed as CArray[uint8]" and have an example of doing that, essentially:
my $array = CArray[uint8].new($string.encode.list);
My question is: Must I do the same thing for a Buf? In case it gets moved by the GC? Or will the GC leave my Buf where it sits? For a short string, that isn't a big deal, but for a large memory buffer, that could potentially be an expensive operation. (See, for example, Archive::Libarchive which you can pass in a Buf with a tar file. Is that code problematic?
multi method open(Buf $data!) {
my $res = archive_read_open_memory $!archive, $data, $data.bytes;
...
Is there (could there be? should there be?) some sort of trait on a Buf that tells the GC not to move it around? I know that could be trouble if I add more data to the Buf, but I promise not to do that. What about for a Blob that is immutable?

You'll get away with this on MoarVM, at least at the moment, provided that you keep a reference to the Blob or Buf alive in Perl 6 for as long as the native code needs it and (in the case of Buf) you don't do a write to it that could cause a resize.
MoarVM allocates the Blob/Buf object inside of the nursery, and will move it during GC runs. However, that object does not hold the data; rather, it holds the size and a pointer to a block of memory holding the values. That block of memory is not allocated using the GC, and so will not move.
+------------------------+
| GC-managed Blob object |
+------------------------+ +------------------------+
| Elements |----->| Non-GC-managed memory |
+------------------------+ | (this bit is passed to |
| Size | | native code) |
+------------------------+ +------------------------+
Whether you should rely on this is a trickier question. Some considerations:
So far as I can tell, things could go rather less well if running on the JVM. I don't know about the JavaScript backend. You could legitimately decide that, due to adoption levels, you're only going to worry about running on MoarVM for now.
Depending on implementation details of MoarVM is OK if you just need the speed in your own code, but if working on a module you expect to be widely adopted, you might want to think if it's worth it. A lot of work is put in by both the Rakudo and MoarVM teams to not regress working code in the module ecosystem, even in cases where it can be well argued that it depended on bugs or undefined behavior. However, that can block improvements. Alternatively, on occasion, the breakage is considered worth it. Either way, it's time consuming, and falls on a team of volunteers. Of course, when module authors are responsive and can apply provided patches, it's somewhat less of a problem.
The problem with "put a trait on it" is that the decision - at least on the JVM - seems to need to be made up front at the time that the memory holding the data is allocated. In which case, a portable solution probably can't allow an existing Buf/Blob to be marked up as such. Perhaps a better way will be for I/O-ish things to be asked to give something CArray-like instead, so that zero-copy can be achieved by having the data in the "right kind of memory" in the first place. That's probably a reasonable feature request.

Related

Read line in Rust

What is the reason for such an implementation of the io::stdin().read_line(&mut input) method? Why not just return a Result with the appropriate error or input? Why pass &mut input? Does this approach have any advantages?
The huge advantage of this is that because you're passing a mutable reference to an existing String, you can reuse a buffer, or pre-allocate it as needed:
// 2K is a good default buffer size, but definitely do
// analyze the situation and adjust its size accordingly
let mut buffer = String::with_capactity(2048);
// Lock our standard input to eliminate synchronization overhead (unlocks when dropped)
let mut stdin = io::stdin().lock();
// Read our first line.
stdin.read_line(&mut buffer)?;
// This is a stand-in for any function that takes an &str
process_line_1(&buffer);
// Discard the data we've read, but retain the buffer that we have
buffer.clear();
// Reading a second line will reuse the memory allocation:
stdin.read_line(&mut buffer)?;
process_line_2(&buffer)?;
Remember: allocating too much is a lot more efficient than allocating too often. Buffer sharing across different functions may be a little unwieldy due to Rust's borrowing rules (my advice is to have a "cache struct" that keeps empty pre-allocated buffers for a specific function or a collection of APIs), but if you're creating and destroying the buffer within one function, there's minimal work required to get this caching set up and varying potential for performance benefit from caching allocations like this.
What's great about this API is that it not only enables buffer reuse in a clean way, it also encourages it. If you have several .read_line() calls in a row, it immediately feels wrong to create a new buffer for every call. The API design teaches you how to use it efficiently without saying a word. The takeaway is that this tiny trick doesn't just improve performance of I/O code in Rust, it also attempts to guide beginners towards designing their own APIs in this manner, although allocation reuse is sadly often overlooked in third-party APIs. [citation needed]
I believe it has to do with considerations about where the line read will live: on the heap or maybe on the stack of the caller or wherever the caller wants it to be.
Note that a function has no way to return a reference to a value that lives on it's own stack as that value wouldn't live long enough. So the only other option would be to allocate it on the heap or copy the whole thing around, neither of which is desirable from the POV of the caller.
(Please take into account that I am a rust beginner myself, so this answer may be totally wrong. In which case I'm ready to delete it.)

How to implement instance behaviour (for testing) in Cuis/Squeak/Pharo?

I've implemented a few ExternalStrctures (as part of an "FFI effort"), and for some of them I want to implement finalization for reclaiming the external memory.
I'm trying to write some tests for that, and thought a good way to know if #finalize is called is to change the behaviour for the particular instance I'm using for testing. I'd rather not pollute the implementation with code for supporting tests if possible.
I believe mocking specific methods and changing specific instance behavior is in general a good tool for testing.
I know it's possible in other dialects, and I've implemented it myself in the past in Squeak using #doesNotUnderstand, but I'd like to know if there's a cleaner way, possibly supported by the VM.
Is there a way to change how a particular instance answers a particular message in Cuis/Squeak/Pharo?
Luciano gave this wonderful example:
EllipseMorph copy compile: 'defaultColor ^Color red'; new :: openInWorld
The mail thread is here:
http://cuis-smalltalk.org/pipermail/cuis-dev_cuis-smalltalk.org/2016-March/000458.html
After dealing with the problem I decided to go for an end to end test, actually verifying the resource (memory in my case) is restored to the system. I had not used instance behavior, though Luciano's and Juan's solution (in a comment) is very interesting. Here's the code I'm using for testing:
testFinalizationReleasesExternalMemory
" WeakArray restartFinalizationProcess "
| handles |
handles := (1 to: 11) collect: [:i |
Smalltalk garbageCollect.
APIStatus create getHandle].
self assert: (handles asSet size) < 11.
In the example, #create uses an FFI call to an external function that allocates memory and returns a pointer (the name create comes from the external API):
create
| answer |
answer := ExternalAPI current createStatus.
self finalizationRegistry add: answer.
^ answer
ExternalAPI here is the FFI interface, #createStatus is the API call that allocates the memory for an APIStatus and returns a pointer to it.
On finalization I call the API which restores the memory:
delete
self finalizationRegistry remove: self ifAbsent: [].
self library deleteStatus: self.
handle := nil.
Where #deleteStatus: is again the API call which frees the memory.
The test assumes that the external library reuses the memory once it's free, specially when the newly allocated block has the same size of the previous. This is correct in most cases today, but I'd like to see this test failing if it's not, if at least just to learn something new.
The test allocates 11 external structures, saves their pointers, leaves the finalization mechanism free the memory of each one before allocating the next, and then compares whether any of the pointers is repeated. I'm not sure why I decided to use 10 pointers as a good number, just 2 should be enough, but memory allocation algorithms are sometimes tricky.

Does class_getInstanceSize have a known bug about returning incorrect sizes?

Reading through the other questions that are similar to mine, I see that most people want to know why you would need to know the size of an instance, so I'll go ahead and tell you although it's not really central to the problem. I'm working on a project that requires allocating thousands to hundreds of thousands of very small objects, and the default allocation pattern for objects simply doesn't cut it. I've already worked around this issue by creating an object pool class, that allows a tremendous amount of objects to be allocated and initialized all at once; deallocation works flawlessly as well (objects are returned to the pool).
It actually works perfectly and isn't my issue, but I noticed class_getInstanceSize was returning unusually large sizes. For instance, a class that stores one size_t and two (including isA) Class instance variables is reported to be 40-52 bytes in size. I give a range because calling class_getInstanceSize multiple times, even in a row, has no guarantee of returning the same size. In fact, every object but NSObject seemingly reports random sizes that are far from what they should be.
As a test, I tried:
printf("Instance Size: %zu\n", class_getInstanceSize(objc_getClass("MyClassName"));
That line of code always returns a value that corresponds to the size that I've calculated by hand to be correct. For instance, the earlier example comes out to 12 bytes (32-bit) and 24 bytes (64-bit).
Thinking that the runtime may be doing something behind the scenes that requires more memory, I watched the actual memory use of each object. For the example given, the only memory read from or written to is in that 12/24 byte block that I've calculated to be the expected size.
class_getInstanceSize acts like this on both the Apple & GNU 2.0 runtime. So is there a known bug with class_getInstanceSize that causes this behavior, or am I doing something fundamentally wrong? Before you blame my object pool; I've tried this same test in a brand new project using both the traditional alloc class method and by allocating the object using class_createInstance(self, 0); in a custom class method.
Two things I forgot to mention before: I'm almost entirely testing this on my own custom classes, so I know the trickery isn't down to the class actually being a class cluster or any of that nonsense; second, class_getInstanceSize([MyClassName class]) and class_getInstanceSize(self) \\ Ran inside a class method rarely produce the same result, despite both simply referencing isA. Again, this happens in both runtimes.
I think I've solved the problem and it was due to possibly the dumbest reason ever.
I use a profiling/debugging library that is old; in fact, I don't know its actual name (the library is libcsuomm; the header for it has no identifying info). All I know about it is that it was a library available on the computers in the compsci labs (I did a year of Comp-Sci before switching to a Geology major, graduating and never looking back).
Anyway, the point of the library is that it provides a number of profiling and debugging functionalities; the one I use it most for is memory leak detection, since it actually tracks per object unlike my other favorite memory-leak library (now unsupported, MSS) which is based in C and not aware of objects outside of raw allocations.
Because I use it so much when debugging, I always set it up by default without even thinking about it. So even when creating my test projects to try and pinpoint the bug, I set it up without even putting any thought into it. Well, it turns out that the library works by pulling some runtime trickery, so it can properly track objects. Things seem to work correctly now that I've disabled it, so I believe that it was the source of my problems.
Now I feel bad about jumping to conclusions about it being a bug, but at the time I couldn't see anything in my own code that could possibly cause that problem.

What is the performance difference between blocks and callbacks?

One of the things that block objects, introduced in Snow Leopard, are good for is situations that would previously have been handled with callbacks. The syntax is much cleaner for passing context around. However, I haven't seen any information on the performance implications of using blocks in this manner. What, if any, performance pitfalls should I look out for when using blocks, particularly as a replacement for a C-style callback?
The blocks runtime looks pretty tight. Block descriptors and functions are statically allocated, so they could enlarge the working set of your program, but you only "pay" in storage for the variables you reference from the enclosing scope. Non-global block literals and __block variables are constructed on the stack without any branching, so you're unlikely to run into much of a slowdown from that. Calling a block is just result = (*b->__FuncPtr)(b, arg1, arg2); this is comparable to result = (*callback_func_ptr)(callback_ctx, arg1, arg2).
If you think of blocks as "callbacks that write their own context structure and handle the ugly packing, memory management, casting, and dereferencing for you," I think you'll realize that blocks are a small cost at runtime and a huge savings in programming time.
You might want to check out this blog post and this one. Blocks are implemented as Objective-C objects, except they can be put on the stack, so they don't necessarily have to be malloc'd (if you retain a reference to a block, it will be copied onto the heap, though). They will thus probably perform better than most Objective-C objects, but will have a slight performance hit compared to a simple callback--I'd guess it shouldn't be a problem 95% of the time.

How does it know where my value is in memory?

When I write a program and tell it int c=5, it puts the value 5 into a little bit of it's memory, but how does it remember which one? The only way I could think of would be to have another bit of memory to tell it, but then it would have to remember where it kept that as well, so how does it remember where everything is?
Your code gets compiled before execution, at that step your variable will be replaced by the actual reference of the space where the value will be stored.
This at least is the general principle. In reality it will be way more complecated, but still the same basic idea.
There are lots of good answers here, but they all seem to miss one important point that I think was the main thrust of the OP's question, so here goes. I'm talking about compiled languages like C++, interpreted ones are much more complex.
When compiling your program, the compiler examines your code to find all the variables. Some variables are going to be global (or static), and some are going to be local. For the static variables, it assigns them fixed memory addresses. These addresses are likely to be sequential, and they start at some specific value. Due to the segmentation of memory on most architectures (and the virtual memory mechanisms), every application can (potentially) use the same memory addresses. Thus, if we assume the memory space programs are allowed to use starts at 0 for our example, every program you compile will put the first global variable at location 0. If that variable was 4 bytes, the next one would be at location 4, etc. These won't conflict with other programs running on your system because they're actually being mapped to an arbitrary sequential section of memory at run time. This is why it can assign a fixed address at compile time without worrying about hitting other programs.
For local variables, instead of being assigned a fixed address, they're assigned a fixed address relative to the stack pointer (which is usually a register). When a function is called that allocates variables on the stack, the stack pointer is simply moved by the required number of bytes, creating a gap in the used bytes on the stack. All the local variables are assigned fixed offsets to the stack pointer that put them into that gap. Every time a local variable is used, the real memory address is calculated by adding the stack pointer and the offset (neglecting caching values in registers). When the function returns, the stack pointer is reset to the way it was before the function was called, thus the entire stack frame including local variables is free to be overwritten by the next function call.
read Variable (programming) - Memory allocation:
http://en.wikipedia.org/wiki/Variable_(programming)#Memory_allocation
here is the text from the link (if you don't want to actually go there, but you are missing all the links within the text):
The specifics of variable allocation
and the representation of their values
vary widely, both among programming
languages and among implementations of
a given language. Many language
implementations allocate space for
local variables, whose extent lasts
for a single function call on the call
stack, and whose memory is
automatically reclaimed when the
function returns. (More generally, in
name binding, the name of a variable
is bound to the address of some
particular block (contiguous sequence)
of bytes in memory, and operations on
the variable manipulate that block.
Referencing is more common for
variables whose values have large or
unknown sizes when the code is
compiled. Such variables reference the
location of the value instead of the
storing value itself, which is
allocated from a pool of memory called
the heap.
Bound variables have values. A value,
however, is an abstraction, an idea;
in implementation, a value is
represented by some data object, which
is stored somewhere in computer
memory. The program, or the runtime
environment, must set aside memory for
each data object and, since memory is
finite, ensure that this memory is
yielded for reuse when the object is
no longer needed to represent some
variable's value.
Objects allocated from the heap must
be reclaimed—especially when the
objects are no longer needed. In a
garbage-collected language (such as
C#, Java, and Lisp), the runtime
environment automatically reclaims
objects when extant variables can no
longer refer to them. In
non-garbage-collected languages, such
as C, the program (and the programmer)
must explicitly allocate memory, and
then later free it, to reclaim its
memory. Failure to do so leads to
memory leaks, in which the heap is
depleted as the program runs, risking
eventual failure from exhausting
available memory.
When a variable refers to a data
structure created dynamically, some of
its components may be only indirectly
accessed through the variable. In such
circumstances, garbage collectors (or
analogous program features in
languages that lack garbage
collectors) must deal with a case
where only a portion of the memory
reachable from the variable needs to
be reclaimed
There's a multi-step dance that turns c = 5 into machine instructions to update a location in memory.
The compiler generates code in two parts. There's the instruction part (load a register with the address of C; load a register with the literal 5; store). And there's a data allocation part (leave 4 bytes of room at offset 0 for a variable known as "C").
A "linking loader" has to put this stuff into memory in a way that the OS will be able to run it. The loader requests memory and the OS allocates some blocks of virtual memory. The OS also maps the virtual memory to physical memory through an unrelated set of management mechanisms.
The loader puts the data page into one place and instruction part into another place. Notice that the instructions use relative addresses (an offset of 0 into the data page). The loader provides the actual location of the data page so that the instructions can resolve the real address.
When the actual "store" instruction is executed, the OS has to see if the referenced data page is actually in physical memory. It may be in the swap file and have to get loaded into physical memory. The virtual address being used is translated to a physical address of memory locations.
It's built into the program.
Basically, when a program is compiled into machine language, it becomes a series of instructions. Some instructions have memory addresses built into them, and this is the "end of the chain", so to speak. The compiler decides where each variable will be and burns this information into the executable file. (Remember the compiler is a DIFFERENT program to the program you are writing; just concentrate on how your own program works for the moment.)
For example,
ADD [1A56], 15
might add 15 to the value at location 1A56. (This instruction would be encoded using some code that the processor understands, but I won't explain that.)
Now, other instructions let you use a "variable" memory address - a memory address that was itself loaded from some location. This is the basis of pointers in C. You certainly can't have an infinite chain of these, otherwise you would run out of memory.
I hope that clears things up.
I'm going to phrase my response in very basic terminology. Please don't be insulted, I'm just not sure how proficient you already are and want to provide an answer acceptable to someone who could be a total beginner.
You aren't actually that far off in your assumption. The program you run your code through, usually called a compiler (or interpreter, depending on the language), keeps track of all the variables you use. You can think of your variables as a series of bins, and the individual pieces of data are kept inside these bins. The bins have labels on them, and when you build your source code into a program you can run, all of the labels are carried forward. The compiler takes care of this for you, so when you run the program, the proper things are fetched from their respective bin.
The variables you use are just another layer of labels. This makes things easier for you to keep track of. The way the variables are stored internally may have very complex or cryptic labels on them, but all you need to worry about is how you are referring to them in your code. Stay consistent, use good variable names, and keep track of what you're doing with your variables and the compiler/interpreter takes care of handling the low level tasks associated with that. This is a very simple, basic case of variable usage with memory.
You should study pointers.
http://home.netcom.com/~tjensen/ptr/ch1x.htm
Reduced to the bare metal, a variable lookup either reduces to an address that is some statically known offset to a base pointer held in a register (the stack pointer), or it is a constant address (global variable).
In an interpreted language, one register if often reserved to hold a pointer to a data structure (the "environment") that associates variable names with their current values.
Computers ultimately only undertand on and off - which we conveniently abstract to binary. This language is the basest level and is called machine language. I'm not sure if this is folklore - but some programmers used to (or maybe still do) program directly in machine language. Typing or reading in binary would be very cumbersome, which is why hexadecimal is often used to abbreviate the actual binary.
Because most of us are not savants, machine language is abstracted into assembly language. Assemply is a very primitive language that directly controls memory. There are a very limited number of commands (push/pop/add/goto), but these ultimately accomplish everything that is programmed. Different machine architectures have different versions of assembly, but the gist is that there are a few dozen key memory registers (physically in the CPU) - in a x86 architecture they are EAX, EBX, ECX, EDX, ... These contain data or pointers that the CPU uses to figure out what to do next. The CPU can only do 1 thing at a time and it uses these registers to figure out what to do next. Computers seem to be able to do lots of things simultaneously because the CPU can process these instructions very quickly - (millions/billions instructions per second). Of course, multi-core processors complicate things, but let's not go there...
Because most of us are not smart or accurate enough to program in assembly where you can easily crash the system, assembly is further abstracted into a 3rd generation language (3GL) - this is your C/C++/C#/Java etc... When you tell one of these languages to put the integer value 5 in a variable, your instructions are stored in text; the assembler compiles your text into an assembly file (executable); when the program is executed, the program and its instructions are queued by the CPU, when it is show time for that specific line of code, it gets read in the the CPU register and processed.
The 'not smart enough' comments about the languages are a bit tongue-in-cheek. Theoretically, the further you get away from zeros and ones to plain human language, the more quickly and efficiently you should be able to produce code.
There is an important flaw here that a few people make, which is assuming that all variables are stored in memory. Well, unless you count the CPU registers as memory, then this won't be completely right. Some compilers will optimize the generated code and if they can keep a variable stored in a register then some compilers will make use of this!
Then, of course, there's the complex matter of heap and stack memory. Local variables can be located in both! The preferred location would be in the stack, which is accessed way more often than the heap. This is the case for almost all local variables. Global variables are often part of the data segment of the final executable and tend to become part of the heap, although you can't release these global memory areas. But the heap is often used for on-the-fly allocations of new memory blocks, by allocating memory for them.
But with Global variables, the code will know exactly where they are and thus write their exact location in the code. (Well, their location from the beginning of the data segment anyways.) Register variables are located in the CPU and the compiler knows exactly which register, which is also just told to the code. Stack variables are located at an offset from the current stack pointer. This stack pointer will increase and decrease all the time, depending on the number of levels of procedures calling other procedures.
Only heap values are complex. When the application needs to store data on the heap, it needs a second variable to store it's address, otherwise it could lose track. This second variable is called a pointer and is located as global data or as part of the stack. (Or, on rare occasions, in the CPU registers.)
Oh, it's even a bit more complex than this, but already I can see some eyes rolling due to this information overkill. :-)
Think of memory as a drawer into which you decide how to devide it according to your spontaneous needs.
When you declare a variable of type integer or any other type, the compiler or interpreter (whichever) allocates a memory address in its Data Segment (DS register in assembler) and reserves a certain amount of following addresses depending on your type's length in bit.
As per your question, an integer is 32 bits long, so, from one given address, let's say D003F8AC, the 32 bits following this address will be reserved for your declared integer.
On compile time, whereever you reference your variable, the generated assembler code will replace it with its DS address. So, when you get the value of your variable C, the processor queries the address D003F8AC and retrieves it.
Hope this helps, since you already have much answers. :-)