Memory management in COM - com

It is very common to allocate a block of memory during the execution of COM server, and then pass that memory block to the client through an ouput parameter. Then it is the client obligation to free that memory, with methods such as CoTaskMemFree().
Question is, where is this block of memory allocated? Supposing that the COM server and COM client are in different processes, in order for the client to access that memory block, it SHOULD be allocated in the client's process address space. But is that true? I heard that COM has a "Task Memory Allocator". But I know little about it.
Just some wild guesses:
First, it is the COM server allocate the memory at the request of the COM client, with CoTaskMemAlloc().
And then, COM client get that piece of memory, use it, and free it with CoTaskMemFree().
So the "Task Memory Allocator" must keep track of both the client and server processes. Otherewise, it won't know who(the server) did the memory allocation action and who(the client) should be given that memory. Then, the allocated memory will be somehow injected to the client's process address space.
Could anyone shed some light on this topic?

Well, "task memory allocator" is a COM-owned allocator that exposes those CoTaskMem* functions. Now suppose the client and the server are in different processes and the server uses CoTaskMemAlloc() to allocate an "out" parameter. How does it get to the client?
COM subsytem with marshalling does that. The server allocates memory and returns control from its COM method implementation. COM subsystem now has to marshal the call results to the client. It simply takes ownership of that memory and marshals it to the client. The client allocates its own block on its (the client) heap, data is copied to the client, the block on the server is freed. The client get ownership of the block and must free it later otherwise the block is leaked.
So the client and the server address spaces are always separated and no direct data access happens. Each uses its own memory allocator, marshalling kicks in in the middle to make the client allocate memory and make the server free memory so that the client gets ownership of a legally allocated block and the server releases ownership of the block it itself allocated.
So to the client it almost looks like the server allocated the memory and returned it to the client. The one notable exception is that logical addresses are allowed to differ - say server allocated memory at address 0x10001000 and returned that address together with the block. The client is not guaranteed to get the block at the same logical address - the address will be up to the client side allocator.

If you allocate a block of memory using CoTaskMemAlloc, that memory ends up being allocated using the default per-process heap (source).
The difference between using CoTaskMemAlloc and any other allocator however is that this allocation is COM aware, which means that COM is able to marshal this memory across process boundaries (e.g. by copying the memory) when required.

The question of where the memory block is allocated by the Task Memory Allocator is an implementation detail, and I don't think that it's relevant for you to know. However, the important thing is, that this Memory Allocator gives you an interface and mechanism by which you can allocate and deallocate memory across process boundaries.
When you allocate memory which a certain mechanism, whether it be new or CoTaskMemAlloc, you have to use the corresponding deallocation mechanism. So with new you use delete and with CoTaskMemAlloc you would use CoTaskMemFree
The OLE Allocator will take care of allocating and freeing memory correctly, even if you allocate in process X, pass to process Y and then deallocate.
The point is that you use the same mechanism.

Related

Is there a way to map a host-cached Vulkan buffer to a specific memory location?

Vulkan is able to import host memory using VkImportMemoryHostPointerInfoEXT. I queried the supported memory types for VK_EXTERNAL_MEMORY_HANDLE_TYPE_HOST_ALLOCATION_BIT_EXT but the only kind of memory that was available for it was coherent, which does not work for my use case. The memory needs to use explicit invalidations/flushes for performance reasons. So really, I don't want the API to allocate any host-side memory, I just want to tell it the base address that the buffer should upload from/download to. Otherwise I have to use intermediate copies. Using the address returned by vkMapMemory for the host-side work is not desirable for my use-case.
If the Vulkan implementation does not allow you to import memory allocations as "CACHED", then you can't force it to do so. The API provides the opportunity for the implementation to advertise the ability to import your allocations as "CACHED", but the implementation explicitly refused to do it.
Which probably means that it can't. And you can't make the implementation do something it can't do.
So if you have some API that created and manipulates some memory (which cannot use memory provided by someone else), and the Vulkan implementation won't allow reading from that memory unless it is allowed to remove the cached nature of the allocation, and you need CPU caching of that memory, then you're going to have to fall back on memcpy.
I want to mirror memory between the CPU and GPU so that I can access it from either without an implicit PCI-e bus transfer.
If the GPU is discrete, that's impossible. In a discrete GPU setup, the GPU and the CPU have separate local memory pools, and access to either pool from the other requires some form of PCIe transfer operation. Vulkan lets you pick which one is going to have slower access, but one of them will have slower access to the memory.
If the GPU is integrated, then typically there is only one memory pool and one memory type for it. That type will be both local and coherent (and probably cached too), which represents fast access from both devices.
Whether VkImportMemoryHostPointerInfoEXT or vkMapMemory of non-DEVICE_LOCAL_BIT heap, you will typically get a COHERENT memory type.
Because well, the conventional host heap memory from malloc in C is naturally coherent (and the CPUs do typically have automatic cache-coherency mechanisms). There is no cflush() nor cinvalidate() in C.
There is no reason for there being implicit PCI-e transfers when R\W such memory from the Host side. Of course, the dedicated GPU has to read it somehow, so there would be bus transfers when the deviced tries to access the memory. Or you need to have an explicit memory in DEVICE_LOCAL_BIT heap, and transfer data between the two explicitly via vkCmdCopy* to keep them the same.
Actual UMA achitectures could have a non-COHERENT memory type. But their memory heap is always advertised as DEVICE_LOCAL_BIT (even if it is the main memory).

Are JVM heap/stack different than virtual address space heap/stack?

Memory is divided into "segments" called heap, stack, bss, data, and text. However, the JVM also has these concepts of stack and heap. So how are these two reconciled?
Are they different levels of abstraction, where main memory is one or two levels below the JVM, and whose "segments" maps naturally to JVM's "segments"? Since JVM is supposed to be a virtual computer, it seems to me like they emulate what happens underneath but at a higher level of abstraction.
Sounds to me like you've been reading a textbook or similar. All these terms generally have very precise definitions in books/lectures, but a lot less precise definitions in reality. Therefore what people mean when they say heap is not necessarily exactly the same as what a book etc. says.
Memory is divided into "segments" called heap, stack, bss, data, and text.
This is only true for a typical user space process. In other word this will be true for an everyday program written in c or similar, however it is not true for all programs, and definitely not true for the entire memory space.
When a program is executed the OS allocates memory for the various segments listed, except the the heap. The program can request memory from the OS while it is executing. This allows a program to use a different amount of memory depending on its needs. The heap refers to memory requested by the program usually via a function like malloc. To clarify the heap typically refers to a managed region of memory, usually managed with malloc/free. It is also possible to request memory directly from the OS, in an unmanaged fashion. Most people (Imo) would say this wouldn't count as part of the heap.
The stack is a data structure/segment which keeps track of local variables and function calls. It stores important information like where to return after a function call. In c or other "native" languages the stack is created by the OS and can grow or shrink if needed.
Java allows program to request memory during execution using new. Memory allocated to a java program using new is referred to as memory in the java heap. One could imagine that if you where implementing a Jvm you would use malloc behind the scenes of new. This would result in a java heap within a regular native heap. In reality "serious" jvms do not do this and interact directly with the OS for memory.
In Java the stack is created by the Jvm. One could imagine that this is allocated by malloc, but as with the heap this is likely not how real world jvms do it.
Edit:
A Jvm like hotspot. Would likely allocate memory directly from the OS. This memory would then get put into some kind of pool, from which it would be removed as needed. Reasons for needed memory would be needed includes new, or a stack that needs to grow.

Vulkan on devices that share host memory

For the purpose of this question, we'll say vkMapMemory for all allocations on such a device cannot fail; they are trivially host-visible, and the result is a direct pointer to some other region of host memory (no work needs to be done).
Is there some way to detect this situation?
The purpose in mind is an arena-based allocator that aggressively maps any host-visible memory, and an objective is to avoid redundant allocations on such hardware.
Yes, it can be detected relatively reliably.
If vkGetPhysicalDeviceMemoryProperties has only one Memory Heap (which would be labeled VK_MEMORY_HEAP_DEVICE_LOCAL_BIT) then it is certain it is the same memory as host.
In words of the authors:
https://www.khronos.org/registry/vulkan/specs/1.0-extensions/html/vkspec.html#memory-device
In a unified memory architecture (UMA) system, there is often only a single memory heap which is considered to be equally “local” to the host and to the device, and such an implementation must advertise the heap as device-local.
In other cases you know trivially if the memory is on the host (i.e. the given Memory Heap on dGPU would not have VK_MEMORY_HEAP_DEVICE_LOCAL_BIT set)
Though, implementations for UMA-based systems described by #krOoze have little reason to not expose direct pointers to buffer data.
Your question seems to proceed from a false assumption.
Vulkan is not OpenGL. Generally speaking, it does not try to hide things from you. If a memory heap cannot be accessed directly by the CPU, then the Vulkan implementation will not expose a memory type for that heap that is host-visible. Conversely, if a memory heap can be accessed directly by the CPU, then the Vulkan implementation will expose a memory type for that heap that is host-visible.
Therefore, if you can map a device allocation at all in Vulkan, then you should assume that you have a "direct pointer to buffer data".

Is it possible to access a memory block allocated by cudaHostAlloc with cudaHostAllocPortable flag from a different process?

CUDA documentation says that portable memory blocks can be accessed from all contexts, does this mean we can use such blocks across processes? Specifically, I want to pass this host pointer to a different process that will copy to device.
No, it is only accessible within the same process. Use you should use cudaIpc... or the OS's IPC.
Portable memory Can be used by many host thread, not process. Actually, pinned memory is only available to the thread malloc it if not portable.
You should use IPC to share memory between process.

How does XCode memory leak detection work?

How does the XCode Instrument Leak tool figure out if an object is a leak or just something not released yet?
I'm pretty new to Objective C, the leak tool detects a leak in the code I work with. But the code looks sound to me. So just wondering how much can I trust this tool?
A "leak" as an object that's still allocated, but your application no longer has a reference
pointing to that object. Since you no longer have a reference, there's no way you will be able to release the object, thus it's a leak.
As the leaks(1) man page says:
leaks identifies leaked memory -- memory that the application has allocated, but has been lost and cannot be freed. Specifically, leaks examines a specified process's memory for values that may be pointers to malloc-allocated buffers. Any buffer reachable from a pointer in writable memory, a register,
or on the stack is assumed to be memory in use. Any buffer reachable from a pointer in a reachable
malloc-allocated buffer is also assumed to be in use. The buffers which are not reachable are leaks;
the buffers could never be freed because no pointer exists in memory to the buffer, and thus free()
could never be called for these buffers
You might also want to look into the ObjectAlloc tool in Instruments.