memory separation in v8 for different contexts

memory separation in v8 for different contexts - chromium

I am a freshman to learn chromium and v8. I know that the main thread in the renderer process is bound to an Isolate in v8, and an Isolate can possess several Contexts. But I am confused about the memory layout in an Isolate with multiple Contexts.
v8 said that context is used to provide isolated execution environment for JS scripts in a different frame or world. So what is the memory boundary among these contexts? Does each context has a separated stack and heap? And if multiple stacks or heaps exist, how v8 maintains and switches them?
Any ideas are appreciated. Thank you.

Isolates provide isolated execution environments. Between different contexts, some limited amount of interaction is possible.
Every isolate has one heap (and stack). When there are multiple contexts in an isolate, they all share the same heap.
The "memory boundary" between contexts consists of access checks that V8 performs in certain places.

Related

Why does a GraalVM (SubstrateVM) native image uses so much less memory at runtime than a corresponding JIT build?

I'm wondering why a GraalVM (SubstrateVM) native image of a Java application makes it run where the runtime behavior will consume much less memory, yet if run normally, it will consume a lot more memory?
And why can't the normal JIT be made to similarly consume a small amount of memory?

GraalVM native images don't include the JIT compiler or the related infrastructure -- so there's no need to allocate memory for JIT, for the internal representation of the program to JIT it (for example a control flow graph), no need to store some of the class metadata, etc.
So it's unlikely that a JIT which actually does useful work can be implemented with the same zero overhead.
It could be possible to create an economic implementation of the virtual machine that will perhaps use less memory than HotSpot. Especially if you only want to measure the default configuration without comparing the setups where you control the amounts of memory the JVM is allowed to use. However, one needs to realize that it'll either be an incremental improvement on the existing implementations or picking a different option for some trade-off, because the existing JVM implementations are actually really-really good.

Why do queues in a queue family in Vulkan need priority if we can't distinguish between them?

As asked in the title. My main point is "why", as in what's the benefiting factor in such logical structure for queues and queue families.
Do chip/card makers actually etch multiple independent queues onto their chips? That are at the same time separately distinguishable?
Does implementing separate processing units/streams provide any benefit to implementations? And by extension, does it retroactiely benefit older APIs such as OpenCL?
I've observed an interesting fact: that in my "Intel(R) Core(TM) i3-8100B CPU # 3.60GHz" Mac Mini, there are 2 GPUs listed in "vulkaninfo.app" (from LunarG SDK). My bad, the app linked against 2 libMoltonVK.dylib (1 in "Contents/Frameworks", 1 in "/usr/local/lib").

"Why" is not a great question for SO format. It leads to speculation.
The queues are distinguishable in Vulkan. They each have their index with which they can be distinguished. Keep in mind they are rather a driver thing. Even when the driver has more queues, even single one typically can use all the GPU's computing resources.
Furthermore Vulkan specification does not really say what should happen when you supply a specific priority value. It is perfectly valid for driver\GPU to ignore it.
Chip makers do have compute units that are independent. They can theoretically execute different code from each other. But it is not usually advantageous. In the usual work rendering some regular W × H image, it saturates all the compute units with the same work.

Why: because you can submit different types of work that're of different importance, and you can give a hint to the Vulkan implementation what you want to be done first-most.
Everything else in the question are pointless:
Do chip/card makers actually etch multiple independent queues onto their chips? That are at the same time separately distinguishable?
Not necessarily, those may be logical queues that're time-sliced.
Does implementing separate processing units/streams provide any benefit to implementations? And by extension, does it retroactiely benefit older APIs such as OpenCL?
No, a contemporary API called Metal (from Apple) don't have a queue count or the concept of queue family at all.

Memory/Address Sanitizer vs Valgrind

I want some tool to diagnose use-after-free bugs and uninitialized bugs. I am considering Sanitizer(Memory and/or Address) and Valgrind. But I have very little idea about their advantages and disadvantages. Can anyone tell the main features, differences and pros/cons of Sanitizer and Valgrind?
Edit: I found some of comparisons like: Valgrind uses DBI(dynamic binary instrumentation) and Sanitizer uses CTI(compile-time instrumentation). Valgrind makes the program much slower(20x) whether Sanitizer runs much faster than Valgrind(2x). If anyone can give me some more important points to consider, it will be a great help.

I think you'll find this wiki useful.
TLDR main advantages of sanitizers are
much smaller CPU overheads (Lsan is practically free, UBsan/Isan is 1.25x, Asan and Msan are 2-4x for computationally intensive tasks and 1.05-1.1x for GUIs, Tsan is 5-15x)
wider class of detected errors (stack and global overflows, use-after-return/scope)
full support of multi-threaded apps (Valgrind support for multi-threading is a joke)
much smaller memory overhead (up to 2x for Asan, up to 3x for Msan, up to 10x for Tsan which is way better than Valgrind)
Disadvantages are
more complicated integration (you need to teach your build system to understand Asan and sometimes work around limitations/bugs in Asan itself, you also need to use relatively recent compiler)
MemorySanitizer is not reall^W easily usable at the moment as it requires one to rebuild all dependencies under Msan (including all standard libraries e.g. libc++); this means that casual users can only use Valgrind for detecting uninitialized errors
sanitizers typically can not be combined with each other (the only supported combination is Asan+UBsan+Lsan) which means that you'll have to do separate QA runs to catch all types of bugs

One big difference is that the LLVM-included memory and thread sanitizers implicitly map huge swathes of address space (e.g., by calling mmap(X, Y, 0, MAP_NORESERVE|MAP_ANONYMOUS|MAP_FIXED|MAP_PRIVATE, -1, 0) across terabytes of address space in the x86_64 environment). Even though they don't necessarily allocate that memory, the mapping can play havoc with restrictive environments (e.g., ones with reasonable settings for ulimit values).

Should I try to use as many queues as possible?

On my machine I have two queue families, one that supports everything and one that only supports transfer.
The queue family that supports everything has a queueCount of 16.
Now the spec states
Command buffers submitted to different queues may execute in parallel or even out of order with respect to one another
Does that mean I should try to use all available queues for maximal performance?

Yes, if you have workload that is highly independent use separate queues.
If the queues need a lot of synchronization between themselves, it may kill any potential benefit you may get.
Basically what you are doing is supplying GPU with some alternative work it can do (and fill stalls and bubbles and idles with and giving GPU the choice) in the case of same queue family. And there is some potential to better use CPU (e.g. singlethreaded vs one queue per thread).
Using separate transfer queues (or other specialized family) seem to be the recommended approach even.
That is generally speaking. More realistic, empirical, sceptical and practical view was already presented by SW and NB answers. In reality one does have to be bit more cautious as those queues target the same resources, have same limits, and other common restrictions, limiting potential benefits gained from this. Notably, if the driver does the wrong thing with multiple queues, it may be very very bad for cache.
This AMD's Leveraging asynchronous queues for concurrent execution(2016) discusses a bit how it maps to their HW\driver. It shows potential benefits of using separate queue families. It says that although they offer two queues of compute family, they did not observe benefits in apps at that time. They say they have only one graphics queue, and why.
NVIDIA seems to have a similar idea of "asynch compute". Shown in Moving to Vulkan: Asynchronous compute.
To be safe, it seems we should still stick with only one graphics, and one async compute queue though on current HW. 16 queues seem like a trap and a way to hurt yourself.
With transfer queues it is not as simple as it seems either. You should use the dedicated ones for Host->Device transfers. And the non-dedicated should be used for device->device transfer ops.

To what end?
Take the typical structure of a deferred renderer. You build your g-buffers, do your lighting passes, do some post-processing and tone mapping, maybe throw in some transparent stuff, and then present the final image. Each process depends on the previous process having completed before it can begin. You can't do your lighting passes until you've finished your g-buffer. And so forth.
How could you parallelize that across multiple queues of execution? You can't parallelize the g-buffer building or the lighting passes, since all of those commands are writing to the same attached images (and you can't do that from multiple queues). And if they're not writing to the same images, then you're going to have to pick a queue in which to combine the resulting images into the final one. Also, I have no idea how depth buffering would work without using the same depth buffer.
And that combination step would require synchronization.
Now, there are many tasks which can be parallelized. Doing frustum culling. Particle system updates. Memory transfers. Things like that; data which is intended for the next frame. But how many queues could you realistically keep busy at once? 3? Maybe 4?
Not to mention, you're going to need to build a rendering system which can scale. Vulkan does not require that implementations provide more than 1 queue. So your code needs to be able to run reasonably on a system that only offers one queue as well as a system that offers 16. And to take advantage of a 16 queue system, you might need to render very differently.
Oh, and be advised that if you ask for a bunch of queues, but don't use them, performance could be impacted. If you ask for 8 queues, the implementation has no choice but to assume that you intend to be able to issue 8 concurrent sets of commands. Which means that the hardware cannot dedicate all of its resources to a single queue. So if you only ever use 3 of them... you may be losing over 50% of your potential performance to resources that the implementation is waiting for you to use.
Granted, the implementation could scale such things dynamically. But unless you profile this particular case, you'll never know. Oh, and if it does scale dynamically... then you won't be gaining a whole lot from using multiple queues like this either.
Lastly, there has been some research into how effective multiple queue submissions can be at keeping the GPU fed, on several platforms (read all of the parts). The general long and short of it seems to be that:
Having multiple queues executing genuine rendering operations isn't helpful.
Having a single rendering queue with one or more compute queues (either as actual compute queues or graphics queues you submit compute work to) is useful at keeping execution units well saturated during rendering operations.

That strongly depends on your actual scenario and setup. It's hard to tell without any details.
If you submit command buffers to multiple queues you also need to do proper synchronization, and if that's not done right you may get actually worse performance than just using one queue.
Note that even if you submit to only one queue an implementation may execute command buffers in parallel and even out-of-order (aka "in-flight"), see details on this in chapter chapter 2.2 of the specs or this AMD presentation.
If you do compute and graphics, using separate queues with simultaneous submissions (and a synchronization) will improve performance on hardware that supports async compute.
So there is no definitive yes or no on this without knowing about your actual use case.

Since you can submit multiple independent workload in the same queue, and it doesn't seem there is any implicit ordering guarantee among them, you don't really need more than one queue to saturate the queue family. So I guess the sole purpose of multiple queues is to allow for different priorities among the queues, as specified during device creation.
I know this answer is in direct contradiction to the accepted answer, but that answer fails to address the issue that you don't need more queues to send more parallel work to the device.

Memory Map for RTOS

I am looking forward to understand, what purpose a memory map serves in embedded system.
How does the function stack differs here, from normal unix system.
Any insights that can help me debug few memory related crashes for embedded system will be helpful.

Embedded systems, especially real-time ones, often have a lot of statically-allocated data, and/or data placed at specific locations in memory. The memory map tells you where these things are, which can be helpful when you run into problems and need to examine the state of the system. For example, you might dump all of memory and then analyze it after the fact; in such a case, the memory map will be rather handy for finding the objects you suspect might be related to the problem.
On the code side, your system might log a hardware exception that points to the address of the instruction where the exception was detected. Looking up the memory locations of functions, combined with a disassembly of the function, can help you analyze such problems.
The details really depend on what kind of embedded system you're building. If you provide more details, people may be able to give better responses.

I am not sure that I understand the question. You seem to be suggesting that a "memory map" is something unique to embedded systems or that it is a tangible software component. It is neither; it is merely a description of the layout of an application's memory usage.
All applications will have a memory map regardless of platform, the difference is that typically on an embedded system the application is linked as a single monolithic entity, so that the resultant memory layout refers to the entire system rather than an individual process as it might in an application on a GPOS platform.
It is the linker and the linker script that determines memory mapping, and your linker will be able to output a map report file that describes the layout and allocation applied. This is true of embedded and desktop applications regardless of OS or architecture.

The memory map for a RTOS is not that much different than the memory map for any computer. It defines which hardware resides at which of the processor's addresses. That hardware may be RAM, ROM, Flash, serial ports, parallel ports, timers, interrupt vectors, or any number of other parts addressable by the processor.
The memory map also describes how you intend to budget for limited resources such as RAM, ROM, or Flash in your system design.
For instance, if there's multiple tasks running, RAM might be mapped so that each task has it's own specific area of RAM allocated to it.
In turn, each tasks's part of RAM would be mapped so that there are specific areas for the stack, another for static variables, and perhaps more again for heap(s).
When you have an operating system on the target, it looks after a lot of this dynamically. However, if your application is the only software on the device, you'll have to manage these decisions yourself, usually at compile/link time. Search "link scripts" for further clues,

The Memory map is a layout of memory of system. It is present in both embedded systems and normal applications. Though it is present in normal applications, it's usage is well appreciated in embedded systems due to system constraints.
Memory map is managed by means of linker scripts or linker command files. It maps resources like Flash or Internal RAM(L1P,L1D,L2,L3) or External RAM(DDR) or ROM or peripherals (ports,serial,parallel,USB etc) or specific device registers or I/O ports with appropriate fixed addresses in the memory space of the system.
In case of embedded systems, based on the memory configuration or constraints of board and performance requirements, the segments like text segment or data segment or BSS can also be placed in the appropriate memory of choice.
There are occasions where various versions of development boards will have different configurations of memory and peripherals. In that case, we may need to edit the linker scripts according to memory configuration and peripherals of the board as an essential check-point in board bring-up.
Memory map can help in defining the shared memory too that can play a key role in multi-threaded applications and also for multi-core applications.
Crashes can be debugged by back tracing the address of crash and mapping it to the memory of the system to get an high level idea of the possible library or object causing the problem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas