Vulkan API calls to GPU drivers

Vulkan API calls to GPU drivers - api

Background:
I have been eyeing writing an application which needs very basic but fast graphics (just drawing lines and squares), and I'm probably going to use a library such as GLFW, or Vulkano if i'm going with Rust.
I want to understand a specific, and I guess quite practical, detail of the Vulkan API. I understand that GPUs can be quite a complicated topic, but I want to emphasize that I don't have any background in low-level graphics or Vulkan, so I understand if my question cannot be answered, or if my question does not even make sense. I'll try my best to use the correct terminology. I have to admit, I'm not the best at skimming through and looking at large amounts of source code I don't quite understand and still grasp the overall concept, which is why I hope I can find my answer here. I've tried looking at the source code for Vulkan and Mesa drivers, but it bore no fruit.
ORIGINAL Question:
I want to understand how an API call is propagated to the GPU driver.
I have searched around, but couldn't find the specifics I am searching for. The closest posts I've found are these two:
https://softwareengineering.stackexchange.com/questions/279069/how-does-a-program-talk-to-a-graphics-card
https://superuser.com/questions/461022/how-does-the-cpu-and-gpu-interact-in-displaying-computer-graphics
They both mention something similar to "In order to make the GPU do something, you have to make a call via a supported API". I know that, but neither of the two dig into the specifics of how that API call is made. Hopefully, the diagram below illustrates my question.
MyVulkanProgram.c with "#include <vulkan/vulkan.h>"
|
| (Makes call via Vulkan API)
v
This is the part I don't understand!
|
v
Driver (Mesa, for example) takes the request sent via the Vulkan API.
|
| (Driver asks GPU to perform task)
v
GPU does task
I don't care what or how the GPU does something. Just how it is invoked through the API call via Vulkan and how it propagates through the system. Ideally what I'm looking for is a code-snippet or link to where in the Vulkan source code the actual request is sent to the driver.
Or have I gotten it all wrong? Is Vulkan more part of the driver than I realize? Is it maybe the case that the driver includes the same Vulkan header as my "MyVulkanProgram.c" and the driver is linked together with library files such as libvulkan.so et al? Is it more like the diagram below?
MyVulkanProgram.c with "#include <vulkan/vulkan.h>"
|
| (Makes call via Vulkan API)
v
Driver (Mesa, for example, which includes the vulkan headers and is linked with the Vulkan shared object-files) takes the request sent via the Vulkan API.
|
| (Driver asks GPU to perform task)
v
GPU does task
Might be a basic question, might not be, but I'm confused nonetheless. Very thankful for any answers!
UPDATED Question:
After having read the answer from #krOoze (answer from krOoze), and given the "Vulkan loader" overview figure in the mentioned document, I can more precisely express my question.
How does an application, making a call via the Vulkan API, reach the ICD via the Vulkan loader?

You are looking for the Vulkan-Loader/LoaderAndLayerInterface.md documentation.
The app interfaces with The Loader (sometimes called Vulkan RT, or Vulkan Runtime). That is the vulkan-1.dll (or so).
The Loader also has vulkan-1.lib, which is classic dll shim. It is where the loading of core version and WSI commands happens, but you can skip the lib and do it all manually directly from the dll using vkGetInstanceProcAddr.
Then you have ICDs (Installable Client Drivers). Those are something like nvoglv64.dll, and you can have more of them on your PC (e.g. Intel iGPU + NV). The name is arbitrary and vendor specific. The Loader finds them via config files.
Now when you call something to a command obtained with vkGetInstanceProcAddress (which is everything if you use the *.lib only), you get onto a loader trampoline, which calls a chain of layers, after which the relevant ICD (or all of them) are called. Then the callstack is unwound, so it goes the other direction until the returned to the app. The loader mutexes and merges the the input and output to the ICD.
Commands obtained with vkGetDeviceProcAddress are little bit more streamlined, as they do not require to be mutexed or merged and are meant to be passed to the ICD without much intervention from the Loader.
The code is also at the same repo: trampoline.c, and loader.c. It's pretty straightforward; every layer just calls the layer below it. Starts at the trampoline, and ends with the terminator layer which in turn will call the ICD layer.

Related

SDL_CreateRenderer fails when compiled with software rendering only

I'm trying to compile a project (tyr-quake) with my custom build of SDL2. My SDL2 build among other things disables all accelerated video (OpenGL, OpenGLES, Vulkan, Metal, etc), X11 and Wayland, but enables KMSDRM.
All is well, and the project I wanted to compile with this build of SDL2 compiled too. Except that when running, SDL_CreateRenderer returns Couldn't find matching render driver (even if I modify the source to pass it SDL_RENDERER_SOFTWARE and set the SDL_HINT_FRAMEBUFFER_ACCELERATION hint to "0").
I looked around the SDL source code a bit, and the software SW_CreateRenderer is indeed being called, but later on (in SDL_CreateWindowTexture) it still wants to create a renderer using a different render driver (it explicitly avoids the software one).
I also tried patching the source code to do the following:
SDL_Surface *surface = SDL_GetWindowSurface(sdl_window);
renderer = SDL_CreateSoftwareRenderer(surface);
But that also failed, as SDL_GetWindowSurface fails with No hardware accelerated renderers available and returns NULL.
My question is: is there a way to only have software rendering with SDL when using KMSDRM, or am I required to have some hardware accelerated rendering option enabled and available.

I think I figured this out on my own.
It is not possible to do so. But, if one wants to do that, implementing CreateWindowFramebuffer, UpdateWindowFramebuffer and DestroyWindowFramebuffer, and setting the appropriate function pointers should grant you the ability to create a purely software-based renderer. Sadly, I don't know KMS and DRM enough to be able to implement this myself.

What happens if an MPI process crashes?

I am evaluating different multiprocessing libraries for a fault tolerant application. I basically need any process to be allowed to crash without stopping the whole application.
I can do it using the fork() system call. The limit here is that the process can be created on the same machine, only.
Can I do the same with MPI? If a process created with MPI crashes, can the parent process keep running and eventually create a new process?
Is there any alternative (possibly multiplatform and open source) library to get the same result?
As reported here, MPI 4.0 will have support for fault tolerance.

If you want collectives, you're going to have to wait for MPI-3.something (as High Performance Mark and Hristo Illev suggest)
If you can live with point-to-point, and you are a patient person willing to raise a bunch of bug reports against your MPI implementation, you can try the following:
disable the default MPI error handler
carefully check every single return code from your MPI programs
keep track in your application which ranks are up and which are down. Oh, and when they go down they can never get back. but you're unable to use collectives anyway (see my opening statement), so that's not a huge deal, right?
Here's an old paper (back when Bill still worked at Argonne. I think it's from 2003):
http://www.mcs.anl.gov/~lusk/papers/fault-tolerance.pdf . It lays out the kinds of fault tolerant things one can do in MPI. Perhaps such a "constrained MPI" might still work for your needs.

If you're willing to go for something research quality, there's two implementations of a potential fault tolerance chapter for a future version of MPI (MPI-4?). The proposal is called User Level Failure Mitigation. There's an experimental version in MPICH 3.2a2 and a branch of Open MPI that also provides the interfaces. Both are far from production quality, but you're welcome to try them out. Just know that since this isn't in the MPI Standard, the function prefixes are not MPI_*. For MPICH, they're MPIX_*, for the Open MPI branch, they're OMPI_* (though I believe they'll be changing theirs to be MPIX_* soon as well.
As Rob Latham mentioned, there will be lots of work you'll need to do within your app to handle failures, though you don't necessarily have to check all of your return codes. You can/should use MPI error handlers as a callback function to simplify things. There's information/examples in the spec available along with the Open MPI branch.

Hacking Mono to support async I/O on memory-mapped files

I'm looking for a little advice on "hacking" Mono (and in fact, .NET too).
Context: As part of the Isis2 library (Isis2.codeplex.com) I want to support very fast "zero copy" replication of memory-mapped files on machines that have the right sort of hardware (Infiband NICs), and minimal copying for more standard Ethernet with UDP. So the setup is this: We have a set of processes {A,B....} all linked to Isis2, and some member, maybe A, has a big memory-mapped file, call it F, and asks Isis2 to please rereplicate F onto B, D, G, and X. The library will do this very efficiently and very rapidly, even with heavy use by many concurrent initiators. The idea would be to offer this to HPC and cloud developers who are running big-data applications.
Now, Isis2 is coded in C# on .NET and cross-compiles to Linux via Mono. Both .NET and Mono are managed, so neither wants to let me do zero-copy network I/O -- the normal model would be "copy your data into a managed byte[] object, then use SendTo or SendAsync to send. To receive, same deal: Receive or ReceiveAsync into a byte[] object, then copy to the target location in the file." This will be slower than what the hardware can sustain.
Turns out that on .NET I can hack around the normal memory protections. I built my own mapped file wrapper (in fact based on one posted years ago by a researcher at Columbia). I pull in the Win32Kernel.dll library, and then use Win32 methods to map my file, initiate the socket Send and Receive calls, etc. With a bit of hacking I can mimic .NET asynchronous I/O this way, and I end up with something fairly clean and coded entirely in C# with nothing .NET even recognizes as unsafe code. I get to treat my mapped file as a big unmanaged byte array, avoiding all that unneeded copying. Obviously I'll protect all of this from my Isis2 users; they won't know.
Now we get to the crux of my question: on Linux, I obviously can't load the Win32 kernel dll since it doesn't exist. So I need to implement some basic functionality using core Linux O/S calls: the fmap() call will map my file. Linux has its own form of asynchronous I/O too: for Infiniband, I'll use the Verbs library from Mellanox, and for UDP, I'll work with raw IP sends and signals ("interrupts") on completion. Ugly, but I can get this to work, I think. Again, I'll then try to wrap all this to look as much like standard asynchronous Windows async I/O as possible, for code cleanness in Isis2 itself, and I'll hide the whole unmanaged, unsafe mess from end users.
Since I'll be sending a gigabyte or so at a time, in chunks, one key goal is that data sent in order would ideally be received in the order I post my async receives. Obviously I do have to worry about unreliable communication (causes stuff to end up dropped, and I'll then have to copy). But if nothing is dropped I want the n'th chunk I send to end up in the n'th receive region...
So here's my question: Has anyone already done this? Does anyone have any tips on how Mono implements the asynchronous I/O calls that .NET uses so heavily? I should presumably do it the same way. And does anyone have any advice on how to do this with minimal pain?
One more question: Win32 is limited to 2Gb of mapped files. Cloud systems would often run Win64. Any suggestions on how to maximize interoperability while allowing full use of Win64 for those who are running that? (A kind of O/S reflection issue...)

Why don't compilers generate microinstructions rather than assembly code?

I would like to know why, in the real world, compilers produce Assembly code, rather than microinstructions.
If you're already bound to one architecture, why not go one step further and free the processor from having to turn assembly-code into microinstructions at Runtime?
I think perhaps there's a implementation bottleneck somewhere but I haven't found anything on Google.
EDIT by microinstructions I mean: if you assembly instruction is ADD(R1,R2), the microinstructions would be. Load R1 to the ALU, load R2 to the ALU, execute the operation, load the results back onto R1. Another way to see this is to equate one microinstruction to one clock-cycle.
I was under the impression that microinstruction was the 'official' name. Apparently there's some mileage variation here.
FA

Compilers don't produce micro-instructions because processors don't execute micro-instructions. They are an implementation detail of the chip, not something exposed outside the chip. There's no way to provide micro-instructions to a chip.

Because an x86 CPU doesn't execute micro operations, it executes opcodes. You can not create a binary image that contains micro operations since there is no way to encode them in a way that the CPU understands.
What you are suggesting is basically a new RISC-style instruction set for x86 CPUs. The reason that isn't happening is because it would break compatibility with the vast amount of applications and operating systems written for the x86 instruction set.

The answer is quite easy.
(Some) compilers do indeed generate code sequences like load r1, load r2, add r2 to r1. But this are precisely the machine code instructions (that you call microcode). These instructions are the one and only interface between the outer world and the innards of a processor.
(Other compilers generate just C and let a C backend like gcc care about the dirty details.)

Porting newlib to a custom ARM setup

this is my first post, and it covers something which I've been trying to get working on and off for about a year now.
Essentially it boils down to the following: I have a copy of newlib which I'm trying to get working on an LPC2388 (an ARM7TDMI from NXP). This is on a linux box using arm-elf-gcc
The question I have is that I've been looking at a lot of the tutorials talking about porting newlib, and they all talk about the stubs (like exit, open, read/write, sbrk), and I have a pretty good idea of how to implement all of these functions. But where should I put them?
I have the newlib distribution from sources.redhat.com/pub/newlib/newlib-1.18.0.tar.gz and after poking around I found "syscalls.c" (in newlib-1.18.0/newlib/libc/sys/arm) which contains all of the stubs which I have to update, but they're all filled in with rather finished looking code (which does NOT seem to work without the crt0.S, which itself does not work with my chip).
Should I just be wiping out those functions myself, and re-writing them? Or should I write them somewhere else. Should I make a whole new folder in newlib/libc/sys with the name of my "architecture" and change the target to match?
I'm also curious if there's proper etiquette on distribution of something like this after releasing it as an open source project. I currently have a script which downloads binutils, arm-elf-gcc, newlib, and gdb, and compiles them. If I am modifying files which are in the newlib directory, should I hand a patch which my script auto-applies? Or should I add the modified newlib to the repository?
Thanks for bothering to read! Following this is a more detailed breakdown of what I'm doing.
For those who want/need more info about my setup:
I'm building a ARM videogame console based loosely on the Uzebox project ( http://belogic.com/uzebox/ ).
I've been doing all sorts of things pulling from a lot of different resources as I try and figure it out. You can read about the start of my adventures here (sparkfun forums, no one responds as I figure it out on my own): forum.sparkfun.com/viewtopic.php?f=11&t=22072
I followed all of this by reading through the Stackoverflow questions about porting newlib and saw a few of the different tutorials (like wiki.osdev.org/Porting_Newlib ) but they also suffer from telling me to implements stubs without mentioning where, who, what, when, or how!

But where should I put them?
You can put them where you like, so long as they exist in the final link. You might incorporate them in the libc library itself, or you might keep that generic, and have the syscalls as a separate target specific object file or library.
You may need to create your own target specific crt0.s and assemble and link it for your target.
A good tutorial by Miro Samek of Quantum Leaps on getting GNU/ARM development up and running is available here. The examples are based on an Atmel AT91 part so you will need to know a little about your NXP device to adapt the start-up code.
A ready made Newlib porting layer for LPC2xxx was available here, but the links ot teh files appear to be broken. The same porting layer is used in Martin Thomas' WinARM project. This is a Windows port of GNU ARM GCC, but the examples included in it are target specific not host specific.
You should only need to modify the porting layer on Newlib, and since it is target and application specific, you need not (in fact probably should not) submit your code to the project.

When I was using newlib that is exactly what I did, blew away crt0.s, syscalls.c and libcfunc.c. My personal preference was to link in the replacement for crt0.s and syscalls.c (rolled the few functions in libcfunc into the syscalls.c replacement) based on the embedded application.
I never had an interest in pushing any of that work back into the distro, so cannot help you there.
You are on the right path though, crt0.S and syscalls.c are where you want to work to customize for your target. Personally I was interested in a C library (and printf) and would primarily neuter all of the functions to return 0 or 1 or whatever it took to get the function to just work and not get in the way of linking, periodically making the file I/O functions operate on linked in data in rom/ram. Basically without replacing or modifying any other files in newlib I had a fair amount of success, so you are on the right path.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas