Validation warning about SPIR-V Capability - gpu

I'm using Vulkan for heavy GPU computations and in some kernels I'm applying subgroup arithmetic operations. In order to use this, I've included necessary extensions in the kernel:
#extension GL_KHR_shader_subgroup_arithmetic: enable
#extension GL_KHR_shader_subgroup_basic: enable
Everything is working in the way I expect it, but there's a validation error that bothers me. It has this text:
Object 0: handle = 0x78de221930, type = VK_OBJECT_TYPE_DEVICE; | MessageID = 0xa7bb8db6 | vkCreateShaderModule(): The SPIR-V Capability (GroupNonUniformArithmetic) was declared, but none of the requirements were met to use it. The Vulkan spec states: If pCode declares any of the capabilities listed in the SPIR-V Environment appendix, one of the corresponding requirements must be satisfied (https://www.khronos.org/registry/vulkan/specs/1.3-extensions/html/vkspec.html#VUID-VkShaderModuleCreateInfo-pCode-01091)
I don't really understand, what EXACTLY the requirements I need to satisfy. Going to the url didn't really help, I don't think it specifies what I exactly need to do.
Could you please advise how to tackle this issue? Thanks in advance!

Related

Vulkan API calls to GPU drivers

Background:
I have been eyeing writing an application which needs very basic but fast graphics (just drawing lines and squares), and I'm probably going to use a library such as GLFW, or Vulkano if i'm going with Rust.
I want to understand a specific, and I guess quite practical, detail of the Vulkan API. I understand that GPUs can be quite a complicated topic, but I want to emphasize that I don't have any background in low-level graphics or Vulkan, so I understand if my question cannot be answered, or if my question does not even make sense. I'll try my best to use the correct terminology. I have to admit, I'm not the best at skimming through and looking at large amounts of source code I don't quite understand and still grasp the overall concept, which is why I hope I can find my answer here. I've tried looking at the source code for Vulkan and Mesa drivers, but it bore no fruit.
ORIGINAL Question:
I want to understand how an API call is propagated to the GPU driver.
I have searched around, but couldn't find the specifics I am searching for. The closest posts I've found are these two:
https://softwareengineering.stackexchange.com/questions/279069/how-does-a-program-talk-to-a-graphics-card
https://superuser.com/questions/461022/how-does-the-cpu-and-gpu-interact-in-displaying-computer-graphics
They both mention something similar to "In order to make the GPU do something, you have to make a call via a supported API". I know that, but neither of the two dig into the specifics of how that API call is made. Hopefully, the diagram below illustrates my question.
MyVulkanProgram.c with "#include <vulkan/vulkan.h>"
|
| (Makes call via Vulkan API)
v
This is the part I don't understand!
|
v
Driver (Mesa, for example) takes the request sent via the Vulkan API.
|
| (Driver asks GPU to perform task)
v
GPU does task
I don't care what or how the GPU does something. Just how it is invoked through the API call via Vulkan and how it propagates through the system. Ideally what I'm looking for is a code-snippet or link to where in the Vulkan source code the actual request is sent to the driver.
Or have I gotten it all wrong? Is Vulkan more part of the driver than I realize? Is it maybe the case that the driver includes the same Vulkan header as my "MyVulkanProgram.c" and the driver is linked together with library files such as libvulkan.so et al? Is it more like the diagram below?
MyVulkanProgram.c with "#include <vulkan/vulkan.h>"
|
| (Makes call via Vulkan API)
v
Driver (Mesa, for example, which includes the vulkan headers and is linked with the Vulkan shared object-files) takes the request sent via the Vulkan API.
|
| (Driver asks GPU to perform task)
v
GPU does task
Might be a basic question, might not be, but I'm confused nonetheless. Very thankful for any answers!
UPDATED Question:
After having read the answer from #krOoze (answer from krOoze), and given the "Vulkan loader" overview figure in the mentioned document, I can more precisely express my question.
How does an application, making a call via the Vulkan API, reach the ICD via the Vulkan loader?
You are looking for the Vulkan-Loader/LoaderAndLayerInterface.md documentation.
The app interfaces with The Loader (sometimes called Vulkan RT, or Vulkan Runtime). That is the vulkan-1.dll (or so).
The Loader also has vulkan-1.lib, which is classic dll shim. It is where the loading of core version and WSI commands happens, but you can skip the lib and do it all manually directly from the dll using vkGetInstanceProcAddr.
Then you have ICDs (Installable Client Drivers). Those are something like nvoglv64.dll, and you can have more of them on your PC (e.g. Intel iGPU + NV). The name is arbitrary and vendor specific. The Loader finds them via config files.
Now when you call something to a command obtained with vkGetInstanceProcAddress (which is everything if you use the *.lib only), you get onto a loader trampoline, which calls a chain of layers, after which the relevant ICD (or all of them) are called. Then the callstack is unwound, so it goes the other direction until the returned to the app. The loader mutexes and merges the the input and output to the ICD.
Commands obtained with vkGetDeviceProcAddress are little bit more streamlined, as they do not require to be mutexed or merged and are meant to be passed to the ICD without much intervention from the Loader.
The code is also at the same repo: trampoline.c, and loader.c. It's pretty straightforward; every layer just calls the layer below it. Starts at the trampoline, and ends with the terminator layer which in turn will call the ICD layer.

Verilog: assigning to a module input from within the module itself is okay to do?

I just encountered a case where Verilog module inputs were being assigned to from within the module itself!
I thought for sure this would error out any Verilog simulator, but no, one (at least) lets this pass!
How can this be?!
Isn't this just inviting an "X" tragedy, as soon as something outside the module assigns a different value to the input?
Am I REALLY missing something here?
In case it matters, the module in question came as part of a behavioral simulation library provided to us by our foundry.
The Verilog language does not have any rules about the flow of data based on port direction. The SystemVerilog LRM has a section 23.3.3.1 Port coercion that explicily describes places where inputs can be coerced to output and vice versa. However, Synthesis tools have coding requirements that prevent multiple drivers on the same signal. So if there are drivers from both inside and outside the instatiated module, you will get synthesis errors.
SystemVerilog has a number of coding styles that can catch multiple drivers on a signal as part of the simulation flow, so you don't have to wait until you get to synthesis, or use a separate linting tool.

Is local_variables_initializer really necessary?

In practice, isn't running global_variables_initializer enough to initialize model variables?
local_variables_initializer seems to be unnecessary and absent even in official and semi-official tensorflow example code. See for example:
https://github.com/dandelionmane/tf-dev-summit-tensorboard-tutorial
https://www.tensorflow.org/get_started/mnist/pros
In both cases only global_variables_initializer is used.
Am I missing something here? Is there any case where I should call local_variables_initializer explicitly?
local_variables_initializer is useful in particular for streaming metrics (e.g. tf.contrib.metrics.streaming_auc). As said in the doc of contrib.metrics:
Because the streaming metrics use local variables, the Initialization stage is performed by running the op returned by tf.local_variables_initializer().

What is this mach_constant_base_node

In c2 architecture specific file i see the above variable. Please share
1. what it is?
2. Whether does it has any relation to the run time constant pool.
Thank you.
An IR graph node that represents a base address of the compiled method's constants table in a machine-specific manner. This node actually does nothing on x86, since the architecture allows to reference the whole range of 32-bit or 64-bit addresses inline.
Generally, no. Though some constants from the constant pool (particularly, floating point) may appear in that table.
P.S. I guess HotSpot Compiler guys are too busy to browse StackOverflow :) The better place for asking C2 implementation-specific questions is hotspot-compiler-dev list.

Portland group FORTRAN pgf90 program fails when compiled with -fast, succeeds with -fast -Mnounroll

This code hummed along merrily for a long time, until we recently discovered an edge case where it fails silently-- no errors returned.
The fail is apprently pretty subtle. We can get the code to run uneventfully in the edge case by:
1) compiling with any set of options that includes -traceback or debug (-g or -gopt);
2) compiling with -fast -Mnounroll;
3) compiling with optimization <2;
4) adding WRITE statements into the code to determine the location of the fail;
In other words, most of the tools useful for debugging the failure-- actually result in the failure disappearing.
I am probing for any information on failures related to loop unrolling or other optimization, and their resolution.
Thank you all in advance.
I'm not familiar with pgf (heck, it's been 10 years since I used any fortran), but here are some general suggestions for tracking down (potential) compiler bugs:
Simplify a reproducible case. I.e. try to reproduce the problem with a similar looking piece of code that has all the superfluous details removed. This is helpful because a) you'll be less hesitant to release the code publicly, and b) if someone attempts to diagnose the problem, it will be easier for them with less surrounding material.
Talk to the experts: If you have a support contract for pgf, use it! There's a support request form on their site. If not, there's a User Forums section where you might be able to post your information - someone else may have better workaround, or an employee there may be able to log your problem.
Double-check your code. Is it possible that you're relying on some sort of unspecified behavior? This is the sort of thing that would cause your program to switch behavior when changing optimization levels. I'm not saying compiler bugs are impossible, but it could be a hack in your code too.
Hope that's helpful.