My question is similar to this one but part of the (useful) given answer isn't compatible with compiling GLSL for vulkan based on OpenGL ES ESSL 3.10.
In order to use a separate section of the push constant memory in the vertex shader and the fragment shader, the suggested solution is to use layout(offset = #) before the first member of the push constant structure.
Attempting to do this in GLSL ES 310 code leads to the error "'offset on block member': not supported with this profile: es".
Is there a supported way to declare such an offset that is compatible with es?
The only workaround I've found is to declare a bunch of dummy variables in the fragment shader. When I do so, I get validation layer errors if I don't declare the full range of the fragment shader's push constant buffer in VkPipelineLayoutCreateInfo. After fixing that, I get validation layer warnings about "vkCreatePipelineLayout() call has push constants with overlapping ranges".
Obviously I can ignore warnings, but if there's a tidier solution, then that would be much more preferable.
Simple example, this compiles successfully with VulkanSDK\1.0.13.0\Bin\glslangValidator.exe:
#version 430
#extension GL_ARB_enhanced_layouts: enable
layout(std140, push_constant) uniform PushConstants
{
layout(offset=64) mat4 matWorldViewProj;
} ubuf;
layout(location = 0) in vec4 i_Position;
void main() {
gl_Position = ubuf.matWorldViewProj * i_Position;
}
Whereas this does not:
#version 310 es
#extension GL_ARB_enhanced_layouts: enable
layout(std140, push_constant) uniform PushConstants
{
layout(offset=64) mat4 matWorldViewProj;
} ubuf;
layout(location = 0) in vec4 i_Position;
void main() {
gl_Position = ubuf.matWorldViewProj * i_Position;
}
Converting all my 310 ES shader code to 430 would solve my problem, but that wouldn't be ideal. GL_ARB_enhanced_layouts doesn't apply to 310 ES code, so my question is not about why it doesn't work, but rather, do I have any options in ES to achieve the same goal?
I would consider this an error in the GLSL compiler.
What's happening is this. There are some things which compiling GLSL for Vulkan adds to the language, as defined by KHR_vulkan_glsl. The push_constant layout, for example, is explicitly added to GLSL syntax.
However, there are certain things which it does not add to the language. Important to your use case is the ability to apply offsets to members of uniform blocks. Oh yes, KHR_vulkan_glsl uses that information when building the shader's block layout. But the grammar that allows you to say layout(offset=#) is defined by GLSL, not by KHR_vulkan_glsl.
And that grammar is not a part of any version of GLSL-ES. Nor is it provided by any ES extension I am aware of. So you can't use it.
I would say that the reference compiler should, when compiling a shader for Vulkan, either fail to compile any GLSL-ES-based version, or silently ignore any version and extension declarations, and just assume desktop GLSL 4.50.
As for what you can do about it... nothing. Short of hacking that solution into the compiler yourself, your primary solution is to write your code against versions of desktop OpenGL. Like 4.50.
If you compile SPIR-V for Vulkan there is a "VULKAN" define set in your shaders (see GL_KHR_VULKAN_glsl), so you could do something like this:
#ifdef VULKAN
layout(push_constant) uniform pushConstants {
vec4 (offset = 12) pos;
} pushConstBlock;
#else
// GLES stuff
#endif
Related
So I tried using code from another post around here to see if I could use it, it was a code meant to utilize a potentiometer to move a servo motor, but when I attempted to compile it is gave the error above saying No operator "=" matches these operands in "Servo_Project.cpp". How do I go about fixing this error?
Just in case ill say this, the boards I was trying to compile the code were a NUCLEO-L476RG, the board from the post I mentioned utilized Nucleo L496ZG board and a Tower Pro Micro Servo 9G.
#include "mbed.h"
#include "Servo.h"
Servo myservo(D6);
AnalogOut MyPot(A0);
int main() {
float PotReading;
PotReading = MyPot.read();
while(1) {
for(int i=0; i<100; i++) {
myservo = (i/100);
wait(0.01);
}
}
}
This line:
myservo = (i/100);
Is wrong in a couple of ways. First, i/100 will always be zero - integer division truncates in C++. Second, there's not an = operator that allows an integer value to be assigned to a Servo object. YOu need to invoke some kind of Servo method instead, likely write().
myservo.write(SOMETHING);
The SOMETHING should be the position or speed of the servo you're trying to get working. See the Servo class reference for an explanation. Your code tries to use fractions from 0-1 and thatvisn't going to work - the Servo wants a position/speed between 0 and 180.
You should look in the Servo.h header to see what member functions and operators are implemented.
Assuming what you are using is this, it does have:
Servo& operator= (float percent);
Although note that the parameter is float and you are passing an int (the parameter is also in the range 0.0 to 1.0 - so not "percent" as its name suggests - so be wary, both the documentation and the naming are poor). You should have:
myservo = i/100.0f;
However, even though i / 100 would produce zero for all i in the loop, that does not explain the error, since an implicit cast should be possible - even if clearly undesirable. You should look in the actual header you are using to see if the operator= is declared - possibly you have the wrong file or a different version or just an entirely different implementation that happens to use teh same name.
I also notice that if you look in the header, there is no documentation mark-up for this function and the Servo& operator= (Servo& rhs); member is not documented at all - hence the confusing automatically generated "Shorthand for the write and read functions." on the Servo doc page when the function shown is only one of those things. It is possible it has been removed from your version.
Given that the documentation is incomplete and that the operator= looks like an after thought, the simplest solution is to use the read() / write() members directly in any case. Or implement your own Servo class - it appears to be only a thin wrapper/facade of the PwmOut class in any case. Since that is actually part of mbed rather than user contributed code of unknown quality, you may be on firmer ground.
I'm trying to do some 3D boolean operations with CGAL. I have successfully converted my polyhedra to nef polyhedra. However, when ever I try doing a simple union, I get an assertion fail on line 286 of SM_overlayer.h:
CGAL error: assertion violation!
Expression : G.mark(v1,0)==G.mark(v2,0)&& G.mark(v1,1)==G.mark(v2,1)
File : ./CGAL/include/CGAL/Nef_S2/SM_overlayer.h
Line : 287
I tried searching the documentation for "mark". Apparently it is a template argument on the Nef_polyhedron_3" that defaults to bool. However the documentation also says it is not implemented and that you shouldn't mess with it. I am a bit confused why there is even an assertion on some unimplemented feature. I tried simply commenting out the assertion, but it simply goes on to fail at a later point.
I am using the following kernel and typedefs as it was the only example I could find that allowed doubles in the construction of the meshes.
typedef CGAL::Exact_predicates_exact_constructions_kernel Kernel;
typedef CGAL::Nef_polyhedron_3<Kernel> Nef_polyhedron;
typedef CGAL::Polyhedron_3<Kernel> Polyhedron;
I used CGAL 4.6.3 installed with the exe installer. I have also tried the 4.7 beta and I get the same exception(at line 300 instead though).
Related github issue: https://github.com/CGAL/cgal/issues/353
EDIT: The issue turned out to be with the meshes. The meshes I were using had self intersections. Thus, even though is_valid, is_closed and is_triangular returned true, the mesh was in valid for conversion to a nef polyhedron. From CGAL 4.7. The Polygon mesh processing package has been introduced which contains this which can be used to check for self intersections.
I have noticed that PTX code allows for some instructions with complex semantics, such as bit field extract (bfe), find most-significant non-sign bit (bfind), and population count (popc).
Is it more efficient to use them explicitly rather than write code with their intended semantics in C/C++?
For example: "population count", or popc, means counting the one bits. So should I write:
__device__ int popc(int a) {
int d = 0;
while (a != 0) {
if (a & 0x1) d++;
a = a >> 1;
}
return d;
}
for that functionality, or should I, rather, use:
__device__ int popc(int a) {
int d;
asm("popc.u32 %1 %2;":"=r"(d): "r"(a));
return d;
}
? Will the inline PTX be more efficient? Should we write inline PTX to to get peak performance?
also - does GPU have some extra magic instruction corresponding to PTX instructions?
The compiler may identify what you're doing and use a fancy instruction to do it, or it may not. The only way to know in the general case is to look at the output of the compilation in ptx assembly, by using -ptx flag added to nvcc. If the compiler generates it for you, there is no need to hand-code the inline assembly yourself (or use an instrinsic).
Also, whether or not it makes a performance difference in the general case depends on whether or not the code path is used in a significant way, and on other factors such as the current performance limiters of your kernel (e.g. compute-bound or memory-bound).
A few more points in addition to #RobertCrovella's answer:
Even if you do use PTX for something - that should happen rarely. Limit it to small functions of no more than a few PTX lines - which you can then re-use for multiple purposes as you see fit, with most of your coding being in C/C++.
An example of this principle are the intrinsics #njuffa mentiond, in (that's not an official copy of the file I think). Please read it through to see which intrinsics are available to you. That doesn't mean you should use them all, of course.
For your specific example - you do want the PTX over the first version; it certainly won't do any harm. But, again, it is also an example of how you do not need to actually write PTX, since popc has a corresponding __popc intrinsic (again, as #njuffa has noted).
You might also want to have a look at the source code of some CUDA-based libraries to see what kind of PTX snippets they've chosen to use.
This question already has answers here:
ROL / ROR on variable using inline assembly in Objective-C
(2 answers)
Closed 9 years ago.
A few days ago, I asked the question below. Because I was in need of a quick answer, I added:
The code does not need to use inline assembly. However, I haven't found a way to do this using Objective-C / C++ / C instructions.
Today, I would like to learn something. So I ask the question again, looking for an answer using inline assembly.
I would like to perform ROR and ROL operations on variables in an Objective-C program. However, I can't manage it – I am not an assembly expert.
Here is what I have done so far:
uint8_t v1 = ....;
uint8_t v2 = ....; // v2 is either 1, 2, 3, 4 or 5
asm("ROR v1, v2");
the error I get is:
Unknown use of instruction mnemonic with unknown size suffix
How can I fix this?
A rotate is just two shifts - some bits go left, the others right - once you see this rotating is easy without assembly. The pattern is recognised by some compilers and compiled using the rotate instructions. See wikipedia for the code.
Update: Xcode 4.6.2 (others not tested) on x86-64 compiles the double shift + or to a rotate for 32 & 64 bit operands, for 8 & 16 bit operands the double shift + or is kept. Why? Maybe the compiler understands something about the performance of these instructions, maybe the just didn't optimise - but in general if you can avoid assembler do so, the compiler invariably knows best! Also using static inline on the functions, or using macros defined in the same way as the standard macro MAX (a macro has the advantage of adapting to the type of its operands), can be used to inline the operations.
Addendum after OP comment
Here is the i86_64 assembler as an example, for full details of how to use the asm construct start here.
First the non-assembler version:
static inline uint32 rotl32_i64(uint32 value, unsigned shift)
{
// assume shift is in range 0..31 or subtraction would be wrong
// however we know the compiler will spot the pattern and replace
// the expression with a single roll and there will be no subtraction
// so if the compiler changes this may break without:
// shift &= 0x1f;
return (value << shift) | (value >> (32 - shift));
}
void test_rotl32(uint32 value, unsigned shift)
{
uint32 shifted = rotl32_i64(value, shift);
NSLog(#"%8x <<< %u -> %8x", value & 0xFFFFFFFF, shift, shifted & 0xFFFFFFFF);
}
If you look at the assembler output for profiling (so the optimiser kicks in) in Xcode (Product > Generate Output > Assembly File, then select Profiling in the pop-up menu as the bottom of the window) you will see that rotl32_i64 is inlined into test_rotl32 and compiles down to a rotate (roll) instruction.
Now producing the assembler directly yourself is a bit more involved than for the ARM code FrankH showed. This is because to take a variable shift value a specific register, cl, must be used, so we need to give the compiler enough information to do that. Here goes:
static inline uint32 rotl32_i64_asm(uint32 value, unsigned shift)
{
// i64 - shift must be in register cl so create a register local assigned to cl
// no need to mask as i64 will do that
register uint8 cl asm ( "cl" ) = shift;
uint32 shifted;
// emit the rotate left long
// %n values are replaced by args:
// 0: "=r" (shifted) - any register (r), result(=), store in var (shifted)
// 1: "0" (value) - *same* register as %0 (0), load from var (value)
// 2: "r" (cl) - any register (r), load from var (cl - which is the cl register so this one is used)
__asm__ ("roll %2,%0" : "=r" (shifted) : "0" (value), "r" (cl));
return shifted;
}
Change test_rotl32 to call rotl32_i64_asm and check the assembly output again - it should be the same, i.e. the compiler did as well as we did.
Further note that if the commented out masking line in rotl32_i64 is included it essentially becomes rotl32 - the compiler will do the right thing for any architecture all for the cost of a single and instruction in the i64 version.
So asm is there is you need it, using it can be somewhat involved, and the compiler will invariably do as well or better by itself...
HTH
The 32bit rotate in ARM would be:
__asm__("MOV %0, %1, ROR %2\n" : "=r"(out) : "r"(in), "M"(N));
where N is required to be a compile-time constant.
But the output of the barrel shifter, whether used on a register or an immediate operand, is always a full-register-width; you can shift a constant 8-bit quantity to any position within a 32bit word, or - as here - shift/rotate the value in a 32bit register any which way.
But you cannot rotate 16bit or 8bit values within a register using a single ARM instruction. None such exists.
That's why the compiler, on ARM targets, when you use the "normal" (portable [Objective-]C/C++) code (in << xx) | (in >> (w - xx)) will create you one assembler instruction for a 32bit rotate, but at least two (a normal shift followed by a shifted or) for 8/16bit ones.
Out of curiosity, will it be more efficient to write shader variables like this :
lowp vec4 tC = texture2D(uTexture, vTexCoord); // texture color
or
lowp vec4 textureColor = texture2D(uTexture, vTexCoord); // texture color
Note that I wrote variable tC because it has less characters than variable textureColor
I understand in programming language like C/ObjC, it doesn't matter, but what about shader, since you can query the attributes / uniform names.
It shouldn't make a measurable difference. After linking your program during initialization, query the locations of attributes/uniforms, and keep the result around with the program handle. From then on, neither your app nor the driver will be touching the name strings, just the integer locations.
Even if you re-query locations every time you need to change an attrib binding or uniform value, the difference between a short and "moderate" name length likely won't make much difference compared to the other costs of doing the lookup and binding/value change.