What are the benefits of using immutable samplers in Vulkan? - vulkan

Am I mistaken that in the case of immutable sampler it's set into the pipeline descriptor layout, whereas a non-immutable descriptor is a pointer to a sampler, and so essentially a non-immutable sampler is one extra indirection to read data from it? What kind of performance increase are we talking?

If there is any performance increase, it would likely be quite trivial and heavily hardware and algorithm dependent.
Vulkan is an explicit, low-level API. This allows it to better match the hardware, but it also means that you get to specify more precisely what it is that you want to do. In the vast majority of cases, the sampler you're using with a texture will be fixed for that particular use. As such, the API allows you to explicitly state this. While this can potentially allow for some hardware optimization, the main thing it allows you to do is to stop carrying around VkSampler objects when you don't need them. You specify them in the set layout, and you're done.

Related

What is the purpose of Vulkan's new extension VK_KHR_vulkan_memory_model?

Khronos just released their new memory model extension, but there is yet to be an informal discussion, example implementation, etc. so I am confused about the basic details.
https://www.khronos.org/blog/vulkan-has-just-become-the-worlds-first-graphics-api-with-a-formal-memory-model.-so-what-is-a-memory-model-and-why-should-i-care
https://www.khronos.org/registry/vulkan/specs/1.1-extensions/html/vkspec.html#memory-model
What do these new extensions try to solve exactly? Are they trying to solve synchronization problems at the language level (to say remove onerous mutexes in your C++ code), or is it a new and complex set of features to give you more control over how the GPU deals with sync internally?
(Speculative question) Would it be a good idea to learn and incorporate this new model in the general case or would this model only apply to certain multi-threaded patterns and potentially add overhead?
Most developers won't need to know about the memory model in detail, or use the extensions. In the same way that most C++ developers don't need to be intimately familiar with the C++ memory model (and this isn't just because of x86, it's because most programs don't need anything beyond using standard library mutexes appropriately).
But the memory model allows specifying Vulkan's memory coherence and synchronization primitives with a lot less ambiguity -- and in some cases, additional clarity and consistency. For the most part the definitions didn't actually change: code that was data-race-free before continues to be data-race-free. For a few developers doing advanced or fine-grained synchronization, the additional precision and clarity allows them to know exactly how to make their programs data-race-free without using expensive overly-strong synchronization.
Finally, in building the model the group found a few things that were poorly-designed or broken previously (many of them going all the way back into OpenGL). They've been able to now say precisely what those things do, whether or not they're still useful, and build replacements that do what was actually intended.
The extension advertises that these changes are available, but even more than that, once the extension is final instead of provisional, it will mean that the implementation has been validated to actually conform to the memory model.
Among other things, it enables the same kind of fine grained memory ordering guarantees for atomic operations that are described for C++ here. I would venture to say that many, perhaps even most, PC C++ developers don't really know much about this because the x86 architecture basically makes the memory ordering superfluous.
However, GPUs are not x86 architecture and compute operations executed massively parallel on GPU shader cores probably can, and in some cases must use explicit ordering guarantees to be valid when working against shared data sets.
This video is a good presentation on atomics and ordering as it applies to C++.

high throughput way of "increment" numerical value operation

I have a distributed cache of float (not int). My process will frequently increment these floats and access them occasionally.
If it's local, atomic float data structure (or float adder if there is one) with an increment method would probably be the best way to go. Non-blocking and async would be ideal, since the sequence of increment does not matter as long as each increment is conducted eventually.
What's the best way of incrementing numerical value to achieve high throughput?
My current method is:
batch several increment operations for different key
using invokeAll method in IgniteCache, passing in a CacheEntryProcessor which contains the increment value for each key.
CacheAtomicityMode configuration is set to ATOMIC
Is this the best way to go?
Is there any configuration that I should pay attention to for performance boost, e.g. use binary format or on-heap memory or avoid unnecessary serialization?
I think you are definitely on the right track.
Binary format will be used by default, so you do not need any special configuration for it. I do not think you should worry about serialization of floats, so I would not configure an on-heap cache, unless you run into performance issues.
I also would suggest to take a look at internal implementation of IgniteAtomicSequence as it may give you more useful ideas.
Lastly, I would suggest to implement it as Ignite service, which will allow you to provide a custom API tailored for this functionality.

When to use VK_IMAGE_LAYOUT_GENERAL

It isn't clear to me when it's a good idea to use VK_IMAGE_LAYOUT_GENERAL as opposed to transitioning to the optimal layout for whatever action I'm about to perform. Currently, my policy is to always transition to the optimal layout.
But VK_IMAGE_LAYOUT_GENERAL exists. Maybe I should be using it when I'm only going to use a given layout for a short period of time.
For example, right now, I'm writing code to generate mipmaps using vkCmdBlitImage. As I loop through the sub-resources performing the vkCmdBlitImage commands, should I transition to VK_IMAGE_LAYOUT_TRANSFER_DST_OPTIMAL as I scale down into a mip, then transition to VK_IMAGE_LAYOUT_TRANSFER_SRC_OPTIMAL when I'll be the source for the next mip before finally transitioning to VK_IMAGE_LAYOUT_SHADER_READ_ONLY_OPTIMAL when I'm all done? It seems like a lot of transitioning, and maybe generating the mips in VK_IMAGE_LAYOUT_GENERAL is better.
I appreciate the answer might be to measure, but it's hard to measure on all my target GPUs (especially because I haven't got anything running on Android yet) so if anyone has any decent rule of thumb to apply it would be much appreciated.
FWIW, I'm writing Vulkan code that will run on desktop GPUs and Android, but I'm mainly concerned about performance on the latter.
You would use it when:
You are lazy
You need to map the memory to host (unless you can use PREINITIALIZED)
When you use the image as multiple incompatible attachments and you have no choice
For Store Images
( 5. Other cases when you would switch layouts too much (and you don't even need barriers) relatively to the work done on the images. Measurement needed to confirm GENERAL is better in that case. Most likely a premature optimalization even then.
)
PS: You could transition all the mip-maps together to TRANSFER_DST by a single command beforehand and then only the one you need to SRC. With a decent HDD, it should be even best to already have them stored with mip-maps, if that's a option (and perhaps even have a better quality using some sophisticated algorithm).
PS2: Too bad, there's not a mip-map creation command. The cmdBlit most likely does it anyway under the hood for Images smaller than half resolution....
If you read from mipmap[n] image for creating the mipmap[n+1] image then you should use the transfer image flags if you want your code to run on all Vulkan implementations and get the most performance across all implementations as the flags may be used by the GPU to optimize the image for reads or writes.
So if you want to go cross-vendor only use VK_IMAGE_LAYOUT_GENERAL for setting up the descriptor that uses the final image and not image reads or writes.
If you don't want to use that many transitions you may copy from a buffer instead of an image, though you obviously wouldn't get the format conversion, scaling and filtering that vkCmdBlitImage does for you for free.
Also don't forget to check if the target format actually supports the BLIT_SRC or BLIT_DST bits. This is independent of whether you use the transfer or general layout for copies.

VHDL optimization tips [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
I am quite new in VHDL, and by using different IP cores (by different providers) can see that sometimes they differ massively as per the space that they occupy or timing constraints.
I was wondering if there are rules of thumb for optimization in VHDL (like there are in C, for example; unroll your loops, etc.).
Is it related to the synthesis tools I am using (like the different compilers are using other methods of optimization in C, so you need to learn to read the feedback asm files they return), or is it dependent on my coding skills?
Is it related to the synthesis tools I am using (like the different compilers are using other methods of optimization in C, so you need to learn to read the feedback asm files they return), or is it dependent on my coding skills?
The answer is "yes." When you are coding in HDL, you are actually describing hardware (go figure). Instead of the code being converted into machine code (like it would with C) it is synthesized to logical functions (AND, NOT, OR, XOR, etc) and memory elements (RAM, ROM, FF...).
VHDL can be used in many different ways. You can use VHDL in a purely structural sense where at the base level you are calling our primitives of the underlying technology that you are targeting. For example, you literally instantiate every AND, OR, NOT, and Flip Flop in your design. While this can give you a lot of control, it is not an efficient use of time in 99% of cases.
You can also implement hardware using behavioral constructs with VHDL. Instead of explicitly calling out each logic element, you describe a function to be implemented. For example, if this, do this, otherwise, do something else. You can describe state machines, math operations, and memories all in a behavioral sense. There are huge advantages to describing hardware in a behavioral sense:
Easier for humans to understand
Easier for humans to maintain
More portable between synthesis tools and target hardware
When using behavioral constructs, knowing your synthesis tool and your target hardware can help in understanding how what you write will actually be implemented. For example, if you describe a memory element with a asynchronous reset the implementation in hardware will be different for architectures with a dedicated asynchronous reset input to the memory element and one without.
Synthesis tools will generally publish in their reference manual or user guide a list of suggested HDL constructs to use in order to obtain some desired implementation result. For basic cases, they will be what you would expect. For more elaborate behavior models (e.g. a dual port RAM) there may be some form that you need to follow for the tool to "recognize" what you are describing.
In summary, for best use of your target device:
Know the device you are targeting. How are the programmable elements laid out? How many inputs and outputs are there from lookup tables? Read the device user manual to find out.
Know your synthesis engine. What types of behavioral constructs will be recognized and how will they be implemented? Read the synthesis tool user guide or reference manual to find out. Additionally, experiment by synthesizing small constructs to see how it gets implemented (via RTL or technology viewer, if available).
Know VHDL. Understand the differences between signals and variables. Be able to recognize statements that will generate many levels of logic in your design.
I was wondering if there are rules of thumb for optimization in VHDL
Now that you know the hardware, synthesis tool, and VHDL... Assuming you want to design for maximum performance, the following concepts should be adhered to:
Pipeline, pipeline, pipeline. The more levels of logic you have between synchronous elements, the more difficulty you are going to have making your timing constraint/goal.
Pipeline some more. Having additional stages of registers can provide additional wiggle-room in the future if you need to add more processing steps to your algorithm without affecting the overall latency/timeline.
Be careful when operating on the boundaries of the normal fabric. For example, if interfacing with an IO pin, dedicated multiplies, or other special hardware, you will take more significant timing hits. Additional memory elements should be placed here to avoid critical paths forming.
Review your synthesis and implementation reports frequently. You can learn a lot from reviewing these frequently. For example, if you add a new feature, and your timing takes a hit, you just introduced a critical path. Why? How can you alleviate this issue?
Take care with your "global" structures -- such as resets. Logic that must be widely distributed in your design deserves special care, since it needs to reach across your whole device. You may need special pipeline stages, or timing constraints on this type of logic. If at all possible, avoid "global" structures, unless truly a requirement.
While synthesis tools have design goals to focus on area, speed or power, the designer's choices and skills is the major contributor for the quality of the output. A designer should have a goal to maximize speed or minimize area and it will greatly influence his choices. A design optimized for speed can be made smaller by asking the tool to reduce the area, but not nearly as much as the same design thought for area in the first place.
However, it is more complicated than that. IP cores often target several FPGA technologies as well as ASIC. This can only be achieved by using general VHDL constructs, (or re-writing the code for each target, which non-critical IP providers don't do). FPGA and ASIC vendor have primitives that will improve speed/area when used, but if you write code to use a primitive for a technology, it doesn't mean that the resulting code will be optimized if you change the technology. Both Xilinx and Altera have DSP blocks to speed multiplication and whatnot, but they don't work exactly the same and writing code that uses the full potential of both is very challenging.
Synthesis tools are notorious for doing exactly what you ask them to, even if a more optimized solution is simple, for example:
a <= (x + y) + z; -- This is a 2 cascaded 2-input adder
b <= x + y + z; -- This is a 3-input adder
Will likely lead a different path from xyz to b/c. In the end, the designer need to know what he wants, and he has to verify that the synthesis tool understands his intent.

Use of atomic properties in Objective C: Any side effects?

I understand that the meaning of atomic was explained in What's the difference between the atomic and nonatomic attributes?, but what I want to know is:
Q. Are there any side effects, besides performance issues, in using atomic properties everywhere?
It seems the answer is no, as the performance of iPhone is quite fast nowadays. So why are so many people still using non-atomic?
Even atomic does not guarantee thread safety, but it's still better than nothing, right?
Even atomic does not guarantee thread safety, but it's still better than nothing, right?
Wrong. Having written some really complex concurrent programs, I recommend exactly the opposite. You should reserve atomic for when it truly makes sense to use -- and you may not fully understand this until you write concurrent programs without any use of atomic. If I am writing a multithreaded program, I don't want programming errors masked (e.g. race conditions). I want concurrency issues loud and obvious. This way, they are easier to identify, reproduce, and correct.
The belief that some thread safety is better than none is flawed. The program is either threadsafe, or it is not. Using atomic can make those aspects of your programs more resistant to issues related to concurrency, but that doesn't buy you much. Sure, there will likely be fewer crashes, but the program is still undisputedly incorrect, and it will still blow up in mysterious ways. My advice: If you aren't going to take the time to learn and write correct concurrent programs, just keep them single threaded (if that sounds kind of harsh: it's not meant to be harsh - it will save you from a lot of headaches). Multithreading and concurrency are huge, complicated subjects - it takes a long time to learn to write truly correct, long-lived programs in many domains.
Of course, atomic can be used to achieve threadsafety in some cases -- but making every access atomic guarantees nothing for thread safety. As well, it's highly unusual (statistically) that atomic properties alone will make a class truly threadsafe, particularly as complexity of the class increases; it is more probable that a class with one ivar is truly safe using atomics only, versus a class with 5 ivars. atomic properties are a feature I use very very rarely (again, some pretty large codebases and concurrent programs). It's practically a corner case if atomics are what makes a class truly thread safe.
Performance and execution complexity are the primary reasons to avoid them. Compared to nonatomic accesses, and the frequency and simplicity of accessing a variable, use of atomic adds up very fast. That is, atomic accesses introduce a lot of execution complexity relative to the task they perform.
Spin locks are one way atomic properties are implemented. So, would you want a synchronization primitive such as a spin lock or mutex implicitly surrounding every get and set, knowing it does not guarantee thread safety? I certainly don't! Making every property access in your implementations atomic can consume a ton of CPU time. You should use it only when you have an explicit reason to do so (also mentioned by dasblinkenlicht+1). Implementation Detail: some accesses do not require spin locks to uphold guarantees of atomic; it depends on several things, such as the architecture and the size of a variable.
So to answer your question "any side-effect?" in a TL;DR format: Performance is the primary reason as you noted, while the applicability of what atomic guarantees and how that is useful for you is very narrow at your level of abstraction (oft misunderstood), and it masks real bugs.
You should not pay for what you do not use. Unlike plugged-in computers where CPU cycles cost you in terms of time, CPU cycles on a mobile device cost you both in time and in the battery use. If your application is single-threaded, there is no reason to use atomic, because the locking and unlocking operations would be a waste of time and battery. The battery is more important than the time: while the latency associated with addition of extra operations may be invisible to your end-user, the cycles spent will reduce the time the mobile device can work after a single charge, a measure that a lot of users consider very important.