How to use VkPipelineCache? - vulkan

If I understand it correctly, I'm supposed to create an empty VkPipelineCache object, pass it into vkCreateGraphicsPipelinesand data will be written into it. I can then reuse it with other pipelines I'm creating or save it to a file and use it on the next run.
I've tried following the LunarG example to extra the info:
uint32_t headerLength = pData[0];
uint32_t cacheHeaderVersion = pData[1];
uint32_t vendorID = pData[2];
uint32_t deviceID = pData[3];
But I always get headerLength is 32 and the rest 0. Looking at the spec (https://vulkan.lunarg.com/doc/view/1.0.26.0/linux/vkspec.chunked/ch09s06.html Table 9.1), the cacheHeaderVersion should always be 1, as the only available cache header version is VK_PIPELINE_CACHE_HEADER_VERSION_ONE = 1.
Also the size of pData is usually only 32 bytes, even when I create 10 pipelines with it. What am I doing wrong?

A Vulkan pipeline cache is an opaque object that's only meaningful to the driver. There are very few operations that you're supposed to use on it.
Creating a pipeline cache, optionally with a block of opaque binary data that was saved from an earlier run
Getting the opaque binary data from an existing pipeline cache, typically to serialize to disk before exiting your application
Destroying a pipeline cache as part of the proper shutdown process.
The idea is that the driver can use the cache to speed up creation of pipelines within your program, and also to speed up pipeline creation on subsequent runs of your application.
You should not be attempting to interpret the cache data returned from vkGetPipelineCacheData at all. The only purpose for that data is to be passed into a later call to vkCreatePipelineCache.
Also the size of pData is usually only 32 bytes, even when I create 10 pipelines with it. What am I doing wrong?
Drivers must implement vkCreatePipelineCache, vkGetPipelineCacheData, etc. But they don't actually have to support caching. So if you're working with a driver that doesn't have anything it can cache, or hasn't done the work to support caching, then you'd naturally get back an empty cache (other than the header).

Related

Send heavy data through protobuf. Custom field

I'm developing the API for the application using protobuf and grpc.
I need to send the data with the arbitrary size. Sometimes it is small, sometimes huge. For example Nympy array. If the size is small I want to send it through protobuf, if the size if huge I want to dump data into file and send the filepath to this file through protobuf.
To do so I've created a following .proto messages:
message NumpyTroughProtobuf {
repeated int32 shape = 1;
repeated float array = 2;
}
message NumpyTroughfile {
string filepath = 1;
}
message NumpyTrough {
google.protobuf.Any data = 1;
}
The logic is simple: If the size is big I use data as NumpyTroughfile or if small data as NumpyTroughProtobuf.
Problem (what I want to avoid):
The mechanism of data transformation is the part of my app.
In the current approach I have to check and covert the data before I create the message NumpyTrough. So I have to add some logic into my application which will care of data check and cast. The same I have to do for any language which I use (for example if I send massages from Python to C++).
What I want to do:
The mechanism of data transformation is the part of customized protobuf.
I want to hide the data transformation. I want that my app to send a pure Numpy array into NumpyTrough.data field. All data transformation should be hided.
So I want that the logic of data transformation be the part of custom Protobuf field, not the part of my application.
Which meant that I would like to create a custom type of the field. I just implement the behavior of this filed (marshal/unmarshal) for any languages which I use. Then I can just directly send Numpy data into this custom field and this field will decide how to proceed: turn the data in into file or via other method, send trough Protobuf and restore on the receiver side.
Somethig like this https://github.com/gogo/protobuf/blob/master/custom_types.md but it seems this is not a part of protobuf ecosystem.
Protobuf only defines schema.
You can't add logic to a protobuf definition.
Protobug Any represents arbitrary binary data and so -- somewhere -- you'll need to explain to your users what it represents in order that they can ship data in the correct format to your service.
You get to decide how to distribute the processing of the data:
Either partly client-side functionality that performs preprocessing of the data and ships the output (either as structured data using non-Any types or, if still necessary as Any).
Or partly server-side that receives entirely unprocessed client-side data shipped through Any
Or some combination of the two
NOTE You may want to consider shipping the data regardless of size as file references to simplify your implementation. You're correct to bias protobuf to smaller message sizes but, depending on the file size distribution, does it make sense to complicate your implementation with 2 paths?

What does "VkImageMemoryBarrier::srcAccessMask = 0" mean?

I just read Images Vulkan tutorial, and I didn't understand about "VkImageMemoryBarrier::srcAccessMask = 0".
code:
barrier.srcAccessMask = 0;
barrier.dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
and this tutorial say:
Since the transitionImageLayout function executes a command buffer with only a single command, you could use this implicit synchronization and set srcAccessMask to 0 if you ever needed a VK_ACCESS_HOST_WRITE_BIT dependency in a layout transition.
Q1 : If function have commandbuffer with multi command, then can't use this implicit synchronization?
Q2 : According to the manual page, VK_ACCESS_HOST_WRITE_BIT is 0x00004000. but tutorial use "0". why?
it's "0" mean implicit
it's "VK_ACCESS_HOST_WRITE_BIT" mean explicit ?
Am I understanding correctly?
0 access mask means "nothing". As in, there is no memory dependency the barrier introduces.
Implicit synchronization means Vulkan does it for you. As the tutorial says:
One thing to note is that command buffer submission results in implicit VK_ACCESS_HOST_WRITE_BIT synchronization
Specifically this is Host Write Ordering Guarantee.
Implicit means you don't have to do anything. Any host write to mapped memory is already automatically visible to any device access of any vkQueueSubmit called after the mapped memory write.
Explicit in this case would mean to submit a barrier with VK_PIPELINE_STAGE_HOST_BIT and VK_ACCESS_HOST_*_BIT.
Note the sync guarantees only work one way. So CPU → GPU will be automatic\implicit. But GPU → CPU always need to be explicit (you need a barrier with dst = VK_PIPELINE_STAGE_HOST_BIT to perform memory domain transfer operation).

Vullkan compute shader caches and barriers

I'm trying to understand how the entire L1/L2 flushing works. Suppose I have a compute shader like this one
layout(std430, set = 0, binding = 2) buffer Particles{
Particle particles[];
};
layout(std430, set = 0, binding = 4) buffer Constraints{
Constraint constraints[];
};
void main(){
const uint gID = gl_GlobalInvocationID.x;
for (int pass=0;pass<GAUSS_SEIDEL_PASSES;pass++){
// first query the constraint, which contains particle_id_1 and particle_id_1
const Constraint c = constraints[gID*GAUSS_SEIDEL_PASSES+pass];
// read newest positions
vec3 position1 = particles[c.particle_id_1].position;
vec3 position2 = particles[c.particle_id_2].position;
// modify position1 and position2
position1 += something;
position2 -= something;
// update positions
particles[c.particle_id_1].position = position1;
particles[c.particle_id_2].position = position2;
// in the next iteration, different constraints may use the updated positions
}
}
From what I understand, initially all data resides in L2. When I read particles[c.particle_id_1].position I copy some of the data from L2 to L1 (or directly to a register).
Then in position1 += something I modify L1 (or the register). Finally in particles[c.particle_id_2].position = position1, I flush the data from L1 (or a register) back to L2, right? So if I then have a second compute shader that I want to run afterward this one, and that second shader will read positions of particles, I do not need to synchronize Particles. It would be enough to just put an execution barrier, without memory barrier
void vkCmdPipelineBarrier(
VkCommandBuffer commandBuffer,
VkPipelineStageFlags srcStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkPipelineStageFlags dstStageMask, // here I put VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT
VkDependencyFlags dependencyFlags, // here nothing
uint32_t memoryBarrierCount, // here 0
const VkMemoryBarrier* pMemoryBarriers, // nullptr
uint32_t bufferMemoryBarrierCount, // 0
const VkBufferMemoryBarrier* pBufferMemoryBarriers, // nullptr
uint32_t imageMemoryBarrierCount, // 0
const VkImageMemoryBarrier* pImageMemoryBarriers); // nullptr
Vulkan's memory model does not care about "caches" as caches. Its model is built on the notion of availability and visibility. A value produced by GPU command/stage A is "available" to GPU command/stage B if the command/stage A has an execution dependency with command/stage B. A value produced by GPU command/stage A is "visible" to GPU command/stage B if command/stage A has a memory dependency with command/stage B with regard to the particular memory in question and the access modes that A wrote it and B will access it.
If a value is not both available and visible to a command/stage, then attempting to access it yields undefined behavior.
The implementation of availability and visibility will involve clearing caches and the like. But as far as the Vulkan memory model is concerned, this is an implementation detail it doesn't care about. Nor should you: understand the Vulkan memory model and write code that works within it.
Your pipeline barrier creates an execution dependency, but not a memory dependency. Therefore, values written by CS processes before the barrier are available to CS processes afterwards, but not visible to them. You need to have a memory dependency to establish visibility.
However, if you want a GPU level understanding... it all depends on the GPU. Does the GPU have a cache hierarchy, an L1/L2 split? Maybe some do, maybe not.
It's kind of irrelevant anyway, because merely writing a value to an address in memory is not equivalent to a "flush" of the appropriate caches around that memory. Even using the coherent qualifier would only cause a flush for compute shader operations executing within that same dispatch call. It would not be guaranteed to affect later dispatch calls.
Implementation-dependent. For all we know, a device might have no cache at all, or in future it might be some quantum magic bs.
Shader assignment operation does not imply anything about anything. There's no "L1" or "L2" mentioned anywhere in the Vulkan specification. It is a concept that does not exist.
Completely divorce ourselves from the cache stuff, and all mental bagage that comes with it.
What is important here is that when you read something, then that thing needs to be "visible to" the reading agent (irrespective of what kind of device you use, and what obscure memory architecture it might have). If it is not "visible to", then you might be reading garbage.
When you write something, this does not happen automatically. The writes are not "visible to" anyone.
First you put your writes into src* part of a memory dependency (e.g. via a pipeline barrier). That will make your writes "available from".
Then you put your reader into dst* that will take all referenced writes that are "available from" and make them "visible to" the second synchronization scope.
If you really want to shoehorn this into a cache system concept, don't think of it as levels of cache. Think of it as separate caches. That something is already in some cache does not mean it is in the particular cache the consumer needs.

Read binary files without having them buffered in the volume block cache

Older, now deprecated, macOS file system APIs provided flags to read a file unbuffered.
I seek a modern way to accomplish the same, so that I can read a file's data into memory without it being cached needlessly somewhere else in memory (such as the volume cache).
Reading with fread and first calling setvbuf (fp, NULL, _IONBF, 0) is not having the desired effect in my tests, for example. I am seeking other low-level functions that let me read into a prepared memory buffer and that let me avoid buffering of the whole data.
Background
I am writing a file search program. It reads large amounts of file content (many GBs) that isn't and won't be used by the user otherwise. It would be a waste to have all this data cached in the volume cache as it'll soon get purged by further reads again, anyway. It'll also likely lead to purging file data that's actually in use by the user or system, causing more cache misses.
Therefore, I should be able to tell the system that I do not need the file data cached. The little caching needed for cluster boundaries is not an issue. It's the many large chunks that I read briefly into memory to search it that is not needed to be cached.
Two suggestions:
Use the read() system call instead of stdio.
Disable data caching with the F_NOCACHE option for fcntl().
In Swift that would be something like (error checking omitted for brevity):
import Foundation
let path = "/path/to/file"
let fd = open(path, O_RDONLY)
fcntl(fd, F_NOCACHE, 1)
var buffer = Data(count: 1024 * 1024)
buffer.withUnsafeMutableBytes { ptr in
let amount = read(fd, ptr.baseAddress, ptr.count)
}
close(fd)

What's the PHP APC cache's apc.shm_strings_buffer setting for?

I'm trying to understand the apc.shm_strings_buffer setting in apc.ini. After restarting PHP, the pie chart in the APC admin shows 8MB of cache is already used, even though there are no cached entries (except for apc.php, of course). I've found this relates to the apc.shm_strings_buffer setting.
Can someone help me understand what the setting means? The config file notes that this is the "shared memory size reserved for strings, with M/G suffixe", but I fail to comprehend.
I'm using APC with PHP-FPM.
The easy part to explain is "with M/G suffixe" which means that if you set it to 8M, then 8 megabytes would be allocated, or 1G would allocated 1 gigabyte of memory.
The more difficult bit to explain is that it's a cache for storing strings that are used internally by APC when it's compiling and caching opcode.
The config value was introduced in this change and the bulk of the change was to add apc_string.c to the APC project. The main function that is defined in that C file is apc_new_interned_string which is then used in apc_string_pmemcpy in apc_compile.c. the rest of the APC module to store strings.
For example in apc_compile.c
/* private members are stored inside property_info as a mangled
* string of the form:
* \0<classname>\0<membername>\0
*/
CHECK((dst->name = apc_string_pmemcpy((char *)src->name, src->name_length+1, pool TSRMLS_CC)));
When APC goes to store a string, the function apc_new_interned_string looks to see if it that string is already saved in memory by doing a hash on the string, and if it is already stored in memory, it returns the previous instance of the stored string.
Only if that string is not already stored in the cache does a new piece of memory get allocated to store the string.
If you're running PHP with PHP-FPM, I'm 90% confident that the cache of stored strings is shared amongst all the workers in a single pool, but am still double-checking that.
The whole size allocated to storing shared strings is allocated when PHP starts up - it's not allocated dynamically. So it's to be expected that APC shows the 8MB used for the string cache, even though hardly any strings have actually been cached yet.
Edit
Although this answers what it does, I have no idea how to see how much of the shared string buffer is being used, so there's no way of knowing what it should be set to.