Queue family ownership transfer granularity for VkBuffer - vulkan

Question: Are there any granularity/alignment requirements for transfering sub-regions of a VkBuffer from one queue family to another?
I would like to:
use a single graphics/present queue for sourcing draw calls from a single VkBuffer;
and another separate compute queue to fill sub-allocated regions/sections within said VkBuffer
So, commands submitted to the graphics/present queue (1.) will read/source data written to the VkBuffer by commands submitted to the compute queue (2.). And I do not want to transfer the whole VkBuffer from one queue to the other, but only certain sub-regions (once their computation is finished and the results can be consumed/sourced by the graphics commands).
I've read that whenever you use VK_SHARING_MODE_EXCLUSIVE VkSharingMode you must explicitly transfer ownership from one queue (in my case the compute queue) to the other queue (in my case the graphics/present queue) such that commands submitted to the latter queue sees the changes made by submitted commands in the former queue.
I know how to do this and how to correctly synchronize both release and acquire actions with semaphores.
However, since I wanted to use a singe VkBuffer with manual sub-allocations (actually with VMA virtual allocations) and saw that VkBufferMemoryBarrier provides an offset and size properties, I was wondering whether there are any granularity requirements as for which sections/pages of a VkBuffer can be transferred.
Can I actually transfer single bytes of a VkBuffer from one queue family to another or do I have to obey certain granularity/alignment requirements (other than the alignment requirements for my own data structures within that VkBuffer and the usage of that buffer of course)?

There are no granularity requirements on queue family transfer byte ranges. Indeed, there do not appear to be granularity requirements on memory barrier byte ranges either.

Related

Rabbitmq :: Message is never removed from stream queue

I have created an stream queue in the rabbitmq of my project and configured max-age to 1 minute. I sent a message to the queue,all the consumers consumed the message, but the message is remaining in the queue (I waited more than 1 minute) as "ready". My worry is about accumulation of messages in the HD of rabbitmq instance.
So, my question is: All the messages marked as "ready" are stored in the HD, even after all consumer consumed the messages? If yes, how can I could purge (in this case, max-age is not working for it) these messages from HD of rabbitmq instance?
That is the design; see https://www.rabbitmq.com/streams.html#retention
Streams are implemented as an immutable append-only disk log. This means that the log will grow indefinitely until the disk runs out. To avoid this undesirable scenario it is possible to set a retention configuration per stream which will discard the oldest data in the log based on total log data size and/or age.
There are two parameters that control the retention of a stream. These can be combined. These are either set at declaration time using a queue argument or as a policy which can be dynamically updated. ...
max-age:
valid units: Y, M, D, h, m, s
e.g. 7D for a week
max-length-bytes:
the max total size in bytes
NB: retention is evaluated on per segment basis so there is one more parameter that comes into effect and that is the segment size of the stream. The stream will always leave at least one segment in place as long as the segment contains at least one message. When using broker-provided offset-tracking, offsets for each consumer are persisted in the stream itself as non-message data.
But I see what you mean.
I suggest you ask on the rabbitmq-users Google group where the RabbitMQ engineers hang out; they don't monitor SO closely.
Same problem here, the messages is nerver deleted.
The solution that I found:
It's not possible to avoid to store data into HD or make a purge, but it's possible to prevent excessive disk usage.
Add the argument x-stream-max-segment-size-bytes to the queue decreasing the default size to a size that is OK for your necessity. I defined 1 mb, for example. More details: https://www.rabbitmq.com/streams.html#declaring
At least one segment file will always remain, so if you just send 1 message and wait, it will remain on disk forever. However, if you keep publishing, a new segment file gets created at some point and the retention process kicks in. Files that only contain messages older than the retention period will be deleted.

Upper limit on number of redis streams consumer groups?

We are looking at using redis streams as a cluster wide messaging bus, where each node in the cluster has a unique id. The idea is that each node, when spawned, creates a consumer group with that unique id to a central redis stream to guarantee each node in the cluster gets a copy of every message. In an orchestrated environment, cluster nodes will be spawned and removed on the fly, each having a unique id. Over time I can see this resulting in there being 100's or even 1000's of old/unused consumer groups all subscribed to the same redis stream.
My question is this - is there an upper limit to the number of consumer groups that redis can handle and does a large number of (unused) consumer groups have any real processing cost? It seems that a consumer group is just a pointer stored in redis that points to the last read entry in the stream, and is only accessed when a consumer of the group does a ranged XREADGROUP. That would lead me to assume (without diving into Redis code) that the number of consumer groups really does not matter, save for the small amount of RAM that the consumer groups pointers would eat up.
Now, I understand we should be smarter and a node should delete its own consumer groups when it is being killed or we should be cleaning this up on a scheduled basis, but if a consumer group is just a record in redis, I am not sure it is worth the effort - at least at the MVP stage of development.
TL;DR;
Is my understanding correct, that there is no practical limit on the number of consumer groups for a given stream and that they have no processing cost unless used?
Your understanding is correct, there's no practical limit to the number of CGs and these do not impact the operational performance.
That said, other than the wasted RAM (which could become significant, depending on the number of consumers in the group and PEL entries), this will add time complexity to invocations of XINFO STREAM ... FULL and XINFO GROUPS as these list the CGs. Once you have a non-trivial number of CGs, every call to these would become slow (and block the server while it is executing).
Therefore, I'd still recommend implementing some type of "garbage collection" for the "stale" CGs, perhaps as soon as the MVP is done. Like any computing resource (e.g. disk space, network, mutexes...) and given there are no free lunches, CGs need to be managed as well.
P.S. IIUC, you're planning to use a single consumer in each group, and have each CG/consumer correspond to a node in your app's cluster. If that is the case, I'm not sure that you need CGs and you can use the simpler XREAD (instead of XREADGROUP) while keeping the last ID locally in the node.
OTOH, assuming I'm missing something and that there's a real need for this use pattern, I'd imagine Redis being able to support it better by offering some form of expiry for idle groups.

Flow control limitting message rate on single queue

I have a exchange and only one queue bind to it. When the message publishing rate goes over some cap the rabbitmq automatically throttles the incoming message rate.
On further investigation i found this happens due to the "Flow control" trottling mechanism built in rabbitmq. https://www.rabbitmq.com/blog/2014/04/14/finding-bottlenecks-with-rabbitmq-3-3/
As per this document i have connection, channels in flow control and not the queue. which means there is a cpu-bound / disk-bound limit.
My messages are not persistent so i don't have disk limitation. On Searching, i found documents stating a queue is limited to single cpu. https://groups.google.com/forum/#!msg/rabbitmq-users/wzHMV7F0ugU/zhW_9b8ACQAJ
What does it mean ? do the rabbitmq queue process uses only 1 cpu even multiple cores are available in the machine? what is the limitation of cpu with respect to queue flow control?
A queue is handled by one and one only CPU, which mean that you have to design your message flow through rabbit with multiple queue in order to remain scalable.
If you are on one queue only you will be limited to a maximum number of messages no matter if you have 1 or more cores
https://www.rabbitmq.com/queues.html#runtime-characteristics
If you have a specific need to build an architecture with only one logical queue, which is explicitely not recommended ; or if you have a queue with a really high trafic, you can check sharded queues here : Github Sharded queues Plugin
It's a pluggin (take with caution and test everything before going to production, especialy failure and replication) that split a logical queue name into multiple queues.
If you are running a benchmark on rabbitmq, remember to produce and consume on a number of queues superior to the amount of CPU cores present on the server.
Other tips about benchmark, try to produce only, consume only, and both at the same time, with different persistence settings (persistence, message size, lazy queues, ...) and ack settings.

Barriers or semaphores for multiple submissions over the same queue?

In this example where we copy some buffer into vertex buffer and we want to quickly to start rendering using this buffer in two submissions without waiting over some fence:
vkBeginCommandBuffer(tansferCommandBuffer)
vkCmdCopyBuffer(tansferCommandBuffer, hostVisibleBuffer, vertexBuffer)
vkEndCommandBuffer(tansferCommandBuffer)
vkQueueSubmit(queue, tansferCommandBuffer)
vkBeginCommandBuffer(renderCommandBuffer)
...
vkCmdBindVertexBuffers(vertexBuffer)
vkCmdDraw()
...
vkEndCommandBuffer(renderCommandBuffer)
vkQueueSubmit(queue, renderCommandBuffer)
From what I understand is that tansferCommandBuffer might not have been finished when renderCommandBuffer is submitted, and renderCommandBuffer may get scheduled and reads form floating data in vertexBuffer.
We could attach a semaphore while submitting tansferCommandBuffer to be singled after completion and forward this semaphore to renderCommandBuffer to wait for before execution. The issue here is that it blocks the second batch commands that do not depend on the buffer.
Or we could insert a barrier after the copy command or before the bind vertex command, which seems to be much better since we can specify that the access to the buffer is our main concerned and possibly keep part of the batch to be executed.
Is there any good reason for using semaphores instead of barriers for similar cases (single queue, multiple submissions)?
Barriers are necessary whenever you change the way in which resources are used to inform the driver/hardware about that change. So in your example the barrier is probably needed too, nevertheless.
But as for the semaphores. When you submit a command buffer, you specify both semaphore handles and pipeline stages at which wait should occur on a corresponding semaphore. You do this through the following members of the VkSubmitInfo structure:
pWaitSemaphores is a pointer to an array of semaphores upon which to wait before the command buffers for this batch begin execution. If
semaphores to wait on are provided, they define a semaphore wait
operation.
pWaitDstStageMask is a pointer to an array of pipeline stages at which each corresponding semaphore wait will occur.
So when you submit a command buffer, all commands up to the specified stage can be executed by the hardware.

GCD custom queue for synchronization

In Mike Ash's GCD article, he mentions: "Custom queues can be used as a synchronization mechanism in place of locks."
Questions:
1) How does dispatch_barrier_async work differently from dispatch_async? Doesn't dispatch_async achieve the same function as dispatch_barrier_async synchronization wise?
2) Is custom queue the only option? Can't we use main queue for synchronization purpose?
First, whether a call to submit a task to a queue is _sync or _async does not in any way affect whether the task is synchronized with other threads or tasks. It only affects whether the caller is blocked until the task completes executing or if it can continue on. The _sync stands for "synchronous" and _async stands for "asynchronous", which sound similar to but are different from "synchronized" and "unsynchronized". The former have nothing to do with thread safety, while the latter are crucial.
You can use a serial queue for synchronizing access to shared data structures. A serial queue only executes one task at a time. So, if all tasks which touch a given data structure are submitted to the same serial queue, then they will never be executing simultaneously and their accesses to the data structure will be safe.
The main queue is a serial queue, so it has this same property. However, any long-running task submitted to the main queue will block user interaction. If the tasks don't have to interact with the GUI or have a similar requirement that they run on the main thread, it's better to use a custom serial queue.
It's also possible to achieve synchronization using a custom concurrent queue if you use the barrier routines. dispatch_barrier_async() is different from dispatch_async() in that the queue temporarily become a serial queue, more or less. When the barrier task reaches the head of the queue, it is not started until all previous tasks in that queue have completed. Once they do, the barrier task is executed. Until the barrier task completes, the queue will not start any subsequent tasks that it holds.
Non-barrier tasks submitted to a concurrent queue may run simultaneously with one another, which means they are not synchronized and, if they access shared data structures, they can corrupt that data structure or get incorrect results, etc.
The barrier routines are useful for read-write synchronization. It is usually safe for multiple threads to be reading from a data structure simultaneously, so long as no thread is trying to modify (write to) the data structure at the same time. A task that modifies or writes to the data structure must not run simultaneously with either readers or other writers. This can be achieved by submitting read tasks as non-barrier tasks to a given queue and submitting write tasks as barrier tasks to that same queue.