Can I do transfer operation in transfer queue and graphics queue at the same time? - vulkan

I have made 2 instances of VkQueue: one from graphics family and another one from transfer family. Command pools and command buffers are separated accordingly. Both are doing transfer operations.
Purpose of first one except rendering is to update uniform buffers on
each frame.
Purpose of second one is to update resources: model
vertex/index buffers, texture images etc.
They work in parallel in different threads asynchronously. So it is possible that there will be 2 calls of vkQueueSubmit at the same time.
Is such usage allowed and is it safe?
Note: once I have multithreaded my program sometimes I have VK_DEVICE_LOST on vkQueueSumbit and it is likely that it happens more frequently when resources are loading, that is why I actually came to this question

The Vulkan specification is pretty clear about CPU synchronization of Vulkan functions. vkQueueSubmit says:
Host access to queue must be externally synchronized
Where "queue" is the parameter passed to vkQueueSubmit. It doesn't say every queue; it says "that queue".
And if "external synchronization" is not specifically stated as a requirement of a command, then it isn't a requirement of that command.

Related

Does Vulkan parallel rendering relies on multiple queues?

I'm a newbie of Vulkan, and not very clear on how parallel rendering works, here's some question (the "queue" mentioned below refers specifically to the graphics queue:
Does parallel rendering relies on a device which supports more than one queue?
If question 1 is a yes, what if the physical device only have one queue, but Vulkan abstracted to 4 queues (which is the real case of my macbook's gpu), will the rendering in this case really parallel?
If question 1 is a yes, what if there is only one queue in Vulkan's abstraction, does that mean the device defiantly can render objects in parallel.
P.S. About question 2, when I use Metal api, the number of queues are only one, but when using Vulkan api, the number is 4, I'm not sure it is right to say "the physical device only have one queue".
I have the sneaking suspicion you are abusing the word "parallel". Make sure you know what it means.
Rendering on GPU is by nature embarrassingly parallel. Typically one queue can feed the entire GPU, and typically apps assume that is true.
In all likelihood they made the number of queues equal to the CPU core count. In Vulkan, submissions to a single queue always need to be externally synchronized. Having more queues allows to submit from multiple threads without synchronization.
If there is only one Vulkan queue, you can only submit to one queue. And any submission has to be synchronized with mutex or coming only from one thread in the first place.

Can vkQueuePresentKHR be synced using a pipeline barrier?

The fact vkQueuePresentKHR gets a queue parameter makes me think that it is like a command that is delivered to the queue for execution. If so, it is possible to make it waits (until the writing into the image to be presented is finished) using a pipeline barrier where source stage is VK_PIPELINE_STAGE_COLOR_ATTACHMENT_OUTPUT_BIT and destination is VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT. Or maybe even by an image barrier to ease the sync constraint for the image only.
But the fact that in every tutorial and books the sync is done using semaphore , makes me think that my assumption is wrong. If so, why vkQueuePresentKHR needs a queue parameter ? because the semaphore parameter is seems to be enough: when it is signaled, vkQueuePresentKHR can present the image according to the image index parameter and the swapchain handle parameter.
There are couple of outstanding Issues against the specification. Notably KhronosGroup/Vulkan-Docs#1308 is exactly your question.
Meanwhile everyone usually follows this language:
The processing of the presentation happens in issue order with other queue operations, but semaphores have to be used to ensure that prior rendering and other commands in the specified queue complete before the presentation begins.
Which implies semaphore has to be used. And given we are not 110 % sure, that means semaphore should be used until we know any better.
Another semi-official source is the sync wiki, which uses a semaphore.
Despite what this quote says, I think it is reasonable to believe it is also permissible to use other sync that makes the image already visible before the vkQueuePresent, such as fence wait.
But just pipeline barriers are likely not sufficient. The presentation is outside the queue system:
However, the scope of this set of queue operations does not include the actual processing of the image by the presentation engine.
Additionally there is no VkPipelineStageFlagBit for it, and vkQueuePresentKHR is not included in the submission order, so it cannot be in the synchronization scope of any vkCmdPipelineBarrier.
The confusing part is this unfortunate wording:
Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are automatically made visible to the read access performed by the presentation engine.
I believe the trick is the "before vkQueuePresentKHR is executed". As said above, vkQueuePresentKHR is not part of submission order, therefore you do not know if the memory was or wasn't made available via a pipeline barrier before the vkQueuePresentKHR is executed.
Presentation is a queue operation. That's why you submit it to a queue. A queue that will execute the presentation of the image. And specifically to a queue that is able to perform present operations.
As for how to synchronize... the specification is a bit ambiguous on this point.
Semaphores are definitely able to work; there's a specific callout for this:
Semaphores are not necessary for making the results of prior commands visible to the present:
Any writes to memory backing the images referenced by the pImageIndices and pSwapchains members of pPresentInfo, that are available before vkQueuePresentKHR is executed, are
automatically made visible to the read access performed by the presentation engine. This automatic visibility operation for an image happens-after the semaphore signal operation, and happens-before the presentation engine accesses the image.
While provisions are made for semaphores, there is no specific statement of other things. In particular, if you don't wait on a semaphore, it's not clear what "happens-after the semaphore signal operation" means, since no such signal operation happened.
Now, the API for vkQueuePresentKHR makes it clear that you don't need to provide a semaphore to wait on:
waitSemaphoreCount is the number of semaphores to wait for before issuing the present request.
The number may be zero.
One might thing that, as a queue operation, all prior synchronization on that queue would still affect presentation. For example, an external subpass dependency if you wrote to the swapchain image as an attachment. And it probably would... if not for one little problem.
See, synchronization is ultimately based on dependencies between stages. And presentation... doesn't have a stage. So while your source for the external dependency would be well-understood, it's not clear what destination stage would work. Even specifying the all-stages flag wouldn't necessarily work.
Does "not a stage" exist in the set of all stages?
In any case, it's best to just use a semaphore. You'll probably need one anyway, so just use that.

Rendering Terrain Dynamically with Argument Buffers : Understanding why the particle buffer is not overwritten by the GPU inflight

I am looking through an Apple demo project that is associated with the 2017 WWDC video entitled "Introducing Metal 2" where the developers demonstrate the use of argument buffers. The project is linked here on the page titled "Rendering Terrain Dynamically with Argument Buffers" on the Apple developer website. Here, they synchronize resource writes by the CPU to prevent race conditions with a dispatch_semaphore_t, signaling it when the command buffer finishes executing on the GPU and waiting on it if the CPU is writing data several frames ahead of the GPU. This is consistent with what was shown in a previous 2014 WWDC "Working With Metal: Fundamentals".
I noticed that it seems the APPLParticleRenderer is sending data to be written by the GPU in a compute pass before it finishes reading from that same buffer from the fragment shader from a previous render pass. The resource storage mode of the buffer is MTLResourceStorageModePrivate. My question: does Metal automatically synchronize access to private id<MTLBuffer>s accessible only by the GPU? Do render, compute, and blit passes called from new id<MTLCommandEncoder> have access to the buffer only after other passes have written and read from it (exclusive access)? I have seen that there are guaranteed barriers within tile shaders, where tile memory is accessed exclusively by the kernel before subsequent fragment shaders access the memory.
Lastly, in the 2016 WWDC "What's New in Metal, Part 2", the first presenter, Charles Brissart, at 16:44 mentions that fragment and vertex functions reading and writing from the same buffer must be placed into two render command encoders, but for compute kernels one compute command encoder suffices. This is consistent with what is seen within the particle renderer.
See my comment on the original question for a brief version of this answer.
It turns out that Metal tracks dependencies between commands scheduled to the GPU by default for MTLResource types. The hazardTrackingMode property of a MTLResource is defaulted to MTLHazardTrackingModeTracked (MTLHazardTrackingMode.tracked in Swift) according to the Metal documentation. This means Metal tracks dependencies across commands that modify the resource, as is the case with the particle kernel, and delays execution until prior commands accessing the resource are complete.
Therefore, since the _particleDataPool buffer has a storage mode of MTLResourceStorageModePrivate (storageModePrivate in Swift), it can only be written to by the GPU; hence, no CPU/GPU synchronization is necessary with a semaphore for this buffer and thus no multi-buffer system is necessary for the resource.
Only when a resource can be written to by the CPU while the GPU is still reading from it do we want multiple buffers so the CPU is not idle.
Note that the default hazard tracking mode for a MTLHeap is MTLHazardTrackingModeUntracked (MTLHazardTrackingMode.untracked in Swift), in which case you are responsible for synchronizing resource writes by the GPU
EDIT
After reading into resource synchronization in Metal, there are some additional points I would like to make that I think further clarify what's going on. Note that the remaining portion is in Swift. To learn more in detail, I recommend reading the "Synchronization" section in the Metal documentation here.
MTLFence
Firstly, a MTLFence is used to synchronize accesses to untracked resources within the execution of a single command buffer. A fence gives you explicit control over when the GPU accesses resources and is necessary when you are working with an untracked resource. Otherwise, Metal will handle this synchronization for you
It is important to note that the automatic management I mention in the answer only occurs within a single command buffer between encoding passes. But this does not mean we need to synchronize across command buffers scheduled in the same command queue since a command buffer is not immediately scheduled for execution. In fact, according to the documentation on the addScheduledHandler(_:) method of the MTLCommandBuffer protocol found here
The device object schedules the command buffer after it identifies any dependencies with work tasks submitted by other command buffers or other APIs in the system.
at which point it would be safe to access these same buffers. Note that within a single render encoding pass, it is important to mention that if a vertex shader writes into a buffer the fragment shader in the same pass reads from, this is undefined. I mentioned this in the original question, the solution being to use two render pass encoders. I have yet to determine why this is not necessary for a compute encoder, but I imagine it has to do with how kernels are executed in comparison to vertex and fragment shaders
MTLEvent
In some cases, however, command buffers in different queues created by the same MTLDevice need access to the same resource or depend on one another in some way. In this case, synchronization is necessary because the separate queues schedule their own command buffers without knowledge of the other, meaning there is potential for the two command buffers to be executing concurrently.
To fix this problem, you use an MTLEvent instance created by the device using makeEvent() and encode event signals at specific points in each buffer.
MTLSharedEvent
In the event (no pun intended) that you have multiple processors (different CPU cores, CPU and GPU, or multi-GPU), resource synchronization is needed. Here, you create a MTLSharedEvent in place of a MTLEvent that can be used to synchronize across devices and processes. It is essentially the same API as that of the MTLEvent, but involves command queues on different devices.

Vulkan's execution model and sycnhronization [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I am trying to clear up my confusion around Vulkan's execution model and I would like to have my understanding verified and get answers to questions that still remain unclear to me.
So my understanding is following:
The host and the device execute completely asynchronously with respect to each other. I have to use VkFence to synchronize between them, i.e. when I want to know that a particular submission has finished executing on the device, I have to wait on the host for the appropriate VkFence to be signaled.
Different command queues execute asynchronously with respect to each other. Vulkan specification does not provide any guarantees about the order in which submissions to these queues start or finish execution. So vkQueueSubmit on queue A executes completely independently from vkQueueSubmit on queue B and I have to use VkSemaphore in order to make sure that for example submission to queue B starts executing after the submission to queue A is finished.
However different commands submitted to the same command queue respect their submission order, which means that commands submitted later won't start execution unless commands submitted earlier have already started their execution, but on the other hand this does not mean that these later commands cannot finish execution before earlier commands.
State setting commands (e.g. vkCmdBindPipeline, vkCmdBindVertexBuffers ...) are not asynchronous and delayed for later (like e.g. vkCmdDraw). Actually they execute right away on the host (not on the device) and modify the state of VkCommandBuffer and this cumulatively modified state is used in recording action commands that come after.
From the perspective of synchronization VkRenderPass can be thought of as just a simpler interface to pipeline barriers. It can be thought of as having one pipeline barrier in the beginning of render pass instance (in place of vkCmdBeginRenderPass), one at the end of render pass instance (in place of vkCmdEndRenderPass) and one pipeline barrier after each subpass (in place of vkCmdNextSubpass).
In my head the mental model of how commands execute on a single command queue is as one huge stream of commands (ordered in the order that they were recorded to command buffer and the order that these command buffers were submitted to the queue) split by pipeline barriers. Each pipeline barrier splits the stream into two sections, commands before the barrier (section A) and commands after the barrier (section B). Commands in section B are allowed to start (or rather continue their execution with pipeline stage Y) only after all commands in section A have finished executing pipeline stage X.
Questions:
The Vulkan specification (section 2.2.1. Queue Operation) states:
Command buffer submissions to a single queue respect submission order
and other implicit ordering guarantees, but otherwise may overlap or
execute out of order. Other types of batches and queue submissions
against a single queue (e.g. sparse memory binding) have no implicit
ordering constraints with any other queue submission or batch.
Lets say that in my program I have only one general queue, that can issue all kinds of commands (graphics, compute, transfer, presentation, ...), so does the above statement mean the following ?
vkQueueSubmit #3 starts execution only after vkQueueSubmit #2 has already started execution, which starts only after vkQueueSubmit #1 has already started, ... but vkQueueBindSparse or vkQueuePresentKHR can start at any time regardless of when they were issued by the host ... In other words I always have to use VkSemaphore to ensure that presentation (vkQueuePresentKHR) starts at the right time (only after all my graphics work was submitted and executed and thus is ready to be presented).
I am a little bit confused with the definition of submission order within command buffers themselves. Specification states (section 6.2. Implicit Synchronization Guarantees):
1)
For commands recorded outside a render pass, this includes all other
commands recorded outside a render pass, including
vkCmdBeginRenderPass and vkCmdEndRenderPass commands; it does not
directly include commands inside a render pass.
2)
For commands recorded inside a render pass, this includes all other
commands recorded inside the same subpass, including the
vkCmdBeginRenderPass and vkCmdEndRenderPass commands that delimit the
same render pass instance; it does not include commands recorded to
other subpasses.
The first bullet point seems to be clear. The submission order is the order in which commands were recorded to command buffers, whilst whatever is inside of a vkCmdBeginRenderPass and vkCmdEndRenderPass block is considered as one command for the purpose of this bullet point. The second bullet point is a bit unclear to me though. How is the submission order defined here ? It is clear that any command within a specific subpass does not start its execution unless a previous command has already started its execution or unless vkCmdBeginRenderPass was executed. But what about different subpasses ? Does this mean that subpass 1 can start its execution before subpass 0 has started its execution ? This does not make sense to me. What would make sense, is if later subpasses are only allowed to start once previous subpasses have finished.
Vulkan specification (section 6.1.2. Pipeline Stages) states:
Execution of operations across pipeline stages must adhere to implicit
ordering guarantees, particularly including pipeline stage order.
Does this mean that for example Vertex shader stage from draw call 2 is not allowed to begin execution unless vertex shader stage from draw call 1 has already started its execution ?
My mental model of Vulkan's command queue execution (number 6 of my understanding) provokes the question, whether a pipeline barrier submitted to the beginning of a command buffer (B) would affect an earlier command buffer (A). I mean would it make the commands in command buffer B wait to start execution until commands in command buffer A are finished ? I read somewhere that synchronization between different command buffers is the job for events, but according to my understanding this should also be possible with barriers.
Also if I used VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT as source stage and VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT as destination stage of a pipeline barrier that should basically disable any overlap between the commands before and after the barrier, right ?
So as I see it, there are several different parallelisms in Vulkan:
Between CPU and GPU, these are synchronized with VkFence
Between different commands queues on the GPU, these are synchronized with VkSemaphore
Between different submissions to the same queue, exception seem to be submissions with vkQueueSubmit. These are also synchronized with VkSemaphore.
Between different draw calls. These are synchronized with pipeline barrier.
This one is the most confusing to me. So if I have a drawcall that in some way uses the results of any previous drawcall or writes to the same render target (framebuffer), then as far as I understand, I need to make sure that the later drawcall sees the memory effects of all previous drawcalls. But what about, when I am rendering a scene with a bunch of game characters, trees and buildings. Lets say that each such object is one drawcall and all these drawcalls write to the same framebuffer. Do I need to issue a memory barrier after every drawcall ? Intuitively this feels redundant and the demos that I checked out did not issue any barriers in this case, but are there any guarantees that drawcalls logically following after will see the memory effects of drawcalls logically before them ? The question is, when do I need to synchronize between different drawcalls ?
Within a single draw call. Synchronization on this level is possible with shader atomic instructions.
However as far as I am not doing anything unusual, like writing to the same memory address from multiple shader instances or reading from the same memory that I have just written to (e.g. implementation of custom blending in fragment shader), I should be fine. In other words if every fragment shader reads and writes only its corresponding pixel or vertex data, I do not need to worry about synchronization within the same drawcall.
The host and the device execute completely asynchronously with respect to each other.
Yes.
Unless explicit synchronization is used (that is VkFence, vk*WaitIdle, VkEvent). Or the one rare implicit synchronization ( host writes are visible to device access from any subsequent vkQueueSubmit).
Do note there also has to be a "memory domain operation". I.e. you must use VK_PIPELINE_STAGE_HOST_BIT when reading output of GPU on the CPU. (VkFence alone, doing the execution and memory dependency, does not suffice).
Different command queues execute asynchronously with respect to each other.
Correct. In other words, commands from any two queues may run serially, next to each other (in parallel), or even be pre-empted and time-shared, or some combination of the above. Anything goes. Unless explicit synchronization (VkSemaphore or VkFence) is used.
However different commands submitted to the same command queue respect their submission order
Yes. But it is only specification formalism that has no real-world effect. It is only specified so we have formal linguistic framework in which to describe other stuff in the specification (e.g. it specifies nomenclature necessary to describe the behavior of pipeline barriers).
State setting commands (e.g. vkCmdBindPipeline, vkCmdBindVertexBuffers ...) are not asynchronous and delayed for later (like e.g. vkCmdDraw).
No, that is not exactly how I would describe it.
They are not "delayed". They are simply executed exactly where they are recorded in the command buffers.
This is perhaps one of the things where we need the "submission order" formalism. All commands later in submission order after state command see the new state. (I.e. only the commands recorded after the state command see the new state).
From the perspective of synchronization VkRenderPass can be thought of as just a simpler interface to pipeline barriers.
I don't think so. It is actually perhaps bit more complex.
What it does is more efficient synchronization, although it perhaps defines functionally the same synchronization as pipeline barriers could. What it does differently is that (among other things) it defines this synchronization as a monolith (i.e. you tell the driver upfront what resources you are gonna use, and you outline all the things you are gonna do to them later).
Render Pass is a harness necessitated by mobile tiling architecture GPU. On desktop it is also useful if they have some architectural inspiration from the mobile GPUs, or simply as an oracle for driver optimization.
so does the above statement mean the following ? vkQueueSubmit #3 starts execution only after vkQueueSubmit #2 has already started execution, which starts only after vkQueueSubmit #1 has already started
Yes, and no. Read above about the formalism of submission order.
Technically, yes, the commands are guaranteed to execute its VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT in order. But that stage does nothing.
AIS, it is only specification formalism used for other things. It does not say anything in of itself.
I am a little bit confused with the definition of submission order within command buffers themselves.
Yes, the language is bit tricky. The part that trips you up is the subpasses. Note that subpasses are by definition also asynchronous. Therefore we cannot use the simple rule in quote "1)".
If I decode it, what the spec quote means is:
a) Any command recorded before the Render Pass Instance (i.e. before vkCmdBeginRenderPass) is earlier in submission order than the vkCmdBeginRenderPass, and earlier than any and all the commands in the subpasses. (And vice versa, anything in the subpasses is later in submission order.)
b) Similarly any command recorded after the Render Pass Instance (i.e. after vkCmdEndRenderPass) is later in submission order than the vkCmdEndRenderPass, and later than any and all the commands in the subpasses.
c) The commands in a single subpass have the submission order same as the order they were recorded in (vkCmd*).
d) Commands in any two subpasses do not have submission order wrt each other.
Remember submission order is only a formalism. What "d)" means in reality is only that you cannot execute vkCmdPipelineBarrier in subpass 1 and expect that barrier to cover anything from subpass 0. (What you must do is use the VkSubpassDependency instead of vkCmdPipelineBarrier to achieve dependency between subpass 0 and 1.)
Execution of operations across pipeline stages must adhere to implicit ordering guarantees, particularly including pipeline stage order.
This is only an introductory statement linking to some of the other stuff in the specification. It does not say anything in of itself.
"implicit ordering guarantees" links to the submission order we covered.
"pipeline stage order" simply links to pipeline stage ordering. This simply specifies "logical order" between pipeline stages (e.g. Vertex Shader is before Fragment Shader). What it means is whenever you use stage flag bit in any srcStage parameter, Vulkan will implicitly assume you also mean any logically earlier stage flag bit. (And similarly for dstStage).
My mental model of Vulkan's command queue execution (number 6 of my understanding) provokes the question, whether a pipeline barrier submitted to the beginning of a command buffer (B) would affect an earlier command buffer (A)
Yes, that is the general idea.
Think of it like this: vkQueueSubmit concatenates the commands from command buffer at the end of the Queue. It is called "queue" for a reason. Therefore a pipeline barrier affects the command buffer that was submitted earlier. (And BTW that's why it is called submission order)
Also if I used VK_PIPELINE_STAGE_BOTTOM_OF_PIPE_BIT as source stage and VK_PIPELINE_STAGE_TOP_OF_PIPE_BIT as destination stage of a pipeline barrier that should basically disable any overlap between the commands before and after the barrier, right ?
Yes, but that is a code rot.
In this case use VK_PIPELINE_STAGE_ALL_COMMANDS_BIT instead. It is much easier to understand for anyone reading such code.
So as I see it, there are several different parallelisms in Vulkan:
Asynchrony.
Parallelism is not guaranteed. I.e. the driver is allowed to serialize the workload, or time-share it.
But e.g. with some common sense you can guess there will be (notable) parallelism between CPU and GPU, if it is a dedicated GPU.
The question is, when do I need to synchronize between different drawcalls ?
Yes, I think no framebuffer sync between draw commands is one of the exceptions\simplifications Vulkan has.
I believe people support it by the specification of Primitive Order and Rasterization Order.
I.e. in a single subpass you should not need a pipeline barrier between two vkCmdDraw* to synchronize the color and depth buffer. (I think) you still need to explicitly synchronize draw in a subpass with other subpasses and with outside of the render pass instance.
However as far as I am not doing anything unusual, like writing to the same memory address from multiple shader instances or reading from the same memory that I have just written to (e.g. implementation of custom blending in fragment shader), I should be fine.
Yes. The pipeline and the fixed and programmable stages should work similarly as in OpenGL. You should for most part be able to use OpenGL's shaders with little to no modification and achieve the same behavior.

The best way to add a new 3D object at runtime

All the time till now I had 3D objects created during the startup. But now I need to add them dynamically. What can be simpler, I thought...
The main issue right now is how to upload the new object's data in the fastest way and find out when the data is uploaded.
Here's my setup:
I'm using the vulkan memory allocator library, so I'm free form memory management burden.
I'm planning to use a separate VkBuffer for every object - this way I don't need to manage offsets, alignments and it would be easier to add/remove objects.
And here are my thoughts/questions:
How to upload the data? I want the buffer to be gpu-visible only, that means I need a staging buffer.
If I use the staging buffer I need to know when the data is ready to use on the gpu. I don't want to flush the pipeline and wait. The only way I see is to use a fence per object and only call the draw command when this fence is ready.
If I use a staging buffer and want to upload multiple objects during a short frame, I need somehow to be sure that the parts of this staging buffer not being overridden by different objects. For this, I need to keep it big, handle alignment for the offsets. But how big?
I'm pretty sure I'm overcomplicating. I believe there should be a much simpler pattern. How would you do this?
I believe there should be a much simpler pattern.
It's Vulkan; it's an explicit, low-level API. "Simple" is not its goal.
Overall, your Vulkan code needs to be written to adapt to the capabilities of the hardware. That's the best way to get performance out of it.
The first decision that needs to be made is whether you need staging at all. Staging (for buffer copies) is only necessary if your device's DEVICE_LOCAL memory is not mappable. And yes, there are (integrated) GPUs that allow you to map DEVICE_LOCAL memory. If that is the case, then you can just write directly to where you need the data to go.
If staging is needed, then you need to decide if the hardware supports an independent transfer-only queue. If so, then you will likely get performance benefits by employing it. Not all hardware supports transfer-only queues, so your application needs to adapt. Also, transfer-only queues can have restrictions on the granularity of memory transfers taking place on those queues, so you need to check to see if your streaming strategy fits within the limits of that particular hardware.
Also, if there is no appropriate transfer queue, you can create the effect of a transfer queue by using a second compute or graphics queue... if the hardware supports multiple queues at all. Being able to submit transfer commands and rendering commands on different queues is a good thing, assuming you are taking advantage of threading (ie: issuing submits of the batches to the different queues on different threads).
If you are able to use a separate queue for transfers (whether a true transfer queue or just a separate compute/graphics queue), then you get to play around with semaphores. The batch that transfers data must signal a semaphore when it completes; this is part of the batch in the vkQueueSubmit call. The batch on the main queue that uses the transferred data for some process needs to wait on that semaphore. So both threads need to be using the same VkSemaphore object. And the wait on the semaphore should just have a global memory barrier, to make the memory visible.
The tricky part is this: you cannot submit the batch that waits on the semaphore until the submit call for the batch that signals it has been submitted. You don't have to wait until completion, but you do have to wait until the vkQueueSubmit call on the transfer queue has returned. So you need a way to transfer the semaphore between different threads, or you could just issue both submit commands on the same thread.
If you aren't using a second queue, then things are slightly simpler.
You still want to build the transfer command buffer itself on a different thread (to take advantage of threading CB construction). But that CB now needs to be communicated to the thread responsible for submitting the rendering stuff. And this channel of communication needs to know that this CB contains transfer commands, which some of the rendering CB processes ought to wait on.
The simplest and most flexible way to do this is to build the transfer CB so that the last command is a vkCmdSetEvent command (and the first command is a vkCmdResetEvent to reset it from previous frames of usage). The submission thread then only needs to create a small CB that only contains a vkCmdWaitEvents command which waits on the transfer event that will be set. That command should issue a full memory barrier, and that CB should execute between the transfer CB and any rendering CBs that read from the transferred data.
The flexibility of this is in the structure of the process. It is structured similarly to how the multi-queue version works. In both cases, a separate thread needs to communicate something to the render submission thread (in one case, a semaphore; in the other, a CB and an event). And the render submission thread needs to do things to wait on that "something", but without disrupting the process of building the rendering commands itself (in one case, you just change the batch to wait on the semaphore; in the other, you insert a CB that waits for the event).
If you want to get a bit smarter about execution dependencies, you can even have the transfer operation forward information about which pipeline stages need to wait on the operation. But that's mostly an optimization.
Here's the thing though: all of the staging cases are not performance-friendly. They're problematic because you can't do anything while the transfer operation is going on. And that is the case because... you're trying to read from the memory in the same frame you're writing to it. That's bad.
You should endeavor instead to delay rendering any objects for which loading is not complete. Or put another way, you want to load the data for new objects before you need them, not on the same frame you need them. This is what streaming systems do: they pre-emptively load data that will be needed soon, but not right now.
But how big?
Only you and your use cases can answer that question. If you are streaming in fixed-sized blocks (which you should do where possible), then it's fairly easy: your staging buffer should be one or maybe two streaming blocks in size. If your rendering system is more flexible, imposing few limitations on the higher-level code, then your staging buffer and your streaming system needs to be more flexible. And there's no right answer for that; it depends entirely on how it gets used.
Welcome to using explicit, low-level APIs.