How to read data from gpu memory ,not using memcpy? - gpu

In vulkan API,how can we read data from gpu memory,like some data which were calculated by compute shader?

First wait on the fence related to the compute invocation. Then map the memory you wrote the result into and if the memory is not coherent you need to invalidate the range.
Read the data out of the pointer you got from the mapping operation.

I've just gone through the same issue. I think #ratchet freak's comment 1 has got to the point. In my case, I was trying to transfer data from a texture(VkImage) to host memory. I used a linear buffer(VkBuffer) as the staging buffer. I originally used
VkMemoryPropertyFlags flag = VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT | VK_MEMORY_PROPERTY_HOST_COHERENT_BIT;
and found memcpy() very slow. Then I added VK_MEMORY_PROPERTY_HOST_CACHED_BIT and the speed becomes about 10x.

Related

What does "VkImageMemoryBarrier::srcAccessMask = 0" mean?

I just read Images Vulkan tutorial, and I didn't understand about "VkImageMemoryBarrier::srcAccessMask = 0".
code:
barrier.srcAccessMask = 0;
barrier.dstAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
and this tutorial say:
Since the transitionImageLayout function executes a command buffer with only a single command, you could use this implicit synchronization and set srcAccessMask to 0 if you ever needed a VK_ACCESS_HOST_WRITE_BIT dependency in a layout transition.
Q1 : If function have commandbuffer with multi command, then can't use this implicit synchronization?
Q2 : According to the manual page, VK_ACCESS_HOST_WRITE_BIT is 0x00004000. but tutorial use "0". why?
it's "0" mean implicit
it's "VK_ACCESS_HOST_WRITE_BIT" mean explicit ?
Am I understanding correctly?
0 access mask means "nothing". As in, there is no memory dependency the barrier introduces.
Implicit synchronization means Vulkan does it for you. As the tutorial says:
One thing to note is that command buffer submission results in implicit VK_ACCESS_HOST_WRITE_BIT synchronization
Specifically this is Host Write Ordering Guarantee.
Implicit means you don't have to do anything. Any host write to mapped memory is already automatically visible to any device access of any vkQueueSubmit called after the mapped memory write.
Explicit in this case would mean to submit a barrier with VK_PIPELINE_STAGE_HOST_BIT and VK_ACCESS_HOST_*_BIT.
Note the sync guarantees only work one way. So CPU → GPU will be automatic\implicit. But GPU → CPU always need to be explicit (you need a barrier with dst = VK_PIPELINE_STAGE_HOST_BIT to perform memory domain transfer operation).

Processes in Operating Systems

When I read a source about the processes and threads in the operating system, I faced this sentence and it sounded weird to me:
When a program is executed and handled by the processor, it converts into a process. A process needs to use the data and code segment in the memory.
I think the first sentence is true naturally. However, I cannot understand why the process needs to use solely data and code segment?
#include <stdio.h>
x = 10;
y;
int main(void){
int *array = (int*)malloc(sizeof(int) * 4);
printf("x and y are %d %d", x, y);
return 0;
}
I think that when this code is executed, the generated process use bss, data, heap and code segment. In my opinion, a process can benefit from any segment of the memory.
If my thoughts are wrong, can anyone explain the reason ?
A process has to store in memory:
Code.
Heap.
Stack.
Data.
BSS.
Except for really trivial ones, a program will use all these segments. Take a look at wikipedia's explanation of what the segments contain.
I think in the sentence the author didn't want to go into details and refers to Stack/Heap/Data/BSS as the data of your program, not the actual data segment.
This statement is not correct.
When a program is executed and handled by the processor, it converts into a process. A process needs to use the data and code segment in the memory.
A process has to exist before a program can be executed. On many non-eunuch's systems a single process runs multiple program.s
I think that when this code is executed, the generated process use bss, data, heap and code segment. In my opinion, a process can benefit from any segment of the memory.
The LINKER deine program segments. The loader follows the instructions of the linker to create the address space.
"bss, data, heap, and code" is a bad way to envision the address space.
There is:
Executable data
Read only data
Read/write data that can be
initialized
uninitialized
Heap and stack are just read/write data. The operating system cannot even tell what data is stack and what is heap. It's all just memory.

how ext4 works with fallocate

Recently, I am testing the proper usage of ext4 filesystem. what is my expert is that:
when system crashed, the data had been write return ok can not loss, but metadate can.
Here is my usage:
1. call fallocate to alloc centain space
fallocate(fd, 0, 0, 4*1024*1024); //4MB
2. call fsync(fd) let data and metadata write to disks
3. then i call function to randomly write the file with 4k size(random data but not 0). with O_DRICT flag,but not call fsync. I log the offset with return write ok.
4. check the offset that logged. but i find in some offset, read 4k data, is 0. It seems mean that offset isn't used like hole files.
My question is that:
<1. why after calling fallocate and fsync the metadata of the file still seems
indicate some blocks is not used, so when read it return null. It is my understand .
<2. have other api to call, can make sure that in allocate space with file is not holes ,after that when write data return ok with O_DIRECT can make sure the data will not be loss even the system crashed.
Thanks.
Only writing to the file space can eliminate the hole. Without writing, there is no dirty page and fsync simply does nothing.
I am wondering how did you execute you step 4. It seems that you did it by a manual crash, did you? If you read it after write without a crash, it should not be zero, provided you wrote non-zeros. If you read it after a crash, zero can happen if disk cache existed. However, this kind of zero is not like holes, they are zeros read from the disk (very probably the disk contains zeros).

Difference between memcpy_htod and to_gpu in Pycuda?

I am learning PyCUDA, and while going through the documentation on pycuda.gpuarray, I am puzzled by the difference between pycuda.driver.memcpy_htod (also _dtoh) and pycuda.gpuarray.to_gpu (also get) functions. According to gpuarray documentation, .get().
For example,transfer the contents of self into array or a newly allocated numpy.ndarray. If array is given, it must have the right size (not necessarily shape) and dtype. If it is not given, a pagelocked specifies whether the new array is allocated page-locked.
Is this saying that .get() is implemented exactly the same way as pycuda.driver.memcpy_dtoh? Somehow, I think I am mis-interpreting it.
pycuda.gpuarray.GPUArray.get() stores the GPUArray as a numpy.ndarray.
pycuda.driver.memcpy_dtoh() and friends copy plain buffers between CPU and GPU memory without any processing of the data in the buffers.

Neon VLD consuming more cycles than what is expected?

I have a simple asm code which loads 12 quad registers of NEON, and have paralleled pairwise add instruction along with the load instruction ( to exploit the dual issue capability). I have verified the code here:
http://pulsar.webshaker.net/ccc/sample-d3a7fe78
As one can see, the code is taking around 13 cycles. But when I load the code on the board, the load instructions seems to take more than one cycle per load, I verified and found out that the VPADAL is taking 1 cycle as stated, but VLD1 is taking more than one cycle. Why is that?
I have taken care of the following:
The address is 16 byte aligned.
Have provided the alignment hint in the instruction vld1.64 {d0, d1} [r0,:128]!
Tried preload instruction pld [r0, #192], at places but that seems to add to the cycles instead of actually reducing the latency.
Can someone tell me what am I doing wrong, why this latency?
Other Details:
With reference to cortex-a8
arm-2009q1 cross compiler tool chain
coding in assembly
Your code is executing much slower than expected because as it's currently written, it's causing the perfect storm of pipeline stalls. On any modern CPU with a pipelined architecture, instructions can execute in one cycle under ideal conditions. The ideal conditions are that the instruction is not waiting for memory and doesn't have any register dependencies. The way you've written the code, you're not allowing for the delay in reading from memory and making the next instruction dependent on the results of the read. This is causing the worst possible performance. Also, I'm not sure why you're accumulating the pairwise adds into multiple registers. Try something like this:
veor.u16 q12,q12,q12 # clear accumulated sum
top_of_loop:
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
vld1.u16 {q0,q1},[r0,:128]!
vld1.u16 {q2,q3},[r0,:128]!
vpadal.u16 q12,q0
vpadal.u16 q12,q1
vpadal.u16 q12,q2
vpadal.u16 q12,q3
subs r1,r1,#8
bne top_of_loop
Experiment with different numbers of load instructions before executing the adds. The point is that you need to allow time for the read to occur before you can use the target register.
Note: Using Q4-Q7 is risky because they're non-volatile registers. On Android you will get random garbage appearing in these (especially Q4).