What does `VFCT` stands for in of `UEFI_ACPI_VFCT` struct in amdgpu driver? - amd-gpu

UEFI_ACPI_VFCT struct is defined here and here in kernel tree, but I can't find more information about it. What it stands for?

Related

OpenCL local memory exists on Mali/Adreno GPU

Does OpenCL local memory really exist on Mali/Adreno GPU or they only exist in some special mobile phones?
If they exist, in which case should we use local memory, such as GEMM/Conv or other cl kernel?
Interesting question. OpenCL defines a number of conceptual memories including local memory, constant memory, global memory, and private memory. And physically as you know, the hardware implementation of these memories is hardware dependent. For instance, some may emulate local memory using cache or system memory instead of having physical memory.
AFAIK, ARM Mali GPU does not have local memory, whereas Qualcomm Adreno GPU does have local memory.
For instance below table shows the definition of each memory in OpenCL and their
relative latency and physical locations in Adreno GPU cited from OpenCL Optimization and Best Practices for Qualcomm Adreno
GPUs∗
Answer updated:
as commented by SK-logic below, Mali6xx have a local memory (shared with cache).
Memory is shared on recent Mali, not local, but OpenCL still has the concept of the memory being separate, so there are special commands to make sure there is no copying. Use of private/local memory is not recommended.
For more information on best use of memory with Mali OpenCL, please read:
https://developer.arm.com/documentation/101574/0400/Optimizing-OpenCL-for-Mali-GPUs/Optimizing-memory-allocation/About-memory-allocation?lang=en

Query size of local memory accessible to subgroups in Vulkan

Is there a way to know how much local memory every compute unit has access to? For example in OpenCL I could call
cl_ulong size;
clGetDeviceInfo(deviceID, CL_DEVICE_LOCAL_MEM_SIZE, sizeof(cl_ulong), &size, 0);
Vulkan should have something equivalent to this.
The GLSL compute shader abstraction's equivalent to OpenCL local memory is shared memory: memory accessible to all work items in a work group (defined by shared-qualified variables). As such, you may query GL_MAX_COMPUTE_SHARED_MEMORY_SIZE to get the amount of shared memory.

Why is my pcl cuda code running in CPU instead of GPU?

I have a code where I use the pcl/gpu namespace:
pcl::gpu::Octree::PointCloud clusterCloud;
clusterCloud.upload(cloud_filtered->points);
pcl::gpu::Octree::Ptr octree_device (new pcl::gpu::Octree);
octree_device->setCloud(clusterCloud);
octree_device->build();
/*tree->setCloud (clusterCloud);*/
// Create the cluster extractor object for the planar model and set all the parameters
std::vector<pcl::PointIndices> cluster_indices;
pcl::gpu::EuclideanClusterExtraction ec;
ec.setClusterTolerance (0.1);
ec.setMinClusterSize (2000);
ec.setMaxClusterSize (250000);
ec.setSearchMethod (octree_device);
ec.setHostCloud (cloud_filtered);
ec.extract (cluster_indices);
I have installed CUDA and included the needed pcl/gpu ".hpp"s to do this. It compiles (I have a catkin workspace with ROS) and when I do run it works really slow. I used nvidia-smi and my code is only running in the CPU, and I don't know why and how to solve it.
This code is an implementation of the gpu/segmentation example here:
pcl/seg.cpp
(Making this an answer since it's too long for a comment.)
I don't know pcl, but maybe it's because you pass a host-side std::vector rather than data that's on the device side.
... what is "host side" and "device side", you ask? And what's std?
Well, std is just a namespace used by the C++ standard library. std::vector is a (templated) class in the C++ standard library, which dynamically allocates memory for the elements you put in it.
The thing is, the memory std::vector uses is your main system memory (RAM) which doesn't have anything to do with the GPU. But it's likely that your pcl library requires that you pass data that's in GPU memory - which can't be the data in an std::vector. You would need to allocate device-side memory and copy your data there from the host side memory.
See also:
Why we do not have access to device memory on host side?
and consult the CUDA programming guide regarding how to perform this allocation and copying (at least, how to perform it at the lowest possible level; your "pcl" may have its own facilities for this.)

How do GPU handle indirect branches

My understanding of GPUs is that they handle branches by executing all path while suspending instances that are not supposed to execute the path. This works well for if/then/else kind of construct and loops (instance that terminated the loop can be suspended until all instance are suspended).
This flat out does not work if the branch is indirect. But modern GPUs (Fermi and beyond for nVidia, not sure when it appear for AMD, R600 ?) claim to support indirect branches (function pointers, virtual dispatch, ...).
Question is, what kind of magic is going on in the chip to make this happen ?
Accordingly to the Cuda programming guide there is some strong restrictions on virtual functions and dynamic dispatching.
See http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#functions for more information. Another interesting article about how code is mapped to the GPU hardware is http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html .
GPUs are like CPUs in regards to indirect branches. They both have an IP (instruction pointer) that points to physical memory. This IP is incremented for each hardware instruction that gets executed. An indirect branch just sets the IP to the new location. How this is done is a little bit more complicated. I will use PTX for Nvidia and GCN Assembly for AMD.
An AMD GCN GPU can have its IP simply set from any register. Example: "s_branch S8" The IP can be set with any value. In fact, on an AMD GPU, its possible to write to the program memory in a kernel and then set the IP to execute it (self modifying code).
On NVidia's PTX there is no indirect jump. I have been waiting for real hardware indirect branch support since 2009. The most current version of the PTX ISA 4.3 still does not have indirect branching. In the current PTX ISA manual, http://docs.nvidia.com/cuda/parallel-thread-execution, it still reads that "Indirect branch is currently unimplemented".
However, "indirect calls" are supported via jump tables. These are slightly different then indirect branches but do the same thing. I did some testing with jump tables in the past and the performance was not great. I believe the way this works is that the kernel is lunched with a table of already known call locations. Then when it runs across a "call %r10(params)" (something like that) it saves the current IP and then references the jump table by an index and then sets the IP to that address. I'm not 100% sure but its something like that.
Like you said, besides branching both AMD and NVidia GPUS also allow instructions to be executed but ignored. It executes them but does not write the output. This is another way of handing an if/then/else as some cores are ignored while others run. This does not really have much to do with branching. This is a trick to just avoid a time consuming branches. Some CPUs like the Intel Itanium also do this.
You can also try searching under these other names also: Indirect Calls, Indirect Branches, Dynamic branching, virtual functions, function pointers or jump tables
Hope this helps. Sorry I went so long.

NVIDIA CUDA 4.0, page-locking a memory with runtime API

NVIDIA CUDA 4.0 (RC2 is assumed here) offers the nice feature of page-locking a memory range that was allocated before via the "normal" malloc function. This can be done using the driver API function:
CUresult cuMemHostRegister (void * p, size_t bytesize, unsigned int Flags);
Now, the development of the project was done so far using the runtime API. Unfortunately it seems that the runtime API does not offer a function like cuMemHostRegister. I really would like to avoid mixing driver and runtime API calls.
Does anyone know how to page-lock memory that was prior allocated using standard malloc ? Standard libc functions should not be used, since the page-locking is carried out for staging the memory for a fast transfer to the GPU, so I really want to stick to the "CUDA"-way.
Frank
The 4.0 runtime API offers cudaHostRegister(), which does exactly what you are asking about. Be aware that the memory allocation you lock must be host page aligned, so you probably should use either mmap() or posix_memalign() (or one of its relatives) to allocate the memory. Passing cudaHostRegister() an allocation of arbitrary size from standard malloc() will probably fail with an invalid argument error.