How the binary from the Pipeline Cache is generated? - vulkan

I was trying to get the binary of a shader, which runs on my GPU. I managed to get it from the pipeline cache (using VkPipelineCache and vkGetPipelineCacheData) and I exported it to a file. Now, I want to find more information on how this binary is generated.
My questions are:
1) What kind of binary is it?
2) What is the format of the binary? (size of headers etc…)
3) Does the Vulkan driver generate the binary itself, or maybe does it use the Nvidia’s compiler/drivers?
4) Does it follow the Nvidia ISA? At some point It should, because at the end of the day it will execute on the GPU, however the question is whether at that level (pipeline cache) a translation to the target device ISA has been performed.
Let me mention that I am running Vulkan 1.1.97 on a GeForce GT 740M (418.56 drivers).

The pipeline cache's data is entirely implementation-dependent. The driver spits out some binary data which it may be able to read later. That's the beginning and end of what is known about it.

Related

How can I write raw data on SD card without filesystem by using DSP?

I'm new in embedded electronic/programming but I have a project. For this project, I want to write raw data provided by sensor in 512b buffers on a SD card (SPI mode) with no filesystem.
I'm trying to do it by using the low level functions of the FATFS library. I've heard that I have to create my own format and rewrite functions to do it, but I'm lost in all those lines code... I suppose that I have first to initialize SD by using a command sequence (CMD0, CMD8, CMD55, ACMD41...).
I'm not sure for the next steps, if I have to open a file with the fopen function and then use the fwrite function...
If somebody can explain me how it work for a non filesystem SD card and guide me in the steps to follow I would be very grateful.
Keep in mind that cards with a memory capacity > 2Tb are not supported in SPI mode
(Physical Layer Simplified Specification.
If you don't want to use a file system, then fopen, and fwrite are irrelevant. You simply use the the low-level block driver. If you are referring to http://www.elm-chan.org/fsw/ff/00index_e.html, then the API subset you need is :
disk_status - Get device status
disk_initialize - Initialize device
disk_read - Read data
disk_write - Write data
disk_ioctl - Control device dependent functions
However, since you then have to manage the blocks somehow so you know which blocks are available and where to write next (like one large file), you will essentially be writing your own filesystem (albeit a very simple and limited one). So it begs the question of why you would not just use an existing filesystem?
There are many reasons for not using FAT in an embedded system but few of them will be resolved by implementing your own "raw" filesystem without you doing a lot of work reinventing the filesystem! Consider something more robust and designed for resource constrained embedded systems with potentially unreliable power source, such as littlefs.

Would a Vulkan program run on a device without gpu (discrete or integrated)?

Perhaps this question could be rephrased as 'what would happen if I were to try and run a Vulkan program on a cpu-only build'.
I'm wondering whether the program would run but not produce output, crash or not build in the first place (although I expect the building process to be for a cpu architecture instead of a gpu architecture).
Would it use the on-motherboard graphics to produce output? In that case, what would happen if the program was run on a cpu-only server?
Depends on how the program initialized vulkan.
Any build can have the vulkan loader installed this is the dynamically loaded library that finds the actual driver, if that is missing the program would be unable to load the loader and may either fail to start or show an error message, depending on how they try and load that.
If no device is available then the number of devices is 0. This is again up to the application to manage. Either by going for an alternative graphics API (opengl) or a error message and failing to start.

Why is my pcl cuda code running in CPU instead of GPU?

I have a code where I use the pcl/gpu namespace:
pcl::gpu::Octree::PointCloud clusterCloud;
clusterCloud.upload(cloud_filtered->points);
pcl::gpu::Octree::Ptr octree_device (new pcl::gpu::Octree);
octree_device->setCloud(clusterCloud);
octree_device->build();
/*tree->setCloud (clusterCloud);*/
// Create the cluster extractor object for the planar model and set all the parameters
std::vector<pcl::PointIndices> cluster_indices;
pcl::gpu::EuclideanClusterExtraction ec;
ec.setClusterTolerance (0.1);
ec.setMinClusterSize (2000);
ec.setMaxClusterSize (250000);
ec.setSearchMethod (octree_device);
ec.setHostCloud (cloud_filtered);
ec.extract (cluster_indices);
I have installed CUDA and included the needed pcl/gpu ".hpp"s to do this. It compiles (I have a catkin workspace with ROS) and when I do run it works really slow. I used nvidia-smi and my code is only running in the CPU, and I don't know why and how to solve it.
This code is an implementation of the gpu/segmentation example here:
pcl/seg.cpp
(Making this an answer since it's too long for a comment.)
I don't know pcl, but maybe it's because you pass a host-side std::vector rather than data that's on the device side.
... what is "host side" and "device side", you ask? And what's std?
Well, std is just a namespace used by the C++ standard library. std::vector is a (templated) class in the C++ standard library, which dynamically allocates memory for the elements you put in it.
The thing is, the memory std::vector uses is your main system memory (RAM) which doesn't have anything to do with the GPU. But it's likely that your pcl library requires that you pass data that's in GPU memory - which can't be the data in an std::vector. You would need to allocate device-side memory and copy your data there from the host side memory.
See also:
Why we do not have access to device memory on host side?
and consult the CUDA programming guide regarding how to perform this allocation and copying (at least, how to perform it at the lowest possible level; your "pcl" may have its own facilities for this.)

How to send data to target through Trace32 debugger?

I need a way to send some data to the ucontroller through Trace32. I heard that this is possible some way, but I have no idea where to start.
What I am actually trying to do is run a piece of code on a Aurix TC297 ucontroller to do some measurements (runtime, RAM, etc.). This piece of code is actually a Kalman filter that needs as input a vector of structs that I have too send from the computer through Trace32. Please help !
"A way to send some data to the ucontroller through Trace32" is a little bit vague. There are various possibilities depending on what your actually try to achieve and might also depend on the used CPU family and target OS. Anyhow one of the following might work:
Simply writing some raw data to the target memory can be achieved with the Data.Set command.
To transfer a big amount of data (or even a whole application) from a file to the target memory the Data.LOAD commands might be the right choice. E.g. Data.LOAD.Binary command for a raw binary file.
To set variables in your application or even initiate C-style data arrays use the Var.Set command.
To write data to NOR flash or onchip flash memory you'll need the FLASH.AUTO command in addition to the previously mentioned commands (after declaring the flash memory to TRACE32).
To write data to a NAND, SPI or other serial flash memory you probably should use the FLASHFILE.Set command (after initialization of the FLASHFILE programming system).
To transfer data from TRACE32 to your target while the CPU is running you might have to configure correctly SYStem.MemAccess and use the memory access class prefix "E". E.g. Data.Set E:<addr> <data> or Var.Set %E <expression>.
You can use FDX for a bidirectional data transfer between debugger and a running target application.
To enable the target application to open and read files from the computer running TRACE32, you have to compile your application with suitable semihosting code and initiate semihosting in TRACE32 with TERM.GATE command.

What are common structures for firmware files?

I'm a total n00b in embedded programming. Suppose I'm building a firmware using a compiler. The result of this operation is a file that will be flashed into (I guess) the flash memory of a MCU such an ARM or a AVR.
My question is: What common structures (if any) are used for such generated files containing the firmware?
I came from desktop development and I understand that for example for Windows the compiler will most likely generate a PE or PE+, while Unix-like systems I may get a ELF and COFF, but have no idea for embedded systems.
I also understand that this highly depends on many factors (Compiler, ISA, MCU vendor, OS, etc.) so I'm fine with at least one example.
Update: I will up vote all answers providing examples of used structures and will select the one I feel best surveys the state of the art.
The firmware file is the Executable and Linkable File, usually processed to a binary (.bin) or text represented binary (.hex).
This binary file is the exact memory that is written to the embedded flash. When you first power the board, an internal bootloader will redirect the execution to your firmware entry point, normally at the address 0x0.
From there, it is your code that is running, this is why you have a startup code (usually startup.s file) that will configure clock, stack pointer registers, vector table, load the data section to RAM (your initialized variables), clear the zero initialized section, maybe you will want to copy your code to RAM and jump to the copy to avoid running code from FLASH (can be faster on some platforms), and so on.
When running over an Operational System, all these platform choices and resources are not in control of user code, there you can only link to the OS libraries and use the provided API to do low level actions. In embedded, it is 100% user code, you access the hardware and manage its resources.
Not surprisingly, Operational Systems are booted in a similar manner as firmware, since both are there in touch with the processor, memory and I/Os.
All of that, to say: the structure of a firmware is similar to the structure of any compiled program. There's the data sections and code sections that are organized in memory during the load by the Operational System, or by the program itself when running on embedded.
One main difference is the memory addressing in the firwmare binary, usually addresses are physical RAM address, since you do not have memory mapping feature on most of micro-controllers. This is transparent to the user, the compiler will abstract it.
Other significant difference is the stack pointer, on OSs user code will not reserve memory for the stack by itself, it relays on OS for that. When on firmware, you have to do it in user code for the same reason as before, there's no middle man to manage it for you. The linker script of the compiler will reserve Stack and Heap memory accordingly configured and there will be a stack_pointer symbol on your .map file letting you know where it points to. You won't find it in OSs program's map files.
Most tools output either an ELF, or a COFF, or something similar that can eventually boil down to a HEX/bin file.
That isn't necessarily what your target wants to see, however. Every vendor has their own format of "firmware" files. Sometimes they're encrypted and signed, sometimes plain text. Sometimes there's compression, sometimes it's raw. It might be a simple file, or something complex that is more than just your program.
An integral part of doing embedded work is the build flow and system startup/booting procedure, plus getting your code onto the part. Don't underestimate the effort.
Ultimately the data written to the ROM is normally just the code and constant data from which your application is composed, and therefore has no structure other than perhaps being segmented into code and data, and possibly custom segments if you have created them. The structure in this sense is defined by the linker script or configuration used to build the code. The file containing this code/data may be raw binary, or an encoded binary format such as Intel Hex or Motorola S-Record for example.
Typically also your toolchain will generate an object code file that contains not only the code/data, but also symbolic and debug information for use by a debugger. In this case when the debugger runs, it loads the code to the target (as in the binary file case above) and the symbol/debug information to the host to allow source level debugging. These files may be in a proprietary object file format specific to the toolchain, but are often standard "open" formats such as ELF. However strictly the meta-data components of an object file are not part of the firmware since they are not loaded on the target.
I've recently run across another firmware format not listed here. It's a binary format, might be called ".EEP" but might not. I think it is used by NXP. I've seen it used for ARM THUMB2 and for mystery stuff that may be a DSP/BSP.
All the following are 32-bit values, all stored in reverse endian except for CAFEBABE (so... BEBAFECA?):
CAFEBABE
length_in_16_bit_words(yes, 16-bit...?!)
base_addr
CRC32
length*2 bytes of data
FFFF (optional filler if the length is an odd number)
If there are more data blocks:
length
base
checksum that is not a CRC but something bizarre
data
FFFF (optional filler if the length is an odd number)
...
When no more data blocks remain:
length == 0
base == 0
checksum that is not a CRC but something bizarre
Then all of that is repeated for another memory bank/device.