Would a Vulkan program run on a device without gpu (discrete or integrated)? - vulkan

Perhaps this question could be rephrased as 'what would happen if I were to try and run a Vulkan program on a cpu-only build'.
I'm wondering whether the program would run but not produce output, crash or not build in the first place (although I expect the building process to be for a cpu architecture instead of a gpu architecture).
Would it use the on-motherboard graphics to produce output? In that case, what would happen if the program was run on a cpu-only server?

Depends on how the program initialized vulkan.
Any build can have the vulkan loader installed this is the dynamically loaded library that finds the actual driver, if that is missing the program would be unable to load the loader and may either fail to start or show an error message, depending on how they try and load that.
If no device is available then the number of devices is 0. This is again up to the application to manage. Either by going for an alternative graphics API (opengl) or a error message and failing to start.

Related

Do I have to use Full System Mode after adding a new device on gem5?

I'm trying to add a ORAM module to gem5, it would modify the address from the CPU to Memory. After reading the introduction about how to add a device named HelloDevice to gem5 in ASPLOS 2008 tutorial, I am still confused that if I add a new device to gem5, do I have to use Full System Mode to run tests/test-progs/hello/bin/x86/linux/hello?
tests/test-progs/hello/bin/x86/linux/hello is an userland executable, meant to be ran with se.py.
I think devices are not visible from se.py since it only emulates userland by translating simple instructions, and capturing syscalls, you can't see for example arbitrary hardware registers or physical memory.
Therefore yes, I think you need to use full system emulation with your build.
If you don't know how to use fs.py, give this setup a try.

what is the exact role of an interpreter?

having trouble understanding the exact role of an interpreter. to quote wikipedia - "Programs in interpreted languages[1] are not translated into machine code however, although their interpreter (which may be seen as an executor or processor) typically consists of directly executable machine code (generated from assembly and/or high level language source code)."
my doubt is about this statement - "interpreter (which may be seen as an executor or processor) typically consists of directly executable machine code" ? what does that mean? interpreter is supposed to be a program .How can it 'execute' code by itself ? they have re-stated this fact by saying " interpreter is different from language translators like compilers". Can anyone clarify please ? Also what is the difference (if any) between interpreted language and machine code ?
Compiler:
Transforms your code into binary machine code which can be directly executed by the CPU. Example: C, Fortran
Interpreter:
Is a program that executes the code written by the programmer without an additional step of transformation. Example: Bash scripts, Formulas in Excel
Actually it is not that easy any more. There are many concepts between these two pols. Java is compiled into an intermediate language that is then interpreted, just-in-time compilers compile small parts of interpreted code to speed them up.
"How can it 'execute' code by itself?" Take the Excel example. If you type a calculation into a cell, Excel somehow executes the code, right? But Excel does not compile the code and run it, but it parses it and executes in a general way. Excel has a sum function that in the end is executed on the processor as an add machine command, but there is a lot to do for Excel in between.
I will briefly describe an emulator to explain the main concept mentioned in the question.
Suppose I am using Mame, a video game emulator, and select the old classic arcade "Miss PacMan". Looking at the schematic or looking directly at a PCB inside an arcade video game, it is easy to find the processor : the zilog Z80, the only large chip with 40 pins. Now, if we get the technical data for that processor, we can find the binary encoding for each instruction it can execute. Basically, it get a 8-bit data (value ranging from 0 to 255) which tells the processor what to do. In the case of the emulator, it read the byte (the exact same bytes as would do the Z80 processor inside the original miss pac-man electronic board), determine what a Z80 would do and simulate the instruction.
Some classic video game may have use a x86 processor, similar to the one currently used in most PC. Even when selecting such a game in Mame, the emulator would still read the bytes as found in that game and interpret each one the way the x86 processor would do. In other words, the emulator would not take advantage of the fact that the PC and the emulated game are using a similar processor. It would perform the same steps to emulate any game no matter if the PC on which Mame is running share any similitude with the original game.
You are asking how an interpreter could execute code? The interpreter is a program (the interpreter is just a software, not a physical processor). The wording is effectively confusing. For this sentence to make sense, we would need all the following conditions:
1 - the program to interpret is already in binary, in a machine language that can be executed directly by the processor used in your PC
2 - the program location, the exact address used, is the same as the location that you can reserve in your PC
3 - any library and any I/O occupy the exact same address
When all these condition can be meet, the interpreter could just tell the processor on your PC to stop executing the code from the interpreter but instead, "jump" in the code of the program to be interpreted. Anyone could then say : it is not an interpreter, it is just a launcher.
Maybe such an interpreter which actually does not interpret but let your processor do the real job is still useful in the following way: it could let your processor perform some of the work, but request the generation of an exception when the code to be interpreted is executing some type of instruction. For example, let the code running, but generate a "general protection error" or "trap" or "exception" when trying to execute any of the variant of "IN" or "OUT". The interpreter would take note of the I/O port being written or it would choose a value to give instead of allowing to read a real I/O port. The interpreter would then manage to get the processor "jump" in the program to interpret at the location just after the instruction "IN" or "OUT".
Normally, an interpreter read an ASCII text file, the original source code (which could be Unicode instead of ASCII), determine line by line, word by word, what a compiler would do, then simulate the task on the fly. When the original compiler would need to read many lines to fully understand the current task, the interpreter would also need to read all these lines before being able to simulate the same task.
A big advantage of an interpreter is that it can not crash. Because every instruction is simulated, it is not sensitive to any bug or malicious code. That was a big advantage at the time when computers needed to reboot after encountering any bug, at a time where reboot was taking 10 minutes or more.
Today, with fast SSD to reboot in 5 second and with reliable operating systems which can trap any error in one process and close that process without affecting the stability of the machine, there is less incentive to prefer a slow interpreter over a much faster JIT or much much faster binary executable

how to detect ECC error in memroy testing under UEFI shell

I wrote a EFI binary file to test physical DIMMs under UEFI shell, the process is quite simple - first write a test pattern in to a physical address, then read it out and compare with the original pattern.
However, the DIMMs might encounter correctable or uncorrectable errors. Normally all the correctable ECC would be corrected by hardware automatically and BIOS would handle this (log this error and clean the error registers), uncorrectable errors would typically caused BIOS to issue a NMI, then system hang.
The problem is my test program doesn't know error happens - correctable errors are masked by BIOS FW and uncorrectable errors make system hang...
Is there any method to let the test program know ECC error happens? I would appreciate any advice you may have. Thanks!
I believe that to do this your program will need ultimate control of the hardware. That means it needs to boot completely and remove the EFI environment.
Once you have done that then your program can handle all of the interrupts and CPU registers that indicate ECC errors.
Once done your program would do a soft reset and that would boot the system back into EFI.

Cuda profiler shows strange gaps?

I am trying to figure out what a profile result means, before I start to optimize. I am very new with CUDA and profiling in general and I am confused by the result.
Specifically, I want to know what is happening during seemingly unoccupied chunks of computation. When I look from top to bottom at the CPU and GPU there appears to be nothing happening during large portions of the code. These look like columns with nothing in Thread1 and nothing in GeForce. Is this normal? Whats happening here?
The run was done a multicore machine under no load with nvprof. The GPU code was compiled with -arch=sm_20 -m32 -g -G for CUDA 5.
Larger Image
The error here was to profile the code in debug mode (-G compiler flag: "Generate debug information for device code"). The behavior of the program is deeply changed, and this should not be used to profile and optimize one's code.
One other thing: a thorough documentation of nvcc's debug mode is hard to find. nvcc probably dumps the registers/shared memory in global memory for easier host access and debugging, which may in turn hide problems such as race conditions in shared memory (cf. discussion here: https://stackoverflow.com/a/10726970/1043187). Thus, programs such as cuda-memcheck --tool racecheck should be used in release mode too.

How does one use dynamic recompilation?

It came to my attention some emulators and virtual machines use dynamic recompilation. How do they do that? In C i know how to call a function in ram using typecasting (although i never tried) but how does one read opcodes and generate code for it? Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C? If so how do you find the length of the code? How do you account for system interrupts?
-edit-
system interrupts and how to (re)compile the data is what i am most interested in. Upon more research i heard of one person (no source available) used js, read the machine code, output js source and use eval to 'compile' the js source. Interesting.
It sounds like i MUST have knowledge of the target platform machine code to dynamically recompile
Yes, absolutely. That is why parts of the Java Virtual Machine must be rewritten (namely, the JIT) for every architecture.
When you write a virtual machine, you have a particular host-architecture in mind, and a particular guest-architecture. A portable VM is better called an emulator, since you would be emulating every instruction of the guest-architecture (guest-registers would be represented as host-variables, rather than host-registers).
When the guest- and host-architectures are the same, like VMWare, there are a ton of (pretty neat) optimizations you can do to speed up the virtualization - today we are at the point that this type of virtual machine is BARELY slower than running directly on the processor. Of course, it is extremely architecture-dependent - you would probably be better off rewriting most of VMWare from scratch than trying to port it.
It's quite possible - though obviously not trivial - to disassemble code from a memory pointer, optimize the code in some way, and then write back the optimized code - either to the original location or to a new location with a jump patched into the original location.
Of course, emulators and VMs don't have to RE-write, they can do this at load-time.
This is a wide open question, not sure where you want to go with it. Wikipedia covers the generic topic with a generic answer. The native code being emulated or virtualized is replaced with native code. The more the code is run the more is replaced.
I think you need to do a few things, first decide if you are talking about an emulation or a virtual machine like a vmware or virtualbox. An emulation the processor and hardware is emulated using software, so the next instruction is read by the emulator, the opcode pulled apart by code and you determine what to do with it. I have been doing some 6502 emulation and static binary translation which is dynamic recompilation but pre processed instead of real time. So your emulator may take a LDA #10, load a with immediate, the emulator sees the load A immediate instruction, knows it has to read the next byte which is the immediate the emulator has a variable in the code for the A register and puts the immediate value in that variable. Before completing the instruction the emulator needs to update the flags, in this case the Zero flag is clear the N flag is clear C and V are untouched. But what if the next instruction was a load X immediate? No big deal right? Well, the load x will also modify the z and n flags, so the next time you execute the load a instruction you may figure out that you dont have to compute the flags because they will be destroyed, it is dead code in the emulation. You can continue with this kind of thinking, say you see code that copies the x register to the a register then pushes the a register on the stack then copies the y register to the a register and pushes on the stack, you could replace that chunk with simply pushing the x and y registers on the stack. Or you may see a couple of add with carries chained together to perform a 16 bit add and store the result in adjacent memory locations. Basically looking for operations that the processor being emulated couldnt do but is easy to do in the emulation. Static binary translation which I suggest you look into before dynamic recompilation, performs this analysis and translation in a static manner, as in, before you run the code. Instead of emulating you translate the opcodes to C for example and remove as much dead code as you can (a nice feature is the C compiler can remove more dead code for you).
Once the concept of emulation and translation are understood then you can try to do it dynamically, it is certainly not trivial. I would suggest trying to again doing a static translation of a binary to the machine code of the target processor, which a good exercise. I wouldnt attempt dynamic run time optimizations until I had succeeded in performing them statically against a/the binary.
virtualization is a different story, you are talking about running the same processor on the same processor. So x86 on an x86 for example. the beauty here is that using non-old x86 processors, you can take the program being virtualized and run the actual opcodes on the actual processor, no emulation. You setup traps built into the processor to catch things, so loading values in AX and adding BX, etc these all happen at real time on the processor, when AX wants to read or write memory it depends on your trap mechanism if the addresses are within the virtual machines ram space, no traps, but lets say the program writes to an address which is the virtualized uart, you have the processor trap that then then vmware or whatever decodes that write and emulates it talking to a real serial port. That one instruction though wasnt realtime it took quite a while to execute. What you could do if you chose to is replace that instruction or set of instructions that write a value to the virtualized serial port and maybe have then write to a different address that could be the real serial port or some other location that is not going to cause a fault causing the vm manager to have to emulate the instruction. Or add some code in the virtual memory space that performs a write to the uart without a trap, and have that code instead branch to this uart write routine. The next time you hit that chunk of code it now runs at real time.
Another thing you can do is for example emulate and as you go translate to a virtual intermediate bytcode, like llvm's. From there you can translate from the intermediate machine to the native machine, eventually replacing large sections of program if not the whole thing. You still have to deal with the peripherals and I/O.
Here's an explaination of how they are doing dynamic recompilation for the 'Rubinius' Ruby interpteter:
http://www.engineyard.com/blog/2010/making-ruby-fast-the-rubinius-jit/
This approach is typically used by environments with an intermediate byte code representation (like Java, .net). The byte code contains enough "high level" structures (high level in terms of higher level than machine code) so that the VM can take chunks out of the byte code and replace it by a compiled memory block. The VM typically decide which part is getting compiled by counting how many times the code was already interpreted, since the compilation itself is a complex and time-consuming process. So it is usefull to only compile the parts which get executed many times.
but how does one read opcodes and generate code for it?
The scheme of the opcodes is defined by the specification of the VM, so the VM opens the program file, and interprets it according to the spec.
Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C?
This process is an implementation detail of the VM, typically there is a compiler embedded, which is capable to transform the VM opcode stream into machine code.
How do you account for system interrupts?
Very simple: none. The code in the VM can't interact with real hardware. The VM interact with the OS, and transfer OS events to the code by jumping/calling specific parts inside the interpreted code. Every event in the code or from the OS must pass the VM.
Also hardware virtualization products can use some kind of JIT. A typical use cases in the X86 world is the translation of 16bit real mode code to 32 or 64bit protected mode code to not to be forced to emulate a CPU in real mode. Also a software-only VM replaces jump instructions in the executing code by jumps into the VM control software, which at each branch the following code path for jump instructions scans and them replace, before it jumps to the real code destination. But I doubt if the jump replacement qualifies as JIT compilation.
IIS does this by shadow copying: after compilation it copies assemblies to some temporary place and runs them from temp.
Imagine, that user change some files. Then IIS will recompile asseblies in next steps:
Recompile (all requests handled by old code)
Copies new assemblies (all requests handled by old code)
All new requests will be handled by new code, all requests - by old.
I hope this'd be helpful.
A virtual Machine loads "byte code" or "intermediate language" and not machine code therefore, I suppose, that it just recompiles the byte code more efficiently once it has more runtime data.
http://en.wikipedia.org/wiki/Just-in-time_compilation