I have added a variable in the struct thread_info to count certain event.
This is done in a guest OS.
During the execution of Virtual machine I read these variables from my HOST every now and then.
I have obeserved that sometime I get the value which is expected but sometimes I read junk values.I presume that the GCC is optimizing my variable, and the memory I am reading is in garbage state.
I want to know of possible way to prevent.
turnig Off GCC optimization for the kernel is out of question because my objective is to speed up the virtual machine based on the event I have counted.
#pragma optimize("",off)
make it less efficient because then I will have to break my event counting code(which is just 2 lines) into a function. And this event I am counting occurs very often.
Is there a #pragma technique which I can use??
Will making my variable volatile help the cause??
Thanks
Making the variables volatile will prevent GCC from optimizing them out. You don't need to disable optimization altogether.
However, you might need to deal with the race condition that results by you trying to read from the struct while the kernel is possibly still updating it. I don't know how you'd do that in a VM context though. Maybe there's some special mechanism for guest-host communication provided by the hypervisor you're using. VMware for example has VMCI.
Related
My understanding of GPUs is that they handle branches by executing all path while suspending instances that are not supposed to execute the path. This works well for if/then/else kind of construct and loops (instance that terminated the loop can be suspended until all instance are suspended).
This flat out does not work if the branch is indirect. But modern GPUs (Fermi and beyond for nVidia, not sure when it appear for AMD, R600 ?) claim to support indirect branches (function pointers, virtual dispatch, ...).
Question is, what kind of magic is going on in the chip to make this happen ?
Accordingly to the Cuda programming guide there is some strong restrictions on virtual functions and dynamic dispatching.
See http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#functions for more information. Another interesting article about how code is mapped to the GPU hardware is http://yosefk.com/blog/simd-simt-smt-parallelism-in-nvidia-gpus.html .
GPUs are like CPUs in regards to indirect branches. They both have an IP (instruction pointer) that points to physical memory. This IP is incremented for each hardware instruction that gets executed. An indirect branch just sets the IP to the new location. How this is done is a little bit more complicated. I will use PTX for Nvidia and GCN Assembly for AMD.
An AMD GCN GPU can have its IP simply set from any register. Example: "s_branch S8" The IP can be set with any value. In fact, on an AMD GPU, its possible to write to the program memory in a kernel and then set the IP to execute it (self modifying code).
On NVidia's PTX there is no indirect jump. I have been waiting for real hardware indirect branch support since 2009. The most current version of the PTX ISA 4.3 still does not have indirect branching. In the current PTX ISA manual, http://docs.nvidia.com/cuda/parallel-thread-execution, it still reads that "Indirect branch is currently unimplemented".
However, "indirect calls" are supported via jump tables. These are slightly different then indirect branches but do the same thing. I did some testing with jump tables in the past and the performance was not great. I believe the way this works is that the kernel is lunched with a table of already known call locations. Then when it runs across a "call %r10(params)" (something like that) it saves the current IP and then references the jump table by an index and then sets the IP to that address. I'm not 100% sure but its something like that.
Like you said, besides branching both AMD and NVidia GPUS also allow instructions to be executed but ignored. It executes them but does not write the output. This is another way of handing an if/then/else as some cores are ignored while others run. This does not really have much to do with branching. This is a trick to just avoid a time consuming branches. Some CPUs like the Intel Itanium also do this.
You can also try searching under these other names also: Indirect Calls, Indirect Branches, Dynamic branching, virtual functions, function pointers or jump tables
Hope this helps. Sorry I went so long.
I am profiling a userland application on netbsd with gprof and seeing clock_gettime using upwards of 30% cycles. Gprof does not show where it is getting called from (it shows some function which clearly does not call clock_getttime).
The application uses third party code including libevent 1.4 (which appears to use clock_gettime). I looked into removing the call from that but could not determine much.
I don't understand why it would take that much of time. Any inputs will be appreciated. I also saw gettimeofday taking a lot of cycles. In general, why would getting the time involve so many processing cycles
Is there a way that one can optimize clock_gettime () or can we use any other call?
Is it possible that gcc itself adds this call to the code when it is compiled with -pg for profiling purposes?
Thanks for any answers
It's all relative to whatever else your program is doing, and keep in mind that if you're doing any I/O, the actual CPU time your program uses may be small, and gprof doesn't see anything else.
So if some calls to timing routines get stuck in there, and they are called often enough, sure they can show a high percent.
Why doesn't gprof show where they're being called from?
For routines compiled with -pg, it tries to figure out who the caller is when any routine is entered.
It tries, but that doesn't mean it succeeds.
Anyway, that's gprof.
I am using msp430f5418, with IAR Embedded workbench 5.10.
A Graphical LCD (ST7565R) is connected through SPI into the MSP..
MSP master uses 8-bit, MSB first mode with SMCLK.
Normally we have to check the busy bit before transferring a byte using SPI, right?
But for my case, even if I send data continuously without checking the busy bit, it works fine and I can view the display data correctly.
Can anybody explain why is it working??
Is there any need to check for the ready bit or is it safe??
Thank you,
Your software is probably slow enough that the spi transaction completes every time. If you can verify that that is the case and always will be the case then you can argue not to add even more code to do the check. Removing the code that does the check might speed up your routine just enough to be too fast for the spi interface and cause collisions.
In general you should make sure one thing finishes before another starts. And in general how you make sure can be to use hardware features or through analysis or experiments. If the hardware has the feature and you somehow determine you dont need the check it is still a good idea to do a performance test with and without the check. If the performance is not critical or there isnt much difference it is still probably safer to leave the check in, somewhere down the road, even if your code is heavily commented with warnings, a compiler or code change might be just enough to have it not work without the check.
It came to my attention some emulators and virtual machines use dynamic recompilation. How do they do that? In C i know how to call a function in ram using typecasting (although i never tried) but how does one read opcodes and generate code for it? Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C? If so how do you find the length of the code? How do you account for system interrupts?
-edit-
system interrupts and how to (re)compile the data is what i am most interested in. Upon more research i heard of one person (no source available) used js, read the machine code, output js source and use eval to 'compile' the js source. Interesting.
It sounds like i MUST have knowledge of the target platform machine code to dynamically recompile
Yes, absolutely. That is why parts of the Java Virtual Machine must be rewritten (namely, the JIT) for every architecture.
When you write a virtual machine, you have a particular host-architecture in mind, and a particular guest-architecture. A portable VM is better called an emulator, since you would be emulating every instruction of the guest-architecture (guest-registers would be represented as host-variables, rather than host-registers).
When the guest- and host-architectures are the same, like VMWare, there are a ton of (pretty neat) optimizations you can do to speed up the virtualization - today we are at the point that this type of virtual machine is BARELY slower than running directly on the processor. Of course, it is extremely architecture-dependent - you would probably be better off rewriting most of VMWare from scratch than trying to port it.
It's quite possible - though obviously not trivial - to disassemble code from a memory pointer, optimize the code in some way, and then write back the optimized code - either to the original location or to a new location with a jump patched into the original location.
Of course, emulators and VMs don't have to RE-write, they can do this at load-time.
This is a wide open question, not sure where you want to go with it. Wikipedia covers the generic topic with a generic answer. The native code being emulated or virtualized is replaced with native code. The more the code is run the more is replaced.
I think you need to do a few things, first decide if you are talking about an emulation or a virtual machine like a vmware or virtualbox. An emulation the processor and hardware is emulated using software, so the next instruction is read by the emulator, the opcode pulled apart by code and you determine what to do with it. I have been doing some 6502 emulation and static binary translation which is dynamic recompilation but pre processed instead of real time. So your emulator may take a LDA #10, load a with immediate, the emulator sees the load A immediate instruction, knows it has to read the next byte which is the immediate the emulator has a variable in the code for the A register and puts the immediate value in that variable. Before completing the instruction the emulator needs to update the flags, in this case the Zero flag is clear the N flag is clear C and V are untouched. But what if the next instruction was a load X immediate? No big deal right? Well, the load x will also modify the z and n flags, so the next time you execute the load a instruction you may figure out that you dont have to compute the flags because they will be destroyed, it is dead code in the emulation. You can continue with this kind of thinking, say you see code that copies the x register to the a register then pushes the a register on the stack then copies the y register to the a register and pushes on the stack, you could replace that chunk with simply pushing the x and y registers on the stack. Or you may see a couple of add with carries chained together to perform a 16 bit add and store the result in adjacent memory locations. Basically looking for operations that the processor being emulated couldnt do but is easy to do in the emulation. Static binary translation which I suggest you look into before dynamic recompilation, performs this analysis and translation in a static manner, as in, before you run the code. Instead of emulating you translate the opcodes to C for example and remove as much dead code as you can (a nice feature is the C compiler can remove more dead code for you).
Once the concept of emulation and translation are understood then you can try to do it dynamically, it is certainly not trivial. I would suggest trying to again doing a static translation of a binary to the machine code of the target processor, which a good exercise. I wouldnt attempt dynamic run time optimizations until I had succeeded in performing them statically against a/the binary.
virtualization is a different story, you are talking about running the same processor on the same processor. So x86 on an x86 for example. the beauty here is that using non-old x86 processors, you can take the program being virtualized and run the actual opcodes on the actual processor, no emulation. You setup traps built into the processor to catch things, so loading values in AX and adding BX, etc these all happen at real time on the processor, when AX wants to read or write memory it depends on your trap mechanism if the addresses are within the virtual machines ram space, no traps, but lets say the program writes to an address which is the virtualized uart, you have the processor trap that then then vmware or whatever decodes that write and emulates it talking to a real serial port. That one instruction though wasnt realtime it took quite a while to execute. What you could do if you chose to is replace that instruction or set of instructions that write a value to the virtualized serial port and maybe have then write to a different address that could be the real serial port or some other location that is not going to cause a fault causing the vm manager to have to emulate the instruction. Or add some code in the virtual memory space that performs a write to the uart without a trap, and have that code instead branch to this uart write routine. The next time you hit that chunk of code it now runs at real time.
Another thing you can do is for example emulate and as you go translate to a virtual intermediate bytcode, like llvm's. From there you can translate from the intermediate machine to the native machine, eventually replacing large sections of program if not the whole thing. You still have to deal with the peripherals and I/O.
Here's an explaination of how they are doing dynamic recompilation for the 'Rubinius' Ruby interpteter:
http://www.engineyard.com/blog/2010/making-ruby-fast-the-rubinius-jit/
This approach is typically used by environments with an intermediate byte code representation (like Java, .net). The byte code contains enough "high level" structures (high level in terms of higher level than machine code) so that the VM can take chunks out of the byte code and replace it by a compiled memory block. The VM typically decide which part is getting compiled by counting how many times the code was already interpreted, since the compilation itself is a complex and time-consuming process. So it is usefull to only compile the parts which get executed many times.
but how does one read opcodes and generate code for it?
The scheme of the opcodes is defined by the specification of the VM, so the VM opens the program file, and interprets it according to the spec.
Does the person need to have premade assembly chunks and copy/batch them together? is the assembly written in C?
This process is an implementation detail of the VM, typically there is a compiler embedded, which is capable to transform the VM opcode stream into machine code.
How do you account for system interrupts?
Very simple: none. The code in the VM can't interact with real hardware. The VM interact with the OS, and transfer OS events to the code by jumping/calling specific parts inside the interpreted code. Every event in the code or from the OS must pass the VM.
Also hardware virtualization products can use some kind of JIT. A typical use cases in the X86 world is the translation of 16bit real mode code to 32 or 64bit protected mode code to not to be forced to emulate a CPU in real mode. Also a software-only VM replaces jump instructions in the executing code by jumps into the VM control software, which at each branch the following code path for jump instructions scans and them replace, before it jumps to the real code destination. But I doubt if the jump replacement qualifies as JIT compilation.
IIS does this by shadow copying: after compilation it copies assemblies to some temporary place and runs them from temp.
Imagine, that user change some files. Then IIS will recompile asseblies in next steps:
Recompile (all requests handled by old code)
Copies new assemblies (all requests handled by old code)
All new requests will be handled by new code, all requests - by old.
I hope this'd be helpful.
A virtual Machine loads "byte code" or "intermediate language" and not machine code therefore, I suppose, that it just recompiles the byte code more efficiently once it has more runtime data.
http://en.wikipedia.org/wiki/Just-in-time_compilation
In our embedded system (using a PowerPC processor), we want to disable the processor cache. What steps do we need to take?
To clarify a bit, the application in question must have as constant a speed of execution as we can make it.
Variability in executing the same code path is not acceptable. This is the reason to turn off the cache.
I'm kind of late to the question, and also it's been a while since I did all the low-level processor init code on PPCs, but I seem to remember the cache & MMU being pretty tightly coupled (one had to be enabled to enable the other) and I think in the MMU page tables, you could define the cacheable attribute.
So my point is this: if there's a certain subset of code that must run in deterministic time, maybe you locate that code (via a linker command file) in a region of memory that is defined as non-cacheable in the page tables? That way all the code that can/should benefit from the cache does, and the (hopefully) subset of code that shouldn't, doesn't.
I'd handle it this way anyway, so that later, if you want to enable caching for part of the system, you just need to flip a few bits in the MMU page tables, instead of (re-)writing the init code to set up all the page tables & caching.
From the E600 reference manual:
The HID0 special-purpose register contains several bits that invalidate, disable, and lock the instruction and data caches.
You should use HID0[DCE] = 0 to disable the data cache.
You should use HID0[ICE] = 0 to disable the instruction cache.
Note that at power up, both caches are disabled.
You will need to write this in assembly code.
Perhaps you don't want to globally disable cache, you only want to disable it for a particular address range?
On some processors you can configure TLB (translation lookaside buffer) entries for address ranges such that each range could have caching enabled or disabled. This way you can disable caching for memory mapped I/O, and still leave caching on for the main block of RAM.
The only PowerPC I've done this on was a PowerPC 440EP (from IBM, then AMCC), so I don't know if all PowerPCs work the same way.
What kind of PPC core is it? The cache control is very different between different cores from different vendors... also, disabling the cache is in general considered a really bad thing to do to the machine. Performance becomes so crawlingly slow that you would do as well with an old 8-bit processor (exaggerating a bit). Some ARM variants have TCMs, tightly-coupled memories, that work instead of caches, but I am not aware of any PPC variant with that facility.
Maybe a better solution is to keep Level 1 caches active, and use the on-chip L2 caches as statically mapped RAM instead? That is common on modern PowerQUICC devices, at least.
Turning off the cache will do you no good at all. Your execution speed will drop by an order of magnitude. You would never ship a system like this, so its performance under these conditions is of no interest.
To achieve a steady execution speed, consider one of these approaches:
1) Lock some or all of the cache. All current PowerPC chips from Freescale, IBM, and AMCC offer this feature.
2) If it's a Freescale chip with L2 cache, consider mapping part of that cache as on-chip memory.