About the necessity of GOT & PLT structures in ELF header - elf

In some architectures (like x86_64) where it's possible to reference data (mov e.g.) and code (jmp, call) using PC(RIP)-relative addressing mode, is there really a technical reason justifying the need of such structures (got, plt) ?
I mean, if I want to mov a global data (for instance) to a register, I could make the following instruction (standard PIE) :
mov rax,QWORD PTR [rip+0x2009db]
mov eax,DWORD PTR [rax]
(where 0x2009db is the offset between rip and the right entry in the got containing the symbol address)
And why couldn't we do something like that :
mov rax, rip+0xYYYYYY
mov eax,DWORD PTR [rax]
(0xYYYYYYY being the direct delta between the RIP value and the symbol (a global variable e.g.))
I'm not used to do ASM, so my example is perhaps false. But my idea is : why not just simply compute the absolute address of the symbol based on RIP, put it in EAX, and then access its content. If the instruction set allows to do whatever we want with relative addressing, why use such structures (got, plt) ?
The same question would apply for call/jmp instructions.
Is it because the instruction set does not allow it ?
Is it because the offset value cannot cover the entire address space ? But.. is it important ? Since the structure of a section is maintained one mapped into the virtual addressing space of a process (e.g. .dat section followed by .got or something like). I mean, why would the offset be bigger in referring directly to the symbol address instead of the entry address in the got ?
Other reason ?
Thanks !

Basically, the reason for these structures is exactly having an extra level of indirection.
In this way, you can interpose symbols in dynamic libraries with LD_PRELOAD. And even without it, the dynamic binding rules are such that a symbol defined in an executable overrides one defined in a shared library even for calls from that library (see this).
Also, consider these points.
The address at which a shared library holding the implementation of the called function gets loaded is not known beforehand (this is by design, in particular it is done on purpose: it's a feature of ld.so known as ASLR), so the dynamic loader needs to apply relocations to at least all call sites that are executed at run time.
If not for PLT, this would kill the advantage of sharing code segments of libraries mapped to memory by different process images, because in different processes the same library might have a different address, leading to different patched code. PLT is a relatively small piece of data that is not shared. See this post.
PLT allows to bind functions lazily, upon first invocation. PLT slots initially hold the address of the resolver. After the resolution is done, the result is cached in the PLT stot.
Relocation mechanism for GOT/PLT is covered here. All in all, there's enough information on the internet on how (and why) PLT and GOT work.
Also, check out GCC's -fno-plt option. This is an optimization, but note that GOT is still needed and that lazy binding is not supported for functions without PLT entries.

Related

Can neon registers be indexed?

Consider a neon register such as:
uint16x8_t foo;
To access an individual lane, one is supposed to use vgetq_lane_u16(foo, 3). However, one might be tempted to write foo[3] instead given the intuition that foo is an array of shorts. When doing so, the gcc (10) compiles without warnings, but it is not clear that it does what it was intended to do.
The gcc documentation does not specifically mention the indexed access, but it says that operations behave like C++ valarrays. Those do support indexed access in the intuitive way.
Regardless of what foo[3] evaluates to, doing so seems to be faster than vgetq_lane_u16(foo, 3), so probably they're different or we wouldn't need both.
So what exactly does foo[3] mean? Is its behavior defined at all? If not, why does gcc happily compile it?
The foo[3] form is the GCC Vector extension form, as you have found and linked documentation to; it behaves as so:
Vectors can be subscripted as if the vector were an array with the same number of elements and base type. Out of bound accesses invoke undefined behavior at run time. Warnings for out of bound accesses for vector subscription can be enabled with -Warray-bounds.
This can have surprising results when used on big-endian systems, so Arm’s Arm C Language Extensions recommend to use vget_lane if you are using other Neon intrinsics like vld1 in the same code path.

IJVM using a local variable for GOTO statement

I am working with IJVM and trying to use the GOTO instruction using a local variable in place of a static offset (or label). It won't work. I suppose it is simply treating the variable name as a label and trying to branch to it, but no such label exists. Is there any way I can force it to read the contents of the variable (which contains an offset), or some other solution?
Thanks in advance.
For security reasons, JVM bytecode doesn't let you jump to arbitrary instructions based on the contents of a variable. This restriction makes it possible for the JVM to verify various security properties of the bytecode by statically enumerating all control paths through a particular method. If you were able to jump anywhere, the static analyzer couldn't prove that all necessary program invariants held.
If you do need to jump to an arbitrary index, consider looking into the tableswitch or lookupswitch instructions, which would let you enumerate possible destinations in advance. It's not exactly what you're looking for, but to the best of my knowledge the sort of arbitrary jump you're trying to make isn't possible in JVM bytecode.
Hope this helps!
The GOTO instruction is implemented in MIC1. It interprets the 2 bytes after the opcode as an offset to the PC at the start of the instruction.
I think that the assignment must be asking you to write a new GOTO in MIC1 that interprets the byte after the opcode as the offset to a local variable that contains the branch offset.

STM32 programming tips and questions

I could not find any good document on internet about STM32 programming. STM's own documents do not explain anything more than register functions. I will greatly appreciate if anyone can explain my following questions?
I noticed that in all example programs that STM provides, local variables for main() are always defined outside of the main() function (with occasional use of static keyword). Is there any reason for that? Should I follow a similar practice? Should I avoid using local variables inside the main?
I have a gloabal variable which is updated within the clock interrupt handle. I am using the same variable inside another function as a loop condition. Don't I need to access this variable using some form of atomic read operation? How can I know that a clock interrupt does not change its value in the middle of the function execution? Should I need to cancel clock interrupt everytime I need to use this variable inside a function? (However, this seems extremely ineffective to me as I use it as loop condition. I believe there should be better ways of doing it).
Keil automatically inserts a startup code which is written in assembly (i.e. startup_stm32f4xx.s). This startup code has the following import statements:
IMPORT SystemInit
IMPORT __main
.In "C", it makes sense. However, in C++ both main and system_init have different names (e.g. _int_main__void). How can this startup code can still work in C++ even without using "extern "C" " (I tried and it worked). How can the c++ linker (armcc --cpp) can associate these statements with the correct functions?
you can use local or global variables, using local in embedded systems has a risk of your stack colliding with your data. with globals you dont have that problem. but this is true no matter where you are, embedded microcontroller, desktop, etc.
I would make a copy of the global in the foreground task that uses it.
unsigned int myglobal;
void fun ( void )
{
unsigned int myg;
myg=myglobal;
and then only use myg for the rest of the function. Basically you are taking a snapshot and using the snapshot. You would want to do the same thing if you are reading a register, if you want to do multiple things based on a sample of something take one sample of it and make decisions on that one sample, otherwise the item can change between samples. If you are using one global to communicate back and forth to the interrupt handler, well I would use two variables one foreground to interrupt, the other interrupt to foreground. yes, there are times where you need to carefully manage a shared resource like that, normally it has to do with times where you need to do more than one thing, for example if you had several items that all need to change as a group before the handler can see them change then you need to disable the interrupt handler until all the items have changed. here again there is nothing special about embedded microcontrollers this is all basic stuff you would see on a desktop system with a full blown operating system.
Keil knows what they are doing if they support C++ then from a system level they have this worked out. I dont use Keil I use gcc and llvm for microcontrollers like this one.
Edit:
Here is an example of what I am talking about
https://github.com/dwelch67/stm32vld/tree/master/stm32f4d/blinker05
stm32 using timer based interrupts, the interrupt handler modifies a variable shared with the foreground task. The foreground task takes a single snapshot of the shared variable (per loop) and if need be uses the snapshot more than once in the loop rather than the shared variable which can change. This is C not C++ I understand that, and I am using gcc and llvm not Keil. (note llvm has known problems optimizing tight while loops, very old bug, dont know why they have no interest in fixing it, llvm works for this example).
Question 1: Local variables
The sample code provided by ST is not particularly efficient or elegant. It gets the job done, but sometimes there are no good reasons for the things they do.
In general, you use always want your variables to have the smallest scope possible. If you only use a variable in one function, define it inside that function. Add the "static" keyword to local variables if and only if you need them to retain their value after the function is done.
In some embedded environments, like the PIC18 architecture with the C18 compiler, local variables are much more expensive (more program space, slower execution time) than global. On the Cortex M3, that is not true, so you should feel free to use local variables. Check the assembly listing and see for yourself.
Question 2: Sharing variables between interrupts and the main loop
People have written entire chapters explaining the answers to this group of questions. Whenever you share a variable between the main loop and an interrupt, you should definitely use the volatile keywords on it. Variables of 32 or fewer bits can be accessed atomically (unless they are misaligned).
If you need to access a larger variable, or two variables at the same time from the main loop, then you will have to disable the clock interrupt while you are accessing the variables. If your interrupt does not require precise timing, this will not be a problem. When you re-enable the interrupt, it will automatically fire if it needs to.
Question 3: main function in C++
I'm not sure. You can use arm-none-eabi-nm (or whatever nm is called in your toolchain) on your object file to see what symbol name the C++ compiler assigns to main(). I would bet that C++ compilers refrain from mangling the main function for this exact reason, but I'm not sure.
STM's sample code is not an exemplar of good coding practice, it is merely intended to exemplify use of their standard peripheral library (assuming those are the examples you are talking about). In some cases it may be that variables are declared external to main() because they are accessed from an interrupt context (shared memory). There is also perhaps a possibility that it was done that way merely to allow the variables to be watched in the debugger from any context; but that is not a reason to copy the technique. My opinion of STM's example code is that it is generally pretty poor even as example code, let alone from a software engineering point of view.
In this case your clock interrupt variable is atomic so long as it is 32bit or less so long as you are not using read-modify-write semantics with multiple writers. You can safely have one writer, and multiple readers regardless. This is true for this particular platform, but not necessarily universally; the answer may be different for 8 or 16 bit systems, or for multi-core systems for example. The variable should be declared volatile in any case.
I am using C++ on STM32 with Keil, and there is no problem. I am not sure why you think that the C++ entry points are different, they are not here (Keil ARM-MDK v4.22a). The start-up code calls SystemInit() which initialises the PLL and memory timing for example, then calls __main() which performs global static initialisation then calls C++ constructors for global static objects before calling main(). If in doubt, step through the code in the debugger. It is important to note that __main() is not the main() function you write for your application, it is a wrapper with different behaviour for C and C++, but which ultimately calls your main() function.

Building a COM object vtable in x86 assembly

I am building a COM object in x86 assembly using NASM. I understand COM quite well and I understand x86 assembly pretty well, but getting the two to mesh is getting me hung up... (by the way, if you're thinking of attempting to dissuade me from using x86 assembly, please refrain, I have very particular reasons why I'm building this in x86 assembly!)
I am trying to build a vtable to use in my COM object, but I keep getting strange pointers, rather than actual pointers to my functions. (I'm thinking that I'm getting relative offsets or that NASM is embedding temporary values in there and they're not being replaced with the real values during linking)
The current interface I'm trying to build is the IClassFactory interface, with code as follows:
%define S_OK 0x00000000
%define E_NOINTERFACE 0x80004002
section .text
; All of these have very simple shells rather than implementations, but that is just until I can get the vtable worked out
ClassFactory_QueryInterface:
mov eax, E_NOINTERFACE
retn 12
ClassFactory_AddRef:
mov eax, 1
retn 4
ClassFactory_Release:
mov eax, 1
retn 4
ClassFactory_CreateInstance:
mov eax, E_NOINTERFACE
retn 16
ClassFactory_LockServer:
mov eax, S_OK
retn 8
global ClassFactory_vtable
ClassFactory_vtable dd ClassFactory_QueryInterface, ClassFactory_AddRef, ClassFactory_Release, ClassFactory_CreateInstance, ClassFactory_LockServer
global ClassFactory_object
ClassFactory_object dd ClassFactory_vtable
Note: This is not all of the code, I have DllGetClassObject, DllMain, etc. in a different file.
But when I assemble (using NASM: nasm -f win32 comobject.asm) and link (using MS Link: link /dll /subsystem:windows /out:comobject.dll comobject.obj), and examine the executable using OllyDbg, the vtable comes out with strange values. For example, in my last build, the actual addresses for the functions are as follows:
QueryInterface - 0x00381012
AddRef - 0x0038101A
Release - 0x00381020
CreateInstance - 0x00381026
LockServer - 0x0038102E
But the vtable came out with these values:
QueryInterface - 0x00F51012
AddRef - 0x00F5101A
Release - 0x00F51020
CreateInstance - 0x00F51026
LockServer - 0x00F5102E
These values look awfully suspicious... almost like the relocation didn't take. Also, the vtable comes out as 0x00F5104A, all of which are inaccessible memory addresses. (for informational purposes, these values come out different every time)
I tried doing the same thing in C++ using Visual Studio 2010 Express and everything comes out fine. So I'm assuming that it's just something that I'm missing in my assembly...
Can anyone point out to me why these values aren't coming out properly?
I must apologize, the problem turned out to be my own fault... In all of the scuffle building the thing, I had removed the /dll from the linker invocation, causing it to be built as an EXE, not a DLL...
Let me explain this a little better for the next person who runs across this.
All Windows executables have a base address which is assumed to be the virtual address that the executable will be loaded into. Executables that are loaded into a running process in most cases will not be loaded at the "preferred" base address, because another DLL (or the application itself) is probably already occupying the address. For this reason, Windows PE executables use what is called a Relocation Table. The Relocation Table tells Windows which locations in the executable need to be rewritten in case of a relocation to a new base address.
However, with the advent of Virtual Memory, most linkers will omit the relocation table from EXEs as an optimization, because the executable will always be loaded at it's base address (unless it conflicts with the reserved kernel addresses, in which case it will fail to load all-together). So because I stopped compiling as a DLL, my executable was not being given a Relocation Table and as a result, would not load properly into running process' address space.
Update:
By default, MSVC only includes relocation tables in DLL projects, as described on MSDN:
By default, /FIXED:NO is the default when building a DLL, and /FIXED is the default for any other project type.
This behavior can be changed by supplying the /FIXED:NO switch to the linker. The default for non-DLL projects is /FIXED which tells the linker that the target has a fixed base address and does not require a relocation table.
have you tried to build a stub COM interface in C and disassemble the result? That should give you a clue what is going wrong in your implementation.
Have you tried to also declare your globals as export? I haven't done x86 for a long time. But reading nasm's doc seems to imply you need to both global and export for the DLL relocation fixup to work.

JIT code generation techniques

How does a virtual machine generate native machine code on the fly and execute it?
Assuming you can figure out what are the native machine op-codes you want to emit, how do you go about actually running it?
Is it something as hacky as mapping the mnemonic instructions to binary codes, stuffing it into an char* pointer and casting it as a function and executing?
Or would you generate a temporary shared library (.dll or .so or whatever) and load it into memory using standard functions like LoadLibrary ?
You can just make the program counter point to the code you want to execute. Remember that data can be data or code. On x86 the program counter is the EIP register. The IP part of EIP stands for instruction pointer. The JMP instruction is called to jump to an address. After the jump EIP will contain this address.
Is it something as hacky as mapping the mnemonic instructions to binary codes, stuffing it into an char* pointer and casting it as a function and executing?
Yes. This is one way of doing it. The resulting code would be cast to a pointer to function in C.
Is it something as hacky as mapping the mnemonic instructions to binary codes, stuffing it into an char* pointer and casting it as a function and executing?
Yes, if you were doing it in C or C++ (or something similar), that's exactly what you'd do.
It appears hacky, but that's actually an artifact of the language design. Remember, the actual algorithm you want to use is very simple: determine what instructions you want to use, load them into a buffer in memory, and jump to the beginning of that buffer.
If you really try to do this, though, make sure you get the calling convention right when you return to your C program. I think if I wanted to generate code I'd look for a library to take care of that aspect for me. Nanojit's been in the news recently; you could look at that.
Yup. You just build up a char* and execute it. However, you need to note a couple details. The char* must be in an executable section of memory and must have proper alignment.
In addition to nanojit you can also check out LLVM which is another library that's capable of compiling various program representations down to a function pointer. It's interface is clean and the generated code tends to be efficient.
As far as i know it compiles everything in memory because it has to run some heuristics to to optimize the code (i.e.: inlining over time) but you can have a look at the Shared Source Common Language Infrastructure 2.0 rotor release. The whole codebase is identical to .NET except for the Jitter and the GC.
As well as Rotor 2.0 - you could also take a look at the HotSpot virtual machine in the OpenJDK.
About generating a DLL: the additional required I/O for that, plus linking, plus the complexity of generating the DLL format, would make that much more complicate, and above all they'd kill performance; additionally, in the end you still call a function pointer to the loaded code, so...
Also, JIT compilation can happen one method at a time, and if you want to do that you'd generate lots of small DLLs.
About the "executable section" requirement, calling mprotect() on POSIX systems can fix the permissions (there's a similar API on Win32). You need to do that for a big memory segment instead that once per method since it'd be too slow otherwise.
On plain x86 you wouldn't notice the problem, on x86 with PAE or 64bit AMD64/Intel 64 bit machines you'd get a segfault.
Is it something as hacky as mapping
the mnemonic instructions to binary
codes, stuffing it into an char*
pointer and casting it as a function
and executing?
Yes, that works.
To do this in windows you must set PAGE_EXECUTE_READWRITE to the allocated block:
void (*MyFunc)() = (void (*)()) VirtualAlloc(NULL, sizeofblock, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
//Now fill up the block with executable code and issue-
MyFunc();