How to get a variable that stored in the stack? - variables

While programming (for example, in C), a lot of variables are kept in the stack.
A stack is a FIFO data structure, and we can pop only the top value of the stack.
let's say I have 100 variables stored in the stack, and I want to get the value of the one of them, which is not in the top of the stack.
How do I get it? Do the operating system pop all the variables which are newer in the stack until getting the wanted variable, then push all of them back inside?
Or is there a different way the operating system can approach a variable inside the stack?
Thanks

The stack, as used in languages like C, is not a typical LIFO. It's called a stack because it is used in a way similar to a LIFO: When a procedure is called, a new frame is pushed onto the stack. The frame typically contains local variables and bookkeeping information like where to return to. Similarly, when a procedure returns, its frame is popped off the stack.
There's nothing magical about this. The compiler (not the operating system) allocates a register to be used as a stack pointer - let's call it SP. By convention, SP points to the memory location of the next free stack word:
+----------------+ (high address)
| argument 0 |
+----------------+
| argument 1 |
+----------------+
| return address |
+----------------+
| local 0 |
+----------------+
| local 1 |
+----------------+ +----+
| free slot | <-------------- | SP |
+----------------+ (low address) +----+
To push a value onto the stack, we do something like this (in pseudo-assembly):
STORE [SP], 42 ; store the value 42 at the address where SP points
SUB SP, 1 ; move down (the stack grows down!) to the next stack location
Where the notation [SP] is read as "the contents of the memory cell to which SP points". Some architectures, notably x86, provide a push instruction that does both the storing and subtraction. To pop (and discard) the n top values on the stack, we just add n to SP*.
Now, suppose we want to access the local 0 field above. Easy enough if our CPU has a base+offset addressing mode! Assume SP points to the free slot as in the picture above.
LOAD R0, [SP+2] ; load "local 0" into register R0
Notice how we didn't need to pop local 0 off the stack first, because we can reference any field using its offset from the stack pointer.
Depending on the compiler and machine architecture, there may be another register pointing to the area between locals and arguments (or thereabouts). This register, typically called a frame pointer, remains fixed as the stack pointer moves around.
I want to stress the fact that normally, the operating system isn't involved in stack manipulation at all. The kernel allocates the initial stack, and possibly monitors its growth, but leaves the pushing and popping of values to the user program.
*For simplicity, I've assumed that the machine word size is 1 byte, which is why we subtract 1 from SP. On a 32-bit machine, pushing a word onto the stack means subtracting (at least) four bytes.

I. The operating system doesn't do anything directly with your variables.
II. Don't think of the stack as a physical stack (the reason for this is overly simple: it isn't one). The elements of the stack can be accessed directly, and the compiler generates code that does so. Google "stack pointer relative addressing".

Related

Is there a way to swap long (or double) and reference values on JVM stack?

Let's say I have following bytecode sequence
aload 0 // this
lload 1
aload 3
For the sake of the question, let's assume that these instructions are generated by other code and I don't have control over it.
I need to swap last two items on a stack, a long and a reference. I can't do it with swap, cause long takes two slots on a stack and swap doesn't care about that.
I'll get something like this when loading the class:
java.lang.VerifyError: Bad type on operand stack
Exception Details:
Location:
bytecode/generated/SomeClasName.someMethod(Ljava/lang/Object;)Z #18: swap
Reason:
Type long_2nd (current frame, stack[3]) is not assignable to category1 type
Is there a way to swap category1 and category2 types on a stack without resorting to locals?
Use
dup_x2 + pop.
This requires one extra stack slot though.

How can we jump to different memory address in Smalltalk?

I am trying to build an assembly language interpreter in Smalltalk. Is there any command if I want to jump to a different memory location?
Example: There is an array of the memory address from 1-10.
1 LDI 10 //Load 10 to a register
2 XCH //Exchange value with different register
3 LDI 20 // Load 20 to a register
4 ADD //Add the values 10 and 20
5 JMP 1 //Jump to memory address 1
6 HLT
To jump from memory address 5 to address 1, is there any command?
If you are trying to model an assembly interpreter you need to represent several objects. At least you will need to have objects (i.e., classes) for registers, instructions and memory. In this design, a program (or routine) would be a sequence of instructions and your interpreter would have an instruction pointer ip that moves along the routine.
At every position of the ip, the interpreter would have to "execute" the current instruction, which would result in modifications to the registers and or specific memory locations.
For instance, you start the interpretation by assigning 1 to ip. Now you read the instruction with index ip, in this case:
1. LDI 10
Then you have to send the #execute message to the instruction. In this case the execution assigns the value 10 to the object representing register I. Now you increment ip and repeat until you run out of instructions.
In this "simulation" of the processor the jmp instruction would be one of the easiest ones to interpret: it would simply change the value of the instruction pointer ip to the target location.

Storing things in isa

The 64-bit runtime took away the ability to directly access the isa field of an object, something CLANG engineers had been warning us about for a while. They've been replaced by a rather inventive (and magic) set of everchanging ABI rules about which sections of the newly christened isa header contain information about the object, or even other state (in the case of NSNumber/NSString). There seems to be a loophole, in that you can opt out of the new "magic" isa and use one of your own (a raw isa) at the expense of taking the slow road through certain runtime code paths.
My question is twofold, then:
If it's possible to opt out and object_setClass() an arbitrary class into an object in +allocWithZone:, is it also possible to put anything up there in the extra space with the class, or will the runtime try to read it through the fast paths?
What exactly in the isa header is tagged to let the runtime differentiate it from a normal isa?
If it's possible to opt out and object_setClass() an arbitrary class into an object in +allocWithZone:
According to this article by Greg Parker
If you override +allocWithZone:, you may initialize your object's isa field to a "raw" isa pointer. If you do, no extra data will be stored in that isa field and you may suffer the slow path through code like retain/release. To enable these optimizations, instead set the isa field to zero (if it is not already) and then call object_setClass().
So yes, you can opt out and manually set a raw isa pointer. To inform the runtime about this, you have to the first LSB of the isa to 0. (see below)
Also, there's an environment variable that you can set, named OBJC_DISABLE_NONPOINTER_ISA, which is pretty self-explanatory.
is it also possible to put anything up there in the extra space with the class, or will the runtime try to read it through the fast paths?
The extra space is not being wasted. It's used by the runtime for useful in-place information about the object, such as the current state and - most importantly - its retain count (this is a big improvement since it used to be fetched every time from an external hash table).
So no, you cannot use the extra space for your own purposes, unless you opt out (as discussed above). In that case the runtime will go through the long path, ignoring the information contained in the extra bits.
Always according to Greg Parker's article, here's the new layout of the isa (note that this is very likely to change over time, so don't trust it)
(LSB)
1 bit | indexed | 0 is raw isa, 1 is non-pointer isa.
1 bit | has_assoc | Object has or once had an associated reference. Object with no associated references can deallocate faster.
1 bit | has_cxx_dtor | Object has a C++ or ARC destructor. Objects with no destructor can deallocate faster.
30 bits | shiftcls | Class pointer's non-zero bits.
9 bits | magic | Equals 0xd2. Used by the debugger to distinguish real objects from uninitialized junk.
1 bit | weakly_referenced | Object is or once was pointed to by an ARC weak variable. Objects not weakly referenced can deallocate faster.
1 bit | deallocating | Object is currently deallocating.
1 bit | has_sidetable_rc | Object's retain count is too large to store inline.
19 bits | extra_rc | Object's retain count above 1. (For example, if extra_rc is 5 then the object's real retain count is 6.)
(MSB)
What exactly in the isa header is tagged to let the runtime differentiate it from a normal isa?
As anticipated above you can discriminate between a raw isa and a new rich isa by looking at the first LSB.
To wrap it up, while it looks feasible to opt out and start messing with the extra bits available on a 64 bit architecture, I personally discourage it. The new isa layout is carefully crafted for optimizing the runtime performances and it's far from guaranteed to stay the same over time.
Apple may also decide in the future to drop the retro-compatibility with the raw isa representation, preventing opt out. Any code assuming the isa to be a pointer would then break.
You can't safely do this, since if (when, really) the usable address space expands beyond 33 bits, the layout will presumably need to change again. Currently though, the bottom bit of the isa controls whether it's treated as having extra info or not.

Memory addresses, pointers, variables, values - what goes on behind the scenes

This is going to be a pretty loaded question but ever since I started learning about pointers I've been very curious about what happens behind the scenes when a program is run.
As far as I know, computer memory is commonly thought of as a long strip of memory divided evenly into individual bytes. Certainly pictures such as the following evoke such a metaphor:
One thing I've been wondering, what do the memory addresses themselves represent? I'm sure it's no coincidence that memory addresses appear as 8 digit hexadecimal values (eg/ 00EB5748). Why is this?
Furthermore, when I declare a variable x, what is happening at the memory level? Is the compiler simply reserving a random address (+however many consecutive addresses it needs for the variable type) for data storage?
Now suppose x is an unsigned int that occupies 2 bytes of memory (ie values ranging from 0 to 65536). When I declare x = 12, what is happening? What is it that I'm making equal to 12? When I draw conceptual diagrams, I usually have a box for an address (say &x) pointing to a variable (x) that occupies seemingly nothing, and I'm sure that can't be a fully accurate picture of what's going on.
And what's happening at the binary level? Is the address 00EB5748 treated as 111010110101011101001000 and storing a value of 12 somewhere, or 1100?
Mostly my confusion & curiosity stems from the relationship between memory addresses and actual values being declared (eg/ 12, 'a', -355.2). As another example, suppose our address 00EB5748 is pointing to a char 's' whose value is 115 according to ASCII charts. Is the address describing a position that stores the value 115 in 1 byte, by flipping the appropriate 1s and 0s at that position in memory?
Just open any book. You will see pages. Every page has a number. Consecutive pages are numbered by consecutive numbers. Do you have any confusion with numbered pages? I think no. Then you should not have confusion with computer memory.
Books were main memory storage devices before computer era. Computer memory derived basic concept from books: book has pages -> computer memory has memory cells, book has page numbers -> computer memory has memory addresses.
One thing I've been wondering, what do the memory addresses themselves represent?
Numbers. Every memory cell has number, like every page in book.
Furthermore, when I declare a variable x, what is happening at the memory level? Is the compiler simply reserving a random address (+however many consecutive addresses it needs for the variable type) for data storage?
Memory manager marks some memory cells occupied and tells the address of first reserved cell to compiler. Compiler associates name and type of variable with this address. (This picture is from my head, it can be inaccurate).
When I declare x = 12, what is happening?
When you declared variable x, memory cells were reserved for this variable. Now you write 12 into these memory cells. Note that 12 is binary coded in some way, depending on type of variable x. If x is unsigned int which occupies 2 memory cells, then one cell will contain 0, other will contain 12. Because binary integer representation of 12 is
0000 0000 0000 1100
|_______| |_______|
cell cell
If 12 is floating-point number it will be coded in other way.
A memory address is simply the position of a given byte in memory. The zeroth byte is at 0x00000000. The tenth at 0x0000000A. The 65535th at 0x0000FFFF. And so on.
Local variables live on the stack*. When compiling a block of code, the compiler counts how many bytes are needed to hold all the local variables, and then increments the stack pointer so that all the variables can fit below it (along with some other stuff like frame pointers and return addresses and whatnot). Then it just remembers that, for example, local variable x is at an offset -2 from the stack pointer, foo is at an offset -4 and so on, and uses those addresses whenever those variables are referenced in the following code.
Since the compiler knows that x is at address (stack pointer - 2), that's the location that is set to the value 12 when you do x = 12.
Not entirely sure if I understand this question, but say you want to read the memory at address 0x00EB5748. The control unit in the CPU reads the instruction, sees that it is a load instruction, and passes the address (in binary of course) to the load/store unit, along with some other junk like how many bytes to read. Then the LSU sends that address to some memory (probably L1 cache), and after a certain time gets the value 12 back. Then this data is available to, say, put in a register, or send to the ALU to do arithmetic, or whatever.
That seems to be accurate, yes. Going back to the first question, an address simply means "byte number 0xWHATEVER in memory".
Hope this clarified things a bit at least.
*I should probably explain the stack as well. A stack is a portion of memory reserved for local variables (and some other stuff). It starts at a fixed location in memory, and stops at the memory address contained in a special register called the stack pointer. To begin with, the stack is empty, so the stack pointer just contains the start of the stack. As you put more data on the stack, the SP is incremented. This means that you can always put more data on it simply by putting it at the address in the SP, and then incrementing the SP so that once again anything past that address is free memory.

What are the most hardcore optimisations you've seen?

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.
I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.
I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.
In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.
(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)
In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:
Validate the bootsector of the host harddrive and make sure it was not infected.
Validate the floppy bootsector and
make sure that it was not infected.
Code to remove the virus from the
harddrive if it was infected.
Code to duplicate the antivirus
bootsector to another floppy if a
special key was pressed.
Code to boot the harddrive if all was
well, and no infections was found.
This was done in the program space of a bootsector, about 440 bytes :)
The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"
I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.
My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.
In c/c++ the code is: (sourced from Wikipedia)
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1); // Now this is what you call a real magic number
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
A Very Biological Optimisation
Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.
Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)
For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)
That's hardcore space optimisation!
UPDATE: Links to further info:
Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
Or you could google for "overlapping reading frames"
Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.
In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.
I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.
The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.
Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:
Background 1 uses a Push Effective Address (PEA) instruction
Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.
Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:
Define a code fragment to do the blend in the same 64K program bank as the code buffer
Create a copy of this code for each of the 82 words
There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
Done! We have a hard-coded subroutine that does not affect the CPU registers.
Here is the actual code fragments
code_buff: PEA $0000 ; rightmost word (16-bits = 4 pixels)
PEA $0000 ; background 1
PEA $0000 ; background 1
PEA $0000 ; background 1
LDA (72),y ; background 2
PHA
LDA (70),y ; background 2
PHA
JMP word_68 ; mix the data
word_68_rtn: PEA $0000 ; more background 1
...
PEA $0000
BRA *+40 ; patched exit code
...
word_68: LDA (68),y ; load data for background 2
AND #$00FF ; mask
ORA #$AB00 ; blend with data from background 1
PHA
JMP word_68_rtn ; jump back
word_66: LDA (66),y
...
The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.
Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.
Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.
The Stalin Scheme compiler is pretty crazy in that aspect.
I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:
Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases
I forget what N was. This was in the source code for Windows that was leaked in 2004.
I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.
Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.
The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.
The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)