What should be memory filled with if my malbolge program contains only 1 instruction? - virtual-machine

According to the original specification '98, Ben Olmstead Malbolge VM fill empty memory cells using crazy op on two previous cells. "Cells which are not initialized are set by performing op on the previous two cells repetitively." I.e.
[m] = crz [m-2], [m-1]
For the sake of sanity what should I do if the program contains only 1 instruction?
Or should I assume the last character always to be EOF?

Judging by the implementation and language-lawyering, there are two options -
If we consider the definition of "two previous cells" as, literally, the two previous cells, then a single-char or empty malbolge program is illegal in the language, because it can not be executed according to the specs.
If we consider the definition of [m] = crz [m-2], [m-1], it gets interesting. The main implementation (alongside probably most of the rest) uses unsigned short (or int) for the memory pointer. When you try subtracting 2 from 1 (m-2) it results in 0xffff, decimal 65535 (see this answer for details), which is just a bit over malbolge's 59049 memory limit. That glitch runs (almost) perfectly on a normal machine, using the 0xffff cell for crazy-op computing (without even harming the out-of-environment memory!), but will fail on a limited-memory or virtual machine.
You might end up with 0xffffffff instead of 0xffff, depending on the way you use the pointer.
In short,
If you run it by hand, assume it fails.
If you run it on a virtual machine, it fails.
If you run it on a simulator, it will probably work, but will fail the point of running itself, since 0xffff is a random-valued memory cell, leading to random values along the environment memory. On the other hand, what can you expect from a single-byte malbolge program?

Related

A general-purpose warp-level std::copy-like function - what should it account for?

A C++ standard library implements std::copy with the following code (ignoring all sorts of wrappers and concept checks etc) with the simple loop:
for (; __first != __last; ++__result, ++__first)
*__result = *__first;
Now, suppose I want a general-purpose std::copy-like function for warps (not blocks; not grids) to use for collaboratively copying data from one place to another. Let's even assume for simplicity that the function takes pointers rather than an arbitrary iterator.
Of course, writing general-purpose code in CUDA is often a useless pursuit - since we might be sacrificing a lot of the benefit of using a GPU in the first place in favor of generality - so I'll allow myself some boolean/enum template parameters to possibly select between frequently-occurring cases, avoiding runtime checks. So the signature might be, say:
template <typename T, bool SomeOption, my_enum_t AnotherOption>
T* copy(
T* __restrict__ destination,
const T* __restrict__ source,
size_t length
);
but for each of these cases I'm aiming for optimal performance (or optimal expected performance given that we don't know what other warps are doing).
Which factors should I take into consideration when writing such a function? Or in other words: Which cases should I distinguish between in implementing this function?
Notes:
This should target Compute Capabilities 3.0 or better (i.e. Kepler or newer micro-architectures)
I don't want to make a Runtime API memcpy() call. At least, I don't think I do.
Factors I believe should be taken into consideration:
Coalescing memory writes - ensuring that consecutive lanes in a warp write to consecutive memory locations (no gaps).
Type size vs Memory transaction size I - if sizeof(T) is sizeof(T) is 1 or 2, and we have have each lane write a single element, the entire warp would write less than 128B, wasting some of the memory transaction. Instead, we should have each thread place 2 or 4 input elements in a register, and write that
Type size vs Memory transaction size II - For type sizes such that lcm(4, sizeof(T)) > 4, it's not quite clear what to do. How well does the compiler/the GPU handle writes when each lane writes more than 4 bytes? I wonder.
Slack due to the reading of multiple elements at a time - If each thread wishes to read 2 or 4 elements for each write, and write 4-byte integers - we might have 1 or 2 elements at the beginning and the end of the input which must be handled separately.
Slack due to input address mis-alignment - The input is read in 32B transactions (under reasonable assumptions); we thus have to handle the first elements up to the multiple of 32B, and the last elements (after the last such multiple,) differently.
Slack due to output address mis-alignment - The output is written in transactions of upto 128B (or is it just 32B?); we thus have to handle the first elements up to the multiple of this number, and the last elements (after the last such multiple,) differently.
Whether or not T is trivially-copy-constructible. But let's assume that it is.
But it could be that I'm missing some considerations, or that some of the above are redundant.
Factors I've been wondering about:
The block size (i.e. how many other warps are there)
The compute capability (given that it's at least 3)
Whether the source/target is in shared memory / constant memory
Choice of caching mode

Memory addresses, pointers, variables, values - what goes on behind the scenes

This is going to be a pretty loaded question but ever since I started learning about pointers I've been very curious about what happens behind the scenes when a program is run.
As far as I know, computer memory is commonly thought of as a long strip of memory divided evenly into individual bytes. Certainly pictures such as the following evoke such a metaphor:
One thing I've been wondering, what do the memory addresses themselves represent? I'm sure it's no coincidence that memory addresses appear as 8 digit hexadecimal values (eg/ 00EB5748). Why is this?
Furthermore, when I declare a variable x, what is happening at the memory level? Is the compiler simply reserving a random address (+however many consecutive addresses it needs for the variable type) for data storage?
Now suppose x is an unsigned int that occupies 2 bytes of memory (ie values ranging from 0 to 65536). When I declare x = 12, what is happening? What is it that I'm making equal to 12? When I draw conceptual diagrams, I usually have a box for an address (say &x) pointing to a variable (x) that occupies seemingly nothing, and I'm sure that can't be a fully accurate picture of what's going on.
And what's happening at the binary level? Is the address 00EB5748 treated as 111010110101011101001000 and storing a value of 12 somewhere, or 1100?
Mostly my confusion & curiosity stems from the relationship between memory addresses and actual values being declared (eg/ 12, 'a', -355.2). As another example, suppose our address 00EB5748 is pointing to a char 's' whose value is 115 according to ASCII charts. Is the address describing a position that stores the value 115 in 1 byte, by flipping the appropriate 1s and 0s at that position in memory?
Just open any book. You will see pages. Every page has a number. Consecutive pages are numbered by consecutive numbers. Do you have any confusion with numbered pages? I think no. Then you should not have confusion with computer memory.
Books were main memory storage devices before computer era. Computer memory derived basic concept from books: book has pages -> computer memory has memory cells, book has page numbers -> computer memory has memory addresses.
One thing I've been wondering, what do the memory addresses themselves represent?
Numbers. Every memory cell has number, like every page in book.
Furthermore, when I declare a variable x, what is happening at the memory level? Is the compiler simply reserving a random address (+however many consecutive addresses it needs for the variable type) for data storage?
Memory manager marks some memory cells occupied and tells the address of first reserved cell to compiler. Compiler associates name and type of variable with this address. (This picture is from my head, it can be inaccurate).
When I declare x = 12, what is happening?
When you declared variable x, memory cells were reserved for this variable. Now you write 12 into these memory cells. Note that 12 is binary coded in some way, depending on type of variable x. If x is unsigned int which occupies 2 memory cells, then one cell will contain 0, other will contain 12. Because binary integer representation of 12 is
0000 0000 0000 1100
|_______| |_______|
cell cell
If 12 is floating-point number it will be coded in other way.
A memory address is simply the position of a given byte in memory. The zeroth byte is at 0x00000000. The tenth at 0x0000000A. The 65535th at 0x0000FFFF. And so on.
Local variables live on the stack*. When compiling a block of code, the compiler counts how many bytes are needed to hold all the local variables, and then increments the stack pointer so that all the variables can fit below it (along with some other stuff like frame pointers and return addresses and whatnot). Then it just remembers that, for example, local variable x is at an offset -2 from the stack pointer, foo is at an offset -4 and so on, and uses those addresses whenever those variables are referenced in the following code.
Since the compiler knows that x is at address (stack pointer - 2), that's the location that is set to the value 12 when you do x = 12.
Not entirely sure if I understand this question, but say you want to read the memory at address 0x00EB5748. The control unit in the CPU reads the instruction, sees that it is a load instruction, and passes the address (in binary of course) to the load/store unit, along with some other junk like how many bytes to read. Then the LSU sends that address to some memory (probably L1 cache), and after a certain time gets the value 12 back. Then this data is available to, say, put in a register, or send to the ALU to do arithmetic, or whatever.
That seems to be accurate, yes. Going back to the first question, an address simply means "byte number 0xWHATEVER in memory".
Hope this clarified things a bit at least.
*I should probably explain the stack as well. A stack is a portion of memory reserved for local variables (and some other stuff). It starts at a fixed location in memory, and stops at the memory address contained in a special register called the stack pointer. To begin with, the stack is empty, so the stack pointer just contains the start of the stack. As you put more data on the stack, the SP is incremented. This means that you can always put more data on it simply by putting it at the address in the SP, and then incrementing the SP so that once again anything past that address is free memory.

How are the digits in ObjC method type encoding calculated?

Is is a follow-up to my previous question:
What are the digits in an ObjC method type encoding string?
Say there is an encoding:
v24#0:4:8#12B16#20
How are those numbers calculated? B is a char so it should occupy just 1 byte (not 4 bytes). Does it have something to do with "alignment"? What is the size of void?
Is it correct to calculate the numbers as follows? Ask sizeof on every item and round up the result to multiple of 4? And the first number becomes the sum of all the other ones?
The numbers were used in the m68K days to denote stack layout. That is, you could literally decode the the method signature and, for just about all types, know exactly which bytes at what offset within the stack frame you could diddle to get/set arguments.
This worked because the m68K's ABI was entirely [IIRC -- been a long long time] stack based argument/return passing. There wasn't anything shoved into registers across call boundaries.
However, as Objective-C was ported to other platforms, always-on-the-stack was no longer the calling convention. Arguments and return values are often passed in registers.
Thus, those offsets are now useless. As well, the type encoding used by the compiler is no longer complete (because it never was terribly useful) and there will be types that won't be encoded. Not too mention that encoding some C++ templatized types yields method type encoding strings that can be many Kilobytes in size (I think the record I ran into was around 30K of type information).
So, no, it isn't correct to use sizeof() to generate the numbers because they are effectively meaningless to everything. The only reason why they still exist is for binary compatibility; there are bits of esoteric code here and there that still parse the type encoding string with the expectation that there will be random numbers sprinkled here and there.
Note that there are vestiges of API in the ObjC runtime that still lead one to believe that it might be possible to encode/decode stack frames on the fly. It really isn't as the C ABI doesn't guarantee that argument registers will be preserved across call boundaries in the face of optimization. You'd have to drop to assembly and things get ugly really really fast (>shudder<).
The full encoding string is constructed (in clang) by the method ASTContext::getObjCEncodingForMethodDecl, which you can find in lib/AST/ASTContext.cpp.
The method that does the size rounding is ASTContext::getObjCEncodingTypeSize, in the same file. It forces each size to be at least the size of an int. On all of Apple's current platforms, an int is 4 bytes.
The stack frame size and argument offsets are calculated by the compiler. I'm actually trying to track this down in the Clang source myself this week; it possibly has something to do with CodeGenTypes::arrangeObjCMessageSendSignature. (Looks like Rob just made my life a lot easier!)
The first number is the sum of the others, yes -- it's the total space occupied by the arguments. To get the size of the type represented by an ObjC type encoding in your code, you should use NSGetSizeAndAlignment().

Where does the limitation of 10^15 in D.J. Bernstein's 'primegen' program come from?

At http://cr.yp.to/primegen.html you can find sources of program that uses Atkin's sieve to generate primes. As the author says that it may take few months to answer an e-mail sent to him (I understand that, he sure is an occupied man!) I'm posting this question.
The page states that 'primegen can generate primes up to 1000000000000000'. I am trying to understand why it is so. There is of course a limitation up to 2^64 ~ 2 * 10^19 (size of long unsigned int) because this is how the numbers are represented. I know for sure that if there would be a huge prime gap (> 2^31) then printing of numbers would fail. However in this range I think there is no such prime gap.
Either the author overestimated the bound (and really it is around 10^19) or there is a place in the source code where the arithmetic operation can overflow or something like that.
The funny thing is that you actually MAY run it for numbers > 10^15:
./primes 10000000000000000 10000000000000100
10000000000000061
10000000000000069
10000000000000079
10000000000000099
and if you believe Wolfram Alpha, it is correct.
Some facts I had "reverse-engineered":
numbers are sifted in batches of 1,920 * PRIMEGEN_WORDS = 3,932,160 numbers (see primegen_fill function in primegen_next.c)
PRIMEGEN_WORDS controls how big a single sifting is - you can adjust it in primegen_impl.h to fit your CPU cache,
the implementation of the sieve itself is in primegen.c file - I assume it is correct; what you get is a bitmask of primes in pg->buf (see primegen_fill function)
The bitmask is analyzed and primes are stored in pg->p array.
I see no point where the overflow may happen.
I wish I was on my computer to look, but I suspect you would have different success if you started at 1 as your lower bound.
Just from the algorithm, I would conclude that the upper bound comes from the 32 bit numbers.
The page mentiones Pentium-III as CPU so my guess it is very old and does not use 64 bit.
2^32 are approx 10^9. Sieve of Atkins (which the algorithm uses) requires N^(1/2) bits (it uses a big bitfield). Which means in 2^32 big memory you can make (conservativ) N approx 10^15. As this number is a rough conservative upper bound (you have system and other programs occupying memory, reserving address ranges for IO,...) the real upper bound is/might be higher.

What are the most hardcore optimisations you've seen?

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.
I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.
I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.
In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.
(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)
In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:
Validate the bootsector of the host harddrive and make sure it was not infected.
Validate the floppy bootsector and
make sure that it was not infected.
Code to remove the virus from the
harddrive if it was infected.
Code to duplicate the antivirus
bootsector to another floppy if a
special key was pressed.
Code to boot the harddrive if all was
well, and no infections was found.
This was done in the program space of a bootsector, about 440 bytes :)
The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"
I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.
My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.
In c/c++ the code is: (sourced from Wikipedia)
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1); // Now this is what you call a real magic number
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
A Very Biological Optimisation
Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.
Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)
For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)
That's hardcore space optimisation!
UPDATE: Links to further info:
Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
Or you could google for "overlapping reading frames"
Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.
In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.
I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.
The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.
Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:
Background 1 uses a Push Effective Address (PEA) instruction
Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.
Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:
Define a code fragment to do the blend in the same 64K program bank as the code buffer
Create a copy of this code for each of the 82 words
There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
Done! We have a hard-coded subroutine that does not affect the CPU registers.
Here is the actual code fragments
code_buff: PEA $0000 ; rightmost word (16-bits = 4 pixels)
PEA $0000 ; background 1
PEA $0000 ; background 1
PEA $0000 ; background 1
LDA (72),y ; background 2
PHA
LDA (70),y ; background 2
PHA
JMP word_68 ; mix the data
word_68_rtn: PEA $0000 ; more background 1
...
PEA $0000
BRA *+40 ; patched exit code
...
word_68: LDA (68),y ; load data for background 2
AND #$00FF ; mask
ORA #$AB00 ; blend with data from background 1
PHA
JMP word_68_rtn ; jump back
word_66: LDA (66),y
...
The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.
Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.
Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.
The Stalin Scheme compiler is pretty crazy in that aspect.
I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:
Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases
I forget what N was. This was in the source code for Windows that was leaked in 2004.
I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.
Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.
The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.
The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)