Playing a simulation in fast-forward - optimization

I am working a simple game where users create structures in a sector, which is a 10x10 grid. Some structures generate resources and some consume resources. The sector itself might contain some resources outside of any structure. The generators and consumers are related. For example, a well might be generating water, then an splitter be consuming water and making hydrogen and oxygen, while a refinery is consuming hydrogen and oxygen and making rocket fuel, etc.
The rate at which they generate or consume resources can vary by structure - I call this the tick rate. Each time a consumer ticks, it will first attempt to extract those resources from the structures that surround it in the sector. If there are not enough, it will try to get them from the sector's storage. If it is still not enough, the structure will stop. Structures hold the resources that they generate up to some maximum. Once they are full, they will not generate more until some are consumed. If a structure is stopped, it will also not generate more resources, but the resources it already has can still be used by another adjacent structure.
It is not uncommon that there are patterns. For example, if the well is very slow, the splitter will turn off when the well runs out of water, and then the refinery will turn off when the splitter runs out of gases. Then when the well generates again, everything will turn back on.
When the user is playing a sector, I tick the sector continuously at the resolution of shortest tick rate of the sector's structures. This works fine. The pseudo-code looks like this:
const numTicks = (Date.now() - lastTickTime) / shortestTickTime;
let currentTickTime = lastTickTime;
for (i = 0; i < numTicks; i++) {
currentTickTime += shortestTickTime;
// check the consumers - all structures that are consumers
for (curConsumer of consumers) {
if (curConsumer.isRunning &&
(currentTickTime - curConsumer.lastTickTime >= curConsumer.tickRate) {
... check surrounding structures for resources
if (curConsumer.stillNeedsResources) {
... check sector for researches
}
if (curConsumer.stillNeedsResources) {
... no resources available
curConsumer.isRunning = false;
}
}
// check the generators - all structures that are generators
for (curGenerator of generators) {
if (curGenerator.isRunning &&
(currentTickTime - curGenerator.lastTickTime >= curGenerator.tickRate) {
... add the generated resources
}
}
}
}
Now I am dealing with the case of a user coming back to a sector after a long absence - say, a few days - when hundreds or thousands of ticks have passed. If I just naively try to play all of the ticks, it can take several seconds or several minutes to complete.
I am wondering if there are any tips or tricks for simulations of this kind for computing the net change without playing each tick. Or, alternately, if there are changes I can make to the simulation to make this easier to compute. Thanks!

Step 1: Convert the raw data into "graph of nodes" form, where each node represents a machine, and producers are at the bottom and consumers are at the top. For example it might look like:
|
(fuel)
|
Refinery
/ \
/ \
(hydrogen) (oxygen)
\ /
\ /
Splitter
|
(water)
|
well
Note: If a machine has an output buffer (or input buffer/s); then those buffers should be separate nodes. For example, if everything has output buffers it might look like this:
|
(fuel)
|
Refinery output buffer
|
(fuel)
|
Refinery
/ \
/ \
(hydrogen) (oxygen)
| |
Hydrogen Oxygen
output output
buffer buffer
| |
(hydrogen) (oxygen)
\ /
\ /
Splitter
|
(water)
|
Well output buffer
|
(water)
|
well
Step 2: Determine "current steady state average rates" by (initially) working from bottom up (producers to consumers). For example, if a well produces 1 unit of water every 4 ticks then assume it does produce an average of 0.25 water per tick; and if a splitter can convert 1 unit of water into 2 units of hydrogen and 1 unit of oxygen every 3 ticks then that's a max. rate of 0.333 water converted to 0.666 hydrogen and 0.333 oxygen, but you already know the well isn't producing water fast enough and can determine that the splitter will actually consume 0.25 water to produce 0.5 hydrogen and 0.25 oxygen.
Note that if a producer overproduces you will need to backtrack. For example, if a well produces 1 unit of water every 2 ticks then you'd assume it does produce an average of 0.5 water per tick; and if a splitter can convert 1 unit of water into 2 units of hydrogen and 1 unit of oxygen every 3 ticks then you know the well is producing more water than the splitter can consume and have to go back to the well and clamp its output to 0.333 water per tick.
Step 3: Determine when (how many ticks) until the next thing happens that will change the "current steady state average rates". If a resource can become depleted (e.g. the well runs dry) you need to know when that will happen. In the same way, if a "storage vessel" (an output buffer, input buffer, water tank, fuel storage chest, ...) becomes full or empty then you need to know when that will happen too. This is all based on the "current steady state average rates" you have - e.g. if a well's output buffer is empty and gains water (from well) at a rate of 0.5 water per tick and loses water (to splitter) at a rate of 0.33 water per tick; then you can calculate that the quantity it is storing increases at a rate of "0.5 - 0.33 = 0.17 per tick" and (combined with the buffer's capacity) calculate when the output buffer will become full.
Note that "how many ticks until the next thing happens" must also be limited to when you want to stop simulating.
Step 4: Advance time up until the next thing that happens. This mostly means updating the amount of stuff stored in "storage vessels" using the "current steady state average rates" you have; and then modifying any information (e.g. setting a well to "not functioning, ran dry").
Step 5: Repeat the previous steps until you reach the time you want. This is just "if(current_time < stop_time) goto step 2".
Step 6: Update the world with the final state of everything. This is mostly the reverse of step 1 - setting quantities in "storage vessels", marking resources as depleted, etc.
Notes:
You might need to add "transport" as a type of node. For example, if you have conveyor belts taking water from the well to the splitter then that can be simulated like a "storage vessel" but with lag time (e.g. if belt was empty and items start being put on the belt, then it's going to take "length * speed" time before items arrive at the other end).
If you want; you could add "breakdown" to the game (e.g. maybe there's a small chance that a splitter malfunctions and needs to be repaired). This is just an extra thing to take into account at step 3.
Don't forget that you can modify the game design to make it easier. For a start, I would avoid feedback loops (e.g. if the fuel from the reactor is fed into a generator to create power that is consumed by a splitter that ...) because this makes everything significantly harder. I'd also be tempted to avoid "variable speed" things (e.g. miners travelling between an ore field and a drop off point, where the distance the miners travel increases as closer parts of the ore field are consumed so the "average ore per tick" is increasing and never constant).
Don't forget that it's a game - it doesn't have to be 100% perfectly accurate, and only needs to be convincing enough to fool the player. If it's "slightly wrong" (e.g. an output buffer should have 23 items but only has 21 itmes) it's likely that nobody will ever notice.
Depending on other details; you might (or might not) consider stopping early and switching to "simulate one tick at a time". For example, if you need to simulate 12345600 ticks, then you could use the approach I've described for the first 12345500 ticks, then use the approach you already have to do the last 100 ticks. This can help to make some things (e.g. position of items on a conveyor belt) seem a lot more realistic.

Related

CorePlot - dynamic x-axis data using two arrays

This is more of an open discussion topic than anything else. Currently I'm storing 50 Float32 values in my NSMutableArray *voltageArray before I refresh my CPTPlot *plot. Every time I obtain 50 values, I remove the previous 50 from the voltageArray and repeat the process....always displaying the 50 values in "real time" on my plot.
However, the data I'm receiving (which is voltage coming from a Cypress BLE module equipped with a pressure transducer) is so quick that any variation (0.4 V to 4.0 V; no pressure to lots of pressure) cannot be seen on my graph. It just shows up as a straight line, varying up and down without showing increased or decreased slopes.
To show overall change, I wanted to take those 50 values, store them in the first index of another NSMutableArray *stampArray and use the index of stampArray to display information. Meanwhile, the numberOfRecordsForPlot: method would look like this:
- (NSUInteger)numberOfRecordsForPlot:(CPTPlot *)plotnumberOfRecords {
return (DATA_PER_STAMP * _stampCount);
}
This would initially be 50, then after 50 pieces of data are captured from the BLE module, _stampCount would increase by one, and the number of records for plot would increase by 50 (till about 2500-10000 range, then I'd refresh the whole the thing and restart the process.)
Is this the right approach? How would I be able to make the first 50 points stay on the graph, while building the next 50, etc.? Imagine an y = x^2 graph, and what the graph looks like when applying integration (the whole breaking the area under the curve into rectangles).
Look at the "Real Time Plot" demo in the Plot Gallery example app included with Core Plot. It starts off with an empty plot, adding a new point each cycle until reaching the maximum number of points. After that, one old point is removed for each new one added so the total number stays constant. The demo uses a timer to pass random data to the plot, but your app can of course collect data from anywhere. Be sure to always interact with the graph from the main thread.
I doubt you'll be able to display 10,000 data points on one plot (does your display have enough pixels to resolve that many points?). If not, you'll get much better drawing performance if you filter and/or smooth the data to remove some of the points before sending them to the plot.

Effective Access Time

I am given the following question:
Now assume the system has no page faults, we are considering adding a TLB that will take
1 nano-second to lookup an address translation. What hit rate (to the nearest 5%) in the TLB is required
to reduce the effective access time to memory by a factor of 2.5?
I am told an average memory access takes 100ns. Since there are 4 memory accesses for a 3 level PT (3 for the page table 1 for physical memory) I deduced that it takes 400ns.
I am then asked to reduce that by a factor of 2.5. So (2/5) *400 = 160ns.
my goal EAT is 160ns. I started setting up the problem and I can't figure out where to go from here.
I am given the following solution but I am just unable to follow it:
Access time = 100 X + (1-x) 400 – 100 ns for hit (read memory), 400ns for
miss
160 = 100 x + 400 – 400x -> x = .8 -> 80% TLB hit rate
Can someone explain to me how they got to this step? I thought EAT was p(time it takes for a hit) + 1-p(time it takes for a miss) where p is the hit rate. isn't the time it takes for a hit 300ns? and then time it takes for a miss is 400ns?
from my logic I tried: p(300) + ((1-p) (400)) but when I went to compute it, I did not get the correct setup as the solution. Can someone explain where my logic is going wrong? Am I wrong about how many memory accesses a hit takes?

measuring time between two rising edges in beaglebone

I am reading sensor output as square wave(0-5 volt) via oscilloscope. Now I want to measure frequency of one period with Beaglebone. So I should measure the time between two rising edges. However, I don't have any experience with working Beaglebone. Can you give some advices or sample codes about measuring time between rising edges?
How deterministic do you need this to be? If you can tolerate some inaccuracy, you can probably do it on the main Linux OS; if you want to be fancy pants, this seems like a potential use case for the BBB's PRU's (which I unfortunately haven't used so take this with substantial amounts of salt). I would expect you'd be able to write PRU code that just sits with an infinite outerloop and then inside that loop, start looping until it sees the pin shows 0, then starts looping until the pin shows 1 (this is the first rising edge), then starts counting until either the pin shows 0 again (this would then be the falling edge) or another loop to the next rising edge... either way, you could take the counter value and you should be able to directly convert that into time (the PRU is states as having fixed frequency for each instruction, and is a 200Mhz (50ns/instruction). Assuming your loop is something like
#starting with pin low
inner loop 1:
registerX = loadPin
increment counter
jump if zero registerX to inner loop 1
# pin is now high
inner loop 2:
registerX = loadPin
increment counter
jump if one registerX to inner loop 2
# pin is now low again
That should take 3 instructions per counter increment, so you can get the time as 3 * counter * 50 ns.
As suggested by Foon in his answer, the PRUs are a good fit for this task (although depending on your requirements it may be fine to use the ARM processor and standard GPIO). Please note that (as far as I know) both the regular GPIOs and the PRU inputs are based on 3.3V logic, and connecting a 5V signal might fry your board! You will need an additional component or circuit to convert from 5V to 3.3V.
I've written a basic example that measures timing between rising edges on the header pin P8.15 for my own purpose of measuring an engine's rpm. If you decide to use it, you should check the timing results against a known reference. It's about right but I haven't checked it carefully at all. It is implemented using PRU assembly and uses the pypruss python module to simplify interfacing.

Software memory testing for bus failures

I have a board with quite a few flash chips, some of them are showing intermittent failures. Standard memory tests are not showing any specific problem addresses, other than certain chips are failing intermittently under mechanical and thermal stress.
Suspecting the actual connections and not the flash cells themselves, I'm looking for a way to test the parallel bus for address or data pin errors.
There are some memory tests but they apply better to RAM rather than flash memory (http://www.ganssle.com/testingram.htm). Specifically, the parallel flash has a sequence of bus writes to write to each value; a write/verify failure could easily be the write operation which could be any pin on the bus.
Ideas welcome...
The typical memory tests are there to do that. I prefer a pseudo randomizer (deterministic using an lfsr) to the 0xAA, 0x55, 0xFF, 0x00 tests. This allows for an address bus test as well as data bus test in two passes (repeat inverted). I say typical in the sense of wiggle the data bits and address bits both states each and vary the states of signals and their neighbors. The pounding on a ram to create thermal or other stresses, well you cant write very fast to a flash so you cant really do fast write/read cycles.
Flash creates another problem and that is writing then reading back isnt that interesting, you want to write the read back later, hours, days, weeks to determine if the part is actually holding data.
When you say thermal or stress do you mean only during the time it is above X degrees it fails, or do you mean that due to thermal stress it is broken all the time after the event. Likewise with mechanical, while vibrating or under mechanical stress the part fails, but when relieved of that stress it is okay, or the mechanical stress has done permanent damage that can be detected under stress or not.
Now although you cant do fast write/read cycles, you can punish a flash by reading heavily. I have seen read-disturb problems by constant reading of one block or location. Not necessarily something you have time to do for every location, but you might fill the ram with a pseudo random pattern and concentrate on one location for a while, (minutes, tens of minutes), if you have a part that you know is bad see if this accelerates the detection of the problem and if any location will work or only certain ones. then another thing is to read all the locations repetitively for hours/days or leave it sit for hours/days/weeks and then do a read pass without an erase or write and see if it has lost anything.
unfortunately as you probably know each new failure case takes its own research project and development of a new test.
First step to test a memory is data bus test0 0 0 0 0 0 0 • In this test, data bus wiring is properly tested to0 0 0 0 0 0 0 confirm that the value placed on data bus by processor0 0 0 0 0 0 0 is correctly received by memory device at the other end0 0 0 0 0 0 00 0 0 0 0 0 0 • An obvious way to test is to write all possible0 0 0 0 0 0 0 data values and verify 0 0 0 0 0 0 0 • Each bit can be tested independently• To perform walking 1s test, write the first data value given in the table, verify by reading it back, write the second value, verify and so on. • When you reach the end of the table, the test is complete
In the linked article Jack Ganssle says: "Critical to this [test], and every other RAM test algorithm, is that you write the pattern to all of RAM before doing the read test."
Since reading should be isolated from writing, testing the flash is easier. Perform the writing portion of the tests while the system is not under stress. Then perform the reading portion with the system under stress. By recording the address, expected value, and actual value in enough error cases, you should be able to determine the source of the errors.
If the system never fails when doing the above, you can then perform the whole tests while under stress. Any errors that appear are most likely write errors.
I've decided to design a memory pattern that I think I can deduce both data and address errors from. The concept is to use values significantly different as key indicators of possible read errors. The concept is also to detect a failure on one pin at a time.
The test will read alternately from only bottom and top addresses (0x000000 and 0x3FFFFF - my chip has 22 address lines). In those locations I will put 0xFF and 0x00 respectively (byte wide). The idea is to flip all address and data lines and see what happens. (All other values in the flash have at least 3 bits different from 0x00 and 0xFF)
There are 44 addresses that a single pin failure could send me to in error. In each address put one of 22 values to represent which of the 22 address pin was flipped. Each are 2 bits different from each other, and 3 bits different from 00 and FF. (I tried for 3 bits different from each other but 8 bits could only get 14 values)
07,0B,0D,0E,16,1A,1C,1F,25,29,2C,
2F,34,38,3D,3E,43,49,4A,4F,52,58
The remaining addresses I put a nice pattern of six values 33,55,66,99,AA,CC. (3 bits different from all other values) value(address) = nicePattern[ sum of bits set in address % 6];
I tested this and have statistically collected 100s of intermittent failure incidents synchronized to the mechanical stress.
single bit errors detectable
double bit errors deducible (Explainable by a combination of frequent single bit errors)
3 or more bit errors (generally inconclusive)
Even though some of the chips had 3 failing pins, 70% of the incidents were single bit (they usually didn't fail at the same time)
The testing group is now using this to identify which specific connections are failing.

What are the most hardcore optimisations you've seen?

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.
I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.
I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.
In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.
(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)
In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:
Validate the bootsector of the host harddrive and make sure it was not infected.
Validate the floppy bootsector and
make sure that it was not infected.
Code to remove the virus from the
harddrive if it was infected.
Code to duplicate the antivirus
bootsector to another floppy if a
special key was pressed.
Code to boot the harddrive if all was
well, and no infections was found.
This was done in the program space of a bootsector, about 440 bytes :)
The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"
I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.
My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.
In c/c++ the code is: (sourced from Wikipedia)
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1); // Now this is what you call a real magic number
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
A Very Biological Optimisation
Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.
Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)
For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)
That's hardcore space optimisation!
UPDATE: Links to further info:
Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
Or you could google for "overlapping reading frames"
Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.
In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.
I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.
The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.
Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:
Background 1 uses a Push Effective Address (PEA) instruction
Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.
Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:
Define a code fragment to do the blend in the same 64K program bank as the code buffer
Create a copy of this code for each of the 82 words
There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
Done! We have a hard-coded subroutine that does not affect the CPU registers.
Here is the actual code fragments
code_buff: PEA $0000 ; rightmost word (16-bits = 4 pixels)
PEA $0000 ; background 1
PEA $0000 ; background 1
PEA $0000 ; background 1
LDA (72),y ; background 2
PHA
LDA (70),y ; background 2
PHA
JMP word_68 ; mix the data
word_68_rtn: PEA $0000 ; more background 1
...
PEA $0000
BRA *+40 ; patched exit code
...
word_68: LDA (68),y ; load data for background 2
AND #$00FF ; mask
ORA #$AB00 ; blend with data from background 1
PHA
JMP word_68_rtn ; jump back
word_66: LDA (66),y
...
The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.
Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.
Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.
The Stalin Scheme compiler is pretty crazy in that aspect.
I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:
Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases
I forget what N was. This was in the source code for Windows that was leaked in 2004.
I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.
Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.
The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.
The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)