How does gzip compression rate change when streaming data? - apache

I've been trying to understand the gzip algorithm, in the context of HTTP streaming connections (SSE, and the various comet technologies). I tested some alternative representations of data, with these filesizes:
40 csv.txt
63 json-short.txt
80 json-readable.txt
27 rawbin.txt
46 sse.csv.txt
69 sse.json-short.txt
86 sse.json-readable.txt
When compressed with gzip -9v, I get:
csv.txt: 25.0%
json-readable.txt: 16.2%
json-short.txt: 20.6%
rawbin.txt: 18.5%
sse.csv.txt: 25.0%
sse.json-readable.txt: 15.1%
sse.json-short.txt: 18.8%
Those are not very good compression rates, but also were the reverse of expectations: the more verbose JSON formats appear to be compressing worse.
My question is: does the compression get better as more and more data is streamed? Does it dynamically and implicitly learn which bits are scaffolding and which bits are variable data? If it is a learning algorithm, is there a point where it stops learning, or is it theoretically always adapting to the data stream? And if so, is extra weight given to more recent data?
I did a crude test, by cat-ing 25 of sse.json-readable.txt into one file. Gzip then gave me 95.7% compression ratio. But I describe this as crude for two reasons. First each line of data was identical, whereas in realistic data the numbers and timestamps will be slightly different, and only the scaffolding is the same. The second reason is gzip is being given a single file: does the gzip algorithm do a pre-scan of the data to learn it, or jump around in the file? If so, those results won't apply to Apache streaming data (as it will have already compressed and sent the first line of data, before it sees the second line).
As a secondary question, can I assume time is not a factor? E.g. assuming there is no socket reconnection involved, there might be a 1-second gap between each line of data, or a 60-second gap.
Useful reference on how gzip works: http://www.infinitepartitions.com/art001.html
(By the way, my current understanding is that the compression when streaming will be based solely on an analysis of the first block of data; so I'm wondering if I could get better compression by sending a few lines of dummy data, to give it a chance to learn a better compression?!?)
http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/mod_deflate.c
The 15 is what gives the 32KB.
http://www.zlib.net/zlib_how.html
http://www.zlib.net/zlib_tech.html
UPDATE: USEFUL LINKS
Here is the Apache module code:
http://svn.apache.org/repos/asf/httpd/httpd/trunk/modules/filters/mod_deflate.c
The windows size of 15 is what gives the 32KB window that Mark Adler mentions in his answer.
Here are some pages that can help understand the Apache code:
http://www.zlib.net/zlib_how.html
http://www.zlib.net/zlib_tech.html
Here are the above test files, in case you care:
csv.txt
2013-03-29 03:15:24,EUR/USD,1.303,1.304
json-short.txt
{"t":"2013-03-29 06:09:03","s":"EUR\/USD","b":1.303,"a":1.304}
json-readable.txt
{"timestamp":"2013-03-29 06:09:03","symbol":"EUR\/USD","bid":1.303,"ask":1.304}
sse.csv.txt
data:2013-03-29 03:15:24,EUR/USD,1.303,1.304
sse.json-short.txt
data:{"t":"2013-03-29 06:09:03","s":"EUR\/USD","b":1.303,"a":1.304}
sse.json-readable.txt
data:{"timestamp":"2013-03-29 06:09:03","symbol":"EUR\/USD","bid":1.303,"ask":1.304}
NOTE: the sse.* versions end in two LFs, the others end in one LF.
rawbin.txt was made with this PHP script:
$s=pack("la7dd",time(),"USD/JPY",1.303,1.304);
file_put_contents("rawbin.txt",$s);

gzip uses a sliding window of the last 32K of data in which it searches for matching strings. It will accumulate 16K literals and matching strings, which may go back a few of those windows in order to generate a block with a single set of Huffman codes. That is as far back as gzip looks, and it never "jumps around", but rather just maintains a sliding history that it forgets once the old data drops off the back end.
There is a way with zlib (not with gzip) to provide a "priming" dictionary, which is simply up to 32K of data that can be used for matching strings when compressing the first 32K of the actual data. That is useful, sometimes very useful, for compressing small amounts of data, e.g much less than 32K. Without that zlib or gzip will do a poor job compressing short strings. They really need a few times 32K of data to get rolling.
For the extremely short files you are testing with, you are getting expansion, not compression.

Related

Erlang binary protocol serialization

I'm currently using Erlang for a big project but i have a question regarding a proper proceeding.
I receive bytes over a tcp socket. The bytes are according to a fixed protocol, the sender is a pyton client. The python client uses class inheritance to create bytes from the objects.
Now i would like to (in Erlang) take the bytes and convert these to their equivelant messages, they all have a common message header.
How can i do this as generic as possible in Erlang?
Kind Regards,
Me
Pattern matching/binary header consumption using Erlang's binary syntax. But you will need to know either exactly what bytes or bits your are expecting to receive, or the field sizes in bytes or bits.
For example, let's say that you are expecting a string of bytes that will either begin with the equivalent of the ASCII strings "PUSH" or "PULL", followed by some other data you will place somewhere. You can create a function head that matches those, and captures the rest to pass on to a function that does "push()" or "pull()" based on the byte header:
operation_type(<<"PUSH", Rest/binary>>) -> push(Rest);
operation_type(<<"PULL", Rest/binary>>) -> pull(Rest).
The bytes after the first four will now be in Rest, leaving you free to interpret whatever subsequent headers or data remain in turn. You could also match on the whole binary:
operation_type(Bin = <<"PUSH", _/binary>>) -> push(Bin);
operation_type(Bin = <<"PULL", _/binary>>) -> pull(Bin).
In this case the "_" variable works like it always does -- you're just checking for the lead, essentially peeking the buffer and passing the whole thing on based on the initial contents.
You could also skip around in it. Say you knew you were going to receive a binary with 4 bytes of fluff at the front, 6 bytes of type data, and then the rest you want to pass on:
filter_thingy(<<_:4/binary, Type:6/binary, Rest/binary>>) ->
% Do stuff with Rest based on Type...
It becomes very natural to split binaries in function headers (whether the data equates to character strings or not), letting the "Rest" fall through to appropriate functions as you go along. If you are receiving Python pickle data or something similar, you would want to write the parsing routine in a recursive way, so that the conclusion of each data type returns you to the top to determine the next type, with an accumulated tree that represents the data read so far.
I only covered 8-bit bytes above, but there is also a pure bitstring syntax, which lets you go as far into the weeds with bits and bytes as you need with the same ease of syntax. Matching is a real lifesaver here.
Hopefully this informed more than confused. Binary syntax in Erlang makes this the most pleasant binary parsing environment in a general programming language I've yet encountered.
http://www.erlang.org/doc/programming_examples/bit_syntax.html

Trying to understand nbits value from stratum protocol

I'm looking at the stratum protocol and I'm having a problem with the nbits value of the mining.notify method. I have trouble calculating it, I assume it's the currency difficulty.
I pull a notify from a dogecoin pool and it returned 1b3cc366 and at the time the difficulty was 1078.52975077.
I'm assuming here that 1b3cc366 should give me 1078.52975077 when converted. But I can't seem to do the conversion right.
I've looked here, here and also tried the .NET function BitConverter.Int64BitsToDouble.
Can someone help me understand what the nbits value signify?
You are right, nbits is current network difficulty.
Difficulty encoding is throughly described here.
Hexadecimal representation like 0x1b3cc366 consists of two parts:
0x1b -- number of bytes in a target
0x3cc366 -- target prefix
This means that valid hash should be less than 0x3cc366000000000000000000000000000000000000000000000000 (it is exactly 0x1b = 27 bytes long).
Floating point representation of difficulty shows how much current target is harder than the one used in the genesis block.
Satoshi decided to use 0x1d00ffff as a difficulty for the genesis block, so the target was
0x00ffff0000000000000000000000000000000000000000000000000000.
And 1078.52975077 is how much current target is greater than the initial one:
$ echo 'ibase=16;FFFF0000000000000000000000000000000000000000000000000000 / 3CC366000000000000000000000000000000000000000000000000' | bc -l
1078.52975077482646448605

What can cause erroneous or missing index in AVI file?

I've built an AVI M-jpeg encoder which basically build an AVI Riff header with all the infos.
I'm adding a frame index at the end of the video stream as specified in the specs.
Index is built as follow:
idx1[Size], then 00dc[0x10,0x00,0x00,0x00][Offset from frame X][Size from frame X] until the end. I compared to any other AVI file, and everything is the same. So I can't understand where softwares don't find - or search for - the index in my AVI file. Also verified several time that each tag has the good byte length indicated after. By the way, there is the good padding in each offset, and the length is the size of the jpeg only.
I attached the current rendered file: movie.avi
I spent the whole day trying to figure out what is the problem with my index. AVI spec is really simple, so I'm smashing my head on the desk.
[Edit]
As soon as my video is longer than 1 second, it fails. That makes no sense for me currently as the algorithm is the same, whatever how many frames are written.
Your AVI file violates the alignment rule: every chunk must start at an even byte.
Add a zero byte after every odd-length frame, and update the index accordingly. The chunk size in the header should still be odd to tell the true size of the data, but all offsets should be even.

Analysis of pgpdump output

I have used pgpdump on an encrypted file (via BouncyCastle) to get more information about it and found several lines about partial start, partial continue and partial end.
So I was wondering what exactly this was describing. Is it some sort of fragmentation of plain text?
Furthermore what does the bit count stand for after the RSA algorithm? In this case it's 1022 bits, but I've seen files with 1023 and 1024bits.
Partial body lengths are pretty well explained by this tumblr post. OpenPGP messages are composed of packets of a given length. Sometimes for large outputs (or in the case of packets from GnuPG, short messages), there will be partial body lengths that specify that another header will show up that tell the reader to continue reading From the post:
A partial body length tells the parser: “I know there are at least N more bytes in this packet. After N more bytes, there will be another header to tell if how many more bytes to read.” The idea being, I guess, that you can encrypt a stream of data as it comes in without having to know when it ends. Maybe you are PGP encrypting a speech, or some off-the-air TV. I don’t know. It can be infinite length — you can just keep throwing more partial body length headers in there, each one can handle up to a gigabyte in length. Every gigabyte it informs the parser: “yeah, there’s more coming!”
So in the case of your screenshot, pgpdump reads 8192 bytes, then encounters another header that says to read another 2048 bytes. after that 2k bytes, it hits another header for 1037 bytes, so on and so forth until the last continue header. 489 bytes after that is the end of the message
The 1022 bits, is the length of the public modulus. It is always going to be close to 1024 (if you have a 1024-bit key) but it can end up being slightly shorter than that given the initial selection of the RSA parameters. They are still called "1024-bit keys" though, even though they are slightly shorter than that.

What are the most hardcore optimisations you've seen?

I'm not talking about algorithmic stuff (eg use quicksort instead of bubblesort), and I'm not talking about simple things like loop unrolling.
I'm talking about the hardcore stuff. Like Tiny Teensy ELF, The Story of Mel; practically everything in the demoscene, and so on.
I once wrote a brute force RC5 key search that processed two keys at a time, the first key used the integer pipeline, the second key used the SSE pipelines and the two were interleaved at the instruction level. This was then coupled with a supervisor program that ran an instance of the code on each core in the system. In total, the code ran about 25 times faster than a naive C version.
In one (here unnamed) video game engine I worked with, they had rewritten the model-export tool (the thing that turns a Maya mesh into something the game loads) so that instead of just emitting data, it would actually emit the exact stream of microinstructions that would be necessary to render that particular model. It used a genetic algorithm to find the one that would run in the minimum number of cycles. That is to say, the data format for a given model was actually a perfectly-optimized subroutine for rendering just that model. So, drawing a mesh to the screen meant loading it into memory and branching into it.
(This wasn't for a PC, but for a console that had a vector unit separate and parallel to the CPU.)
In the early days of DOS when we used floppy discs for all data transport there were viruses as well. One common way for viruses to infect different computers was to copy a virus bootloader into the bootsector of an inserted floppydisc. When the user inserted the floppydisc into another computer and rebooted without remembering to remove the floppy, the virus was run and infected the harddrive bootsector, thus permanently infecting the host PC. A particulary annoying virus I was infected by was called "Form", to battle this I wrote a custom floppy bootsector that had the following features:
Validate the bootsector of the host harddrive and make sure it was not infected.
Validate the floppy bootsector and
make sure that it was not infected.
Code to remove the virus from the
harddrive if it was infected.
Code to duplicate the antivirus
bootsector to another floppy if a
special key was pressed.
Code to boot the harddrive if all was
well, and no infections was found.
This was done in the program space of a bootsector, about 440 bytes :)
The biggest problem for my mates was the very cryptic messages displayed because I needed all the space for code. It was like "FFVD RM?", which meant "FindForm Virus Detected, Remove?"
I was quite happy with that piece of code. The optimization was program size, not speed. Two quite different optimizations in assembly.
My favorite is the floating point inverse square root via integer operations. This is a cool little hack on how floating point values are stored and can execute faster (even doing a 1/result is faster than the stock-standard square root function) or produce more accurate results than the standard methods.
In c/c++ the code is: (sourced from Wikipedia)
float InvSqrt (float x)
{
float xhalf = 0.5f*x;
int i = *(int*)&x;
i = 0x5f3759df - (i>>1); // Now this is what you call a real magic number
x = *(float*)&i;
x = x*(1.5f - xhalf*x*x);
return x;
}
A Very Biological Optimisation
Quick background: Triplets of DNA nucleotides (A, C, G and T) encode amino acids, which are joined into proteins, which are what make up most of most living things.
Ordinarily, each different protein requires a separate sequence of DNA triplets (its "gene") to encode its amino acids -- so e.g. 3 proteins of lengths 30, 40, and 50 would require 90 + 120 + 150 = 360 nucleotides in total. However, in viruses, space is at a premium -- so some viruses overlap the DNA sequences for different genes, using the fact that there are 6 possible "reading frames" to use for DNA-to-protein translation (namely starting from a position that is divisible by 3; from a position that divides 3 with remainder 1; or from a position that divides 3 with remainder 2; and the same again, but reading the sequence in reverse.)
For comparison: Try writing an x86 assembly language program where the 300-byte function doFoo() begins at offset 0x1000... and another 200-byte function doBar() starts at offset 0x1001! (I propose a name for this competition: Are you smarter than Hepatitis B?)
That's hardcore space optimisation!
UPDATE: Links to further info:
Reading Frames on Wikipedia suggests Hepatitis B and "Barley Yellow Dwarf" virus (a plant virus) both overlap reading frames.
Hepatitis B genome info on Wikipedia. Seems that different reading-frame subunits produce different variations of a surface protein.
Or you could google for "overlapping reading frames"
Seems this can even happen in mammals! Extensively overlapping reading frames in a second mammalian gene is a 2001 scientific paper by Marilyn Kozak that talks about a "second" gene in rat with "extensive overlapping reading frames". (This is quite surprising as mammals have a genome structure that provides ample room for separate genes for separate proteins.) Haven't read beyond the abstract myself.
I wrote a tile-based game engine for the Apple IIgs in 65816 assembly language a few years ago. This was a fairly slow machine and programming "on the metal" is a virtual requirement for coaxing out acceptable performance.
In order to quickly update the graphics screen one has to map the stack to the screen in order to use some special instructions that allow one to update 4 screen pixels in only 5 machine cycles. This is nothing particularly fantastic and is described in detail in IIgs Tech Note #70. The hard-core bit was how I had to organize the code to make it flexible enough to be a general-purpose library while still maintaining maximum speed.
I decomposed the graphics screen into scan lines and created a 246 byte code buffer to insert the specialized 65816 opcodes. The 246 bytes are needed because each scan line of the graphics screen is 80 words wide and 1 additional word is required on each end for smooth scrolling. The Push Effective Address (PEA) instruction takes up 3 bytes, so 3 * (80 + 1 + 1) = 246 bytes.
The graphics screen is rendered by jumping to an address within the 246 byte code buffer that corresponds to the right edge of the screen and patching in a BRanch Always (BRA) instruction into the code at the word immediately following the left-most word. The BRA instruction takes a signed 8-bit offset as its argument, so it just barely has the range to jump out of the code buffer.
Even this isn't too terribly difficult, but the real hard-core optimization comes in here. My graphics engine actually supported two independent background layers and animated tiles by using different 3-byte code sequences depending on the mode:
Background 1 uses a Push Effective Address (PEA) instruction
Background 2 uses a Load Indirect Indexed (LDA ($00),y) instruction followed by a push (PHA)
Animated tiles use a Load Direct Page Indexed (LDA $00,x) instruction followed by a push (PHA)
The critical restriction is that both of the 65816 registers (X and Y) are used to reference data and cannot be modified. Further the direct page register (D) is set based on the origin of the second background and cannot be changed; the data bank register is set to the data bank that holds pixel data for the second background and cannot be changed; the stack pointer (S) is mapped to graphics screen, so there is no possibility of jumping to a subroutine and returning.
Given these restrictions, I had the need to quickly handle cases where a word that is about to be pushed onto the stack is mixed, i.e. half comes from Background 1 and half from Background 2. My solution was to trade memory for speed. Because all of the normal registers were in use, I only had the Program Counter (PC) register to work with. My solution was the following:
Define a code fragment to do the blend in the same 64K program bank as the code buffer
Create a copy of this code for each of the 82 words
There is a 1-1 correspondence, so the return from the code fragment can be a hard-coded address
Done! We have a hard-coded subroutine that does not affect the CPU registers.
Here is the actual code fragments
code_buff: PEA $0000 ; rightmost word (16-bits = 4 pixels)
PEA $0000 ; background 1
PEA $0000 ; background 1
PEA $0000 ; background 1
LDA (72),y ; background 2
PHA
LDA (70),y ; background 2
PHA
JMP word_68 ; mix the data
word_68_rtn: PEA $0000 ; more background 1
...
PEA $0000
BRA *+40 ; patched exit code
...
word_68: LDA (68),y ; load data for background 2
AND #$00FF ; mask
ORA #$AB00 ; blend with data from background 1
PHA
JMP word_68_rtn ; jump back
word_66: LDA (66),y
...
The end result was a near-optimal blitter that has minimal overhead and cranks out more than 15 frames per second at 320x200 on a 2.5 MHz CPU with a 1 MB/s memory bus.
Michael Abrash's "Zen of Assembly Language" had some nifty stuff, though I admit I don't recall specifics off the top of my head.
Actually it seems like everything Abrash wrote had some nifty optimization stuff in it.
The Stalin Scheme compiler is pretty crazy in that aspect.
I once saw a switch statement with a lot of empty cases, a comment at the head of the switch said something along the lines of:
Added case statements that are never hit because the compiler only turns the switch into a jump-table if there are more than N cases
I forget what N was. This was in the source code for Windows that was leaked in 2004.
I've gone to the Intel (or AMD) architecture references to see what instructions there are. movsx - move with sign extension is awesome for moving little signed values into big spaces, for example, in one instruction.
Likewise, if you know you only use 16-bit values, but you can access all of EAX, EBX, ECX, EDX , etc- then you have 8 very fast locations for values - just rotate the registers by 16 bits to access the other values.
The EFF DES cracker, which used custom-built hardware to generate candidate keys (the hardware they made could prove a key isn't the solution, but could not prove a key was the solution) which were then tested with a more conventional code.
The FSG 2.0 packer made by a Polish team, specifically made for packing executables made with assembly. If packing assembly isn't impressive enough (what's supposed to be almost as low as possible) the loader it comes with is 158 bytes and fully functional. If you try packing any assembly made .exe with something like UPX, it will throw a NotCompressableException at you ;)