Does GWT serialize java.lang.Longs efficiently? - serialization

I'm sending object IDs back and forth from client to server through the GWT RPC mechanism. The ids are coming out of the datastore as Longs (8 bytes). I think all of my ids will only need 4 bytes, but something random could happen that gives me a 5-byte (or whatever) value.
Is GWT going to be smart about packing these values in some variable-length encoding that will save space on average? Can I specify that it do so somewhere? Or should I write my own code to copy the Longs to ints and watch out for those exceptional situations?
Thanks~

As stated in the GWT documentation.
long: JavaScript has no 64-bit integral type, so long needs special consideration. Prior to GWT 1.5, the long type was was simply mapped to the integral range of a 64-bit JavaScript floating-point value, giving long variables an actual range less than the full 64 bits. As of GWT 1.5, long primitives are emulated as a pair of 32-bit integers, and work reliably over the entire 64-bit range. Overflow is emulated to match the expected behavior. There are a couple of caveats. Heavy use of long operations will have a performance impact due to the underlying emulation. Additionally, long primitives cannot be used in JSNI code because they are not a native JavaScript numeric type.
If your ids can fit in an Integer, you could be better off with that. Otherwise, if you're using a DTO, make the ids a double, which actually exists in Javascript.

GWT uses gzip compression for responses with a payload of 256 bytes or greater. That should work well if you have a lot of zero bytes in your response.
From RemoteServiceServlet.shouldCompressResponse:
Determines whether the response to a
given servlet request should or should
not be GZIP compressed. This method is
only called in cases where the
requester accepts GZIP encoding.
This implementation currently returns
true if the response string's
estimated byte length is longer than
256 bytes. Subclasses can override
this logic.
So, the server first checks if the requester (the browser, usually) accepts GZIP encoding. Internally, java.util.zip.GZIPOutputStream is used - see RPCServerUtils. On the client side, it's the browser's job to decompress the gzipped payload - since this is done in native code, it should be fairly quick.

Related

CoMarshalInterface() - any way to determine the maximum byte length of result beforehand?

I'm working on a project in which two processes communicate via a TCP-based message bus. For efficiency, I'm considering prepending each message with the byte length of the message.
Some messages however convey information about COM objects; namely, process A calls CoMarshalInterface() and submits the resulting bytes to Process B for deserialization.
In order to determine the byte length of my messages without actually serializing them yet, I'm trying to figure out whether there is any way of knowing the specific, or at least maximum size of bytes that CoMarshalInterface() would yield, without actually having to call that method yet (at least not at this point in code).
Would anybody know if there's any way?
I haven't noticed any big variations in data length for the objects I have tested this with, but I'm not quite sure how CoMarshalInterface works internally. Does it depend on some mechanism implemented by each COM object individually, hence completely unknown size, or is it safe to assume it would never generate more than XYZ bytes of serialized information?
Thanks!

Is it possible to use one variable to represent numbers of unlimited length in programming? [duplicate]

This question already has answers here:
Most efficient implementation of a large number class
(5 answers)
Closed 8 years ago.
I've been using C# for three years to make games and I've played with various simulations where numbers sometimes get big and Int32 is not enough to store the value. Eventually even Int64 became insufficient for my experiments, it took several such fields (actually an array of variable length) and a special property to correctly handle such big numbers. And so I wondered: Is there a way to declare a numeric variable with unlimited (unknown beforehand) length so I can relax and let the computer do the math?
We can write any kind of number we like on paper without needing any special kind of paper. We can also type a lot of words in a text file without needing special file system alterations to make it save and load correctly. Isn't there a variable to declare a who-knows-how-long-it-will-be number in any programming languages?
Starting with .NET 4, the .NET framework contains a BigInteger structure, which can handle integers of arbitrary size.
Since your question is language-agnostic, it might be worth to mention that internally BigInteger stores the value in an array of unsigned integers, see the following SO question for details:
How does the BigInteger store values internally?
BigInteger is immutable, so there is no need to "resize" the array. Arithmetic operations create new instances of BigInteger, with appropriately sized arrays.
Most modern dynamic languages such as Perl6, Tcl8 and Ruby goes one step further by allowing you to store unlimited (up to available RAM) sized numbers in number types.
Most of these languages don't have separate integer and floating point types but rather a single "number" type that automatically gets converted to whatever it needs to be to be stored in RAM. Some, like Perl6, even includes complex numbers in its "number" type.
How it's implemented at the machine level is that by default numbers are assumed to be integers - so int32 or int64. If need be numbers are converted to floats or doubles if the result of a calculation or assignment isn't an integer. If the integer grows too large then the interpreter/runtime environment silently converts it to a bigInt object/struct (which is simply a big, growable array or linked-list of ints).
How it appears to the programmer is that numbers have unlimited size (again, up to available RAM).
Still, there are gotchas with this system (kind of like the 0.1+0.2!=0.3 issue with floats) so you'd still need to be aware of the underlying implementation even if you can ignore it 99.99% of the time.
For example, if at any point in time you super large number gets converted to a floating point number (most likely a double in hardware) you'll lose precision. Because that's just how floating point numbers work. Sometimes you can do it accidentally. In some languages for example, the power function (like pow() in C) returns a floating point result. So raising an integer to the power of another integer may truncate the result if it's too large.
For the most part, it works. And I personally feel that this is the sane way of dealing with numbers. Lots of language designers have apparently come up with this solution independently.
Is it possible to [...] represent numbers of unlimited length [...]?
No.
On existing computers it is not possible to represent unlimited numbers because the machines are finite. Even when using all existing storage it is not possible to store unlimited numbers.
It is possible, though, to store very large numbers. Wikipedia has information on the concept of arbitrary precision integers.
"Unlimited" - no, as Nikolai Ruhe soundly pointed out. "Unknown" - yes, qualified by the first point. :}
A BigInteger type is available in .NET 4.0 and in Java as others point out.
For .NET 2.0+, take a look at IntX.
More generally, languages (or a de facto library used with them at least) generally have some support for arbitrarily long integers, which provides a means of dealing with the "unknown" you describe.
A discussion on the Ubuntu forums somewhat addresses this question more generally and touches on specifics in more languages - some of which provide simpler means of leveraging arbitrarily large integers (e.g. Python and Common Lisp). Personally, the "relax and let the computer do the math" factor was highest for me in Common Lisp years ago: so it may pay to look around broadly for perspective as you seem inclined to do.

Binary Serialization Backend for Orange

Why doesn't the library Orange support a binary serialization backend in addition to its XML? Is it because D currently cannot access/reflect on its binary representation or is it just not prioritized yet? If possible what D language features and/or Phobos modules should I use to realize a binary serialization backend for Orange?
For D2 I guess it should be straightforward considering we have std.binmanip right?
You can check out msgpack-d library which provides binary serialization in MessagePack format. From http://msgpack.org/:
MessagePack is an efficient binary serialization format. It lets you exchange data among multiple languages like JSON but it's faster and smaller. For example, small integers (like flags or error code) are encoded into a single byte, and typical short strings only require an extra byte in addition to the strings themselves.

Will Initialization Vectors grow in size in the future?

I'm currently using AES (256) with CBC mode to encrypt data. I store the initialization vector with the encrypted data. Right now I'm just adding the IV to the beggining of the encrypted data, then on decrypt, reading it in as a hard coded length of bytes.
If the initialization vector length changes in the future, this method will break.
So my questions is:
Will longer AES key sizes in the future = longer IVs? Or, in other words, will the block size of AES change in the future?
If so, what would be the best way of dealing with this? Using the first byte as an indicator of how long the IV is, then reading in that many bytes?
Rijndael does support larger block sizes, but AES is currently fixed at a 128 bit block. It seems relatively unlikely that the larger Rijndael block sizes will be standardized by NIST, since this would effectively be a completely new algorithm, one that hasn't been implemented by anyone. If NIST feels the need for block cipher with a larger size, it's likely the would simply run a new contest.
However what I would recommend is that, rather than the IV length, you include near the start of your message some kind of algorithm identifier (a single byte is all you'll need), which will allow you not just the flexibility to handle larger IVs, but also extend your format in other ways in the future, for instance a new algorithm. Eg 0 == AES-256/CBC, 1 == AES-256/GCM, 2=AES-2.0/CBC, 3=AES-256/CBC with special extra header somewhere, etc, etc.
PS - don't forget to also use a message authentication code, since otherwise you expose yourself to a variety of easy message modification attacks.
The purpose of the initialization vector is to randomize the first block, so that the same data encrypted twice with the same key will not produce the same output.
From an information-theoretic point of view, there are "only" 2^128 distinct IVs for AES, because those are all the possible random values you might XOR with your first block of actual data. So there is never any reason to have an IV larger than the cipher's block size.
Larger block sizes could justify larger IVs. Larger key sizes do not.
A larger block size would mean a different algorithm by definition. So however you tag your data to indicate what algorithm you are using, that is how you will tell what block size (and therefore IV size) to use.
As an alternative solution you could switch to AES-CTR mode. Counter mode requires a Nonce, but the Nonce does not have to be tied to the AES block size. If the AES block size were increased (unlikely, as Jack says) then you could retain the same size Nonce.

What would be a good (de)compression routine for this scenario

I need a FAST decompression routine optimized for restricted resource environment like embedded systems on binary (hex data) that has following characteristics:
Data is 8bit (byte) oriented (data bus is 8 bits wide).
Byte values do NOT range uniformly from 0 - 0xFF, but have a poisson distribution (bell curve) in each DataSet.
Dataset is fixed in advanced (to be burnt into Flash) and each set is rarely > 1 - 2MB
Compression can take as much as time required, but decompression of a byte should take 23uS in the worst case scenario with minimal memory footprint as it will be done on a restricted resource environment like an embedded system (3Mhz - 12Mhz core, 2k byte RAM).
What would be a good decompression routine?
The basic Run-length encoding seems too wasteful - I can immediately see that adding a header setion to the compressed data to put to use unused byte values to represent oft repeated patterns would give phenomenal performance!
With me who only invested a few minutes, surely there must already exist much better algorithms from people who love this stuff?
I would like to have some "ready to go" examples to try out on a PC so that I can compare the performance vis-a-vis a basic RLE.
The two solutions I use when performance is the only concern:
LZO Has a GPL License.
liblzf Has a BSD License.
miniLZO.tar.gz This is LZO, just repacked in to a 'minified' version that is better suited to embedded development.
Both are extremely fast when decompressing. I've found that LZO will create slightly smaller compressed data than liblzf in most cases. You'll need to do your own benchmarks for speeds, but I consider them to be "essentially equal". Both are light-years faster than zlib, though neither compresses as well (as you would expect).
LZO, in particular miniLZO, and liblzf are both excellent for embedded targets.
If you have a preset distribution of values that means the propability of each value is fixed over all datasets, you can create a huffman encoding with fixed codes (the code tree has not to be embedded into the data).
Depending on the data, I'd try huffman with fixed codes or lz77 (see links of Brian).
Well, the main two algorithms that come to mind are Huffman and LZ.
The first basically just creates a dictionary. If you restrict the dictionary's size sufficiently, it should be pretty fast...but don't expect very good compression.
The latter works by adding back-references to repeating portions of output file. This probably would take very little memory to run, except that you would need to either use file i/o to read the back-references or store a chunk of the recently read data in RAM.
I suspect LZ is your best option, if the repeated sections tend to be close to one another. Huffman works by having a dictionary of often repeated elements, as you mentioned.
Since this seems to be audio, I'd look at either differential PCM or ADPCM, or something similar, which will reduce it to 4 bits/sample without much loss in quality.
With the most basic differential PCM implementation, you just store a 4 bit signed difference between the current sample and an accumulator, and add that difference to the accumulator and move to the next sample. If the difference it outside of [-8,7], you have to clamp the value and it may take several samples for the accumulator to catch up. Decoding is very fast using almost no memory, just adding each value to the accumulator and outputting the accumulator as the next sample.
A small improvement over basic DPCM to help the accumulator catch up faster when the signal gets louder and higher pitch is to use a lookup table to decode the 4 bit values to a larger non-linear range, where they're still 1 apart near zero, but increase at larger increments toward the limits. And/or you could reserve one of the values to toggle a multiplier. Deciding when to use it up to the encoder. With these improvements, you can either achieve better quality or get away with 3 bits per sample instead of 4.
If your device has a non-linear μ-law or A-law ADC, you can get quality comparable to 11-12 bit with 8 bit samples. Or you can probably do it yourself in your decoder. http://en.wikipedia.org/wiki/M-law_algorithm
There might be inexpensive chips out there that already do all this for you, depending on what you're making. I haven't looked into any.
You should try different compression algorithms with either a compression software tool with command line switches or a compression library where you can try out different algorithms.
Use typical data for your application.
Then you know which algorithm is best-fitting for your needs.
I have used zlib in embedded systems for a bootloader that decompresses the application image to RAM on start-up. The licence is nicely permissive, no GPL nonsense. It does make a single malloc call, but in my case I simply replaced this with a stub that returned a pointer to a static block, and a corresponding free() stub. I did this by monitoring its memory allocation usage to get the size right. If your system can support dynamic memory allocation, then it is much simpler.
http://www.zlib.net/