Convert compress functions from Python to Kotlin - kotlin

I have functions in Python for compression and decompression of a string (bytearray):
def compress_message(data):
compressor = zlib.compressobj(-1, zlib.DEFLATED, 31, 8, zlib.Z_DEFAULT_STRATEGY)
compressed_data = compressor.compress(data)
compressed_data += compressor.flush()
return compressed_data
def decompress_message(compressed_data):
return zlib.decompress(compressed_data, wbits=31)
I need to convert these functions to kotlin, so I can use them in my mobile app. So far I tried this:
fun String.zlibCompress(): ByteArray {
val input = this.toByteArray(charset("UTF-8"))
val output = ByteArray(input.size * 4)
val compressor = Deflater().apply {
setLevel(-1)
setInput(input)
finish()
}
val compressedDataLength: Int = compressor.deflate(output)
return output.copyOfRange(0, compressedDataLength)
}
However, it gives totally different results for example for string abcdefghijklmnouprstuvwxyz:
Python: 1f8b080000000000000a4b4c4a4e494d4bcfc8cccacec9cdcb2f2d282a2e292d2bafa8ac0200c197b2d21a000000
Kotlin: 789c4b4c4a4e494d4bcfc8cccacec9cdcb2f2d282a2e292d2bafa8ac020090b30b24
Is there any way, how can I modify the kotlin code, so it gives same result as Python?
Thanks for your replies. <3

The 31 parameter in the Python code is requesting a gzip stream, not a zlib stream. In Kotlin, you are generating a zlib stream. (zlib is described in RFC 1950, gzip in RFC 1952.)
It does not appear that Java's Deflater (spelled wrong) class has that option. It does have a nowrap option that gives raw deflate compressed data, around which you can construct your own gzip wrapper, using the RFC to see how.
By the way, the results are not "totally different". You have gzip and zlib wrappers around exactly the same raw deflate compressed data: 4b4c...0020.

Related

Falcon - Difference in stream type between unittests and actual API on post

I'm trying to write unittests for my falcon api, and I encountered a really weird issue when I tried reading the body I added to the unittests.
This is my unittest:
class TestDetectionApi(DetectionApiSetUp):
def test_valid_detection(self):
headers = {"Content-Type": "application/x-www-form-urlencoded"}
body = {'test': 'test'}
detection_result = self.simulate_post('/environments/e6ce2a50-f68f-4a7a-8562-ca50822b805d/detectionEvaluations',
body=urlencode(body), headers=headers)
self.assertEqual(detection_result.json, None)
and this is the part in my API that reads the body:
def _get_request_body(request: falcon.Request) -> dict:
request_stream = request.stream.read()
request_body = json.loads(request_stream)
validate(request_body, REQUEST_VALIDATION_SCHEMA)
return request_body
Now for the weird part, my function for reading the body is working without any issue when I run the API, but when I run the unittests the stream type seems to be different which affect the reading of it.
The stream type when running the API is gunicorn.http.body.Body and using unittests: wsgiref.validate.InputWrapper.
So when reading the body from the api all I need to do it request.stream.read() but when using the unittests I need to do request.stream.input.read() which is pretty annoying since I need to change my original code to work with both cases and I don't want to do it.
Is there a way to fix this issue? Thanks!!
It seems like issue was with how I read it. instead of using stream I used bounded_stream which seemed to work, also I removed the headers and just decoded my body.
my unittest:
class TestDetectionApi(DetectionApiSetUp):
def test_valid_detection(self):
body = '''{'test': 'test'}'''
detection_result = self.simulate_post('/environments/e6ce2a50-f68f-4a7a-8562-ca50822b805d/detectionEvaluations',
body=body.encode(), headers=headers)
self.assertEqual(detection_result.json, None)
how I read it:
def _get_request_body(request: falcon.Request) -> dict:
request_stream = request.bounded_stream.read()
request_body = json.loads(request_stream)
validate(request_body, REQUEST_VALIDATION_SCHEMA)
return request_body

Convert html content to PDF Byte Array with kotlin

val sanitizedHTML = Jsoup.clean(html, whitelist)
val textRenderer = ITextRenderer()
val outputStream = ByteArrayOutputStream()
textRenderer.setDocumentFromString(sanitizedHTML)
textRenderer.layout()
textRenderer.createPDF(outputStream)
textRenderer.finishPDF()
return Base64.getDecoder().decode(outputStream.toByteArray())
I would like to generate pdf from html content and rather than saving as file, would like to upload to server which expects it to be ByteArray.
I tried to do above using jsoup to clean html and textRenderer for generating pdf but keep receiving error about invalid Base64 character 25. Could someone help what I am doing wrong here.
return Base64.getDecoder().decode(outputStream.toByteArray())
This was incorrect, if I remove Base64 decoding it is working well.

Uncompressing a Gzip format?

I am facing a problem with Gzip uncompressing.
The situation is like this. I have some text in UTF-8 format. Now this text is compressed using gzdeflate() function in PHP and then stored in a blob object in Mysql.
Now I tried to retrieve the blob object and then used Java's Gzip Stream to un compress it. But it throws an error saying that it is not in GZIP format.
I even used Inflater in Java to do the same but now I get "DataFormatException:incorrect header check". The code for the inflater is as below.
//rs is the resultset
//blobAsBytes is the byte array
while(rs.next()){
blob = rs.getBlob("old_text");
int blobLength = (int) blob.length();
blobAsBytes = blob.getBytes(1, blobLength);
}
Inflater decompresser = new Inflater();
decompresser.setInput(blobAsBytes);
byte[] result = new byte[100];
int resultLength = decompresser.inflate(result); // this is the line where the exception is occurring.
decompresser.end();
// Decode the bytes into a String
String outputString = new String(result, 0, resultLength, "UTF-8");
System.out.println(outputString);
I have to do this using Java and get all the text back that is stored in the database.
Can someone please help me with this.
Use gzencode(), not gzdeflate(). The latter does not produce the gzip format, it produces the deflate format. The former does produce the gzip format. The PHP functions are horribly and confusingly named.
Alternatively, use the java.util.zip.Inflater class with nowrap true in the Inflater constructor. That will decode raw deflate data on the Java end.

Play Framework 2, Rest Services and gzip decompression

I'm facing what seems a charset issue of play when decompressing gzip content from rest services. When I try to run the code snippet below, an error is thrown, saying "Malformed JSON. Illegal character ((CTRL-CHAR, code 31))":
val url:String = "https://api.stackexchange.com/2.0/info?site=stackoverflow"
Async {
WS.url(url)
.withHeaders("Accept-Encoding" -> "gzip, deflate")
.get()
.map { response =>
Ok("Response: " + (response.json \ "items"))
}
}
At first I thought it would be a problem in StackExchange API itself, but I tried a similar service, which uses gzip compression as well, and the same error happens. It's hard to fix the code because I don't even know where is the "Illegal character". Is there something missing or it's actually a bug in play?
The clue I can provide is that the first byte of a gzip stream is 31 (0x1f). So you probably need to do something else to cause the gzip stream to be decompressed.
By the way, I recommend that you not accept deflate encoding, just gzip.
Here is how it can be done with Play 2.3
// set Http compression: https://www.playframework.com/documentation/2.3.x/ScalaWS
val clientConfig = new DefaultWSClientConfig()
val secureDefaults: AsyncHttpClientConfig = new NingAsyncHttpClientConfigBuilder(clientConfig).build()
val builder = new AsyncHttpClientConfig.Builder(secureDefaults)
builder.setCompressionEnabled(true)
val secureDefaultsWithSpecificOptions: AsyncHttpClientConfig = builder.build()
implicit val implicitClient = new NingWSClient(secureDefaultsWithSpecificOptions)
val response = WS.clientUrl("http://host/endpoint/item").withHeaders(("Accepts-encoding", "gzip")).get()

Performant Entity Serialization: BSON vs MessagePack (vs JSON)

Recently I've found MessagePack, an alternative binary serialization format to Google's Protocol Buffers and JSON which also outperforms both.
Also there's the BSON serialization format that is used by MongoDB for storing data.
Can somebody elaborate the differences and the dis-/advantages of BSON vs MessagePack?
Just to complete the list of performant binary serialization formats: There are also Gobs which are going to be the successor of Google's Protocol Buffers. However in contrast to all the other mentioned formats those are not language-agnostic and rely on Go's built-in reflection there are also Gobs libraries for at least on other language than Go.
// Please note that I'm author of MessagePack. This answer may be biased.
Format design
Compatibility with JSON
In spite of its name, BSON's compatibility with JSON is not so good compared with MessagePack.
BSON has special types like "ObjectId", "Min key", "UUID" or "MD5" (I think these types are required by MongoDB). These types are not compatible with JSON. That means some type information can be lost when you convert objects from BSON to JSON, but of course only when these special types are in the BSON source. It can be a disadvantage to use both JSON and BSON in single service.
MessagePack is designed to be transparently converted from/to JSON.
MessagePack is smaller than BSON
MessagePack's format is less verbose than BSON. As the result, MessagePack can serialize objects smaller than BSON.
For example, a simple map {"a":1, "b":2} is serialized in 7 bytes with MessagePack, while BSON uses 19 bytes.
BSON supports in-place updating
With BSON, you can modify part of stored object without re-serializing the whole of the object. Let's suppose a map {"a":1, "b":2} is stored in a file and you want to update the value of "a" from 1 to 2000.
With MessagePack, 1 uses only 1 byte but 2000 uses 3 bytes. So "b" must be moved backward by 2 bytes, while "b" is not modified.
With BSON, both 1 and 2000 use 5 bytes. Because of this verbosity, you don't have to move "b".
MessagePack has RPC
MessagePack, Protocol Buffers, Thrift and Avro support RPC. But BSON doesn't.
These differences imply that MessagePack is originally designed for network communication while BSON is designed for storages.
Implementation and API design
MessagePack has type-checking APIs (Java, C++ and D)
MessagePack supports static-typing.
Dynamic-typing used with JSON or BSON are useful for dynamic languages like Ruby, Python or JavaScript. But troublesome for static languages. You must write boring type-checking codes.
MessagePack provides type-checking API. It converts dynamically-typed objects into statically-typed objects. Here is a simple example (C++):
#include <msgpack.hpp>
class myclass {
private:
std::string str;
std::vector<int> vec;
public:
// This macro enables this class to be serialized/deserialized
MSGPACK_DEFINE(str, vec);
};
int main(void) {
// serialize
myclass m1 = ...;
msgpack::sbuffer buffer;
msgpack::pack(&buffer, m1);
// deserialize
msgpack::unpacked result;
msgpack::unpack(&result, buffer.data(), buffer.size());
// you get dynamically-typed object
msgpack::object obj = result.get();
// convert it to statically-typed object
myclass m2 = obj.as<myclass>();
}
MessagePack has IDL
It's related to the type-checking API, MessagePack supports IDL. (specification is available from: http://wiki.msgpack.org/display/MSGPACK/Design+of+IDL)
Protocol Buffers and Thrift require IDL (don't support dynamic-typing) and provide more mature IDL implementation.
MessagePack has streaming API (Ruby, Python, Java, C++, ...)
MessagePack supports streaming deserializers. This feature is useful for network communication. Here is an example (Ruby):
require 'msgpack'
# write objects to stdout
$stdout.write [1,2,3].to_msgpack
$stdout.write [1,2,3].to_msgpack
# read objects from stdin using streaming deserializer
unpacker = MessagePack::Unpacker.new($stdin)
# use iterator
unpacker.each {|obj|
p obj
}
I think it's very important to mention that it depends on what your client/server environment look like.
If you are passing bytes multiple times without inspection, such as with a message queue system or streaming log entries to disk, then you may well prefer a binary encoding to emphasize the compact size. Otherwise it's a case by case issue with different environments.
Some environments can have very fast serialization and deserialization to/from msgpack/protobuf's, others not so much. In general, the more low-level the language/environment the better binary serialization will work. In higher level languages (node.js, .Net, JVM) you will often see that JSON serialization is actually faster. The question then becomes is your network overhead more or less constrained than your memory/cpu?
With regards to msgpack vs bson vs protocol buffers... msgpack is the least bytes of the group, protocol buffers being about the same. BSON defines more broad native types than the other two, and may be a better match to your object model, but this makes it more verbose. Protocol buffers have the advantage of being designed to stream... which makes it a more natural format for a binary transfer/storage format.
Personally, I would lean towards the transparency that JSON offers directly, unless there is a clear need for lighter traffic. Over HTTP with gzipped data, the difference in network overhead are even less of an issue between the formats.
A key difference not yet mentioned is that BSON contains size information in bytes for the entire document and further nested sub-documents.
document ::= int32 e_list
This has two major benefits for restricted environments (e.g. embedded) where size and performance is important.
You can immediately check if the data you're going to parse represents a complete document or if you're going to need to request more at some point (be it from some connection or storage). Since this is most likely an asynchronous operation you might already send a new request before parsing.
Your data might contain entire sub-documents with irrelevant information for you. BSON allows you to easily traverse to the next object past the sub-document by using the size information of the sub-document to skip it. msgpack on the other hands contains the number of elements inside whats called a map (similar to BSON's sub-documents). While this is undoubtedly useful information it doesn't help the parser. You'd still have to parse every single object inside the map and can't just skip it. Depending on the structure of your data this might have a huge impact on performance.
Well,as the author said,MessagePack is originally designed for network communication while BSON is designed for storages.
MessagePack is compact while BSON is verbose.
MessagePack is meant to be space-efficient while BSON is designed for CURD (time-efficient).
Most importantly, MessagePack's type system (prefix) follow Huffman encoding, here I drawed a Huffman tree of MessagePack(click link to see image):
Quick test shows minified JSON is deserialized faster than binary MessagePack. In the tests Article.json is 550kb minified JSON, Article.mpack is 420kb MP-version of it. May be an implementation issue of course.
MessagePack:
//test_mp.js
var msg = require('msgpack');
var fs = require('fs');
var article = fs.readFileSync('Article.mpack');
for (var i = 0; i < 10000; i++) {
msg.unpack(article);
}
JSON:
// test_json.js
var msg = require('msgpack');
var fs = require('fs');
var article = fs.readFileSync('Article.json', 'utf-8');
for (var i = 0; i < 10000; i++) {
JSON.parse(article);
}
So times are:
Anarki:Downloads oleksii$ time node test_mp.js
real 2m45.042s
user 2m44.662s
sys 0m2.034s
Anarki:Downloads oleksii$ time node test_json.js
real 2m15.497s
user 2m15.458s
sys 0m0.824s
So space is saved, but faster? No.
Tested versions:
Anarki:Downloads oleksii$ node --version
v0.8.12
Anarki:Downloads oleksii$ npm list msgpack
/Users/oleksii
└── msgpack#0.1.7
I made quick benchmark to compare encoding and decoding speed of MessagePack vs BSON. BSON is faster at least if you have large binary arrays:
BSON writer: 2296 ms (243487 bytes)
BSON reader: 435 ms
MESSAGEPACK writer: 5472 ms (243510 bytes)
MESSAGEPACK reader: 1364 ms
Using C# Newtonsoft.Json and MessagePack by neuecc:
public class TestData
{
public byte[] buffer;
public bool foobar;
public int x, y, w, h;
}
static void Main(string[] args)
{
try
{
int loop = 10000;
var buffer = new TestData();
TestData data2;
byte[] data = null;
int val = 0, val2 = 0, val3 = 0;
buffer.buffer = new byte[243432];
var sw = new Stopwatch();
sw.Start();
for (int i = 0; i < loop; i++)
{
data = SerializeBson(buffer);
val2 = data.Length;
}
var rc1 = sw.ElapsedMilliseconds;
sw.Restart();
for (int i = 0; i < loop; i++)
{
data2 = DeserializeBson(data);
val += data2.buffer[0];
}
var rc2 = sw.ElapsedMilliseconds;
sw.Restart();
for (int i = 0; i < loop; i++)
{
data = SerializeMP(buffer);
val3 = data.Length;
val += data[0];
}
var rc3 = sw.ElapsedMilliseconds;
sw.Restart();
for (int i = 0; i < loop; i++)
{
data2 = DeserializeMP(data);
val += data2.buffer[0];
}
var rc4 = sw.ElapsedMilliseconds;
Console.WriteLine("Results:", val);
Console.WriteLine("BSON writer: {0} ms ({1} bytes)", rc1, val2);
Console.WriteLine("BSON reader: {0} ms", rc2);
Console.WriteLine("MESSAGEPACK writer: {0} ms ({1} bytes)", rc3, val3);
Console.WriteLine("MESSAGEPACK reader: {0} ms", rc4);
}
catch (Exception e)
{
Console.WriteLine(e);
}
Console.ReadLine();
}
static private byte[] SerializeBson(TestData data)
{
var ms = new MemoryStream();
using (var writer = new Newtonsoft.Json.Bson.BsonWriter(ms))
{
var s = new Newtonsoft.Json.JsonSerializer();
s.Serialize(writer, data);
return ms.ToArray();
}
}
static private TestData DeserializeBson(byte[] data)
{
var ms = new MemoryStream(data);
using (var reader = new Newtonsoft.Json.Bson.BsonReader(ms))
{
var s = new Newtonsoft.Json.JsonSerializer();
return s.Deserialize<TestData>(reader);
}
}
static private byte[] SerializeMP(TestData data)
{
return MessagePackSerializer.Typeless.Serialize(data);
}
static private TestData DeserializeMP(byte[] data)
{
return (TestData)MessagePackSerializer.Typeless.Deserialize(data);
}