Zip a Ktor ByteChannel - kotlin

I need to consume some data¹ and stream it as part of a zip archive in a multipart request. I'm using a ByteChannel but I need something in the middle to zip the written data.
There's Java's ZipOutputStream, but that's not async/suspendable and I'm afraid getting an OutputStream from the channel and passing that to a zip stream will impact performance in a considerable way.
Is there any better method? If not, which measures can and should I take to minimize the impact of integrating a Java stream with a coroutine-based channel?
Thanks
¹ Potentially quite large, hence why I must stream it instead of getting it whole, zipping it and then pushing it elsewhere.

Related

Why can't we just use arraybuffer and convert them to int array to upload file?

I got this silly question which originates from my college assignment.
Basically what I was trying to do at that time is to upload an image to a flask backend in REST way and the backend will use open-cv to do a image recognition. Because json data type does not support binary data, I follow some online instructions to use base64 which is of course feasible(it seems to be used a lot in terms of file uploading for REST, not sure about the behind reason). But Later I realized actually I can read the image into ArrayBuffer and convert it to int array and then post to the backend. I just tried it today and it succeeded. Then on both sides, the encoding overhead is avoided and payload size also get reduced since base64 increases size by around 33%.
I want to ask since we can avoid using based64 why we still use base64. Is it just because it avoids issues of line ending encodings across systems? It seems unrelated to binary data uploading.

Is it better better to open or to read large matrices in Julia?

I'm in the process of switching over to Julia from other programming languages and one of the things that Julia will let you hang yourself on is memory. I think this is likely a good thing, a programming language where you actually have to think about some amount of memory management forces the coder to write more efficient code. This would be in contrast to something like R where you can seemingly load datasets that are larger than the allocated memory. Of course, you can't actually do that, so I wonder how does R get around that problem?
Part of what I've done in other programming languages is work on large tabular datasets, often converted over to a R dataframe or a matrix. I think the way this is handled in Julia is to stream data in wherever possible, so my main question is this:
Is it better to use readline("my_file.txt") to access data or is it better to use open("my_file.txt", "w")? If possible, wouldn't it be better to access a large dataset all at once for speed? Or would it be better to always stream data?
I hope this makes sense. Any further resources would be greatly appreciated.
I'm not an extensive user of Julia's data-ecosystem packages, but CSV.jl offers the Chunks and Rows alternatives to File, and these might let you process the files incrementally.
While it may not be relevant to your use case, the mechanisms mentioned in #Przemyslaw Szufel's answer are used other places as well. Two I'm familiar with are the TiffImages.jl and NRRD.jl packages, both I/O packages mostly for loading image data into Julia. With these, you can load terabyte-sized datasets on a laptop. There may be more packages that use the same mechanism, and many package maintainers would probably be grateful to receive a pull request that supports optional memory-mapping when applicable.
In R you cannot have a data frame larger than memory. There is no magical buffering mechanism. However, when running R-based analytics you could use a disk.frame package for that.
Similarly, in Julia if you want to process data frames larger than memory you need to use am appropriate package. The most reasonable and natural option in Julia ecosystem is JuliaDB.
If you want to do something more low-level solution have a look at:
Mmap that provides Memory-mapped I/O that exactly solves the issue of conveniently handling data too large to fit into memory
SharedArrays that offers a disk mapped array with implementation based on Mmap.
In conclusion, if your data is data frame based - try JuliaDB, otherwise have a look at Mmap and SharedArrays (look at the filename parameter)

Improving the performance of the titanic pattern

I am referring to the titanic pattern explained in the zeromq guide. Can someone please explain why it recommends not using a key-value store as compared to reading/writing disk files for persistence. Quoting from the guide:
"What I'd not recommend is storing messages in a database, not even a "fast" key/value
store, unless you really like a specific database and don't have performance worries. You
will pay a steep price for the abstraction, ten to a thousand times over a raw disk file."
There are other recommendations given in the guide, like storing the messages on to a disk file in a circular buffer fashion. But would it not be faster to store the messages, and retrieving them from a redis store? Any ideas? Thank you.
In the zeromq guide, the example provided for this pattern uses simple files in a very naive way (using buffered I/O, without any fsync'ing). The broker is directly responsible of storing things on the filesystem. So the performance is mostly linked to the efficiency of the VFS and the filesystem cache. There is no real I/O in the picture.
In this context, the cost of an extra hop to store and retrieve the data into Redis will be very noticeable, especially if it is implemented using synchronous queries/replies.
Redis is very efficient as a remote key/value store, but it cannot compete with an embedded store (even a store implemented on top of a filesystem cache).

Distributed file system as a distributed buffer system?

The Problem
I've been developing an application which needs to support reads on a data object asynchronously with appending writes. In other words, a buffer. There will be many data objects at any given time.
I've been researching into available distributed file systems to find one which supports reading a file as it's being written to, but my search has come up with nothing. I know Amazon S3 does not support this from experience, while I am unsure about others such as HadoopDFS.
Solution: Chunking?
I have thought of chunking the data as a solution, which would involve splitting the incoming writes into n-byte chunks to write to the DFS as a whole. Chunks which are no longer needed can be deleted without interfering with the new data being written, as they are separate files on the DFS.
The problem with this strategy is it would result in pauses when a buffer reader consumes data faster than the buffer writer creates it. Smaller chunks would mitigate this effect, but not perfectly.
Summarized Questions
Does a DFS exist which supports reading/writing an object as a buffer?
If not, is chunking data on the DFS the best way to simulate a buffer?
Lustre (lustre.org) supports concurrent coherent reads and writes, including O_APPEND writes. Concurrency is controlled by the Lustre distributed lock manager (LDLM), which grants extent locks to clients and revokes the locks on conflict. Multiple readers and multiple concurrent writers to the same file are supported, they will see consistent file data, much like in case of a local file-system.

What are good compression-oriented application programming interfaces (APIs)?

What are good compression-oriented application programming interfaces (APIs)?
Do people still use the
1991 "data compression interface" draft standard, and the
1991 "Stream transformation algorithm interface" draft standard.
(Both draft standards by Ross Williams)?
Are there any alternatives to those draft standards?
(I'm particularly looking for C APIs, but links to compression-oriented APIs in C++ and other languages would also be appreciated).
I'm experimenting with some data compression algorithms.
Typically the compressed file I'm producing is composed of a series of blocks,
with a block header indicating which compression algorithm needs to be used to decompress the remaining data in that block -- Huffman, LZW, LZP, "stored uncompressed", etc.
The block header also indicates which filter(s) need to be used to convert the intermediate stream or buffer of data from the decompressor into a lossless copy of the original plaintext -- Burrows–Wheeler transform, delta encoding, XML end-tag restoration, "copy unchanged", etc.
Rather than use a huge switch statement that selects based on the "compression type", which calls the selected decompression algorithm or filter algorithm, each procedure with its own special number and order of parameters,
it simplifies my code if every algorithm has exactly the same API -- the same number and order of parameters, etc.
Rather than waiting for the decompressor to run through the entire input stream before handing its output to the first filter,
It would be nice if the API supported decompressed output data coming out the final filter "relatively quickly" (low-latency) after relatively little compressed data has been fed into the initial decompressor.
It would be nice if the API could be used in systems that have only one thread or process.
Currently I'm kludging together my own internal API,
re-using existing compression algorithm implementations by
writing short wrapper functions to convert between my internal API and the special number and order of parameters used by each implementation.
Is there an already-existing API that I could use rather than designing my own from scratch?
Where can I find such an API?
I fear such an "API" does not exist.
Especially, requirement such as "starting stage-2 while stage-1 is ongoing and unfinished" is completely implementation dependant; and cannot be added later by an API layer.
Btw, Maciej Adamczyk just tried the same as you.
He made an open source benchmark comparing multiple compression algorithms over a block-compression scenario. The code can be consulted here :
http://encode.ru/threads/1371-Filesystem-benchmark?p=26630&viewfull=1#post26630
He has been obliged to "encapsulate" all these different compressor interfaces in order to cope with the difference.
Now for the good thing : most compressors tend to have relatively similar C interface when it comes to compressing a block of data.
AS an example, they can be as simple as this one :
http://code.google.com/p/lz4/source/browse/trunk/lz4.h
So, in the end, the adaptation layer is not so heavy.