How to force haskell to don't store whole bytestring? - optimization

I writing a small (relatively) application on haskell in academic purpose. I'm implementing a Huffman compression, based on this code http://www.haskell.org/haskellwiki/Toy_compression_implementations .
My variant of this code is here https://github.com/kravitz/har/blob/a5d221f227c27fd1c5587217a29a169a377521a6/huffman.hs and it uses lazy bytestrings. When I implemented RLE compression everything was smooth, because it process the input stream in one step. But Huffman process it twice and as a result I have an evaluated bytestring stored in memory, which is bad for a big files (but for relatively small files it allocates too much space in heap too). That is not only my suspicion, because profiling also shows that most of the heap eaten by bytestring allocation.
Also I seriallizing a stream length in file, and it also may cause the full bytestring loading in memory. Is there any simple way to say ghc be kindly and re-evaluate stream several times?

Instead of passing a bytestring to the encoder, you can pass something that computes a bytestring, then explicitly recompute the value each time you need it.
compress :: ST s ByteString -> ST s ByteString
compress makeInput = do
len <- (return $!) . ByteString.length =<< makeInput
codebook <- (return $!) . makeCodebook =<< makeInput
return . encode len codebook =<< makeInput
compressIO :: IO ByteString -> IO ByteString
compressIO m = stToIO (compress (unsafeIOToST m))
The parameter to compress should actually compute the value. Simply wrapping a value with return won't work. Also, each call to makeInput must actually have its result evaluated, else there will remain a lazy, un-evaluated copy of the input in memory when the input is recomputed.
The usual approach, as barsoap said, is to just compress one block at a time.

The usual approach when (Huffmann-)compressing, as one can't get around processing the input twice, once to collect the probability distribution, and once to do the actual compressing, is to chunk up the input into blocks and compress each separately. While that still eats memory, it only eats, maximally, a constant amount.
That said, you might want to have a look at bytestring-mmap, though that won't work with standard input, sockets, and other file descriptors that aren't backed by a file system which supports mmap.
You can also re-read the bytestring from file (again, provided you're not receiving it from anything pipe-like) after collecting the probability distribution, but that will still make your code bail out on say 1TB files.

Related

how do i know in advance that the buffer size is enough in nanopb?

im trying to use nanopb, according to the example:
https://github.com/nanopb/nanopb/blob/master/examples/simple/simple.c
the buffer size is initialized to 128:
uint8_t buffer[128];
my question is how do i know (in advance) this 128-length buffer is enough to transmit my message? how to decide a proper(enough but not waste too much due to over-large) size of buffer before initial (or coding) it?
looks like a noob question :) , but thx for your quick suggestion.
When possible, nanopb adds a define in the generated .pb.h file that has the maximum encoded size of a message. In the file examples/simple/simple.pb.h you'll find:
/* Maximum encoded size of messages (where known) */
#define SimpleMessage_size 11
And could specify uint8_t buffer[SimpleMessage_size];.
This define will be available only if all repeated and string fields have been specified (nanopb).max_count and (nanopb).max_size options.
For many practical purposes, you can pick a buffer size that you estimate will be large enough, and handle error conditions. It is also possible to use pb_get_encoded_size() to calculate the encoded size and dynamically allocate storage, but in general that is not a great solution in embedded applications. When total system memory size is limited, it is often better to have a constant sized buffer that you can test with, instead of having the available amount of dynamic memory vary at the runtime.

how ext4 works with fallocate

Recently, I am testing the proper usage of ext4 filesystem. what is my expert is that:
when system crashed, the data had been write return ok can not loss, but metadate can.
Here is my usage:
1. call fallocate to alloc centain space
fallocate(fd, 0, 0, 4*1024*1024); //4MB
2. call fsync(fd) let data and metadata write to disks
3. then i call function to randomly write the file with 4k size(random data but not 0). with O_DRICT flagļ¼Œbut not call fsync. I log the offset with return write ok.
4. check the offset that logged. but i find in some offset, read 4k data, is 0. It seems mean that offset isn't used like hole files.
My question is that:
<1. why after calling fallocate and fsync the metadata of the file still seems
indicate some blocks is not used, so when read it return null. It is my understand .
<2. have other api to call, can make sure that in allocate space with file is not holes ,after that when write data return ok with O_DIRECT can make sure the data will not be loss even the system crashed.
Thanks.
Only writing to the file space can eliminate the hole. Without writing, there is no dirty page and fsync simply does nothing.
I am wondering how did you execute you step 4. It seems that you did it by a manual crash, did you? If you read it after write without a crash, it should not be zero, provided you wrote non-zeros. If you read it after a crash, zero can happen if disk cache existed. However, this kind of zero is not like holes, they are zeros read from the disk (very probably the disk contains zeros).

How do EMACS Lisp programmers read text files for non-editing purposes?

What do EMACS Lisp programmers do, when they want to write something roughly the equivalent of...
for line in open("foo.txt", "r", encoding="utf-8").readlines():
...(split on ws and call a fn, or whatever)...
..?
When I look in the EMACS lisp help, I see functions about opening files into text editing buffers -- not exactly what I was intending. I suppose I could write functions to visit the lines of the file, but if I did that, I wouldn't want the user to see it, and besides, it doesn't seem very efficient from a text-processing standpoint.
I think a more direct translation of the original Python code is as follows:
(with-temp-buffer
(insert-file-contents "foo.txt")
(while (search-forward-regexp "\\(.*\\)\n?" nil t)
; do something with this line in (match-string 1)
))
I think with-temp-buffer/insert-file-contents is generally preferable to with-current-buffer/find-file-noselect, because the former guarantees that you're working with a fresh copy of the entire file contents. With the latter construction, if you happen to already have a buffer visiting the target file, then that buffer is returned by find-file-noselect, so if that buffer has been narrowed, you'll only see that part of the file when you process it.
Keep in mind that it may very well be more convenient not to process the file line-by-line. For example, this is an expression that returns a list of all sequences of consecutive digits in the file:
(with-temp-buffer
(insert-file-contents "foo.txt")
(loop while (search-forward-regexp "[0-9]+" nil t)
collect (match-string 0)))
(require 'cl) first to bring in the loop macro.
Yes, that is what you want to do: visit the file in a buffer, and operate on the text in that buffer.
You do not have to display the buffer, i.e., the user need not see it.
And as for efficiency: manipulating text in a buffer is typically the most efficient way to manipulate text.
You can visit a file in a buffer in several ways. You might want to use an existing file buffer for this, depending on the use case. That is, if the file is already "open" in Emacs then you might want to use its buffer.
Or you might want to disregard any existing file buffer for an already "open" file, and read the file anew into a new buffer. For that, as #Sean mentions, you can use insert-file-contents with a buffer that you create. You can create the buffer using with-temp-buffer or generate-new-buffer, depending, again, on what you want/need to do with it.
If you do want to reuse a buffer that is already visiting the file, you can test whether it has been modified in memory, whether it is narrowed, etc., and do whatever is appropriate for your use case. You can check whether there is already a buffer visiting the file (using any path/file name) using function find-buffer-visiting.
To visit the file, taking advantage of any existing buffer that is visiting it, you can use find-file-noselect. That function returns the buffer that visits the file, so you can pass that buffer as the first argument to with-current-buffer. Here is a simple example.
(with-current-buffer (let ((enable-local-variables ())) (find-file-noselect file))
;; Do some stuff with the text in the buffer.
;; Optionally save the buffer back to the file.
)
(The binding of enable-local-variables to nil is a minor optimization, for the common case where you don't need to bother with buffer-local variables.)

Erlang binary protocol serialization

I'm currently using Erlang for a big project but i have a question regarding a proper proceeding.
I receive bytes over a tcp socket. The bytes are according to a fixed protocol, the sender is a pyton client. The python client uses class inheritance to create bytes from the objects.
Now i would like to (in Erlang) take the bytes and convert these to their equivelant messages, they all have a common message header.
How can i do this as generic as possible in Erlang?
Kind Regards,
Me
Pattern matching/binary header consumption using Erlang's binary syntax. But you will need to know either exactly what bytes or bits your are expecting to receive, or the field sizes in bytes or bits.
For example, let's say that you are expecting a string of bytes that will either begin with the equivalent of the ASCII strings "PUSH" or "PULL", followed by some other data you will place somewhere. You can create a function head that matches those, and captures the rest to pass on to a function that does "push()" or "pull()" based on the byte header:
operation_type(<<"PUSH", Rest/binary>>) -> push(Rest);
operation_type(<<"PULL", Rest/binary>>) -> pull(Rest).
The bytes after the first four will now be in Rest, leaving you free to interpret whatever subsequent headers or data remain in turn. You could also match on the whole binary:
operation_type(Bin = <<"PUSH", _/binary>>) -> push(Bin);
operation_type(Bin = <<"PULL", _/binary>>) -> pull(Bin).
In this case the "_" variable works like it always does -- you're just checking for the lead, essentially peeking the buffer and passing the whole thing on based on the initial contents.
You could also skip around in it. Say you knew you were going to receive a binary with 4 bytes of fluff at the front, 6 bytes of type data, and then the rest you want to pass on:
filter_thingy(<<_:4/binary, Type:6/binary, Rest/binary>>) ->
% Do stuff with Rest based on Type...
It becomes very natural to split binaries in function headers (whether the data equates to character strings or not), letting the "Rest" fall through to appropriate functions as you go along. If you are receiving Python pickle data or something similar, you would want to write the parsing routine in a recursive way, so that the conclusion of each data type returns you to the top to determine the next type, with an accumulated tree that represents the data read so far.
I only covered 8-bit bytes above, but there is also a pure bitstring syntax, which lets you go as far into the weeds with bits and bytes as you need with the same ease of syntax. Matching is a real lifesaver here.
Hopefully this informed more than confused. Binary syntax in Erlang makes this the most pleasant binary parsing environment in a general programming language I've yet encountered.
http://www.erlang.org/doc/programming_examples/bit_syntax.html

File reading and checksums in go. Difference between methods

Recently I'm into creating checksums for files in go. My code is working with small and big files. I tried two methods, the first uses ioutil.ReadFile("filename") and the second is working with os.Open("filename").
Examples:
The first function is working with the io/ioutil and works for small files. When I try to copy a big file my ram gets blastet and for a 1.5GB iso it uses 3GB of ram.
func byteCopy(fileToCopy string) {
file, err := ioutil.ReadFile(fileToCopy) //1.5GB file
omg(err) //error handling function
ioutil.WriteFile("2.iso", file, 0777)
os.Remove("2.iso")
}
Even worse when I want to create a checksum with crypto/sha512 and io/ioutil.
It will never finish and abort because it runs out of memory.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
When using the function below everything works fine.
func ioHash() {
f, err := os.Open(iso) //iso is a big ~ 1.5tb file
omg(err) //error handling function
defer f.Close()
h := sha512.New()
io.Copy(h, f)
fmt.Printf("%x", h.Sum(nil))
}
My Question:
Why is the ioutil.ReadFile() function not working right? The 1.5GB file should not fill my 16GB of ram. I don't know where to look right now.
Could somebody explain the differences between the methods? I don't get it with reading the go-doc and examples.
Having usable code is nice, but understanding why its working is way above that.
Thanks in advance!
The following code doesn't do what you think it does.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
This first reads your 1.5GB iso. As jnml pointed out, it continuously makes bigger and bigger buffers to fill it. In the end, And total buffer size is no less than 1.5GB and no greater than 1.875GB (by the current implementation).
However, after that you then make another buffer! h.Sum(file) doesn't hash file. It appends the current hash to file! This may or may not cause yet another allocation.
The real problem is that you are taking that file, now appended with the hash, and printing it with %x. Fmt actually pre-computes using the same type of method jnml pointed out that ioutil.ReadAll used. So it constantly allocated bigger and bigger buffers to store the hex of your file. Since each letter is 4 bits, that means we are talking about no less than a 3GB buffer for that and no greater than 3.75GB.
This means your active buffers may be as big 5.625GB. Combine that with the GC not being perfect and not removing all the intermediate buffers, and it could very easily fill your space.
The correct way to write that code would have been.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
h.Write(file)
fmt.Printf("%x", h.Sum(nil))
}
This doesn't do nearly the number the allocations.
The bottom line is that ReadFile is rarely what you want to use. IO streaming (using readers and writers) is always the best way when it is an option. Not only do you allocate much less when you use io.Copy, you also hash and read the disk concurrently. In your ReadFile example, the two resources are used synchronously when they don't depend on each other.
ioutil.ReadFile is working right. It's your fault to abuse the system resources by using that function for things you know are huge.
ioutil.ReadFile is a handy helper for files you're pretty sure in advance that they're going to be small. Like configuration files, most source code files etc. (Actually it's optimizing things for files <= 1e9 bytes, but that's an implementation detail and not part of the API contract. Your 1.5GB file forces it to use slice growing and thus allocating more than one big buffer for your data in the process of reading the file.)
Even your other approach using os.File is not okay. You definitely should be using the "bufio" package for sequential processing of large files, see bufio.NewReader.