How to create a lazy-evaluated range from a file? - file-io

The File I/O API in Phobos is relatively easy to use, but right now I feel like it's not very well integrated with D's range interface.
I could create a range delimiting the full contents by reading the entire file into an array:
import std.file;
auto mydata = cast(ubyte[]) read("filename");
processData(mydata); // takes a range of ubytes
But this eager evaluation of the data might be undesired if I only want to retrieve a file's header, for example. The upTo parameter doesn't solve this issue if the file's format assumes a variable-length header or any other element we wish to retrieve. It could even be in the middle of the file, and read forces me to read all of the file up to that point.
But indeed, there are alternatives. readf, readln, byLine and most particularly byChunk let me retrieve pieces of data until I reach the end of the file, or just when I want to stop reading the file.
import std.stdio;
File file("filename");
auto chunkRange = file.byChunk(1000); // a range of ubyte[]s
processData(chunkRange); // oops! not expecting chunks!
But now I have introduced the complexity of dealing with fixed size chunks of data, rather than a continuous range of bytes.
So how can I create a simple input range of bytes from a file that is lazy evaluated, either by characters or by small chunks (to reduce the number of reads)? Can the range in the second example be seamlessly encapsulated in a way that the data can be processed like in the first example?

You can use std.algorithm.joiner:
auto r = File("test.txt").byChunk(4096).joiner();
Note that byChunk reuses the same buffer for each chunk, so you may need to add .map!(chunk => chunk.idup) to lazily copy the chunks to the heap.

Related

Send heavy data through protobuf. Custom field

I'm developing the API for the application using protobuf and grpc.
I need to send the data with the arbitrary size. Sometimes it is small, sometimes huge. For example Nympy array. If the size is small I want to send it through protobuf, if the size if huge I want to dump data into file and send the filepath to this file through protobuf.
To do so I've created a following .proto messages:
message NumpyTroughProtobuf {
repeated int32 shape = 1;
repeated float array = 2;
}
message NumpyTroughfile {
string filepath = 1;
}
message NumpyTrough {
google.protobuf.Any data = 1;
}
The logic is simple: If the size is big I use data as NumpyTroughfile or if small data as NumpyTroughProtobuf.
Problem (what I want to avoid):
The mechanism of data transformation is the part of my app.
In the current approach I have to check and covert the data before I create the message NumpyTrough. So I have to add some logic into my application which will care of data check and cast. The same I have to do for any language which I use (for example if I send massages from Python to C++).
What I want to do:
The mechanism of data transformation is the part of customized protobuf.
I want to hide the data transformation. I want that my app to send a pure Numpy array into NumpyTrough.data field. All data transformation should be hided.
So I want that the logic of data transformation be the part of custom Protobuf field, not the part of my application.
Which meant that I would like to create a custom type of the field. I just implement the behavior of this filed (marshal/unmarshal) for any languages which I use. Then I can just directly send Numpy data into this custom field and this field will decide how to proceed: turn the data in into file or via other method, send trough Protobuf and restore on the receiver side.
Somethig like this https://github.com/gogo/protobuf/blob/master/custom_types.md but it seems this is not a part of protobuf ecosystem.
Protobuf only defines schema.
You can't add logic to a protobuf definition.
Protobug Any represents arbitrary binary data and so -- somewhere -- you'll need to explain to your users what it represents in order that they can ship data in the correct format to your service.
You get to decide how to distribute the processing of the data:
Either partly client-side functionality that performs preprocessing of the data and ships the output (either as structured data using non-Any types or, if still necessary as Any).
Or partly server-side that receives entirely unprocessed client-side data shipped through Any
Or some combination of the two
NOTE You may want to consider shipping the data regardless of size as file references to simplify your implementation. You're correct to bias protobuf to smaller message sizes but, depending on the file size distribution, does it make sense to complicate your implementation with 2 paths?

How do EMACS Lisp programmers read text files for non-editing purposes?

What do EMACS Lisp programmers do, when they want to write something roughly the equivalent of...
for line in open("foo.txt", "r", encoding="utf-8").readlines():
...(split on ws and call a fn, or whatever)...
..?
When I look in the EMACS lisp help, I see functions about opening files into text editing buffers -- not exactly what I was intending. I suppose I could write functions to visit the lines of the file, but if I did that, I wouldn't want the user to see it, and besides, it doesn't seem very efficient from a text-processing standpoint.
I think a more direct translation of the original Python code is as follows:
(with-temp-buffer
(insert-file-contents "foo.txt")
(while (search-forward-regexp "\\(.*\\)\n?" nil t)
; do something with this line in (match-string 1)
))
I think with-temp-buffer/insert-file-contents is generally preferable to with-current-buffer/find-file-noselect, because the former guarantees that you're working with a fresh copy of the entire file contents. With the latter construction, if you happen to already have a buffer visiting the target file, then that buffer is returned by find-file-noselect, so if that buffer has been narrowed, you'll only see that part of the file when you process it.
Keep in mind that it may very well be more convenient not to process the file line-by-line. For example, this is an expression that returns a list of all sequences of consecutive digits in the file:
(with-temp-buffer
(insert-file-contents "foo.txt")
(loop while (search-forward-regexp "[0-9]+" nil t)
collect (match-string 0)))
(require 'cl) first to bring in the loop macro.
Yes, that is what you want to do: visit the file in a buffer, and operate on the text in that buffer.
You do not have to display the buffer, i.e., the user need not see it.
And as for efficiency: manipulating text in a buffer is typically the most efficient way to manipulate text.
You can visit a file in a buffer in several ways. You might want to use an existing file buffer for this, depending on the use case. That is, if the file is already "open" in Emacs then you might want to use its buffer.
Or you might want to disregard any existing file buffer for an already "open" file, and read the file anew into a new buffer. For that, as #Sean mentions, you can use insert-file-contents with a buffer that you create. You can create the buffer using with-temp-buffer or generate-new-buffer, depending, again, on what you want/need to do with it.
If you do want to reuse a buffer that is already visiting the file, you can test whether it has been modified in memory, whether it is narrowed, etc., and do whatever is appropriate for your use case. You can check whether there is already a buffer visiting the file (using any path/file name) using function find-buffer-visiting.
To visit the file, taking advantage of any existing buffer that is visiting it, you can use find-file-noselect. That function returns the buffer that visits the file, so you can pass that buffer as the first argument to with-current-buffer. Here is a simple example.
(with-current-buffer (let ((enable-local-variables ())) (find-file-noselect file))
;; Do some stuff with the text in the buffer.
;; Optionally save the buffer back to the file.
)
(The binding of enable-local-variables to nil is a minor optimization, for the common case where you don't need to bother with buffer-local variables.)

Byte InputRange from file

How to construct easily a raw byte-by-byte InputRange/ForwardRange/RandomAccessRange from a file?
file.byChunk(4096).joiner
This reads a file in 4096-byte chunks and lazily joins the chunks together into a single ubyte input range.
joiner is from std.algorithm, so you'll have to import it first.
The easiest way to make a raw byte range from a file is to just read it all right into memory:
import std.file;
auto data = cast(ubyte[]) read("filename");
// data is a full-featured random access range of the contents
If the file is too large for that to be reasonable, you could try a memory-mapped file http://dlang.org/phobos/std_mmfile.html and use the opSlice to get an array off it. Since it is an array, you get full range features, but since it is memory mapped by the operating system, you get lazy reading as you touch the file.
For a simple InputRange, there's LockingTextReader (undocumented) in Phobos, or you could construct one yourself over byChunk or even fgetc, the C function. fgetc would be the easiest to write:
struct FileByByte {
ubyte front;
void popFront() { front = cast(ubyte) fgetc(fp); }
bool empty() { return feof(fp); }
FILE* fp;
this(FILE* fp) { this.fp = fp; popFront(); /* prime it */ }
}
I haven't actually tested that but i'm pretty sure it'd work. (BTW the file open and close is separate from this because ranges are supposed to be just views into data, not managed containers. You wouldn't want the file closed just because you passed this range into a function.)
This is not a forward nor random access range though. Those are trickier to do on streams without a lot of buffering code and I think that'd be a mistake to try to write - generally, ranges should be cheap, not emulating features the underlying container doesn't natively support.
EDIT: The other answer has a non-buffering way! https://stackoverflow.com/a/30278933/1457000 That's awesome.

Erlang binary protocol serialization

I'm currently using Erlang for a big project but i have a question regarding a proper proceeding.
I receive bytes over a tcp socket. The bytes are according to a fixed protocol, the sender is a pyton client. The python client uses class inheritance to create bytes from the objects.
Now i would like to (in Erlang) take the bytes and convert these to their equivelant messages, they all have a common message header.
How can i do this as generic as possible in Erlang?
Kind Regards,
Me
Pattern matching/binary header consumption using Erlang's binary syntax. But you will need to know either exactly what bytes or bits your are expecting to receive, or the field sizes in bytes or bits.
For example, let's say that you are expecting a string of bytes that will either begin with the equivalent of the ASCII strings "PUSH" or "PULL", followed by some other data you will place somewhere. You can create a function head that matches those, and captures the rest to pass on to a function that does "push()" or "pull()" based on the byte header:
operation_type(<<"PUSH", Rest/binary>>) -> push(Rest);
operation_type(<<"PULL", Rest/binary>>) -> pull(Rest).
The bytes after the first four will now be in Rest, leaving you free to interpret whatever subsequent headers or data remain in turn. You could also match on the whole binary:
operation_type(Bin = <<"PUSH", _/binary>>) -> push(Bin);
operation_type(Bin = <<"PULL", _/binary>>) -> pull(Bin).
In this case the "_" variable works like it always does -- you're just checking for the lead, essentially peeking the buffer and passing the whole thing on based on the initial contents.
You could also skip around in it. Say you knew you were going to receive a binary with 4 bytes of fluff at the front, 6 bytes of type data, and then the rest you want to pass on:
filter_thingy(<<_:4/binary, Type:6/binary, Rest/binary>>) ->
% Do stuff with Rest based on Type...
It becomes very natural to split binaries in function headers (whether the data equates to character strings or not), letting the "Rest" fall through to appropriate functions as you go along. If you are receiving Python pickle data or something similar, you would want to write the parsing routine in a recursive way, so that the conclusion of each data type returns you to the top to determine the next type, with an accumulated tree that represents the data read so far.
I only covered 8-bit bytes above, but there is also a pure bitstring syntax, which lets you go as far into the weeds with bits and bytes as you need with the same ease of syntax. Matching is a real lifesaver here.
Hopefully this informed more than confused. Binary syntax in Erlang makes this the most pleasant binary parsing environment in a general programming language I've yet encountered.
http://www.erlang.org/doc/programming_examples/bit_syntax.html

File reading and checksums in go. Difference between methods

Recently I'm into creating checksums for files in go. My code is working with small and big files. I tried two methods, the first uses ioutil.ReadFile("filename") and the second is working with os.Open("filename").
Examples:
The first function is working with the io/ioutil and works for small files. When I try to copy a big file my ram gets blastet and for a 1.5GB iso it uses 3GB of ram.
func byteCopy(fileToCopy string) {
file, err := ioutil.ReadFile(fileToCopy) //1.5GB file
omg(err) //error handling function
ioutil.WriteFile("2.iso", file, 0777)
os.Remove("2.iso")
}
Even worse when I want to create a checksum with crypto/sha512 and io/ioutil.
It will never finish and abort because it runs out of memory.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
When using the function below everything works fine.
func ioHash() {
f, err := os.Open(iso) //iso is a big ~ 1.5tb file
omg(err) //error handling function
defer f.Close()
h := sha512.New()
io.Copy(h, f)
fmt.Printf("%x", h.Sum(nil))
}
My Question:
Why is the ioutil.ReadFile() function not working right? The 1.5GB file should not fill my 16GB of ram. I don't know where to look right now.
Could somebody explain the differences between the methods? I don't get it with reading the go-doc and examples.
Having usable code is nice, but understanding why its working is way above that.
Thanks in advance!
The following code doesn't do what you think it does.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
This first reads your 1.5GB iso. As jnml pointed out, it continuously makes bigger and bigger buffers to fill it. In the end, And total buffer size is no less than 1.5GB and no greater than 1.875GB (by the current implementation).
However, after that you then make another buffer! h.Sum(file) doesn't hash file. It appends the current hash to file! This may or may not cause yet another allocation.
The real problem is that you are taking that file, now appended with the hash, and printing it with %x. Fmt actually pre-computes using the same type of method jnml pointed out that ioutil.ReadAll used. So it constantly allocated bigger and bigger buffers to store the hex of your file. Since each letter is 4 bits, that means we are talking about no less than a 3GB buffer for that and no greater than 3.75GB.
This means your active buffers may be as big 5.625GB. Combine that with the GC not being perfect and not removing all the intermediate buffers, and it could very easily fill your space.
The correct way to write that code would have been.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
h.Write(file)
fmt.Printf("%x", h.Sum(nil))
}
This doesn't do nearly the number the allocations.
The bottom line is that ReadFile is rarely what you want to use. IO streaming (using readers and writers) is always the best way when it is an option. Not only do you allocate much less when you use io.Copy, you also hash and read the disk concurrently. In your ReadFile example, the two resources are used synchronously when they don't depend on each other.
ioutil.ReadFile is working right. It's your fault to abuse the system resources by using that function for things you know are huge.
ioutil.ReadFile is a handy helper for files you're pretty sure in advance that they're going to be small. Like configuration files, most source code files etc. (Actually it's optimizing things for files <= 1e9 bytes, but that's an implementation detail and not part of the API contract. Your 1.5GB file forces it to use slice growing and thus allocating more than one big buffer for your data in the process of reading the file.)
Even your other approach using os.File is not okay. You definitely should be using the "bufio" package for sequential processing of large files, see bufio.NewReader.