File reading and checksums in go. Difference between methods - file-io

Recently I'm into creating checksums for files in go. My code is working with small and big files. I tried two methods, the first uses ioutil.ReadFile("filename") and the second is working with os.Open("filename").
Examples:
The first function is working with the io/ioutil and works for small files. When I try to copy a big file my ram gets blastet and for a 1.5GB iso it uses 3GB of ram.
func byteCopy(fileToCopy string) {
file, err := ioutil.ReadFile(fileToCopy) //1.5GB file
omg(err) //error handling function
ioutil.WriteFile("2.iso", file, 0777)
os.Remove("2.iso")
}
Even worse when I want to create a checksum with crypto/sha512 and io/ioutil.
It will never finish and abort because it runs out of memory.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
When using the function below everything works fine.
func ioHash() {
f, err := os.Open(iso) //iso is a big ~ 1.5tb file
omg(err) //error handling function
defer f.Close()
h := sha512.New()
io.Copy(h, f)
fmt.Printf("%x", h.Sum(nil))
}
My Question:
Why is the ioutil.ReadFile() function not working right? The 1.5GB file should not fill my 16GB of ram. I don't know where to look right now.
Could somebody explain the differences between the methods? I don't get it with reading the go-doc and examples.
Having usable code is nice, but understanding why its working is way above that.
Thanks in advance!

The following code doesn't do what you think it does.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
This first reads your 1.5GB iso. As jnml pointed out, it continuously makes bigger and bigger buffers to fill it. In the end, And total buffer size is no less than 1.5GB and no greater than 1.875GB (by the current implementation).
However, after that you then make another buffer! h.Sum(file) doesn't hash file. It appends the current hash to file! This may or may not cause yet another allocation.
The real problem is that you are taking that file, now appended with the hash, and printing it with %x. Fmt actually pre-computes using the same type of method jnml pointed out that ioutil.ReadAll used. So it constantly allocated bigger and bigger buffers to store the hex of your file. Since each letter is 4 bits, that means we are talking about no less than a 3GB buffer for that and no greater than 3.75GB.
This means your active buffers may be as big 5.625GB. Combine that with the GC not being perfect and not removing all the intermediate buffers, and it could very easily fill your space.
The correct way to write that code would have been.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
h.Write(file)
fmt.Printf("%x", h.Sum(nil))
}
This doesn't do nearly the number the allocations.
The bottom line is that ReadFile is rarely what you want to use. IO streaming (using readers and writers) is always the best way when it is an option. Not only do you allocate much less when you use io.Copy, you also hash and read the disk concurrently. In your ReadFile example, the two resources are used synchronously when they don't depend on each other.

ioutil.ReadFile is working right. It's your fault to abuse the system resources by using that function for things you know are huge.
ioutil.ReadFile is a handy helper for files you're pretty sure in advance that they're going to be small. Like configuration files, most source code files etc. (Actually it's optimizing things for files <= 1e9 bytes, but that's an implementation detail and not part of the API contract. Your 1.5GB file forces it to use slice growing and thus allocating more than one big buffer for your data in the process of reading the file.)
Even your other approach using os.File is not okay. You definitely should be using the "bufio" package for sequential processing of large files, see bufio.NewReader.

Related

Read binary files without having them buffered in the volume block cache

Older, now deprecated, macOS file system APIs provided flags to read a file unbuffered.
I seek a modern way to accomplish the same, so that I can read a file's data into memory without it being cached needlessly somewhere else in memory (such as the volume cache).
Reading with fread and first calling setvbuf (fp, NULL, _IONBF, 0) is not having the desired effect in my tests, for example. I am seeking other low-level functions that let me read into a prepared memory buffer and that let me avoid buffering of the whole data.
Background
I am writing a file search program. It reads large amounts of file content (many GBs) that isn't and won't be used by the user otherwise. It would be a waste to have all this data cached in the volume cache as it'll soon get purged by further reads again, anyway. It'll also likely lead to purging file data that's actually in use by the user or system, causing more cache misses.
Therefore, I should be able to tell the system that I do not need the file data cached. The little caching needed for cluster boundaries is not an issue. It's the many large chunks that I read briefly into memory to search it that is not needed to be cached.
Two suggestions:
Use the read() system call instead of stdio.
Disable data caching with the F_NOCACHE option for fcntl().
In Swift that would be something like (error checking omitted for brevity):
import Foundation
let path = "/path/to/file"
let fd = open(path, O_RDONLY)
fcntl(fd, F_NOCACHE, 1)
var buffer = Data(count: 1024 * 1024)
buffer.withUnsafeMutableBytes { ptr in
let amount = read(fd, ptr.baseAddress, ptr.count)
}
close(fd)

how ext4 works with fallocate

Recently, I am testing the proper usage of ext4 filesystem. what is my expert is that:
when system crashed, the data had been write return ok can not loss, but metadate can.
Here is my usage:
1. call fallocate to alloc centain space
fallocate(fd, 0, 0, 4*1024*1024); //4MB
2. call fsync(fd) let data and metadata write to disks
3. then i call function to randomly write the file with 4k size(random data but not 0). with O_DRICT flagļ¼Œbut not call fsync. I log the offset with return write ok.
4. check the offset that logged. but i find in some offset, read 4k data, is 0. It seems mean that offset isn't used like hole files.
My question is that:
<1. why after calling fallocate and fsync the metadata of the file still seems
indicate some blocks is not used, so when read it return null. It is my understand .
<2. have other api to call, can make sure that in allocate space with file is not holes ,after that when write data return ok with O_DIRECT can make sure the data will not be loss even the system crashed.
Thanks.
Only writing to the file space can eliminate the hole. Without writing, there is no dirty page and fsync simply does nothing.
I am wondering how did you execute you step 4. It seems that you did it by a manual crash, did you? If you read it after write without a crash, it should not be zero, provided you wrote non-zeros. If you read it after a crash, zero can happen if disk cache existed. However, this kind of zero is not like holes, they are zeros read from the disk (very probably the disk contains zeros).

How do EMACS Lisp programmers read text files for non-editing purposes?

What do EMACS Lisp programmers do, when they want to write something roughly the equivalent of...
for line in open("foo.txt", "r", encoding="utf-8").readlines():
...(split on ws and call a fn, or whatever)...
..?
When I look in the EMACS lisp help, I see functions about opening files into text editing buffers -- not exactly what I was intending. I suppose I could write functions to visit the lines of the file, but if I did that, I wouldn't want the user to see it, and besides, it doesn't seem very efficient from a text-processing standpoint.
I think a more direct translation of the original Python code is as follows:
(with-temp-buffer
(insert-file-contents "foo.txt")
(while (search-forward-regexp "\\(.*\\)\n?" nil t)
; do something with this line in (match-string 1)
))
I think with-temp-buffer/insert-file-contents is generally preferable to with-current-buffer/find-file-noselect, because the former guarantees that you're working with a fresh copy of the entire file contents. With the latter construction, if you happen to already have a buffer visiting the target file, then that buffer is returned by find-file-noselect, so if that buffer has been narrowed, you'll only see that part of the file when you process it.
Keep in mind that it may very well be more convenient not to process the file line-by-line. For example, this is an expression that returns a list of all sequences of consecutive digits in the file:
(with-temp-buffer
(insert-file-contents "foo.txt")
(loop while (search-forward-regexp "[0-9]+" nil t)
collect (match-string 0)))
(require 'cl) first to bring in the loop macro.
Yes, that is what you want to do: visit the file in a buffer, and operate on the text in that buffer.
You do not have to display the buffer, i.e., the user need not see it.
And as for efficiency: manipulating text in a buffer is typically the most efficient way to manipulate text.
You can visit a file in a buffer in several ways. You might want to use an existing file buffer for this, depending on the use case. That is, if the file is already "open" in Emacs then you might want to use its buffer.
Or you might want to disregard any existing file buffer for an already "open" file, and read the file anew into a new buffer. For that, as #Sean mentions, you can use insert-file-contents with a buffer that you create. You can create the buffer using with-temp-buffer or generate-new-buffer, depending, again, on what you want/need to do with it.
If you do want to reuse a buffer that is already visiting the file, you can test whether it has been modified in memory, whether it is narrowed, etc., and do whatever is appropriate for your use case. You can check whether there is already a buffer visiting the file (using any path/file name) using function find-buffer-visiting.
To visit the file, taking advantage of any existing buffer that is visiting it, you can use find-file-noselect. That function returns the buffer that visits the file, so you can pass that buffer as the first argument to with-current-buffer. Here is a simple example.
(with-current-buffer (let ((enable-local-variables ())) (find-file-noselect file))
;; Do some stuff with the text in the buffer.
;; Optionally save the buffer back to the file.
)
(The binding of enable-local-variables to nil is a minor optimization, for the common case where you don't need to bother with buffer-local variables.)

Byte InputRange from file

How to construct easily a raw byte-by-byte InputRange/ForwardRange/RandomAccessRange from a file?
file.byChunk(4096).joiner
This reads a file in 4096-byte chunks and lazily joins the chunks together into a single ubyte input range.
joiner is from std.algorithm, so you'll have to import it first.
The easiest way to make a raw byte range from a file is to just read it all right into memory:
import std.file;
auto data = cast(ubyte[]) read("filename");
// data is a full-featured random access range of the contents
If the file is too large for that to be reasonable, you could try a memory-mapped file http://dlang.org/phobos/std_mmfile.html and use the opSlice to get an array off it. Since it is an array, you get full range features, but since it is memory mapped by the operating system, you get lazy reading as you touch the file.
For a simple InputRange, there's LockingTextReader (undocumented) in Phobos, or you could construct one yourself over byChunk or even fgetc, the C function. fgetc would be the easiest to write:
struct FileByByte {
ubyte front;
void popFront() { front = cast(ubyte) fgetc(fp); }
bool empty() { return feof(fp); }
FILE* fp;
this(FILE* fp) { this.fp = fp; popFront(); /* prime it */ }
}
I haven't actually tested that but i'm pretty sure it'd work. (BTW the file open and close is separate from this because ranges are supposed to be just views into data, not managed containers. You wouldn't want the file closed just because you passed this range into a function.)
This is not a forward nor random access range though. Those are trickier to do on streams without a lot of buffering code and I think that'd be a mistake to try to write - generally, ranges should be cheap, not emulating features the underlying container doesn't natively support.
EDIT: The other answer has a non-buffering way! https://stackoverflow.com/a/30278933/1457000 That's awesome.

How to force haskell to don't store whole bytestring?

I writing a small (relatively) application on haskell in academic purpose. I'm implementing a Huffman compression, based on this code http://www.haskell.org/haskellwiki/Toy_compression_implementations .
My variant of this code is here https://github.com/kravitz/har/blob/a5d221f227c27fd1c5587217a29a169a377521a6/huffman.hs and it uses lazy bytestrings. When I implemented RLE compression everything was smooth, because it process the input stream in one step. But Huffman process it twice and as a result I have an evaluated bytestring stored in memory, which is bad for a big files (but for relatively small files it allocates too much space in heap too). That is not only my suspicion, because profiling also shows that most of the heap eaten by bytestring allocation.
Also I seriallizing a stream length in file, and it also may cause the full bytestring loading in memory. Is there any simple way to say ghc be kindly and re-evaluate stream several times?
Instead of passing a bytestring to the encoder, you can pass something that computes a bytestring, then explicitly recompute the value each time you need it.
compress :: ST s ByteString -> ST s ByteString
compress makeInput = do
len <- (return $!) . ByteString.length =<< makeInput
codebook <- (return $!) . makeCodebook =<< makeInput
return . encode len codebook =<< makeInput
compressIO :: IO ByteString -> IO ByteString
compressIO m = stToIO (compress (unsafeIOToST m))
The parameter to compress should actually compute the value. Simply wrapping a value with return won't work. Also, each call to makeInput must actually have its result evaluated, else there will remain a lazy, un-evaluated copy of the input in memory when the input is recomputed.
The usual approach, as barsoap said, is to just compress one block at a time.
The usual approach when (Huffmann-)compressing, as one can't get around processing the input twice, once to collect the probability distribution, and once to do the actual compressing, is to chunk up the input into blocks and compress each separately. While that still eats memory, it only eats, maximally, a constant amount.
That said, you might want to have a look at bytestring-mmap, though that won't work with standard input, sockets, and other file descriptors that aren't backed by a file system which supports mmap.
You can also re-read the bytestring from file (again, provided you're not receiving it from anything pipe-like) after collecting the probability distribution, but that will still make your code bail out on say 1TB files.