Byte InputRange from file - file-io

How to construct easily a raw byte-by-byte InputRange/ForwardRange/RandomAccessRange from a file?

file.byChunk(4096).joiner
This reads a file in 4096-byte chunks and lazily joins the chunks together into a single ubyte input range.
joiner is from std.algorithm, so you'll have to import it first.

The easiest way to make a raw byte range from a file is to just read it all right into memory:
import std.file;
auto data = cast(ubyte[]) read("filename");
// data is a full-featured random access range of the contents
If the file is too large for that to be reasonable, you could try a memory-mapped file http://dlang.org/phobos/std_mmfile.html and use the opSlice to get an array off it. Since it is an array, you get full range features, but since it is memory mapped by the operating system, you get lazy reading as you touch the file.
For a simple InputRange, there's LockingTextReader (undocumented) in Phobos, or you could construct one yourself over byChunk or even fgetc, the C function. fgetc would be the easiest to write:
struct FileByByte {
ubyte front;
void popFront() { front = cast(ubyte) fgetc(fp); }
bool empty() { return feof(fp); }
FILE* fp;
this(FILE* fp) { this.fp = fp; popFront(); /* prime it */ }
}
I haven't actually tested that but i'm pretty sure it'd work. (BTW the file open and close is separate from this because ranges are supposed to be just views into data, not managed containers. You wouldn't want the file closed just because you passed this range into a function.)
This is not a forward nor random access range though. Those are trickier to do on streams without a lot of buffering code and I think that'd be a mistake to try to write - generally, ranges should be cheap, not emulating features the underlying container doesn't natively support.
EDIT: The other answer has a non-buffering way! https://stackoverflow.com/a/30278933/1457000 That's awesome.

Related

How to read all bytes of a file in winrt with ReadBufferAsync?

I have an array of objects that each needs to load itself from binary file data. I create an array of these objects and then call an AsyncAction for each of them that starts it reading in its data. Trouble is, they are not loading entirely - they tend to get only part of the data from the files. How can I make sure that the whole thing is read? Here is an outline of the code: first I enumerate the folder contents to get a StorageFile for each file it contains. Then, in a for loop, each receiving object is created and passed the next StorageFile, and it creates its own Buffer and DataReader to handle the read. m_file_bytes is a std::vector.
m_buffer = co_await FileIO::ReadBufferAsync(nextFile);
m_data_reader = winrt::Windows::Storage::Streams::DataReader::FromBuffer(m_buffer);
m_file_bytes.resize(m_buffer.Length());
m_data_reader.ReadBytes(m_file_bytes);
My thought was that since the buffer and reader are class members of the object they would not go out of scope and could finish their work uninterrupted as the next objects were asked to load themselves in separate AsyncActions. But the DataReader only gets maybe half of the file data or less. What can be done to make sure it completes? Thanks for any insights.
[Update] Perhaps what is going is that the file system can handle only one read task at a time, and by starting all these async reads each is interrupting the previous one -? But there must be a way to progressively read a folder full of files.
[Update] I think I have it working, by adopting the principle of concentric loops - the idea is not to proceed to the next load until the first one has completed. I think - someone can correct me if I'm wrong, that the file system cannot do simultaneous reads. If there is an accepted and secure example of how to do this I would still love to hear about it, so I'm not answering my own question.
#include <wrl.h>
#include <robuffer.h>
uint8_t* GetBufferData(winrt::Windows::Storage::Streams::IBuffer& buffer)
{
::IUnknown* unknown = winrt::get_unknown(buffer);
::Microsoft::WRL::ComPtr<::Windows::Storage::Streams::IBufferByteAccess> bufferByteAccess;
HRESULT hr = unknown->QueryInterface(_uuidof(::Windows::Storage::Streams::IBufferByteAccess), &bufferByteAccess);
if (FAILED(hr))
return nullptr;
byte* bytes = nullptr;
bufferByteAccess->Buffer(&bytes);
return bytes;
}
https://learn.microsoft.com/en-us/cpp/cppcx/obtaining-pointers-to-data-buffers-c-cx?view=vs-2017
https://learn.microsoft.com/en-us/windows/uwp/xbox-live/storage-platform/connected-storage/connected-storage-using-buffers

Converting boost streambuf to string_ref

The following code compiles fine for me, but given the somewhat complex design of ASIO buffers, I'm not certain it's correct. The intention is to allow the contents of the streambuf to be given to an HTTP parser without creating an intermediate std::string object, which is what other ASIO code examples seem to do.
boost::string_ref makeStringRef(const boost::asio::streambuf& streambuf)
{
auto&& bufferType = streambuf.data();
return {
boost::asio::buffer_cast<const char*>(bufferType),
boost::asio::buffer_size(bufferType)
};
}
I think this is, indeed, not correct, because the streambuf may have several non-contiguous regions.
So you need to copy anyways. Alternatively, just read into a fixed buffer instead. Of course, this requires you to know the maximum size ahead of time, or read in several steps.
By the way, by taking a const& you risk creating a string_ref that refers to a temporary. Always try to be explicit about life time expectations.

How to serialize slice without length using bincode?

I'm using the bincode crate to write a structure into a file. The structure contains a slice with a fixed size. How can I force bincode to write only the slice's content without the slice's length?
#![allow(unstable)]
#![feature(custom_derive, plugin)]
#![plugin(serde_macros)]
extern crate serde;
extern crate bincode;
use std::fs::File;
use bincode::serde::serialize_into;
use bincode::SizeLimit;
#[derive(Serialize)]
struct Foo([u8; 16]);
fn main() {
let data = Foo([0; 16]);
let mut writer = File::create("test.bin").unwrap();
serialize_into::<File, Foo>(&mut writer, &data, SizeLimit::Infinite).unwrap();
}
File 'test.bin' has 24 bytes size instead of 16.
I saw related remark in documentation of bincode, but I did not understand how to use it.
a slice with a fixed size
[u8; 16] is not a slice. It is an array which may be coerced to a slice.
Anyway... I do not believe that you can. The important function appears to be Serializer::serialize_fixed_size_array which is not implemented by the current serializer. That means it defaults to behaving the same as a slice.
Since slices do not have a length known at compile time, they must have their size written when serialized.
If no one else can provide a better suggestion, it's possible that the maintainer could find a way to make this happen. You may want to politely ask the maintainer for this feature or offer to help with the work.
Beyond that, it sounds like you are trying to make the bincode output fit a pre-existing format. That doesn't really make sense; bincode is its own format and had already made various choices and tradeoffs.
If you need to, you could implement your own encoder / decoder (either using serde or not). If you are concerned about file size, you can combine bincode with a compression step as well.

How to create a lazy-evaluated range from a file?

The File I/O API in Phobos is relatively easy to use, but right now I feel like it's not very well integrated with D's range interface.
I could create a range delimiting the full contents by reading the entire file into an array:
import std.file;
auto mydata = cast(ubyte[]) read("filename");
processData(mydata); // takes a range of ubytes
But this eager evaluation of the data might be undesired if I only want to retrieve a file's header, for example. The upTo parameter doesn't solve this issue if the file's format assumes a variable-length header or any other element we wish to retrieve. It could even be in the middle of the file, and read forces me to read all of the file up to that point.
But indeed, there are alternatives. readf, readln, byLine and most particularly byChunk let me retrieve pieces of data until I reach the end of the file, or just when I want to stop reading the file.
import std.stdio;
File file("filename");
auto chunkRange = file.byChunk(1000); // a range of ubyte[]s
processData(chunkRange); // oops! not expecting chunks!
But now I have introduced the complexity of dealing with fixed size chunks of data, rather than a continuous range of bytes.
So how can I create a simple input range of bytes from a file that is lazy evaluated, either by characters or by small chunks (to reduce the number of reads)? Can the range in the second example be seamlessly encapsulated in a way that the data can be processed like in the first example?
You can use std.algorithm.joiner:
auto r = File("test.txt").byChunk(4096).joiner();
Note that byChunk reuses the same buffer for each chunk, so you may need to add .map!(chunk => chunk.idup) to lazily copy the chunks to the heap.

File reading and checksums in go. Difference between methods

Recently I'm into creating checksums for files in go. My code is working with small and big files. I tried two methods, the first uses ioutil.ReadFile("filename") and the second is working with os.Open("filename").
Examples:
The first function is working with the io/ioutil and works for small files. When I try to copy a big file my ram gets blastet and for a 1.5GB iso it uses 3GB of ram.
func byteCopy(fileToCopy string) {
file, err := ioutil.ReadFile(fileToCopy) //1.5GB file
omg(err) //error handling function
ioutil.WriteFile("2.iso", file, 0777)
os.Remove("2.iso")
}
Even worse when I want to create a checksum with crypto/sha512 and io/ioutil.
It will never finish and abort because it runs out of memory.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
When using the function below everything works fine.
func ioHash() {
f, err := os.Open(iso) //iso is a big ~ 1.5tb file
omg(err) //error handling function
defer f.Close()
h := sha512.New()
io.Copy(h, f)
fmt.Printf("%x", h.Sum(nil))
}
My Question:
Why is the ioutil.ReadFile() function not working right? The 1.5GB file should not fill my 16GB of ram. I don't know where to look right now.
Could somebody explain the differences between the methods? I don't get it with reading the go-doc and examples.
Having usable code is nice, but understanding why its working is way above that.
Thanks in advance!
The following code doesn't do what you think it does.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
This first reads your 1.5GB iso. As jnml pointed out, it continuously makes bigger and bigger buffers to fill it. In the end, And total buffer size is no less than 1.5GB and no greater than 1.875GB (by the current implementation).
However, after that you then make another buffer! h.Sum(file) doesn't hash file. It appends the current hash to file! This may or may not cause yet another allocation.
The real problem is that you are taking that file, now appended with the hash, and printing it with %x. Fmt actually pre-computes using the same type of method jnml pointed out that ioutil.ReadAll used. So it constantly allocated bigger and bigger buffers to store the hex of your file. Since each letter is 4 bits, that means we are talking about no less than a 3GB buffer for that and no greater than 3.75GB.
This means your active buffers may be as big 5.625GB. Combine that with the GC not being perfect and not removing all the intermediate buffers, and it could very easily fill your space.
The correct way to write that code would have been.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
h.Write(file)
fmt.Printf("%x", h.Sum(nil))
}
This doesn't do nearly the number the allocations.
The bottom line is that ReadFile is rarely what you want to use. IO streaming (using readers and writers) is always the best way when it is an option. Not only do you allocate much less when you use io.Copy, you also hash and read the disk concurrently. In your ReadFile example, the two resources are used synchronously when they don't depend on each other.
ioutil.ReadFile is working right. It's your fault to abuse the system resources by using that function for things you know are huge.
ioutil.ReadFile is a handy helper for files you're pretty sure in advance that they're going to be small. Like configuration files, most source code files etc. (Actually it's optimizing things for files <= 1e9 bytes, but that's an implementation detail and not part of the API contract. Your 1.5GB file forces it to use slice growing and thus allocating more than one big buffer for your data in the process of reading the file.)
Even your other approach using os.File is not okay. You definitely should be using the "bufio" package for sequential processing of large files, see bufio.NewReader.