How do EMACS Lisp programmers read text files for non-editing purposes? - file-io

What do EMACS Lisp programmers do, when they want to write something roughly the equivalent of...
for line in open("foo.txt", "r", encoding="utf-8").readlines():
...(split on ws and call a fn, or whatever)...
..?
When I look in the EMACS lisp help, I see functions about opening files into text editing buffers -- not exactly what I was intending. I suppose I could write functions to visit the lines of the file, but if I did that, I wouldn't want the user to see it, and besides, it doesn't seem very efficient from a text-processing standpoint.

I think a more direct translation of the original Python code is as follows:
(with-temp-buffer
(insert-file-contents "foo.txt")
(while (search-forward-regexp "\\(.*\\)\n?" nil t)
; do something with this line in (match-string 1)
))
I think with-temp-buffer/insert-file-contents is generally preferable to with-current-buffer/find-file-noselect, because the former guarantees that you're working with a fresh copy of the entire file contents. With the latter construction, if you happen to already have a buffer visiting the target file, then that buffer is returned by find-file-noselect, so if that buffer has been narrowed, you'll only see that part of the file when you process it.
Keep in mind that it may very well be more convenient not to process the file line-by-line. For example, this is an expression that returns a list of all sequences of consecutive digits in the file:
(with-temp-buffer
(insert-file-contents "foo.txt")
(loop while (search-forward-regexp "[0-9]+" nil t)
collect (match-string 0)))
(require 'cl) first to bring in the loop macro.

Yes, that is what you want to do: visit the file in a buffer, and operate on the text in that buffer.
You do not have to display the buffer, i.e., the user need not see it.
And as for efficiency: manipulating text in a buffer is typically the most efficient way to manipulate text.
You can visit a file in a buffer in several ways. You might want to use an existing file buffer for this, depending on the use case. That is, if the file is already "open" in Emacs then you might want to use its buffer.
Or you might want to disregard any existing file buffer for an already "open" file, and read the file anew into a new buffer. For that, as #Sean mentions, you can use insert-file-contents with a buffer that you create. You can create the buffer using with-temp-buffer or generate-new-buffer, depending, again, on what you want/need to do with it.
If you do want to reuse a buffer that is already visiting the file, you can test whether it has been modified in memory, whether it is narrowed, etc., and do whatever is appropriate for your use case. You can check whether there is already a buffer visiting the file (using any path/file name) using function find-buffer-visiting.
To visit the file, taking advantage of any existing buffer that is visiting it, you can use find-file-noselect. That function returns the buffer that visits the file, so you can pass that buffer as the first argument to with-current-buffer. Here is a simple example.
(with-current-buffer (let ((enable-local-variables ())) (find-file-noselect file))
;; Do some stuff with the text in the buffer.
;; Optionally save the buffer back to the file.
)
(The binding of enable-local-variables to nil is a minor optimization, for the common case where you don't need to bother with buffer-local variables.)

Related

how ext4 works with fallocate

Recently, I am testing the proper usage of ext4 filesystem. what is my expert is that:
when system crashed, the data had been write return ok can not loss, but metadate can.
Here is my usage:
1. call fallocate to alloc centain space
fallocate(fd, 0, 0, 4*1024*1024); //4MB
2. call fsync(fd) let data and metadata write to disks
3. then i call function to randomly write the file with 4k size(random data but not 0). with O_DRICT flag,but not call fsync. I log the offset with return write ok.
4. check the offset that logged. but i find in some offset, read 4k data, is 0. It seems mean that offset isn't used like hole files.
My question is that:
<1. why after calling fallocate and fsync the metadata of the file still seems
indicate some blocks is not used, so when read it return null. It is my understand .
<2. have other api to call, can make sure that in allocate space with file is not holes ,after that when write data return ok with O_DIRECT can make sure the data will not be loss even the system crashed.
Thanks.
Only writing to the file space can eliminate the hole. Without writing, there is no dirty page and fsync simply does nothing.
I am wondering how did you execute you step 4. It seems that you did it by a manual crash, did you? If you read it after write without a crash, it should not be zero, provided you wrote non-zeros. If you read it after a crash, zero can happen if disk cache existed. However, this kind of zero is not like holes, they are zeros read from the disk (very probably the disk contains zeros).

How to create a lazy-evaluated range from a file?

The File I/O API in Phobos is relatively easy to use, but right now I feel like it's not very well integrated with D's range interface.
I could create a range delimiting the full contents by reading the entire file into an array:
import std.file;
auto mydata = cast(ubyte[]) read("filename");
processData(mydata); // takes a range of ubytes
But this eager evaluation of the data might be undesired if I only want to retrieve a file's header, for example. The upTo parameter doesn't solve this issue if the file's format assumes a variable-length header or any other element we wish to retrieve. It could even be in the middle of the file, and read forces me to read all of the file up to that point.
But indeed, there are alternatives. readf, readln, byLine and most particularly byChunk let me retrieve pieces of data until I reach the end of the file, or just when I want to stop reading the file.
import std.stdio;
File file("filename");
auto chunkRange = file.byChunk(1000); // a range of ubyte[]s
processData(chunkRange); // oops! not expecting chunks!
But now I have introduced the complexity of dealing with fixed size chunks of data, rather than a continuous range of bytes.
So how can I create a simple input range of bytes from a file that is lazy evaluated, either by characters or by small chunks (to reduce the number of reads)? Can the range in the second example be seamlessly encapsulated in a way that the data can be processed like in the first example?
You can use std.algorithm.joiner:
auto r = File("test.txt").byChunk(4096).joiner();
Note that byChunk reuses the same buffer for each chunk, so you may need to add .map!(chunk => chunk.idup) to lazily copy the chunks to the heap.

Erlang binary protocol serialization

I'm currently using Erlang for a big project but i have a question regarding a proper proceeding.
I receive bytes over a tcp socket. The bytes are according to a fixed protocol, the sender is a pyton client. The python client uses class inheritance to create bytes from the objects.
Now i would like to (in Erlang) take the bytes and convert these to their equivelant messages, they all have a common message header.
How can i do this as generic as possible in Erlang?
Kind Regards,
Me
Pattern matching/binary header consumption using Erlang's binary syntax. But you will need to know either exactly what bytes or bits your are expecting to receive, or the field sizes in bytes or bits.
For example, let's say that you are expecting a string of bytes that will either begin with the equivalent of the ASCII strings "PUSH" or "PULL", followed by some other data you will place somewhere. You can create a function head that matches those, and captures the rest to pass on to a function that does "push()" or "pull()" based on the byte header:
operation_type(<<"PUSH", Rest/binary>>) -> push(Rest);
operation_type(<<"PULL", Rest/binary>>) -> pull(Rest).
The bytes after the first four will now be in Rest, leaving you free to interpret whatever subsequent headers or data remain in turn. You could also match on the whole binary:
operation_type(Bin = <<"PUSH", _/binary>>) -> push(Bin);
operation_type(Bin = <<"PULL", _/binary>>) -> pull(Bin).
In this case the "_" variable works like it always does -- you're just checking for the lead, essentially peeking the buffer and passing the whole thing on based on the initial contents.
You could also skip around in it. Say you knew you were going to receive a binary with 4 bytes of fluff at the front, 6 bytes of type data, and then the rest you want to pass on:
filter_thingy(<<_:4/binary, Type:6/binary, Rest/binary>>) ->
% Do stuff with Rest based on Type...
It becomes very natural to split binaries in function headers (whether the data equates to character strings or not), letting the "Rest" fall through to appropriate functions as you go along. If you are receiving Python pickle data or something similar, you would want to write the parsing routine in a recursive way, so that the conclusion of each data type returns you to the top to determine the next type, with an accumulated tree that represents the data read so far.
I only covered 8-bit bytes above, but there is also a pure bitstring syntax, which lets you go as far into the weeds with bits and bytes as you need with the same ease of syntax. Matching is a real lifesaver here.
Hopefully this informed more than confused. Binary syntax in Erlang makes this the most pleasant binary parsing environment in a general programming language I've yet encountered.
http://www.erlang.org/doc/programming_examples/bit_syntax.html

File reading and checksums in go. Difference between methods

Recently I'm into creating checksums for files in go. My code is working with small and big files. I tried two methods, the first uses ioutil.ReadFile("filename") and the second is working with os.Open("filename").
Examples:
The first function is working with the io/ioutil and works for small files. When I try to copy a big file my ram gets blastet and for a 1.5GB iso it uses 3GB of ram.
func byteCopy(fileToCopy string) {
file, err := ioutil.ReadFile(fileToCopy) //1.5GB file
omg(err) //error handling function
ioutil.WriteFile("2.iso", file, 0777)
os.Remove("2.iso")
}
Even worse when I want to create a checksum with crypto/sha512 and io/ioutil.
It will never finish and abort because it runs out of memory.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
When using the function below everything works fine.
func ioHash() {
f, err := os.Open(iso) //iso is a big ~ 1.5tb file
omg(err) //error handling function
defer f.Close()
h := sha512.New()
io.Copy(h, f)
fmt.Printf("%x", h.Sum(nil))
}
My Question:
Why is the ioutil.ReadFile() function not working right? The 1.5GB file should not fill my 16GB of ram. I don't know where to look right now.
Could somebody explain the differences between the methods? I don't get it with reading the go-doc and examples.
Having usable code is nice, but understanding why its working is way above that.
Thanks in advance!
The following code doesn't do what you think it does.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
This first reads your 1.5GB iso. As jnml pointed out, it continuously makes bigger and bigger buffers to fill it. In the end, And total buffer size is no less than 1.5GB and no greater than 1.875GB (by the current implementation).
However, after that you then make another buffer! h.Sum(file) doesn't hash file. It appends the current hash to file! This may or may not cause yet another allocation.
The real problem is that you are taking that file, now appended with the hash, and printing it with %x. Fmt actually pre-computes using the same type of method jnml pointed out that ioutil.ReadAll used. So it constantly allocated bigger and bigger buffers to store the hex of your file. Since each letter is 4 bits, that means we are talking about no less than a 3GB buffer for that and no greater than 3.75GB.
This means your active buffers may be as big 5.625GB. Combine that with the GC not being perfect and not removing all the intermediate buffers, and it could very easily fill your space.
The correct way to write that code would have been.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
h.Write(file)
fmt.Printf("%x", h.Sum(nil))
}
This doesn't do nearly the number the allocations.
The bottom line is that ReadFile is rarely what you want to use. IO streaming (using readers and writers) is always the best way when it is an option. Not only do you allocate much less when you use io.Copy, you also hash and read the disk concurrently. In your ReadFile example, the two resources are used synchronously when they don't depend on each other.
ioutil.ReadFile is working right. It's your fault to abuse the system resources by using that function for things you know are huge.
ioutil.ReadFile is a handy helper for files you're pretty sure in advance that they're going to be small. Like configuration files, most source code files etc. (Actually it's optimizing things for files <= 1e9 bytes, but that's an implementation detail and not part of the API contract. Your 1.5GB file forces it to use slice growing and thus allocating more than one big buffer for your data in the process of reading the file.)
Even your other approach using os.File is not okay. You definitely should be using the "bufio" package for sequential processing of large files, see bufio.NewReader.

Multithreading a filestream in vb2005

I am trying to build a resource file for a website basically jamming all the images into a compressed file that is then unpacked on the output buffers to the client.
my question is in vb2005 can a filestream be multi threaded if you know the size of the converted file, ala like a bit torrent and work on pieces of the filestream ( the individual files in this case) and add them to the resource filestream when they are done instead of one at a time?
If you need something similar to the torrents way of writing to a file, this is how I would implement it:
Open a FileStream on Thread T1, and create a queue "monitor" for step 2
Create a queue that will be read from T1, but written by multiple network reader threads. (the queue data structure would look like this: (position where to write, size of data buffer, data buffer).
Fire up the threads
:)
Anyway, from your comments, your problem seems to be another one..
I have found something in, but I'm not sure if it works:
If you want to write data to a file,
two parallel methods are available,
WriteByte() and Write(). WriteByte()
writes a single byte to the stream:
byte NextByte = 100;
fs.WriteByte(NextByte);
Write(), on the other hand, writes out
an array of bytes. For instance, if
you initialized the ByteArray
mentioned before with some values, you
could use the following code to write
out the first nBytes of the array:
fs.Write(ByteArray, 0, nBytes);
Citation from:
Nagel, Christian, Bill Evjen, Jay
Glynn, Morgan Skinner, and Karli
Watson. "Chapter 24 - Manipulating
Files and the Registry". Professional
C# 2005 with .NET 3.0. Wrox Press. ©
2007. Books24x7. http://common.books24x7.com/book/id_20568/book.asp
(accessed July 22, 2009)
I'm not sure if you're asking if a System.IO.FileStream object can be read from or written to in a multi-threaded fashion. But the answer in both cases is no. This is not a supported scenario. You will need to add some form of locking to ensure serialized access to the resource.
The documentation calls out multi-threaded access to the object as an unsupported scenario
http://msdn.microsoft.com/en-us/library/system.io.filestream.aspx