Should a class or method which processes a file close the file as a side effect? - file-io

I'm wondering which is the more 'Pythonic' / better way to write methods which process files. Should the method which processes the file close that file as a side effect? Should the concept of the data being a 'file' be completely abstracted from the method which is processing the data, meaning it should expect some 'stream' but not necessarily a file?:
As an example, is it OK to do this:
process(open('somefile','r'))
... carry on
Where process() closes the file handle:
def process(somefile):
# do some stuff with somefile
somefile.close()
Or is this better:
file = open('somefile','r')
process(file)
file.close()
For what it's worth, I generally am using Python to write relatively simple scripts that are extremely specifically purposed, where I'll likely be the only one ever using them. That said, I don't want to teach myself any bad practices, and I'd rather learn the best way to do things, as even a small job is worth doing well.

Generally, it is better practice for the opener of a file to close the file. In your question, the second example is better.
This is to prevent possible confusion and invalid operations.
Edit: If your real code isn't any more complex than your example code, then it might be better just to have process() open and close the file, since the caller doesn't use the file for anything else. Just pass in the path/name of the file. But if it is conceivable that a caller to process() will use the file before the file should be closed, then leave the file open and close operations outside of process().

Using with means not having to worry about this.
At least, not on 2.5+.

No, because it reduces flexibility: if function doesn't close file and caller doesn't need it - caller can close file. If it does but caller still need file open - caller stuck with reopening file, using StringIO, or something else.
Also closing file object require additional assumptions on it's real type (object can support .read() or .write(), but have no meaningful .close()), hindering duck-typing.
And files-left-open is not a problem for CPython - it will be closed on garbage collection instantly after something(open('somefile')). (other implementation will gc and close file too, but at unspecified moment)

Related

Track external file content changes

I've fully implemented the NSFilePresenter protocol using NSFileCoordinator to track external changes made to an imported folder/file tree. And it generally works.
However, I'm still a bit confused.
I've implemented almost all (if not all) of the protocol's functions.
presentedSubitemDidChangeAtURL:, presentedSubitemAtURL:didMoveToURL: and presentedItemDidChange are the only ones getting called
I am able to successfully track new folders being added, or files moving. However when going to track file content changes (edited and saved by another app) things get a bit complicated: The method to be called is presentedItemDidChange (which gets obviously called in other cases as well)
What am I missing?
Shouldn't either presentedItemDidGainVersion: or savePresentedItemChangesWithCompletionHandler: get called?
Any help or pointers are more than welcome! :)

Random access files at clean VB.NET way

After working some time in VB.NET I would like to get rid of Microsoft.VisualBasic dependencies.
Since with text files and string manipulation goes easy here I don't know what to do.
Is it possible to write equivalent code in VB.NET without using Microsoft.VisualBasic namespace and how this code should look like?
Dim fnum As Integer = FreeFile()
FileOpen(fnum, "Setup\myadmin", OpenMode.Random, OpenAccess.ReadWrite, OpenShare.Shared, Len(idstruct))
FilePut(fnum, idstruct, 1) 'structure data to file in record 1
FileClose(fnum)
As much as I sympathise with your desire to remove all references to the Microsoft.VisualBasic namespace, and as much as I think that doing so has value, sometimes it's just not worth the trouble. The namespace does contain some useful tools which are not easily reproduced without it.
For instance, the TextFieldParser comes to mind. It allows you to easily read CSV and fixed-width files. There is no other class like it in the .NET framework. So, is it worth it to reinvent the wheel just so that you don't reference the Microsoft.VisualBasic namespace? I would argue that it's not worth it.
While it would be possible to reproduce the behavior of FileGet and FilePut using FileStream and the StreamReader, StreamWriter, BinaryReader, and BinaryWriter classes, it's probably not worth all the trouble. The FileGet and FilePut methods are provided specifically for backwards compatibility, so if compatibility with old systems is your goal, as much as it pains me to say it, using FileGet and FilePut is an appropriate solution.
However, some of this advice hinges on the type of data. If, for instance, the structure only contains fixed width strings, that would be very easy to duplicate with the StreamReader and StreamWriter, or with the TextFieldParser. Or, if it contains just integers, perhaps it will be easy to reproduce with the BinaryReader and BinaryWriter.
However, even if you can easily reproduce the logic using other non-VB-only classes, doing so doesn't gain you anything. In fact, your code will be more complicated and it will be less self-documenting. When you see code using FileGet and FilePut, not only is it easy to tell what is being done, but it is also obvious that it is for backwards compatibility. If you replace them with your own logic, the necessity for backwards compatibility would not be obvious without adding comments to the code.
If you don't like looking at them, which I can certainly understand, it may be worth wrapping them in a wrapper class. For instance, you could create a data-access style class with load/save methods which internally just use FileGet and FilePut. Doing so would be good practice anyway. That way, if you ever choose to store the data in a different format, or a different data source (such as a database), you could change it in the one class without having to rewrite all of your code.
One other thing I've just found is this MSDN page about My.Computer.FileSystem:
http://msdn.microsoft.com/en-us/library/0b485hf7%28v=vs.90%29.aspx
which I found was referred to from the MSDN page on FilePut:
http://msdn.microsoft.com/en-us/library/0s9sa1ab%28v=vs.90%29.aspx
From posts like this one It's my understanding that it's just a wrapper for System.IO anyway, but it supposedly provides a "more convenient and understandable" interface to the underlying IO functions.
If you're referring to your use of the LEN() function (which is the only thing I can see that refers to Microsoft.VisualBasic namespace, unless I'm missing something), then you can just use String.Length instead. E.g. idstruct.Length, assuming idstruct is a String.
Look at System.IO.FileStream. The approach is a bit different, but it is quite simple.
Dim fs As New FileStream(mUserFile, FileMode.XXX, FileAccess.XXX)
or:
Using fs as New FileStream....
End Using
The main thing that will change is that rather than writing a structure, you would convert it to an array of bytes (Count would be the length of your array, offset 0 in your example):
fs.Write(byt(), lOffset, lCount)
You COULD wrap it all up in a class to emulate the old random access file method if there it a lot of legacy code to support. There is also a BinaryReader and BinaryWriter and you could also look into serialization if the data is large but mostly static.

Serializing Running Programs in a Functional Interpreter

I am writing an interpreter implemented functionally using a variations of the Cont Monad. Inspired by Smalltalk's use of images to capture a running program, I am investigating how to serialize the executing hosted program and need help determining how to accomplish this at a high level.
Problem Statement
Using the Cont monad, I can capture the current continuation of a running program in my interpreter. Storing the current continuation allows resuming interpreter execution by calling the continuation. I would like to serialize this continuation so that the state of a running program can be saved to disk or loaded by another interpreter instance. However, my language (I am both targeting and working in Javascript) does not support serializing functions this way.
I would be interested in an approach that can be used to build up the continuation at a given point of execution given some metadata without running the entire program again until it reaches that point. Preferably, minimal changes to the implementation of the interpreter itself would be made.
Considered Approach
One approach that may work is to push all control flow logic into the program state. For example, I currently express a C style for loop using the host language's recursion for the looping behavior:
var forLoop = function(init, test, update, body) {
var iter = function() {
// When test is true, evaluate the loop body and start iter again
// otherwise, evaluate an empty statement and finish
return branch(test,
next(
next(body, update),
iter()),
empty);
};
return next(
init,
iter());
};
This is a nice solution but if I pause the program midway though a for loop, I don't know how I can serialize the continuation that has been built up.
I know I can serialize a transformed program using jumps however, and the for loop can be constructed from jump operations. A first pass of my interpreter would generate blocks of code and save these in the program state. These blocks would capture some action in the hosted language and potentially execute other blocks. The preprocessed program would look something like this:
Label Actions (Block of code, there is no sequencing)
-----------------------------------
start: init, GOTO loop
loop: IF test GOTO loop_body ELSE GOTO end
loop_body: body, GOTO update
update: update, GOTO loop
end: ...
This makes each block of code independent, only relying on values stored in the program state.
To serialize, I would save off the current label name and the state when it was entered. Deserialization would preprocess the input code to build the labels again and then resume at the given label with the given state. But now I have to think in terms of these blocks when implementing my interpreter. Even using composition to hide some of this seems kind of ugly.
Question
Are there any good existing approaches for addressing this problem? Am I thinking about serializing a program the entirely wrong way? Is this even possible for structures like this?
After more research, I have some thoughts on how I would do this. However, I'm not sure that adding serialization is something I want to do at this point as it would effect the rest of the implementation so much.
I'm not satisfied with this approach and would greatly like to hear any alternatives.
Problem
As I noted, transforming the program into a list of statements makes serialization easier. The entire program can be transformed into something like assembly language, but I wanted to avoid this.
Keeping a concept of expressions, what I didn't originally consider is that function calls can occur inside of deeply nested expressions. Take this program for example:
function id(x) { return x; }
10 + id(id(5)) * id(3);
The serializer should be able to serialize the program at any statement, but the statement could potentially be evaluated inside of an expression.
Host Functions In the State
The reason the program state cannot be easily serialized is that it contains host functions for continuations. These continuations must be transformed into data structures that can be serialized and independently reconstructed into the action the original continuation represented. Defunctionalization is most often used to express a higher order language in a first order language, but I believe it would also enable serialization.
Not all continuations can be easily defunctionalized without drastically rewriting the interpreter. As we are only interested in serialization at specific points, serialization at these points requires the entire continuation stack be defunctionalized. So all statements and expressions must be defunctionalized, but internal logic can remain unchanged in most cases because we don't want to allow serialization partway though an internal operation.
However, to my knowledge, defunctionalization does not work the Cont Monad because of bind statements. The lack of a good abstraction makes it difficult to work with.
Thoughts on a Solution
The goal is to create an object made up of only simple data structures that can be used to reconstruct the entire program state.
First, to minimize the amount of work required, I would rewrite the statements level interpreter to use something more like a state machine that can be more easily serialized. Then, I would defunctionalize expressions. Function calls would push the defunctionlized continuation for the remaining expression onto an internal stack for the state machine.
Making Program State a Serializable Object
Looking at how statements work, I'm not convinced that the Cont Monad is the best approach for chaining statements together (I do think it works pretty well at the expression level and for internal operations however). A state machine seems more natural approach, and this would also be easier to serialize.
A machine that steps between statements would be written. All other structures in the state would also be made serializable. Builtin functions would have to use serializable handles to identify them so that no functions are in the state.
Handling Expressions
Expressions would be rewritten to pass defunctionalized continuations instead of host function continuations. When a function call is encountered in an expression, it captures the defunctionalized current continuation and pushes it onto the statement machine's internal stack (This would only happen for hosted functions, not builtin ones), creating a restore point where computation can resume.
When the function returns, the defunctionalized continuation is passed the result.
Concerns
Javascript also allows hosted functions to be evaluated inside almost any expression (getters, setters, type conversion, higher order builtins) and this may complicate things if we allow serialization inside of those functions.
Defunctionalization seems to require working directly with continuations and would make the entire interpreter less flexible.

Erlang serialization

I need to serialize a function in Erlang, send it over to another note, deserialize and execute it there. The problem I am having is with files. If the function reads from a file which is not in the second node I am getting an error. Is there a way how I can differentiate between serializable and not serializable constructs in Erlang? Thus if a function makes use of a file or pid, then it fails to serialize?
Thanks
First of all, if you are sending anonymous functions, be extremely careful with that. Or, rather, just don't do it
There are a couple of cases when this function won't even be executed or will be executed in a completely wrong way.
Every function in Erlang, even an anonymous one, belongs to some module, in the one it's been constructed inside, to be precise. If this function has been built in REPL, it's bound to erl_eval module, which is even more dangerous (I'll explain further, why).
Say, you start two nodes, one of them has a module named 'foo', and one doesn't have such a module loaded (and cannot load it). If you construct a lambda inside the module 'foo', send it to the second node and try to call it, you'll fail with {error, undef}.
There can be another funny problem. Try to make two versions of module 'foo', implement a 'bar' function inside each of them and implement a lambda inside of which (but the lambdas will be different). You'll get yet another error when trying to call a sent lambda.
I think, there could possibly be other tricky parts of sending lambdas to different nodes, but trust me, that's already quite a lot.
Secondly, there are tons of way you can get a process or a port inside a lambda without knowing it in advance
Even though there is a way of catching closured variables from a lambda (if you take a look at a binarized lambda, all the external variables that are being used inside it are listed starting from the 2nd byte), they are not the only source of potential pids or ports.
Consider an easy example: you call a self() function inside your lambda. What will it return? Right, a pid. Okay, we can probably parse the binary and catch this function call, along with a dozen of other built-in functions. But what will you do when you are calling some external function? ets:lookup(sometable, somekey)? some_module:some_function_that_returns_god_knows_what()? You don't know what they are going to return.
Now, to what you can actually do here
When working with files, always send filenames, not descriptors. If you need file's position or something, send it as well. File descriptors shouldn't be known outside the process they've been opened.
As I mentioned, do everything to avoid lambdas to be sent to other nodes. It's hard to tell how to avoid that, since I don't know your exact task. Maybe, you can send a list of functions to execute, like:
[{module1, parse_query},
{module1, dispatch_parsed_query},
{module2, validate_response},
{module2, serialize_query}]
and pass arguments through this functions sequence (be sure all the modules exist everywhere). Maybe, you can actually stick to some module that is going to be frequently changed and deployed over the entire cluster. Maybe, you might want to switch to JS/Lua and use externally started ports (Riak is using spidermonkey to process JS-written lambdas for Map/Reduce requests). Finally, you can actually get module's object code, send it over to another node and load it there. Just keep in mind it's not safe too. You can break some running processes, lose some constructed lambdas, and so on.

Custom performance profiler for Objective C

I want to create a simple to use and lightweight performance profile framework for Objective C. My goal is to measure the bottlenecks of my application.
Just to mention that I am not a beginner and I am aware of Instruments/Time Profiler. This is not what I am looking for. Time Profiler is a great tool but is too developer oriented. I want a framework that can collect performance data from a QA or pre production users and even incorporate in a real production environment to gather the real data.
The main part of this framework is the ability to measure how much time was spent in Objective C message (I am going to profile only Objective C messages).
The easiest way is to start timer in the beginning of a message and stop it at the end. It is the simplest way but its disadvantage is that it is to tedious and error prone - if any message has more than 1 return path then it will require to add the "stop timer" code before each return.
I am thinking of using method swizzling (just to note that I am aware that Apple are not happy with method swizzling but these profiled builds will be used internally only - will not be uploaded on the App Store).
My idea is to mark each message I want to profile and to generate automatically code for the method swizzling method (maybe using macros). When started, the application will swizzle the original selector with the generated one. The generated one will just start a timer, will call the original method and then will stop the timer. So in general the swizzled method will be just a wrapper of the original one.
One of the problems of the above idea is that I cannot think of an easy way how to automatically generate the methods to use for swizzling.
So I greatly will appreciate if anyone has any ideas how to automate the whole process. The perfect scenario is just to write one line of code anywhere mentioning the class and the selector I want to profile and the rest to be generated automatically.
Also will be very thankful if you have any other idea (beside method swizzling) of how to measure the performance.
I came up with a solution that works for me pretty well. First just to clarify that I was unable to find out an easy (and performance fast) way to automatically generate the appropriate swizzled methods for arbitrary selectors (i.e. with arbitrary arguments and return value) using only the selector name. So I had to add the arguments types and the return value for each selector, not only the selector name. In reality it should be relatively easy to create a small tool that would be able to parse all source files and detect automatically what are the arguments types and the returned value of the selector which we want to profile (and prepare the swizzled methods) but right now I don't need such an automated solution.
So right now my solution includes the above ideas for method swizzling, some C++ code and macros to automate and minimize some coding.
First here is the simple C++ class that measures time
class PerfTimer
{
public:
PerfTimer(PerfProfiledDataCounter* perfProfiledDataCounter);
~PerfTimer();
private:
uint64_t _startTime;
PerfProfiledDataCounter* _perfProfiledDataCounter;
};
I am using C++ to use that the destructor will be executed when object has exited the current scope. The idea is to create PerfTimer in the beginning of each swizzled method and it will take care of measuring the elapsed time for this method
The PerfProfiledDataCounter is a simple struct that counts the number of execution and the whole elapsed time (so it may find out what is the average time spent).
Also I am creating for each class I'd like profile, a category named "__Performance_Profiler_Category" and to conforms to "__Performance_Profiler_Marker" protocol. For easier creating I am using some macros that automatically create such categories. Also I have a set of macros that take selector name, return type and arguments type and create selectors for each selector name.
For all of the above tasks, I've created a set of macros to help me. Also I have a single file with .mm extension to register all classes and all selectors I'd like to profile. On app start, I am using the runtime to retrieve all classes that conforms to "__Performance_Profiler_Marker" protocol (i.e. the registered ones) and search for selectors that are marked for profiling (these selectors starts with predefined prefix). Note that this .mm file is the only file that needs .mm extension and there is no need to change file extension for each class I want to profile.
Afterwards the code swizzles the original selectors with the profiled ones. In each profiled one, I just create PerfTimer and call the swizzled method.
In brief that is my idea which turned out to work pretty smoothly.