Writing a large number of files from a long running process? - optimization

I have a project which scans a large file (2.5GB) picking out strings which will then be written to some subset of several hundred files.
It would be fastest just to use normal buffered writes but
I'm worried about running out of filehandles.
I want to be able to watch the progress of the files while they're being written.
I would prefer as little loss as possible if the process is interrupted. Incomplete files are still partially useful.
So instead I open in read/write mode, append the new line, and close again.
This was fast enough much of the time but I have found that on certain OSes this behaviour is a severe pessimization. Last time I ran it on my Windows 7 netbook I interrupted it after several days!
I can implement some kind of MRU filehandle manager which keeps so many files open and flushes after so many write operations each. But is this overkill?
This must be a common situation, is there a "best practice", a "pattern"?
Current implementation is in Perl and has run on Linux, Solaris, and Windows, netbooks to phat servers. But I'm interested in the general problem: language-independent and cross-platform. I've thought of writing the next version in C or node.js.

On Linux, you can open a lot of files (thousands). You can limit the number of opened handles in a single process with the setrlimit syscall and the ulimit shell builtin. You can query them with the getrlimit syscall and also using /proc/self/limits (or /proc/1234/limits for process of pid 1234). The maximum number of system-wide opened files is thru /proc/sys/fs/file-max (on my system, I have 1623114).
So on Linux you could not bother, and open many files at once.
And I would suggest to maintain a memoized cache of opened files, and use them if possible (in a MRU policy). Don't open and close each file too often, only when some limit has been reached... (e.g. when an open did fail).
In other words, you could have your own file abstraction (or just a struct) which knows the file name, may have an opened FILE* (or a null pointer) and keep the current offset, maybe also the last time of opening or writing, then manage a collection of such things in a FIFO discipline (for those having an opened FILE*). You certainly want to avoid close-ing (and later re-open-ing) a file descriptor too often.
You might occasionally (i.e. once a few minutes) call sync(2), but don't call it too often (certainly not more than once per 10 seconds). If using buffered FILE-s don't forget to sometimes fflush them. Again, don't do that very often.

Related

Does a new process is created when syscall is called in Minix?

For example, when we call write(...) in program in minix. Does a new process is created(like with fork()) or is it done within current process?
Is it efficient to make a lot of syscalls?
Process creation is strictly fork's / exec's job. What kind of process could a system call like write possibly spawn?
Now, Minix is a microkernel, meaning that things like file systems run in userland processes. Writing to a file could therefore possibly spawn a new process somewhere else, but that depends on your file system driver. I haven't paid attention to the MinixFS driver so far, so I can't tell you whether that happens -- but it's not very likely, process creation still being relatively expensive.
It's almost never efficient to make a lot of syscalls (context switches being involved). However, "performant", "efficient" and "a lot" are all very relative things, so I can't tell you something you probably don't know already.

Why is it necessary to close a file after using it?

I read a question from a book that said "If the OS closes a file after the program terminates, why does the programmer need to close a file manually (i.e. call file.close())?"
The only reason I could come up with is that the program may not terminate correctly, and so the file may be still open, therefore consuming system resources, because the file is kept on a buffer.
Are there any other reasons?
EDIT: I thought of another reason. Calling file.close() obliges the OS to flush to disk any changes that haven't been committed to the file.
If the programmer manually closes the file they have control over when/how the resources are released.
If it is left to the OS you cant be sure of when/if clean-up will take place, its generally bad practice as well.
In some situations the program will call open so many times that it'll run out of file descriptors if they aren't released again.

How to autostart a program from floppy disk on a Commodore c64

Good news, my c64 ist still running after lots of years spending time on my attic..
But what I always wanted to know is:
How can I automatically load & run a program from a floppy disk that is already inserted
when I switch on the c64?
Some auto-running command like load "*",8,1 would be adequate...
Regards
MoC
You write that a command that you type in, like LOAD"*",8,1 would be adequate. Can I assume, then, that the only problem with that particular command is that it only loads, but doesn't automatically run, the program? If so, you have a number of solutions:
If it's a machine language program, then you should type LOAD"<FILENAME>",8,1: and then (without pressing <RETURN>) press <SHIFT>+<RUN/STOP>.
If it's a BASIC program, type LOAD"<FILENAME>",8: and then (without pressing <RETURN>) press <SHIFT>+<RUN/STOP>.
It is possible to write a BASIC program such that it automatically runs when you load it with LOAD"<FILENAME>",8,1. To do so, first add the following line to the beginning of your program:
0 POKE770,131:POKE771,164
Then issue the following commands to save the program:
PRINT"{CLR}":POKE770,113:POKE771,168:POKE43,0;POKE44,3:POKE157,0:SAVE"<FILENAME>",8
This is not possible without some custom cartridge.
One way to fix this would be getting the Retro Replay cartridge and hacking your own code for it.
I doubt there is a way to do it; you would need a cartridge which handles this case and I don't think one like that exists.
A better and more suitable solution is EasyFlash actually. Retro Replay is commonly used with its own ROM. Since it is a very useful cartridge by default ROM, I would never flash another ROM to it. Also it is more expensive than EasyFlash if you don't have any of those cartridges.
At the moment, I have Prince Of Persia (!) ROM written to my EasyFlash and when I open my c64, it autoruns just like you asked for.
Not 100% relevant, but C128 can autoboot disks in C128 mode. For example Ultima V (which has musics on C128 but not on C64 or C128 in C64 mode) autoboots.
As for cartridges, I'd recommend 1541 Ultimate 2. It can also run games from module rom images (although Prince of Persia doesn't work for me for some reason, perhaps software issue?), but you also get rather good floppy emulator (which also makes it easier to transfer stuff to real disks), REU, tape interface (if you order it) etc.
If you are working with a ML program, there are several methods. If you aren't worried about ever returning to normal READY prompt without a RESET, you can have a small loader that loads into the stack ($0100-$01FF) The loader would just load the next section of code, then jump to it. It would start at $0102 and needs to be as small as possible. Many times, the next piece to load is only 2 characters, so the file name can be placed at $0100 & $0101. Then all you need to do is set LFS, SETNAM, LOAD, then JMP to it. Fill the rest of the stack area with $01. It is also rather safe to only save $0100-$010d so that the entire program will fit on a single disk block.
One issue with this, is that it clears out past stack entries (so, your program will need to reset the stack pointer back to the top.) If your program tries to do a normal RTS out of itself, random things can occur. If you want to exit the program, you'll need to jmp to the reset vector ($FFFC by default,) to do so.

Start/Stop JVM [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Are there any Java VMs which can save their state to a file and then reload that state?
Is there a way to stop the JVM, save its states in the hard disk and then resume the execution from the same point it was stopped? Something like the hibernation of windows, linux, etc?
When it's running on a Unix, you can certainly suspend it by e.g. hitting "^Z" on the command line you started it from, or sending it a SIGSTOP signal. This doesn't quite do what you want. It's not written directly to disk (though it may be swapped out). It won't survive a system restart. Unlike an image file, you can't copy or restore it.
There are also a variety of hacks that let some image based systems (smalltalk, emacs, etc.) "unexec()" themselves and save a copy on a disk. These will break any network connections or open files. Most approaches also require the cooperation of the program being saved, especially to gracefully handle the connections to the outside world being severed.
Finally, you could run the JVM in a VM, and suspend the VM. There, at least, connections to files inside the VM will be saved.

Daemon with Clojure/JVM

I'd like to have a small (not doing too damn much) daemon running on a little server, watching a directory for new files being added to it (and any directories in the main one), and calling another Clojure program to deal with that new file.
Ideally, each file would be added to a queue (a list represented by a ref in Clojure?) and the main process would take care of those files in the queue on a FIFO basis.
My question is: is having a JVM up running this little program all the time too much a resource hog? And do you have any suggestions as to how go about doing this?
Thank you very much!
EDIT: Another question I should ask: should I run this as its own instance (using less memory) and have it launch a new JVM when a file is seen, or have it on the same JVM the Clojure code that will process the file?
As long as it is running fine now and it has no memory leaks it should be fine.
From the daemon terminology I gather it is on a unix clone, and in this case best is to start it from an init script, or from the rc.local script. Unfortunately details differ from OS to OS to be more specific.
Limit the memry using -Xmx=64m or something to make sure it fails before taking down the rest of the services. Play a bit with the number to find the lowest reliable size.
Also, since clojures claim to fame is its ability to deal with concurrency it make a lot of sense to only run one JVM with all functionality running on it in multiple threads. The overhead of spawning new processes is already very big and if it is a JVM which needs to JIT and warm up its memory management, doubly so. On a resource constrained machine could pose a problem. and on a resource rich machine this is a waste.
I always found that the JVM is not made to quickly run something script like and exit again. It is really not made for that use case in my opinion
.