I have 500,000 files which I need to read in Fortran and each file has ~14,000 entries in it (each entry is only about 100 characters long). I need to process each line for each file at a time. For example, I need to process line 1 for all 500,000 files before moving on to line 2 from the files and so forth.
I cannot open them all at once (I tried making an array of file pointers and opening them all) because there will be too many files open at once. Instead, I would like to do something as follows:
do iline = 1,Nlines
do ifile = 1,Nfiles
! open the file
! read a line
! close the file
enddo
end
In hopes that this would allow me to read one line at a time (from each file) and then move on to the next line (in each file). Unfortunately, each time I open the file it starts me off at line 1 again. Is there any way to open/close a file and then open it again where you left off previously?
Thanks
Unfortunately it is not possible in this way in standard Fortran. Even If you specify
position="ASIS"
the actual position will be unspecified for a not already connected unit and will be in fact the beginning of the file on most systems.
That means You have to use
read(*,*)
enough times to get on the right place in the file.
You could also use stream access. The file would be again opened at the beginning, but you can use
read(u,*,pos=n) number
where n is the position saved from the previous open. You can get the position from
inquire(unit=u, pos=n)
n = n
You would open the file with acess="STREAM".
Also 500000 opened files is indeed too much. There are ways how to inquire for the system limits and how to control them, but also your compiler may have some limits http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/
Other solution: Couldn't you store the content of the files in memory? Today couple of Gigabytes is OK, but it may be not enough for you.
You can try using fseek and ftell in something like the following.
! initialize an array of 0's
do iline = 1,Nlines
do ifile = 1,Nfiles
! open the file
! fseek(fd, array(ifile))
! read a line
! array(ifile)=ftell(fd)
! close the file
enddo
end
The (untested) idea is to store the offset of each file in an array and position the cursor at that place upon opening the file. Then, once a line is read, the ftell retrieves the current position which is saved to memory for next round. If all entries have the same length, you can spare the array and just store one value.
If the files have fixed, i.e., constant, record lengths, you could use direct access. Then you could "directly" read a specific record. A big "if" however.
the overhead of all the file opening/closing will be a big performance bottleneck.
You should try to read as much as you can for each open operation given whatever memory you have:
pseudocode:
loop until done:
loop over all files:
open
fseek !as in damiens answer
read N lines into array ! N=100 eg.
save ftell value for file
close
end file loop
loop over N output files:
open
write array data
close
Related
I haven't used openvms for 20+ years. It was my 1st OS. I've been asked if it possible to copy the data from RMS files from openvms server to windows as a text file - so that it's readable.
No-one has experience or knowledge of the record structures etc.
The files are xyz.DAT and are relative files. I'm hoping the dat files are fixed length.
My 1st attempt would be to try and use Datatrieve (DTR) but get an error that the image isn't loaded.
Thought it might be as easy using CONVERT/FDL = nnnn.FDL - by changing the Relative to Sequential. The file seems still to be unreadable.
Is there an easy way to stream an RMS index file to a flat ASCII file?
I use to use COBOL and C to access the data in the past but had lots of libraries to help....
I've notice some solution may use odbc to connect but not sure what I can or cannot install on the server.
I can FTP using Filezilla to the server....
Another plan writing C application to read a file and output out as string.....or DCL too.....doesn't have to be quick...
Any ideas
Has mentioned before
The simple solution MIGHT be to to just use: $ TYPE/OUT=test.TXT test.DAT.
This will handle Relatie and Indexed files alike.
It is much the same as $ CONVERT / FDL=NL: test.DAT test.TXT
Both will just read records from the source and transfer the bytes, byte for byte, to the records in a sequential file.
FTP in ASCII mode will transfer that nicely to windows.
You can also use an 'inline' FDL file to generate a 'unix' LF file like:
$ conv /fdl="record; format stream_lf" test.DAT test.TXT
Or CR-LF file using:
$ conv /fdl="record; format stream" test.DAT test.TXT
Both can be transferring in Binary or Ascii with FTP.
MOSTLY - because this really only works well for TEXT ONLY source .DAT file.
There should be no CR, LF, FF or NUL characters in the source or things will break.
As 'habo' points out, use DUMP /RECORD=COUNT=3 to see how 'readable' the source data is.
If you spot 'binary' data using DUMP then you will need to find a record defintion somewhere which maps byte to Integers or Floating points or Dates as needed.
These defintions can be COBOL LIB files, or BASIC MAPS and are often stores IN the CDD (Common Data Dictionary) or indeed in DATATRIEVE .DIC DICTIONARIES
To use such definition you likely need a program to just read following the 'map' and write/print as text. Normally that's not too hard - notably not when you can find an example program on the server to tweak.
If it is just one or two 'suspect' byte ranges, then you can create a DCL loop to read and write and use F$EXTRACT to select the chunks you like.
If you want further help, kindly describe in words what kind of data is expected and perhaps provide the output from DUMP for 3 or 5 rows.
Good luck!
Hein.
I am writing a scheme macro for a simulation tool. I create thousands of files and I want to add a header (6 lines) to each file. The code for my header runs well and the header gets created in the right way.
But the adding of the header to my file is buggy. It does not add the 6 header lines to top of my files without touching the rest, it delets the first information that are in my file. How much information is deleted, depends on the total length of header-information.
(let* ((out (open-input-output-file filename0) ))
(display header out)
(newline out)
(close-output-port out))
This is how my file looks without the header:
TracePro Release: 20 6 0
Irradiance Map Data for D:****\TracePro\Aktive\sim_mod_09.oml
Linear Units in millimeters
Data for absorbing_area_focuscircle Surface 0
Data generated at 10:55:56 May 28, 2021
This is how my file looks with the header:
axle x y z a b c
pyra 0 0 0 0 0 0
lens 0 0 0 0 0 0
coll 0 0 0 0 0 0
mir1 0 0 0 0 0 0
glass1 0 0 0 0 0 0
ing_area_focuscircle Surface 0
Data generated at 10:57:29 May 28, 2021
Raytrace Time: mins: 0, secs: 0*
Projected Plane Extent from surface geometry
TopLeft:(-1.05125,-214.755,-1.05125)
TopRight:(1.05125,-214.755,-1.05125)
BottomLeft:(-1.05125,-214.755,1.05125)
BottomRight:(1.05125,-214.755,1.05125)
This isn't really an answer, especially as my knowledge of the Scheme language standards isn't good enough to even know if this is possible within a strictly-defined Scheme. However I'll show why it's hard, and then give an example of how to solve it in Racket, first by cheating to make a probably-correct answer and then by trying to do it the hard way to make a probably-not-correct answer.
Why it is hard
No modern filesystem I know of allows you to open a file for 'insertion' where new content is inserted into the file, pushing existing content 'down' the file. Instead you can open a file for writing, conceptually, in two ways:
for appending, which will append new content at the end;
for overwriting, which will overwrite existing content with new.
(Actually these may be the same: opening for appending may just open for overwriting and then move the current location to the end of the file.)
So what you're doing in your sample is opening for overwriting, and then clobbering the content of the file with the header.
How to do it, in outline
The way to do what you need to do, in outline, is:
create and open a temporary file in the same directory as the file you care about;
write the new content to the temporary file;
copy all the content of the existing file to the temporary file;
close the temporary file;
if all is well, rename the temporary file on top of the existing file, if all is not OK, delete it.
If you do this carefully it is safe, because file renames are atomic, or should be, in the filesystem, if the two files are in the same directory. That means that the rename should either completely succeed or completely fail, even if the system crashes part way through or the filesystem fills or something like that. If the filesystem doesn't garuantee that then you're pretty much stuck.
But doing it carefully is not easy (I should admit here that some of my background is doing things like this to system-critical files, so I've spent too long thinking about how to make this safe in a context where getting it wrong is very serious indeed).
Solving this in Racket by cheating
As I said, getting the above process right is hard, and it is therefore something you often want to rely on a battle-tested library for. Racket has such a thing: call-with-atomic-output-file. This seems to be designed to solve exactly this problem: it deals with creating and opening the temporary file for you, deals with the renaming at the end and cleans up appropriately. All you need is a function which copies things around.
So here is a function, prepend-to-file which uses call-with-atomic-output-file to try and do what you want. This is Racket-specific, in many ways, and it is also somewhat overengineered.
(define (prepend-to-file file content #:buffer-size (buffer-size 40960))
;; prepend content to file
;;
;; Try to be a bit too clever about whether we're copying strings or bytes
;; based on the argument
(let-values ([(read-it! write-it make-it)
(if (bytes? content)
(values read-bytes! write-bytes make-bytes)
(values read-string! write-string make-string))])
(call-with-atomic-output-file file
(λ (out path)
;; out is open for writing to the temporary file, path is the
;; temporary file's pathname which we don't care about
(call-with-input-file file
(λ (in)
;; in is now open for reading from the real file
(write-it content out)
(let ([buffer (make-it buffer-size)])
;; copy in to out using a buffer
(for ([nread (in-producer (thunk
(read-it! buffer in))
eof-object?)])
(write-it buffer out 0 nread)))
;; OK just return the file for want of anything better
file))))))
I think it's reasonably likely that the above code actually works in most reasonable cases.
Solving this in Racket without cheating
If we could write call-with-atomic-output-file then we could solve the problem without cheating. But getting this right is hard. Here is an attempt to do this, which is almost certainly incorrect:
(define (call/temporary-output-file file proc)
(let ([tmpname (string-append file
"-"
(number->string (random (expt 2 24))))]
[managed #f]
[once #t])
;; tmpname is the name of the temporary file: this assumes pathnames are
;; only strings, which is wrong. managed is a flag which tells us if
;; proc returned normally, once is a flag which attempts to prevent any
;; continuation nasties so the whole thing can only happen once.
(call-with-output-file tmpname
(λ (out)
(dynamic-wind
(thunk
(when (not once)
;; if this is the case we're getting back in, and this
;; is not OK
(error 'call/temporary-output-file
"this is hopeless")))
(thunk
;; call proc and if it returns normally note that
(begin0 (proc out tmpname)
(set! managed #t)))
(thunk
;; close the output port regardless
(close-output-port out)
(if managed
;; We did OK, so rename the file in place
(rename-file-or-directory tmpname file #t)
;; failed, nuke the temporary file
(when (file-exists? tmpname)
(delete-file tmpname)))
;; finally set once false to prevent shenanigans
(set! once #f)))))))
Notes:
this is still Racket-specific, but it now depends only on simpler functions which have, probably, more obvious counterparts in other implementations (or in the standard);
it tries to deal with some of the edge cases, but almost certainly misses some;
it certainly does not cope in cases such as the rename failing and so on;
Again: don't use this: it's almost certainly buggy.
However if you did use this, then you could simply splice it in instead of call-with-atomic-output-file in the above code and it will, often but probably not always, work.
TL;DR: can I do an incremental read of binary files with Red or Rebol?
I would like to use Red to process some large (13MB to 2GB) structured binary files (Kurzweil synthesizer files). I've used other languages (C, Go, Tcl, Ruby, Dart) to walk through these files, and now I'd like to do the same with Red or Rebol.
Is there a way to incrementally read binary files, byte by byte? All I see is read/binary which seems to slurp the entire file at once (or a part of a file).
I'll need to jump around a little bit, too (either peek at the next byte, or skip to the end of a section, or skip past variable length strings to the start of data).
(Yes, I could make some helpers that tracked the position and used read/part/seek.)
I would like to make a call to the low level OS read/seek if that is possible - something new to learn.
This is on macos, but a portable solution would be great.
Thanks!
PS: "open/read %abc" gives an error "*** Script Error: open does not allow file! for its port argument", even though the help message say the port argument is "port [port! file! url! block!]"
Rebol has ports for that, which are planned for 0.7.0 release in Red. So, current I/O is very basic and buffer-only, and open is a preliminary stub.
I would like to make a call to the low level OS read/seek if that is possible - something new to learn.
You can leverage Rebol or Red/System FFI as a learning excercise.
Here is how you would do it in Rebol:
>> file: open/direct/binary %file.dat
>> until [none? probe copy/part file 20]
>> close file
#{732F7072696E74657253657474696E6773312E62}
#{696E504B01022D00140006000800000021006149}
#{0910890100001103000010000000000000000000}
...
#{000000006A290000646F6350726F70732F617070}
#{2E786D6C504B0506000000000D000D0068030000}
#{292C00000000}
none
first file or pick file 1 will return the next byte value (integer!)
This even works with text files: open/lines/direct, in that case copy/part file 20 will return 20 lines, or you can use pick file 1 or first file to get the next line.
Soon this will be available on Red too.
The title says it all.
I have 4 awk processes logging to the same file, and output seems fine, not mangled, but I'm not sure that just redirecting print output like this: print "xxx" >> file in every process is the right way to do it.
There are many similar questions around the site, but this one is particularly about awk and a pragmatic, code-correct way to approach the problem.
EDIT
Sorry folks, of course I wasn't "just redirecting" like I wrote, I was appending.
No it is not safe.
the awk print "foo" > "file" will open the file and overwrite the file content, till the end of script.
That is, if your 4 awk processes started writing to the same file on different time, they overwrite the result of each other.
To reproduce it, you could start two (or more) awk like this:
awk '{while(++i<9){system("sleep 2");print "p1">"file"}}' <<<"" &
awk '{while(++i<9){system("sleep 2");print "p2">"file"}}' <<<"" &
and same time you monitoring the content of file, you will see finally there are not exactly 8 "p1" and 8 "p2".
using >> could avoid the losing of entries. but the entry sequence from 4 processes could be messed up.
EDIT
Ok, the > was a typo.
I don't know why you really need 4 processes to write into same file. as I said, with >>, the entries won't get lost (if you awk scripts works correctly). however personally I won't do in this way. If I have to have 4 processes, i would write to different files. well I don't know your requirement, just speaking in general.
outputting to different files make the testing, debugging easier.. imagine when one of your processes had problem, you want to solve it. etc...
I think using the operating system print command is save. As in fact this will append the file write buffer with the string you provide as log. So the system will menage the actual writing process of the data to disc, also if another process will want to use the same file the system will see that the resource is already claimed and will wait for 1st thread to finish its processing, than will allow the 2nd process to write to the buffer.
So I've been using the DotNetZip Library for some time now, and it works pretty well, up until yesterday when I maxed out the zipfile size. On any given day, I need to zip PDFs and transfer them to an SFTP site, that only accepts zip files. The amount of PDFs range from a couple hundred, a couple thousand to well over 10K. I had about 24K PDFs yesterday when the DotNetZip process broke. There is a way to split the zipfiles using the DotNetZip library but for some reason, the system that is being used on the SFTP server cant handle zipfiles that are split.
What's the best way to grab say 5K (or any other set amount of files), zip, delete those files and grab another 5K, zip, delete and repeat the process until all files are zipped?
Here is my current code of the zip process...
Dim PathToPDFs As String = "C:\Temp" 'PDF LOCATION
Using Zip As ZipFile = New ZipFile()
Zip.AddSelectedFiles("(name = *.pdf)", PathToPDFs, "", True)
Zip.CompressionMethod = CompressionMethod.Deflate
Zip.CompressionLevel = Ionic.Zlib.CompressionLevel.BestCompression
Zip.Save("C:\Temp\Zipfile.zip")
End Using
Try enumerating through all files first, getting a list of FileInfo, then going through them in a loop, and creating ZIP files every 5K (or whichever your batch size is). You don't need to delete anything, just keep a batch id in memory, so your zip file names would derive from that (i.e. pdf_batch_01.zip).
So when your batch size is reached you would do Save and create a new ZipFile, and keep adding files in the loop. Don't forget to also "commit" at last file (last batch would most likely be incomplete). To sum up, you "commit" when batch size is reached OR processing last entry (a varitation of i=FileCount-1).