Scheme Macro - Adding a Header to a file - header

I am writing a scheme macro for a simulation tool. I create thousands of files and I want to add a header (6 lines) to each file. The code for my header runs well and the header gets created in the right way.
But the adding of the header to my file is buggy. It does not add the 6 header lines to top of my files without touching the rest, it delets the first information that are in my file. How much information is deleted, depends on the total length of header-information.
(let* ((out (open-input-output-file filename0) ))
(display header out)
(newline out)
(close-output-port out))
This is how my file looks without the header:
TracePro Release: 20 6 0
Irradiance Map Data for D:****\TracePro\Aktive\sim_mod_09.oml
Linear Units in millimeters
Data for absorbing_area_focuscircle Surface 0
Data generated at 10:55:56 May 28, 2021
This is how my file looks with the header:
axle x y z a b c
pyra 0 0 0 0 0 0
lens 0 0 0 0 0 0
coll 0 0 0 0 0 0
mir1 0 0 0 0 0 0
glass1 0 0 0 0 0 0
ing_area_focuscircle Surface 0
Data generated at 10:57:29 May 28, 2021
Raytrace Time: mins: 0, secs: 0*
Projected Plane Extent from surface geometry
TopLeft:(-1.05125,-214.755,-1.05125)
TopRight:(1.05125,-214.755,-1.05125)
BottomLeft:(-1.05125,-214.755,1.05125)
BottomRight:(1.05125,-214.755,1.05125)

This isn't really an answer, especially as my knowledge of the Scheme language standards isn't good enough to even know if this is possible within a strictly-defined Scheme. However I'll show why it's hard, and then give an example of how to solve it in Racket, first by cheating to make a probably-correct answer and then by trying to do it the hard way to make a probably-not-correct answer.
Why it is hard
No modern filesystem I know of allows you to open a file for 'insertion' where new content is inserted into the file, pushing existing content 'down' the file. Instead you can open a file for writing, conceptually, in two ways:
for appending, which will append new content at the end;
for overwriting, which will overwrite existing content with new.
(Actually these may be the same: opening for appending may just open for overwriting and then move the current location to the end of the file.)
So what you're doing in your sample is opening for overwriting, and then clobbering the content of the file with the header.
How to do it, in outline
The way to do what you need to do, in outline, is:
create and open a temporary file in the same directory as the file you care about;
write the new content to the temporary file;
copy all the content of the existing file to the temporary file;
close the temporary file;
if all is well, rename the temporary file on top of the existing file, if all is not OK, delete it.
If you do this carefully it is safe, because file renames are atomic, or should be, in the filesystem, if the two files are in the same directory. That means that the rename should either completely succeed or completely fail, even if the system crashes part way through or the filesystem fills or something like that. If the filesystem doesn't garuantee that then you're pretty much stuck.
But doing it carefully is not easy (I should admit here that some of my background is doing things like this to system-critical files, so I've spent too long thinking about how to make this safe in a context where getting it wrong is very serious indeed).
Solving this in Racket by cheating
As I said, getting the above process right is hard, and it is therefore something you often want to rely on a battle-tested library for. Racket has such a thing: call-with-atomic-output-file. This seems to be designed to solve exactly this problem: it deals with creating and opening the temporary file for you, deals with the renaming at the end and cleans up appropriately. All you need is a function which copies things around.
So here is a function, prepend-to-file which uses call-with-atomic-output-file to try and do what you want. This is Racket-specific, in many ways, and it is also somewhat overengineered.
(define (prepend-to-file file content #:buffer-size (buffer-size 40960))
;; prepend content to file
;;
;; Try to be a bit too clever about whether we're copying strings or bytes
;; based on the argument
(let-values ([(read-it! write-it make-it)
(if (bytes? content)
(values read-bytes! write-bytes make-bytes)
(values read-string! write-string make-string))])
(call-with-atomic-output-file file
(λ (out path)
;; out is open for writing to the temporary file, path is the
;; temporary file's pathname which we don't care about
(call-with-input-file file
(λ (in)
;; in is now open for reading from the real file
(write-it content out)
(let ([buffer (make-it buffer-size)])
;; copy in to out using a buffer
(for ([nread (in-producer (thunk
(read-it! buffer in))
eof-object?)])
(write-it buffer out 0 nread)))
;; OK just return the file for want of anything better
file))))))
I think it's reasonably likely that the above code actually works in most reasonable cases.
Solving this in Racket without cheating
If we could write call-with-atomic-output-file then we could solve the problem without cheating. But getting this right is hard. Here is an attempt to do this, which is almost certainly incorrect:
(define (call/temporary-output-file file proc)
(let ([tmpname (string-append file
"-"
(number->string (random (expt 2 24))))]
[managed #f]
[once #t])
;; tmpname is the name of the temporary file: this assumes pathnames are
;; only strings, which is wrong. managed is a flag which tells us if
;; proc returned normally, once is a flag which attempts to prevent any
;; continuation nasties so the whole thing can only happen once.
(call-with-output-file tmpname
(λ (out)
(dynamic-wind
(thunk
(when (not once)
;; if this is the case we're getting back in, and this
;; is not OK
(error 'call/temporary-output-file
"this is hopeless")))
(thunk
;; call proc and if it returns normally note that
(begin0 (proc out tmpname)
(set! managed #t)))
(thunk
;; close the output port regardless
(close-output-port out)
(if managed
;; We did OK, so rename the file in place
(rename-file-or-directory tmpname file #t)
;; failed, nuke the temporary file
(when (file-exists? tmpname)
(delete-file tmpname)))
;; finally set once false to prevent shenanigans
(set! once #f)))))))
Notes:
this is still Racket-specific, but it now depends only on simpler functions which have, probably, more obvious counterparts in other implementations (or in the standard);
it tries to deal with some of the edge cases, but almost certainly misses some;
it certainly does not cope in cases such as the rename failing and so on;
Again: don't use this: it's almost certainly buggy.
However if you did use this, then you could simply splice it in instead of call-with-atomic-output-file in the above code and it will, often but probably not always, work.

Related

How to do an incremental read of binary files

TL;DR: can I do an incremental read of binary files with Red or Rebol?
I would like to use Red to process some large (13MB to 2GB) structured binary files (Kurzweil synthesizer files). I've used other languages (C, Go, Tcl, Ruby, Dart) to walk through these files, and now I'd like to do the same with Red or Rebol.
Is there a way to incrementally read binary files, byte by byte? All I see is read/binary which seems to slurp the entire file at once (or a part of a file).
I'll need to jump around a little bit, too (either peek at the next byte, or skip to the end of a section, or skip past variable length strings to the start of data).
(Yes, I could make some helpers that tracked the position and used read/part/seek.)
I would like to make a call to the low level OS read/seek if that is possible - something new to learn.
This is on macos, but a portable solution would be great.
Thanks!
PS: "open/read %abc" gives an error "*** Script Error: open does not allow file! for its port argument", even though the help message say the port argument is "port [port! file! url! block!]"
Rebol has ports for that, which are planned for 0.7.0 release in Red. So, current I/O is very basic and buffer-only, and open is a preliminary stub.
I would like to make a call to the low level OS read/seek if that is possible - something new to learn.
You can leverage Rebol or Red/System FFI as a learning excercise.
Here is how you would do it in Rebol:
>> file: open/direct/binary %file.dat
>> until [none? probe copy/part file 20]
>> close file
#{732F7072696E74657253657474696E6773312E62}
#{696E504B01022D00140006000800000021006149}
#{0910890100001103000010000000000000000000}
...
#{000000006A290000646F6350726F70732F617070}
#{2E786D6C504B0506000000000D000D0068030000}
#{292C00000000}
none
first file or pick file 1 will return the next byte value (integer!)
This even works with text files: open/lines/direct, in that case copy/part file 20 will return 20 lines, or you can use pick file 1 or first file to get the next line.
Soon this will be available on Red too.

Determining if two rar files are part of the same set

Let's say I have two files, (name).n.rar and (name).n+1.rar, which appear to be part of the same set (same size, etc). Is there any easy way to tell if they're actually part of the same set, without first downloading the full set? Currently the only way I can tell is by downloading an instance of every file and and then seeing if WinRAR gives me an error when I try to unwrap them.
(And on a related note, assuming there is such a method, can I do the same without having adjacent parts?)
Ideally there's an existing program that can do this, but I can code my own if necessary.
Further notes: These are two sets of archives of the same file. They appear identical to obvious checks: filenames are subsequent, contents are sane, sizes are identical, same number of parts. I then receive a full set of files. If they're not from the same set, I can't unrar them - though it seems that WinRAR will proceed to 100% before giving me the CRC error (file corrupt.)
New Answer
All tests were made using WinRAR 5.01 32-bit. Since the algorythm should remain the same, the following statements should be valid for any other previous version. Feel free to comment if you know that's not true.
I'll give a short briefing about the chat. I tried to pack a file larger than 1GB several times; Then I mixed up the files and tried to extract the archives: it worked. The problem was not the size of the file indeed.
I thought about three possible solutions to the problem:
Architecture was influent in the packaging process: so different people tried to pack the files, and mixing up them would result in an error;
Different people tried to pack the files, giving a slightly different size file (for example 250 MB and 250000 KB). This would have been noticed in the file properties, though;
Files were corrupted during the download: re-downloading them would confirm this hypothesis.
I was most curious about the first one: could architecture be influent in the packaging process?
I found out the answer is yes, it is. Here are the passages to repeat the experiment:
Pack your files in an archive, giving a precise part size, in computer A;
Pack the same exact files, giving the same exact part size, in computer B (TODO: Check if this experiment is still valid with similar architecture, e.g. Intel i7 with Intel i5) with a different architecture (e.g. Intel processor with AMD processor);
Transfer one (or more, if you wish, but of course not all of them!) parts from computer B to computer A. Remember to delete those files from computer A before the transfer;
Place all the files in the same directory, check if they all have the same name (e.g. "AAA part1", "AAA part2"...);
Extract them;
Enjoy your CRC Error!.
Tests were made using an Intel i7-3632QM and an AMD FX 6300.
I have some suspects about the fact that the compressed files are the same, but the CRC code is different.
Old Answer
There is a way indeed. During my Computer Science academic studies, we had a Computer Forensics class. I learned that every file has a static beginning (an header, we could say), that makes a program recognize its type and the way to decrypt it. To see it, you just have to open it with a text editor (Notepad++ is the best so far, I guess)
For example, jpeg images begin with ÿØÿá.
I tried to store a video in some splitted .rar files, and knowing if they are part of the same archive was simpler than I thought.
Every rar file begins with Rar!. On the second or third line, it should appear the name of the file stored in the archive: in my case, myVideo.mp4. If all your archives contain that filename, they're probably part of the same archive.
Things are getting worse if there are several files in the archive and you don't know their names. In fact, if there is more than one file, the RAR files structure is as follows:
File 1:
Rar!
NUL NUL NUL //Random things here
NUL NUL NUL NUL NUL myVideo.mp4 NUL NUL NUL NUL
//Random things here. If the dimensions of the file exceed the archive,
//the next file will begin with the same name.
//Let's assume that this is happening.
EOF
File 2:
Rar!
NUL NUL NUL //Random things here
NUL NUL myVideo.mp4 NUL NUL NUL
//This time the file is complete. Since there is still space in the archive,
//it will add another file
NUL NUL NUL NUL mySecondVideo.mp4 NUL NUL NUL NUL
EOF
Let's assume that at the end of the second archive, mySecondVideo hasn't been fully compressed yet.
File 3:
Rar!
NUL NUL NUL
NUL NUL NUL NUL mySecondVideo.mp4 NUL
NUL NUL NUL
NUL myTextFile.txt
NUL NUL NUL mySecondTextFile.txt NUL
EOF
If mySecondTextFile.txt isn't yet fully compressed, my fourth file will begin with its name.
I hope it's clear, I tried to keep it as simple as possible. In the case of more files, I would start from the last archive. I'd write down the first filename found on that file and I'd search it in the previous one. If I found that name, I'd repeat the sequence until the first archive.
I'm not familiar with RAR-format that much, but in case you decide to write your program in Java I can recommend using 7-Zip-JBinding.
http://sevenzipjbind.sourceforge.net/
http://sevenzipjbind.sourceforge.net/basic_snippets.html#open-multipart-rar-archives
You can download first n+1 parts of the archive and then call extract() method ignoring output data only caring for
IArchiveExtractCallback.setOperationResult(ExtractOperationResult)
calls (checking that CRC was ok) and monitoring files getting opened trough
IArchiveOpenVolumeCallback.getStream(java.lang.String)
If volume n+2 get requested, you can conclude that volume n+1 was the right one.
(I'm not 100% sure about this conclusion, but I would give it a try)

Reading MANY files at once in Fortran

I have 500,000 files which I need to read in Fortran and each file has ~14,000 entries in it (each entry is only about 100 characters long). I need to process each line for each file at a time. For example, I need to process line 1 for all 500,000 files before moving on to line 2 from the files and so forth.
I cannot open them all at once (I tried making an array of file pointers and opening them all) because there will be too many files open at once. Instead, I would like to do something as follows:
do iline = 1,Nlines
do ifile = 1,Nfiles
! open the file
! read a line
! close the file
enddo
end
In hopes that this would allow me to read one line at a time (from each file) and then move on to the next line (in each file). Unfortunately, each time I open the file it starts me off at line 1 again. Is there any way to open/close a file and then open it again where you left off previously?
Thanks
Unfortunately it is not possible in this way in standard Fortran. Even If you specify
position="ASIS"
the actual position will be unspecified for a not already connected unit and will be in fact the beginning of the file on most systems.
That means You have to use
read(*,*)
enough times to get on the right place in the file.
You could also use stream access. The file would be again opened at the beginning, but you can use
read(u,*,pos=n) number
where n is the position saved from the previous open. You can get the position from
inquire(unit=u, pos=n)
n = n
You would open the file with acess="STREAM".
Also 500000 opened files is indeed too much. There are ways how to inquire for the system limits and how to control them, but also your compiler may have some limits http://www.cyberciti.biz/faq/linux-increase-the-maximum-number-of-open-files/
Other solution: Couldn't you store the content of the files in memory? Today couple of Gigabytes is OK, but it may be not enough for you.
You can try using fseek and ftell in something like the following.
! initialize an array of 0's
do iline = 1,Nlines
do ifile = 1,Nfiles
! open the file
! fseek(fd, array(ifile))
! read a line
! array(ifile)=ftell(fd)
! close the file
enddo
end
The (untested) idea is to store the offset of each file in an array and position the cursor at that place upon opening the file. Then, once a line is read, the ftell retrieves the current position which is saved to memory for next round. If all entries have the same length, you can spare the array and just store one value.
If the files have fixed, i.e., constant, record lengths, you could use direct access. Then you could "directly" read a specific record. A big "if" however.
the overhead of all the file opening/closing will be a big performance bottleneck.
You should try to read as much as you can for each open operation given whatever memory you have:
pseudocode:
loop until done:
loop over all files:
open
fseek !as in damiens answer
read N lines into array ! N=100 eg.
save ftell value for file
close
end file loop
loop over N output files:
open
write array data
close

correct way to write to the same file from multiple processes awk

The title says it all.
I have 4 awk processes logging to the same file, and output seems fine, not mangled, but I'm not sure that just redirecting print output like this: print "xxx" >> file in every process is the right way to do it.
There are many similar questions around the site, but this one is particularly about awk and a pragmatic, code-correct way to approach the problem.
EDIT
Sorry folks, of course I wasn't "just redirecting" like I wrote, I was appending.
No it is not safe.
the awk print "foo" > "file" will open the file and overwrite the file content, till the end of script.
That is, if your 4 awk processes started writing to the same file on different time, they overwrite the result of each other.
To reproduce it, you could start two (or more) awk like this:
awk '{while(++i<9){system("sleep 2");print "p1">"file"}}' <<<"" &
awk '{while(++i<9){system("sleep 2");print "p2">"file"}}' <<<"" &
and same time you monitoring the content of file, you will see finally there are not exactly 8 "p1" and 8 "p2".
using >> could avoid the losing of entries. but the entry sequence from 4 processes could be messed up.
EDIT
Ok, the > was a typo.
I don't know why you really need 4 processes to write into same file. as I said, with >>, the entries won't get lost (if you awk scripts works correctly). however personally I won't do in this way. If I have to have 4 processes, i would write to different files. well I don't know your requirement, just speaking in general.
outputting to different files make the testing, debugging easier.. imagine when one of your processes had problem, you want to solve it. etc...
I think using the operating system print command is save. As in fact this will append the file write buffer with the string you provide as log. So the system will menage the actual writing process of the data to disc, also if another process will want to use the same file the system will see that the resource is already claimed and will wait for 1st thread to finish its processing, than will allow the 2nd process to write to the buffer.

How do I delete a program header from an ELF binary

I want to write a utility to remove a program header from an ELF binary. For example, when I run readelf -l /my/elf I get a listing of all the program headers: PHDR INTERP ... GNU_STACK GNU_RELRO. When I run my utility, I would like to get all the same program headers back in the same order, minus the one I deleted. Is there any easier way to do this than recreated the entire ELF from scratch, skipping the unwanted header?
Is there any easier way to do this than recreated the entire ELF from scratch
Sure: program headers form a fixed-record table at an offset given by ehdr.e_phoff, containing .e_phnum entries of .e_phentsize bytes.
To delete one entry, simply copy the rest of entries over it, and decrement .e_phnum. That's all there is to it.
Beware: deleting some entries will likely cause the dynamic loader to crash. GNU_STACK is about the only header that can be deleted without too much harm (that I can think of).
Update:
Yes, setting .p_type to PT_NULL is another (and simpler) approach. But such entries are generally not expected to be present, and you may find some systems where PT_NULL will trigger an assertion in the loader (or in some other program).
Finally, adding a new Phdr might be tricky. Usually there is no space to expand the table (as it is immediately followed by some other data, e.g. .text). You can relocate the table to the end of the file, and set .e_phoff and .e_phnum to correspond to the new table, but many programs expect the entire Phdr table to be loaded and available at runtime, and that is not easy to arrange, as the new location at the end of the file will not be "covered" by any PT_LOAD segment.
The GNU Binary File Descriptor library (libbfd) may be helpful.