Scanning big binary with Erlang - file-io

I like to scan larger(>500M) binary files for structs/patterns. I am new to the language an hope that someone can give me start. Actually the files are a database containing Segments. A Segment starts with a fixed sized header followed by a fixed sized optional part followed by the payload/data part of variable lenght. For a first test i just like to log the number of segments in the file. I googled already for some tutorial but found nothing which helped. I need a hint or a tutorial which is not too far from my use case to get started.
Greets
Stefan

you need to learn about Bit Syntax and Binary Comprehensions. More useful links to follow: http://www.erlang.org/documentation/doc-5.6/doc/programming_examples/bit_syntax.html, and http://goto0.cubelogic.org/a/90.You will also need to learn how to process files, reading from files (line-by-line, chunk-by-chunk, at given positions in a file, e.t.c.), writing to files in several ways. The file processing functions are explained here You can also choose to look at the source code of large file processing libraries within the erlang packages e.g. Disk Log, Dets and mnesia. These libraries heavily read and write into files and their source code is open for you to see. I hope that helps

Here is a synthesized sample problem: I have a binary file (test.txt) that I want to parse. I want to find all the binary patterns of <<$a, $b, $c>> in the file.
The content of "test.txt":
I arbitrarily decide to choose the string "abc" as my target string for my test. I want to find all the abc's in my testing file.
A sample program (lab.erl):
-module(lab).
-compile(export_all).
find(BinPattern, InputFile) ->
BinPatternLength = length(binary_to_list(BinPattern)),
{ok, S} = file:open(InputFile, [read, binary, raw]),
loop(S, BinPattern, 0, BinPatternLength, 0),
file:close(S),
io:format("Done!~n", []).
loop(S, BinPattern, StartPos, Length, Acc) ->
case file:pread(S, StartPos, Length) of
{ok, Bin} ->
case Bin of
BinPattern ->
io:format("Found one at position: ~p.~n", [StartPos]),
loop(S, BinPattern, StartPos + 1, Length, Acc + 1);
_ ->
loop(S, BinPattern, StartPos + 1, Length, Acc)
end;
eof ->
io:format("I've proudly found ~p matches:)~n", [Acc])
end.
Run it:
1> c(lab).
{ok,lab}
2> lab:find(<<"abc">>, "./test.txt").
Found one at position: 43.
Found one at position: 103.
I've proudly found 2 matches:)
Done!
ok
Note that the code above is not very efficient (the scanning process shifts one byte at a time) and it is sequential (not utilizing all the "cores" on your computer). It is meant only to get you started.

When your data fits into memory, best thing what you can to do is read data in whole using file:read_file/1. If you can't use file in raw mode. Then you can parse data using bit_syntax. If you write it in right manner you can achieve parsing speed in tens of MB/s when parsing module is compile using HiPE. Exact techniques of parsing depends on exact segment data format and how robust/accurate result you are looking for. For parallel parsing you can inspire by Tim Bray's Wide Finder project.

Related

Openvms: Extracting RMS Indexed file t to Windows as a sequential flat file

I haven't used openvms for 20+ years. It was my 1st OS. I've been asked if it possible to copy the data from RMS files from openvms server to windows as a text file - so that it's readable.
No-one has experience or knowledge of the record structures etc.
The files are xyz.DAT and are relative files. I'm hoping the dat files are fixed length.
My 1st attempt would be to try and use Datatrieve (DTR) but get an error that the image isn't loaded.
Thought it might be as easy using CONVERT/FDL = nnnn.FDL - by changing the Relative to Sequential. The file seems still to be unreadable.
Is there an easy way to stream an RMS index file to a flat ASCII file?
I use to use COBOL and C to access the data in the past but had lots of libraries to help....
I've notice some solution may use odbc to connect but not sure what I can or cannot install on the server.
I can FTP using Filezilla to the server....
Another plan writing C application to read a file and output out as string.....or DCL too.....doesn't have to be quick...
Any ideas
Has mentioned before
The simple solution MIGHT be to to just use: $ TYPE/OUT=test.TXT test.DAT.
This will handle Relatie and Indexed files alike.
It is much the same as $ CONVERT / FDL=NL: test.DAT test.TXT
Both will just read records from the source and transfer the bytes, byte for byte, to the records in a sequential file.
FTP in ASCII mode will transfer that nicely to windows.
You can also use an 'inline' FDL file to generate a 'unix' LF file like:
$ conv /fdl="record; format stream_lf" test.DAT test.TXT
Or CR-LF file using:
$ conv /fdl="record; format stream" test.DAT test.TXT
Both can be transferring in Binary or Ascii with FTP.
MOSTLY - because this really only works well for TEXT ONLY source .DAT file.
There should be no CR, LF, FF or NUL characters in the source or things will break.
As 'habo' points out, use DUMP /RECORD=COUNT=3 to see how 'readable' the source data is.
If you spot 'binary' data using DUMP then you will need to find a record defintion somewhere which maps byte to Integers or Floating points or Dates as needed.
These defintions can be COBOL LIB files, or BASIC MAPS and are often stores IN the CDD (Common Data Dictionary) or indeed in DATATRIEVE .DIC DICTIONARIES
To use such definition you likely need a program to just read following the 'map' and write/print as text. Normally that's not too hard - notably not when you can find an example program on the server to tweak.
If it is just one or two 'suspect' byte ranges, then you can create a DCL loop to read and write and use F$EXTRACT to select the chunks you like.
If you want further help, kindly describe in words what kind of data is expected and perhaps provide the output from DUMP for 3 or 5 rows.
Good luck!
Hein.

Scheme Macro - Adding a Header to a file

I am writing a scheme macro for a simulation tool. I create thousands of files and I want to add a header (6 lines) to each file. The code for my header runs well and the header gets created in the right way.
But the adding of the header to my file is buggy. It does not add the 6 header lines to top of my files without touching the rest, it delets the first information that are in my file. How much information is deleted, depends on the total length of header-information.
(let* ((out (open-input-output-file filename0) ))
(display header out)
(newline out)
(close-output-port out))
This is how my file looks without the header:
TracePro Release: 20 6 0
Irradiance Map Data for D:****\TracePro\Aktive\sim_mod_09.oml
Linear Units in millimeters
Data for absorbing_area_focuscircle Surface 0
Data generated at 10:55:56 May 28, 2021
This is how my file looks with the header:
axle x y z a b c
pyra 0 0 0 0 0 0
lens 0 0 0 0 0 0
coll 0 0 0 0 0 0
mir1 0 0 0 0 0 0
glass1 0 0 0 0 0 0
ing_area_focuscircle Surface 0
Data generated at 10:57:29 May 28, 2021
Raytrace Time: mins: 0, secs: 0*
Projected Plane Extent from surface geometry
TopLeft:(-1.05125,-214.755,-1.05125)
TopRight:(1.05125,-214.755,-1.05125)
BottomLeft:(-1.05125,-214.755,1.05125)
BottomRight:(1.05125,-214.755,1.05125)
This isn't really an answer, especially as my knowledge of the Scheme language standards isn't good enough to even know if this is possible within a strictly-defined Scheme. However I'll show why it's hard, and then give an example of how to solve it in Racket, first by cheating to make a probably-correct answer and then by trying to do it the hard way to make a probably-not-correct answer.
Why it is hard
No modern filesystem I know of allows you to open a file for 'insertion' where new content is inserted into the file, pushing existing content 'down' the file. Instead you can open a file for writing, conceptually, in two ways:
for appending, which will append new content at the end;
for overwriting, which will overwrite existing content with new.
(Actually these may be the same: opening for appending may just open for overwriting and then move the current location to the end of the file.)
So what you're doing in your sample is opening for overwriting, and then clobbering the content of the file with the header.
How to do it, in outline
The way to do what you need to do, in outline, is:
create and open a temporary file in the same directory as the file you care about;
write the new content to the temporary file;
copy all the content of the existing file to the temporary file;
close the temporary file;
if all is well, rename the temporary file on top of the existing file, if all is not OK, delete it.
If you do this carefully it is safe, because file renames are atomic, or should be, in the filesystem, if the two files are in the same directory. That means that the rename should either completely succeed or completely fail, even if the system crashes part way through or the filesystem fills or something like that. If the filesystem doesn't garuantee that then you're pretty much stuck.
But doing it carefully is not easy (I should admit here that some of my background is doing things like this to system-critical files, so I've spent too long thinking about how to make this safe in a context where getting it wrong is very serious indeed).
Solving this in Racket by cheating
As I said, getting the above process right is hard, and it is therefore something you often want to rely on a battle-tested library for. Racket has such a thing: call-with-atomic-output-file. This seems to be designed to solve exactly this problem: it deals with creating and opening the temporary file for you, deals with the renaming at the end and cleans up appropriately. All you need is a function which copies things around.
So here is a function, prepend-to-file which uses call-with-atomic-output-file to try and do what you want. This is Racket-specific, in many ways, and it is also somewhat overengineered.
(define (prepend-to-file file content #:buffer-size (buffer-size 40960))
;; prepend content to file
;;
;; Try to be a bit too clever about whether we're copying strings or bytes
;; based on the argument
(let-values ([(read-it! write-it make-it)
(if (bytes? content)
(values read-bytes! write-bytes make-bytes)
(values read-string! write-string make-string))])
(call-with-atomic-output-file file
(λ (out path)
;; out is open for writing to the temporary file, path is the
;; temporary file's pathname which we don't care about
(call-with-input-file file
(λ (in)
;; in is now open for reading from the real file
(write-it content out)
(let ([buffer (make-it buffer-size)])
;; copy in to out using a buffer
(for ([nread (in-producer (thunk
(read-it! buffer in))
eof-object?)])
(write-it buffer out 0 nread)))
;; OK just return the file for want of anything better
file))))))
I think it's reasonably likely that the above code actually works in most reasonable cases.
Solving this in Racket without cheating
If we could write call-with-atomic-output-file then we could solve the problem without cheating. But getting this right is hard. Here is an attempt to do this, which is almost certainly incorrect:
(define (call/temporary-output-file file proc)
(let ([tmpname (string-append file
"-"
(number->string (random (expt 2 24))))]
[managed #f]
[once #t])
;; tmpname is the name of the temporary file: this assumes pathnames are
;; only strings, which is wrong. managed is a flag which tells us if
;; proc returned normally, once is a flag which attempts to prevent any
;; continuation nasties so the whole thing can only happen once.
(call-with-output-file tmpname
(λ (out)
(dynamic-wind
(thunk
(when (not once)
;; if this is the case we're getting back in, and this
;; is not OK
(error 'call/temporary-output-file
"this is hopeless")))
(thunk
;; call proc and if it returns normally note that
(begin0 (proc out tmpname)
(set! managed #t)))
(thunk
;; close the output port regardless
(close-output-port out)
(if managed
;; We did OK, so rename the file in place
(rename-file-or-directory tmpname file #t)
;; failed, nuke the temporary file
(when (file-exists? tmpname)
(delete-file tmpname)))
;; finally set once false to prevent shenanigans
(set! once #f)))))))
Notes:
this is still Racket-specific, but it now depends only on simpler functions which have, probably, more obvious counterparts in other implementations (or in the standard);
it tries to deal with some of the edge cases, but almost certainly misses some;
it certainly does not cope in cases such as the rename failing and so on;
Again: don't use this: it's almost certainly buggy.
However if you did use this, then you could simply splice it in instead of call-with-atomic-output-file in the above code and it will, often but probably not always, work.

How to do an incremental read of binary files

TL;DR: can I do an incremental read of binary files with Red or Rebol?
I would like to use Red to process some large (13MB to 2GB) structured binary files (Kurzweil synthesizer files). I've used other languages (C, Go, Tcl, Ruby, Dart) to walk through these files, and now I'd like to do the same with Red or Rebol.
Is there a way to incrementally read binary files, byte by byte? All I see is read/binary which seems to slurp the entire file at once (or a part of a file).
I'll need to jump around a little bit, too (either peek at the next byte, or skip to the end of a section, or skip past variable length strings to the start of data).
(Yes, I could make some helpers that tracked the position and used read/part/seek.)
I would like to make a call to the low level OS read/seek if that is possible - something new to learn.
This is on macos, but a portable solution would be great.
Thanks!
PS: "open/read %abc" gives an error "*** Script Error: open does not allow file! for its port argument", even though the help message say the port argument is "port [port! file! url! block!]"
Rebol has ports for that, which are planned for 0.7.0 release in Red. So, current I/O is very basic and buffer-only, and open is a preliminary stub.
I would like to make a call to the low level OS read/seek if that is possible - something new to learn.
You can leverage Rebol or Red/System FFI as a learning excercise.
Here is how you would do it in Rebol:
>> file: open/direct/binary %file.dat
>> until [none? probe copy/part file 20]
>> close file
#{732F7072696E74657253657474696E6773312E62}
#{696E504B01022D00140006000800000021006149}
#{0910890100001103000010000000000000000000}
...
#{000000006A290000646F6350726F70732F617070}
#{2E786D6C504B0506000000000D000D0068030000}
#{292C00000000}
none
first file or pick file 1 will return the next byte value (integer!)
This even works with text files: open/lines/direct, in that case copy/part file 20 will return 20 lines, or you can use pick file 1 or first file to get the next line.
Soon this will be available on Red too.

Ansys multiphysics: blank output file

I have a model of a heating process on Ansys Multiphysics, V11.
After running the simulation, I have a script to plot a temperature profile:
!---------------- POST PROCESSING -----------------------
/post1 ! tdatabase postprocessor
!---define profile temperature
path,s_temp1,2,,100 ! define a path
ppath,1,,dop/2,0,0 ! create a path point
ppath,2,,dop/2,1.5,0 ! create a path point
PDEF,surf_t1,TEMP, ,noav ! print a path
plpath,surf_t1 ! plot a path
What I now need, is to save the resulting path in a text file. I have already looked online for a solution, and found the following code to do it, which I appended after the lines above:
/OUTPUT,filename,extension
PRPATH,surf_t1
/OUTPUT
Ansys generates the file filename.extension but it is empty. I tried to place the OUTPUT command in a few locations in the script, but without any success.
I suspect I need to define something else, but I have no idea where to look, as Ansys documentation online is terribly chaotic, and all internet pages I've opened before writing this question are not better.
A final note: Ansys V11 is an old version of the software, but I don't want to upgrade it and fit the old model to the new software.
For the output of the simulation (which includes all calculation steps, and sub-steps description and node-by-node results) the output must be declared in the beginning of the code, and not in the postprocessing phase.
Declaring
/OUTPUT,filename,extension
in the preamble of the main script makes such that the output is stored in the right location, with the desired extension. At the end of the scripts, you must then declare
/OUTPUT
to reset the output file location for ANSYS.
The output to the PATH call made in the postprocessing script is however not printed in the file.
It is convenient to use
*CFOPEN,file,ext
*VWRITE,Vector(1,1).Vector(1,2)
(2F12.6)
*CFCLOSE
where Vector(1,1) is a two column array created by *DIM, and stores your data to output to file
As this is a special command, run it from file i.e. macro_output.mac

Open a .cfile from rtl_sdr after convert with GNU Radio

I have a binary file (capture.bin) from the rtl_sdr tool. I convert it to a .cfile with this manual http://sdr.osmocom.org/trac/wiki/rtl-sdr#Usingthedata
Where can I get the data in this file? The goal is to get a numerical format output from the the source. Is this possible?
That actually is covered by a GNU Radio FAQ entry.
What is the file format of a file_sink? How can I read files produced by a file sink?
All files are in pure binary format. Just bits. That’s it. A floating point data stream is saved as 32 bits in the file, one after the other. A complex signal has 32 bits for the real part and 32 bits for the imaginary part. Reading back a complex number means reading in 32 bits, saving that to the real part of a complex data structure, and then reading in the next 32 bits as the imaginary part of the data structure. And just keep reading the data.
Take a look at the Octave and Python files in gr-utils for reading in data using Octave and Python’s Scipy module.
The exception to the format is when using the metadata file format. These files are produced by the File Meta Sink: http://gnuradio.org/doc/doxygen/classgr_1_1blocks_1_1file__meta__sink.html block and read by the File Meta Source block. >See the manual page on the metadata file format for more information about how to deal with these files.
A one-line Python command to read the entire file into a numpy array is:
f = scipy.fromfile(open("filename"), dtype=scipy.uint8)
Replace the dtype with scipy.int16, scipy.int32, scipy.float32, scipy.complex64 or >whatever type you were using.
Update
scipy.fromfile will be deprecated in v2.0 so instead use numpy library
f = numpy.fromfile(open("filename"), dtype=numpy.uint8)