How to do an incremental read of binary files

How to do an incremental read of binary files - rebol

TL;DR: can I do an incremental read of binary files with Red or Rebol?
I would like to use Red to process some large (13MB to 2GB) structured binary files (Kurzweil synthesizer files). I've used other languages (C, Go, Tcl, Ruby, Dart) to walk through these files, and now I'd like to do the same with Red or Rebol.
Is there a way to incrementally read binary files, byte by byte? All I see is read/binary which seems to slurp the entire file at once (or a part of a file).
I'll need to jump around a little bit, too (either peek at the next byte, or skip to the end of a section, or skip past variable length strings to the start of data).
(Yes, I could make some helpers that tracked the position and used read/part/seek.)
I would like to make a call to the low level OS read/seek if that is possible - something new to learn.
This is on macos, but a portable solution would be great.
Thanks!
PS: "open/read %abc" gives an error "*** Script Error: open does not allow file! for its port argument", even though the help message say the port argument is "port [port! file! url! block!]"

Rebol has ports for that, which are planned for 0.7.0 release in Red. So, current I/O is very basic and buffer-only, and open is a preliminary stub.
I would like to make a call to the low level OS read/seek if that is possible - something new to learn.
You can leverage Rebol or Red/System FFI as a learning excercise.

Here is how you would do it in Rebol:
>> file: open/direct/binary %file.dat
>> until [none? probe copy/part file 20]
>> close file
#{732F7072696E74657253657474696E6773312E62}
#{696E504B01022D00140006000800000021006149}
#{0910890100001103000010000000000000000000}
...
#{000000006A290000646F6350726F70732F617070}
#{2E786D6C504B0506000000000D000D0068030000}
#{292C00000000}
none
first file or pick file 1 will return the next byte value (integer!)
This even works with text files: open/lines/direct, in that case copy/part file 20 will return 20 lines, or you can use pick file 1 or first file to get the next line.
Soon this will be available on Red too.

Related

Openvms: Extracting RMS Indexed file t to Windows as a sequential flat file

I haven't used openvms for 20+ years. It was my 1st OS. I've been asked if it possible to copy the data from RMS files from openvms server to windows as a text file - so that it's readable.
No-one has experience or knowledge of the record structures etc.
The files are xyz.DAT and are relative files. I'm hoping the dat files are fixed length.
My 1st attempt would be to try and use Datatrieve (DTR) but get an error that the image isn't loaded.
Thought it might be as easy using CONVERT/FDL = nnnn.FDL - by changing the Relative to Sequential. The file seems still to be unreadable.
Is there an easy way to stream an RMS index file to a flat ASCII file?
I use to use COBOL and C to access the data in the past but had lots of libraries to help....
I've notice some solution may use odbc to connect but not sure what I can or cannot install on the server.
I can FTP using Filezilla to the server....
Another plan writing C application to read a file and output out as string.....or DCL too.....doesn't have to be quick...
Any ideas
Has mentioned before

The simple solution MIGHT be to to just use: $ TYPE/OUT=test.TXT test.DAT.
This will handle Relatie and Indexed files alike.
It is much the same as $ CONVERT / FDL=NL: test.DAT test.TXT
Both will just read records from the source and transfer the bytes, byte for byte, to the records in a sequential file.
FTP in ASCII mode will transfer that nicely to windows.
You can also use an 'inline' FDL file to generate a 'unix' LF file like:
$ conv /fdl="record; format stream_lf" test.DAT test.TXT
Or CR-LF file using:
$ conv /fdl="record; format stream" test.DAT test.TXT
Both can be transferring in Binary or Ascii with FTP.
MOSTLY - because this really only works well for TEXT ONLY source .DAT file.
There should be no CR, LF, FF or NUL characters in the source or things will break.
As 'habo' points out, use DUMP /RECORD=COUNT=3 to see how 'readable' the source data is.
If you spot 'binary' data using DUMP then you will need to find a record defintion somewhere which maps byte to Integers or Floating points or Dates as needed.
These defintions can be COBOL LIB files, or BASIC MAPS and are often stores IN the CDD (Common Data Dictionary) or indeed in DATATRIEVE .DIC DICTIONARIES
To use such definition you likely need a program to just read following the 'map' and write/print as text. Normally that's not too hard - notably not when you can find an example program on the server to tweak.
If it is just one or two 'suspect' byte ranges, then you can create a DCL loop to read and write and use F$EXTRACT to select the chunks you like.
If you want further help, kindly describe in words what kind of data is expected and perhaps provide the output from DUMP for 3 or 5 rows.
Good luck!
Hein.

Moving the "cursor" back a line for stdout

I have a little command line tool (written in Objective C, runs under MacOS) that tracks changes to folders and applies rules to files. This tool also informs the user about the progress. It says like:
"Found 3 files of type Z and applied rule"
"Found 6 files of typ x and applied rules"
Currently, the tool outputs the feedback as an endless list but this does not look very handy. What I'm after is a solution to only type the line per file type once and then update the number in the terminal if the tool finds another file of that type. Very similar to how "top" under Unix gives the feedback.
However, to do so, I'll need to move the cursor in the terminal backwards to the beginning of the line and also one or multiple lines backwards.
Is this possible and does anybody know, how to do so?
Thanks
Norbert

Generated corrupt large ply file - how to find the error

I just wrote a java class to generate meshes from a cylinder list stored to a ply file. I tested the files with a hand generated list of 3 cylinders. The resulting file I can open both in Meshlab and Cloudcompare.
When I use the class in my real program I have to write a mesh for more than 13000 cylinders.
Cloudcompare gives me the following error : Reading error(no access right?)
Meshlab this one : error details, unexptected eof
I already checked if my ply file contains the exact number of vertices and faces defined in the header. I also assured, there are no nan (checked for 'n','a', etc in winedit) values contained.
I can reproduce the errors with my test file from the 3 hand made cylinder file by deleting the last line. But as mentioned earlier, I already checked if the line numbers are correct (might be an empty line not caught by my eyes though, as scrolling down half a million lines is impossible).
So are there any programs available to parse the ply file for errors? Open source tools would be appreciated here. Or are the files just to large? 436302 lines to be exact. I use ascii version of ply.

Found a non open source tool called nugraf, which provides information about the corrupted line numbers.
Java seems to print NAN with '?'. For this char i did not check, so problem seems to be solved and I can debug my java software now again.

Determining if two rar files are part of the same set

Let's say I have two files, (name).n.rar and (name).n+1.rar, which appear to be part of the same set (same size, etc). Is there any easy way to tell if they're actually part of the same set, without first downloading the full set? Currently the only way I can tell is by downloading an instance of every file and and then seeing if WinRAR gives me an error when I try to unwrap them.
(And on a related note, assuming there is such a method, can I do the same without having adjacent parts?)
Ideally there's an existing program that can do this, but I can code my own if necessary.
Further notes: These are two sets of archives of the same file. They appear identical to obvious checks: filenames are subsequent, contents are sane, sizes are identical, same number of parts. I then receive a full set of files. If they're not from the same set, I can't unrar them - though it seems that WinRAR will proceed to 100% before giving me the CRC error (file corrupt.)

New Answer
All tests were made using WinRAR 5.01 32-bit. Since the algorythm should remain the same, the following statements should be valid for any other previous version. Feel free to comment if you know that's not true.
I'll give a short briefing about the chat. I tried to pack a file larger than 1GB several times; Then I mixed up the files and tried to extract the archives: it worked. The problem was not the size of the file indeed.
I thought about three possible solutions to the problem:
Architecture was influent in the packaging process: so different people tried to pack the files, and mixing up them would result in an error;
Different people tried to pack the files, giving a slightly different size file (for example 250 MB and 250000 KB). This would have been noticed in the file properties, though;
Files were corrupted during the download: re-downloading them would confirm this hypothesis.
I was most curious about the first one: could architecture be influent in the packaging process?
I found out the answer is yes, it is. Here are the passages to repeat the experiment:
Pack your files in an archive, giving a precise part size, in computer A;
Pack the same exact files, giving the same exact part size, in computer B (TODO: Check if this experiment is still valid with similar architecture, e.g. Intel i7 with Intel i5) with a different architecture (e.g. Intel processor with AMD processor);
Transfer one (or more, if you wish, but of course not all of them!) parts from computer B to computer A. Remember to delete those files from computer A before the transfer;
Place all the files in the same directory, check if they all have the same name (e.g. "AAA part1", "AAA part2"...);
Extract them;
Enjoy your CRC Error!.
Tests were made using an Intel i7-3632QM and an AMD FX 6300.
I have some suspects about the fact that the compressed files are the same, but the CRC code is different.
Old Answer
There is a way indeed. During my Computer Science academic studies, we had a Computer Forensics class. I learned that every file has a static beginning (an header, we could say), that makes a program recognize its type and the way to decrypt it. To see it, you just have to open it with a text editor (Notepad++ is the best so far, I guess)
For example, jpeg images begin with ÿØÿá.
I tried to store a video in some splitted .rar files, and knowing if they are part of the same archive was simpler than I thought.
Every rar file begins with Rar!. On the second or third line, it should appear the name of the file stored in the archive: in my case, myVideo.mp4. If all your archives contain that filename, they're probably part of the same archive.
Things are getting worse if there are several files in the archive and you don't know their names. In fact, if there is more than one file, the RAR files structure is as follows:
File 1:
Rar!
NUL NUL NUL //Random things here
NUL NUL NUL NUL NUL myVideo.mp4 NUL NUL NUL NUL
//Random things here. If the dimensions of the file exceed the archive,
//the next file will begin with the same name.
//Let's assume that this is happening.
EOF
File 2:
Rar!
NUL NUL NUL //Random things here
NUL NUL myVideo.mp4 NUL NUL NUL
//This time the file is complete. Since there is still space in the archive,
//it will add another file
NUL NUL NUL NUL mySecondVideo.mp4 NUL NUL NUL NUL
EOF
Let's assume that at the end of the second archive, mySecondVideo hasn't been fully compressed yet.
File 3:
Rar!
NUL NUL NUL
NUL NUL NUL NUL mySecondVideo.mp4 NUL
NUL NUL NUL
NUL myTextFile.txt
NUL NUL NUL mySecondTextFile.txt NUL
EOF
If mySecondTextFile.txt isn't yet fully compressed, my fourth file will begin with its name.
I hope it's clear, I tried to keep it as simple as possible. In the case of more files, I would start from the last archive. I'd write down the first filename found on that file and I'd search it in the previous one. If I found that name, I'd repeat the sequence until the first archive.

I'm not familiar with RAR-format that much, but in case you decide to write your program in Java I can recommend using 7-Zip-JBinding.
http://sevenzipjbind.sourceforge.net/
http://sevenzipjbind.sourceforge.net/basic_snippets.html#open-multipart-rar-archives
You can download first n+1 parts of the archive and then call extract() method ignoring output data only caring for
IArchiveExtractCallback.setOperationResult(ExtractOperationResult)
calls (checking that CRC was ok) and monitoring files getting opened trough
IArchiveOpenVolumeCallback.getStream(java.lang.String)
If volume n+2 get requested, you can conclude that volume n+1 was the right one.
(I'm not 100% sure about this conclusion, but I would give it a try)

Scanning big binary with Erlang

I like to scan larger(>500M) binary files for structs/patterns. I am new to the language an hope that someone can give me start. Actually the files are a database containing Segments. A Segment starts with a fixed sized header followed by a fixed sized optional part followed by the payload/data part of variable lenght. For a first test i just like to log the number of segments in the file. I googled already for some tutorial but found nothing which helped. I need a hint or a tutorial which is not too far from my use case to get started.
Greets
Stefan

you need to learn about Bit Syntax and Binary Comprehensions. More useful links to follow: http://www.erlang.org/documentation/doc-5.6/doc/programming_examples/bit_syntax.html, and http://goto0.cubelogic.org/a/90.You will also need to learn how to process files, reading from files (line-by-line, chunk-by-chunk, at given positions in a file, e.t.c.), writing to files in several ways. The file processing functions are explained here You can also choose to look at the source code of large file processing libraries within the erlang packages e.g. Disk Log, Dets and mnesia. These libraries heavily read and write into files and their source code is open for you to see. I hope that helps

Here is a synthesized sample problem: I have a binary file (test.txt) that I want to parse. I want to find all the binary patterns of <<$a, $b, $c>> in the file.
The content of "test.txt":
I arbitrarily decide to choose the string "abc" as my target string for my test. I want to find all the abc's in my testing file.
A sample program (lab.erl):
-module(lab).
-compile(export_all).
find(BinPattern, InputFile) ->
BinPatternLength = length(binary_to_list(BinPattern)),
{ok, S} = file:open(InputFile, [read, binary, raw]),
loop(S, BinPattern, 0, BinPatternLength, 0),
file:close(S),
io:format("Done!~n", []).
loop(S, BinPattern, StartPos, Length, Acc) ->
case file:pread(S, StartPos, Length) of
{ok, Bin} ->
case Bin of
BinPattern ->
io:format("Found one at position: ~p.~n", [StartPos]),
loop(S, BinPattern, StartPos + 1, Length, Acc + 1);
_ ->
loop(S, BinPattern, StartPos + 1, Length, Acc)
end;
eof ->
io:format("I've proudly found ~p matches:)~n", [Acc])
end.
Run it:
1> c(lab).
{ok,lab}
2> lab:find(<<"abc">>, "./test.txt").
Found one at position: 43.
Found one at position: 103.
I've proudly found 2 matches:)
Done!
ok
Note that the code above is not very efficient (the scanning process shifts one byte at a time) and it is sequential (not utilizing all the "cores" on your computer). It is meant only to get you started.

When your data fits into memory, best thing what you can to do is read data in whole using file:read_file/1. If you can't use file in raw mode. Then you can parse data using bit_syntax. If you write it in right manner you can achieve parsing speed in tens of MB/s when parsing module is compile using HiPE. Exact techniques of parsing depends on exact segment data format and how robust/accurate result you are looking for. For parallel parsing you can inspire by Tim Bray's Wide Finder project.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas