Pseudocode for defining binary file - structure

For describing how an algorithm works you can use pseudocode. What would you use to define a structure of binary file? For instance, I have a file that has the following structure:
(n) - int32 containing number of files
then, repeated n times:
(len) - int32 containing length of file name
(name) - len bytes containing file name
(size) - file size
(offset) - offset from file start from which starts file contents
Which way is the best to describe file structure like this?

Related

How do I turn a file into a RAW bitstring?

How do I read a file and turn it to a RAW bit string? For example I open an image that is 512kb, It reads the file byte by byte, and it spits out the long bit string that is the file? I would like to apply some functions to the strings but I can't figure a way to unpack files consistently.
I imagine what I need is something that reads a file byte by byte with no care of the original file format... As it reads byte by byte, a giant integer like thing file bit string is created.
I used a Python's bit generator and NumPy, that seemed to work well, but The program didn't behave well with actual files. What is the best way to unpack files into 1's and 0's?
How do I read any file and store the contents as an easy to read HEX file? or BIN file? And how do I stop the "open" function from truncating leading 0's!
UGH!
Using Python or GOLANG, how do I open any file and create an uninterrupted bit string of the contents where every leading zero in a BYTE read is significant?
After looking and asking everyone I'm acquainted to I found my answer. The best way to turn any file into a RAW HEX string is by
f = open("file_name", "rb")
content = f.read().hex()
with open("File HEX bitstream.txt", "w") as text_file:
print(f"HEX Bitstream Import", content, file=text_file)
f.close()

Null char returning from reading a file in Common Lisp

I’m reading files and storing them as a string using this function:
(defun file-to-str (path)
(with-open-file (stream path) :external-format 'utf-8
(let ((data (make-string (file-length stream))))
(read-sequence data stream)
data)))
If the file has only ASCII characters, I get the content of the files as expected; but if there are characters beyond 127, I get a null character (^#), at the end of the string, for each such character beyond 127. So, after $ echo "~a^?" > ~/teste I get
CL-USER> (file-to-string "~/teste")
"~a^?
"
; but after echo "aaa§§§" > ~/teste , the REPL gives me
CL-USER> (file-to-string "~/teste")
"aaa§§§
^#^#^#"
and so forth. How can I fix this? I’m using SBCL 1.4.0 in an utf-8 locale.
First of all, your keyword argument :external-format is misplaced and has no effect. It should be inside the parenteses with stream and path. However, this has no effect to the end result, as UTF-8 is the default encoding.
The problem here is that in UTF-8 encoding, it takes a different number of bytes to encode different characters. ASCII characters all encode into single bytes, but other characters take 2-4 bytes. You are now allocating, in your string, data for every byte of the input file, not every character in it. The unused characters end up unchanged; make-string initializes them as ^#.
The (read-sequence) function returns the index of the first element not changed by the function. You are currently just discarding this information, but you should use it to resize your buffer after you know how many elements have been used:
(defun file-to-str (path)
(with-open-file (stream path :external-format :utf-8)
(let* ((data (make-string (file-length stream)))
(used (read-sequence data stream)))
(subseq data 0 used))))
This is safe, as length of the file is always greater or equal to the number of UTF-8 characters encoded in it. However, it is not terribly efficient, as it allocates an unnecessarily large buffer, and finally copies the whole output into a new string for returning the data.
While this is fine for a learning experiment, for real-world use cases I recommend the Alexandria utility library that has a ready-made function for this:
* (ql:quickload "alexandria")
To load "alexandria":
Load 1 ASDF system:
alexandria
; Loading "alexandria"
* (alexandria:read-file-into-string "~/teste")
"aaa§§§
"
*

Why is a text file generated by a SQL job double the size of a manually created one?

I have a SQL job running that generates data on a daily basis and amends a text file with this data.
If I copy the data out from the SQL generated text file (currently 24kb) into a new text file, why is it that this new text file is 12kb?
I'm trying to maniplate the file using a batch script and it will only work if I copy out the data from the SQL generated into a new file (the smaller one) which is why I came across it.
When a text file is unexpectedly double in size (i.e. it contains 1000 characters but occupies 2000 bytes on disk), it's usually because it's encoded as some full 2-bytes-per-character unicode represntation rather than ascii, or a unicode scheme that aims to save bytes (like UTF8) by encoding characters using a mix of byte ranges (e.g. UTF8 is 1 to 4 bytes, with the aim that most characters will be 1 byte)
If you want to see more, open both files in a hex editor; it will make instantly clear where all the surplus bytes have gone in the double size file

Internal structure of PDF file: decode params

What the next parametres of decoding does mean?
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<4DC888EB77E2D649AEBD54CA55A09C54><227DCAC2C364E84A9778262D41602AD4>]/Info 37 0 R/Length 69/Root 39 0 R/Size 38/Type/XRef/W[1 2 1]>>
I know, that Filter/FlateDecode -- it's filter, which was used to compress the stream. But what are ID, Info, Length, Root, Size? Are these parametres realeted with compression/decompression?
Please consult ISO-32000-1:
You are showing the dictionary of a compressed cross reference table (/Type/XRef):
7.5.8 Cross-Reference Streams
Cross-reference streams are stream objects, and contain a dictionary and a data stream.
Flatedecode: the way the stream is compressed.
Length: This is the number of bytes in the stream. Your PDF is at least a PDF 1.5 file and it has a compressed xref table.
DecodeParms: contains information about the way the stream is encoded.
A Cross-reference stream has some typical dictionary entries:
W: An array of integers representing the size of the fields in a single cross-reference entry. In your case [1 2 1].
Size: The number one greater than the highest object number used in this section or in any section for which this shall be an update. It shall be equivalent to the Size entry in a trailer dictionary.
I also see some entries that belong in the /Root dictionary (aka Catalog) of a PDF file:
14.4 File Identifiers
File identifiers shall be defined by the optional ID entry in a PDF
file’s trailer dictionary. The ID entry is optional but should be
used. The value of this entry shall be an array of two byte strings.
The first byte string shall be a permanent identifier based on the
contents of the file at the time it was originally created and shall
not change when the file is incrementally updated. The second byte
string shall be a changing identifier based on the file’s contents at
the time it was last updated. When a file is first written, both
identifiers shall be set to the same value.
14.3.3 Document Information Dictionary
What you see is a reference to another indirectory object that is a dictionary called the Info dictionary:
The optional Info entry in the trailer of a PDF file shall hold a
document information dictionary containing metadata for the document.
Note: this question isn't really suited for StackOverflow. StackOverflow is a forum where you can post programming problems. Your question isn't a programming problem. You are merely asking us to copy/paste quotes from ISO-32000-1.

Reading file line by line in Lua

As per Lua documentation, file:read("*l") reads next line skipping end of line.
Note:- "*l": reads the next line skipping the end of line, returning nil on end of file. This is the default format
Is this documentation right? Because file:read("*l") reads the current line,instead of next line or my understanding is wrong? Pretty confusing...
Lua manages files using the same model of the underlying C implementation (this model is used also by other programming languages and it is fairly common). If you are not familiar with this way of looking at files, the terminology could be unclear, indeed.
In this model a file is represented as a stream of bytes having a so called current position. The current position is a sort of conceptual pointer to the first byte in the file that will be read or written by the next I/O operation. When you open a file for reading, a new stream is set-up so that its current position is the beginning of the file, i.e. the current position "points" to the first byte in the file.
In Lua you manage streams through so-called file handles, which are a sort of intermediaries for the underlying streams. Any operation you perform using the handle is carried over to the corresponding stream.
Lua io.open opens a file, associates a C stream with it and returns a file handle that represents that stream:
local file_handle = io.open( "myfile.txt" ) -- file opened for reading
Therefore, if you perform any operation that reads some bytes (usually interpreted as characters, if you work with text files) those are read from the stream and for each byte read the current position of the stream advances by one, pointing each time to the next byte to be read.
Lua documentation implies this model. Thus when it says next line, it means that the input operation will read all characters in the stream starting from the current position until an end-of-line character is found.
Note that if you look at text files as a sequence of lines you could be misled, since you could think of a "current line" and a "next line". That would be an higher level model compared to the C model. There is no "current line" in C. In C text files are nothing more than a sequence of bytes where some special characters (end-of-line characters) undergo some special treatment (which is mostly implementation-dependent) and are used by some C standard functions as line terminators, i.e. as marks to detect when stop reading characters.
Another source of confusion for newbies or people coming from higher level languages is that in C, for an historical accident, bytes are handled as characters (the basic data type to handle single bytes is char, which is the smallest numeric type in C!). Therefore for people with a C background it is natural to think of bytes as characters and vice versa.
Although Lua is a much higher level language than C, its close relationship with C (it was designed to be easily interfaced with C code) makes it inherit part of this C "bytes-as-characters" approach. In fact, for example, Lua strings can hold arbitrary bytes and can be used to process raw binary data.
Like Lorenso said above, read starts at the current file position and reads from that position some portion of the file. How much of the file it reads depends on read instruction. For reference, in Lua 5.3:
"*all" : reads to the end of the file
"*line" : reads from the current position to the end of the line.
The end of the line is marked by a special character usually denoted
LfCr (Line feed, carriage return )
"*number" : reads a number, that is, it will read up to the end of what
it recognizes in the text as a number, stopping at, for example, a
comma ",".
num : reads a string with up to num characters
Here's an example that reads a file with a list of numbers into an array (a table), then returns the array. (Just change the "*number" to "*line" and it would read a file line by line):
function read_array(file)
local arr = {}
local handle = assert( io.open(file,"r") )
local value = handle:read("*number")
while value do
table.insert( arr, value )
value = handle:read("*number")
end
handle:close()
return arr
end