How exactly does file compression work at a low level using Huffman coding? (in C) - file-io

TL;DR: How does the compression of plaintext using a Huffman code actually work?
I'm currently learning the Huffman coding algorithm and its application to text file compression. I understand that we could store the same data with less size by using an encoding technique (e.g. Huffman coding) which is determined by the frequency distribution of each character in the text file.
In Huffman coding we want the most frequent character in a text file to get the shortest binary representation (variable-length encoding), hence in total the amount of storage needed for the file is fewer than those of fixed-length encoding such as ASCII.
However I still have no idea on how to actually implement the compression. What kind of file should i use to store the Huffman-encoded binary representation of the text file?. How does the process of compressing the plaintext (probably in .txt format) into a compressed file actually work? Does decompression also work the same way as compression, just in the the reverse direction?
I've tried using binary file in C to store the binary representation of a .txt file. As you can expect the binary file actually became bigger than the original file.
I've read that converting a plain-text file into a compressed file is just a matter of replacing each letter with an appropriate bit string and then handling the possibility of having some extra bits that need to be written. However I still haven't found any good reference on what is a bit string and how to work with it.
Any reference would be helpful, and any answer with C implementation would be perfect. Thank you.

There is only one kind of file. A sequence of bytes. Each byte has eight bits. For Huffman coding you consider the file to be a sequence of bits as opposed to bytes. You accumulate the bits in a buffer, and when you have bytes you write them out to the file. Something like:
// Write the low bits of code to stdout. The remaining bits of code must be zero.
void put_bits(int bits, unsigned code) {
static int have = 0;
static unsigned buf = 0;
if (bits == -1) {
// flush remaining bits
if (have) {
putchar(buf);
have = 0;
buf = 0;
}
return;
}
buf |= code << have;
have += bits;
while (have >= 8) {
putchar(buf);
buf >>= 8;
have -= 8;
}
}

Related

How to read a binary file with TCL

So I have a function I'm using to read data from a file. It works fine if the file is plain text, but when I try to read a binary file, like a png, it returns a different text (diff confirms that). I opened a hex editor to see what was wrong and found out it is putting some c2 bytes along with the file (I don't know if the position is random or if there are other bytes except this c2 one).
This is my function. I just want it to read and save to a variable.
proc read_file {path} {
set channel [open $path r]
fconfigure $channel -translation binary
set return_string "[read $channel]"
close $channel
return "$return_string"
}
To actually print, I'm doing this:
puts -nonewline [read_file file.png]
When you open a file, it defaults to being in text mode . In text mode (which is really a combination of options) the IO layer translates characters from whatever encoding they are in into Tcl's internal encoding, and does the reverse operation on output. The default encoding scheme is platform specific, but in your case it sounds like it is UTF-8. (Tcl uses a complex internal system of encodings; it doesn't expose those to the outside world.)
By contrast, when you put the channel into binary mode, the bytes on the outside are directly mapped to characters in the range 0-255 (and vice versa on output). You get a perfect copy, provided you put both input and output channels in binary mode. (There are other optimisations for binary mode, but they don't matter here.)
When you only put one of the channels in binary mode, you get what looks like corruption. It isn't random though. In particular, when the input is binary but the output is UTF-8, input bytes in the range 128-255 get converted into multiple output bytes, where the first of those bytes is in the sort of range you observed. There are other combinations that mess things up; the whole range of problems is collectively known as mojibake.
tl;dr Don't mix up binary and text data unless you're very careful. The results of getting it wrong are "surprising".

How do I turn a file into a RAW bitstring?

How do I read a file and turn it to a RAW bit string? For example I open an image that is 512kb, It reads the file byte by byte, and it spits out the long bit string that is the file? I would like to apply some functions to the strings but I can't figure a way to unpack files consistently.
I imagine what I need is something that reads a file byte by byte with no care of the original file format... As it reads byte by byte, a giant integer like thing file bit string is created.
I used a Python's bit generator and NumPy, that seemed to work well, but The program didn't behave well with actual files. What is the best way to unpack files into 1's and 0's?
How do I read any file and store the contents as an easy to read HEX file? or BIN file? And how do I stop the "open" function from truncating leading 0's!
UGH!
Using Python or GOLANG, how do I open any file and create an uninterrupted bit string of the contents where every leading zero in a BYTE read is significant?
After looking and asking everyone I'm acquainted to I found my answer. The best way to turn any file into a RAW HEX string is by
f = open("file_name", "rb")
content = f.read().hex()
with open("File HEX bitstream.txt", "w") as text_file:
print(f"HEX Bitstream Import", content, file=text_file)
f.close()

Null char returning from reading a file in Common Lisp

I’m reading files and storing them as a string using this function:
(defun file-to-str (path)
(with-open-file (stream path) :external-format 'utf-8
(let ((data (make-string (file-length stream))))
(read-sequence data stream)
data)))
If the file has only ASCII characters, I get the content of the files as expected; but if there are characters beyond 127, I get a null character (^#), at the end of the string, for each such character beyond 127. So, after $ echo "~a^?" > ~/teste I get
CL-USER> (file-to-string "~/teste")
"~a^?
"
; but after echo "aaa§§§" > ~/teste , the REPL gives me
CL-USER> (file-to-string "~/teste")
"aaa§§§
^#^#^#"
and so forth. How can I fix this? I’m using SBCL 1.4.0 in an utf-8 locale.
First of all, your keyword argument :external-format is misplaced and has no effect. It should be inside the parenteses with stream and path. However, this has no effect to the end result, as UTF-8 is the default encoding.
The problem here is that in UTF-8 encoding, it takes a different number of bytes to encode different characters. ASCII characters all encode into single bytes, but other characters take 2-4 bytes. You are now allocating, in your string, data for every byte of the input file, not every character in it. The unused characters end up unchanged; make-string initializes them as ^#.
The (read-sequence) function returns the index of the first element not changed by the function. You are currently just discarding this information, but you should use it to resize your buffer after you know how many elements have been used:
(defun file-to-str (path)
(with-open-file (stream path :external-format :utf-8)
(let* ((data (make-string (file-length stream)))
(used (read-sequence data stream)))
(subseq data 0 used))))
This is safe, as length of the file is always greater or equal to the number of UTF-8 characters encoded in it. However, it is not terribly efficient, as it allocates an unnecessarily large buffer, and finally copies the whole output into a new string for returning the data.
While this is fine for a learning experiment, for real-world use cases I recommend the Alexandria utility library that has a ready-made function for this:
* (ql:quickload "alexandria")
To load "alexandria":
Load 1 ASDF system:
alexandria
; Loading "alexandria"
* (alexandria:read-file-into-string "~/teste")
"aaa§§§
"
*

Bzip2 block header: 1AY&SY

This is the question about bzip2 archive format. Any Bzip2 archive consists of file header, one or more blocks and tail structure. All blocks should start with "1AY&SY", 6 bytes of BCD-encoded digits of the Pi number, 0x314159265359. According to the source of bzip2:
/*--
A 6-byte block header, the value chosen arbitrarily
as 0x314159265359 :-). A 32 bit value does not really
give a strong enough guarantee that the value will not
appear by chance in the compressed datastream. Worst-case
probability of this event, for a 900k block, is about
2.0e-3 for 32 bits, 1.0e-5 for 40 bits and 4.0e-8 for 48 bits.
For a compressed file of size 100Gb -- about 100000 blocks --
only a 48-bit marker will do. NB: normal compression/
decompression do *not* rely on these statistical properties.
They are only important when trying to recover blocks from
damaged files.
--*/
The question is: Is it true, that all bzip2 archives will have blocks with start aligned to byte boundary? I mean all archives created by reference implementation of bzip2, the bzip2-1.0.5+ utility.
I think that bzip2 may parse the stream not as byte stream but as bit stream (the block itself is encoded by huffman, which is not byte-aligned by design).
So, in other words: If grep -c 1AY&SY greater (huffman may generate 1AY&SY inside block) or equal to count of bzip2 blocks in the file?
BZIP2 looks at a bit stream.
From http://blastedbio.blogspot.com/2011/11/random-access-to-bzip2.html:
Anyway, the important bits are that a BZIP2 file contains one or more
"streams", which are byte aligned, each containing one (zero?) or more
"blocks", which are not byte aligned, followed by an end of stream
marker (the six bytes 0x177245385090 which is the square root of pi as
a binary coded decimal (BCD), a four byte checksum, and empty bits for
byte alignment).
The bzip2 wikipedia article also alludes to bit-block alignment (see the File Format section), which seems to be inline from what I remember from school (had to implement the algorithm...).

How-to emit and parse raw binary data using yaml-cpp

Is it possible to emit and read(parse) binary data(image, file etc)?
Like this is shown here:
http://yaml.org/type/binary.html
How can I do this in yaml-cpp?
As of revision 425, yes! (for emitting)
YAML::Emitter emitter;
emitter << YAML::Binary("Hello, World!", 13);
std::cout << emitter.c_str();
outputs
--- !!binary "SGVsbG8sIFdvcmxkIQ=="
The syntax is
YAML::Binary(const char *bytes, std::size_t size);
I wasn't sure how to pass the byte array: char isn't necessarily one byte, so I'm not sure how portable the algorithm is. What format is your byte array typically in?
(The problem is that uint8_t isn't standard C++ yet, so I'm a little worried about using it.)
As for parsing, yaml-cpp will certainly parse the data as a string, but there's no decoding algorithm yet.
Here it is answered how to read/parse binary data from a yaml file with the yaml-cpp library.
This answer assumes that you are able to load a YAML::Node node object from a yaml file - explained in the yaml-cpp tutorials: https://github.com/jbeder/yaml-cpp/wiki/Tutorial).
The code to parse binary data from a yaml node is:
YAML::Binary binary = node.as<YAML::Binary>();
const unsigned char * data = binary.data();
std::size_t size = binary.size();
Then you have an array of bytes "data" with a known size "size".