Is it possible to emit and read(parse) binary data(image, file etc)?
Like this is shown here:
http://yaml.org/type/binary.html
How can I do this in yaml-cpp?
As of revision 425, yes! (for emitting)
YAML::Emitter emitter;
emitter << YAML::Binary("Hello, World!", 13);
std::cout << emitter.c_str();
outputs
--- !!binary "SGVsbG8sIFdvcmxkIQ=="
The syntax is
YAML::Binary(const char *bytes, std::size_t size);
I wasn't sure how to pass the byte array: char isn't necessarily one byte, so I'm not sure how portable the algorithm is. What format is your byte array typically in?
(The problem is that uint8_t isn't standard C++ yet, so I'm a little worried about using it.)
As for parsing, yaml-cpp will certainly parse the data as a string, but there's no decoding algorithm yet.
Here it is answered how to read/parse binary data from a yaml file with the yaml-cpp library.
This answer assumes that you are able to load a YAML::Node node object from a yaml file - explained in the yaml-cpp tutorials: https://github.com/jbeder/yaml-cpp/wiki/Tutorial).
The code to parse binary data from a yaml node is:
YAML::Binary binary = node.as<YAML::Binary>();
const unsigned char * data = binary.data();
std::size_t size = binary.size();
Then you have an array of bytes "data" with a known size "size".
Related
So I have a function I'm using to read data from a file. It works fine if the file is plain text, but when I try to read a binary file, like a png, it returns a different text (diff confirms that). I opened a hex editor to see what was wrong and found out it is putting some c2 bytes along with the file (I don't know if the position is random or if there are other bytes except this c2 one).
This is my function. I just want it to read and save to a variable.
proc read_file {path} {
set channel [open $path r]
fconfigure $channel -translation binary
set return_string "[read $channel]"
close $channel
return "$return_string"
}
To actually print, I'm doing this:
puts -nonewline [read_file file.png]
When you open a file, it defaults to being in text mode . In text mode (which is really a combination of options) the IO layer translates characters from whatever encoding they are in into Tcl's internal encoding, and does the reverse operation on output. The default encoding scheme is platform specific, but in your case it sounds like it is UTF-8. (Tcl uses a complex internal system of encodings; it doesn't expose those to the outside world.)
By contrast, when you put the channel into binary mode, the bytes on the outside are directly mapped to characters in the range 0-255 (and vice versa on output). You get a perfect copy, provided you put both input and output channels in binary mode. (There are other optimisations for binary mode, but they don't matter here.)
When you only put one of the channels in binary mode, you get what looks like corruption. It isn't random though. In particular, when the input is binary but the output is UTF-8, input bytes in the range 128-255 get converted into multiple output bytes, where the first of those bytes is in the sort of range you observed. There are other combinations that mess things up; the whole range of problems is collectively known as mojibake.
tl;dr Don't mix up binary and text data unless you're very careful. The results of getting it wrong are "surprising".
I'm trying to save a string as a file (pdf), but it seems the encoding isn't what it seems.
I'm wondering if anyone knows what char encoding or converting is required for the following snippet?
%PDF-1.3
%����
1 0 obj
<</Title (Faktura 0)/Producer (ComponentOne C1Pdf)/CreationDate (D:20210122122339+01'00')/ModDate (D:20210122122339+01'00')>>
endobj
2 0 obj
<</Length 1789/Filter /FlateDecode>>stream
x��w6PH/��2P\00� w=S�r ����LM|�c��^.css=K##3V(J��
...more pdf code.
PDF is a binary format, and reading binary data into a string depending on the encoding assumed in that reading can damage the binary data.
Thus, if possible, don't retrieve the data as a string but instead as a byte array or byte buffer and store these bytes as they are.
As you confirmed in a comment,
I just received a array buffer and then wrote to file. It's working now!
TL;DR: How does the compression of plaintext using a Huffman code actually work?
I'm currently learning the Huffman coding algorithm and its application to text file compression. I understand that we could store the same data with less size by using an encoding technique (e.g. Huffman coding) which is determined by the frequency distribution of each character in the text file.
In Huffman coding we want the most frequent character in a text file to get the shortest binary representation (variable-length encoding), hence in total the amount of storage needed for the file is fewer than those of fixed-length encoding such as ASCII.
However I still have no idea on how to actually implement the compression. What kind of file should i use to store the Huffman-encoded binary representation of the text file?. How does the process of compressing the plaintext (probably in .txt format) into a compressed file actually work? Does decompression also work the same way as compression, just in the the reverse direction?
I've tried using binary file in C to store the binary representation of a .txt file. As you can expect the binary file actually became bigger than the original file.
I've read that converting a plain-text file into a compressed file is just a matter of replacing each letter with an appropriate bit string and then handling the possibility of having some extra bits that need to be written. However I still haven't found any good reference on what is a bit string and how to work with it.
Any reference would be helpful, and any answer with C implementation would be perfect. Thank you.
There is only one kind of file. A sequence of bytes. Each byte has eight bits. For Huffman coding you consider the file to be a sequence of bits as opposed to bytes. You accumulate the bits in a buffer, and when you have bytes you write them out to the file. Something like:
// Write the low bits of code to stdout. The remaining bits of code must be zero.
void put_bits(int bits, unsigned code) {
static int have = 0;
static unsigned buf = 0;
if (bits == -1) {
// flush remaining bits
if (have) {
putchar(buf);
have = 0;
buf = 0;
}
return;
}
buf |= code << have;
have += bits;
while (have >= 8) {
putchar(buf);
buf >>= 8;
have -= 8;
}
}
I have a binary file that I would like to read with Fortran. The problem is that it was not written by Fortran, so it doesn't have the record length indicators. So the usual unformatted Fortran read won't work.
I had a thought that I could be sneaky and read the file as a formatted file, byte-by-byte (or 4 bytes by 4 bytes, really) into a character array and then convert the contents of the characters into integers and floats via the transfer function or the dreaded equivalence statement. But this doesn't work: I try to read 4 bytes at a time and, according to the POS output from the inquire statement, the read skips over like 6000 bytes or so, and the character array gets loaded with junk.
So that's a no go. Is there some detail in this approach I am forgetting? Or is there just a fundamentally different and better way to do this in Fortran? (BTW, I also tried reading into an integer*1 array and a byte array. Even though these codes would compile, when it came to the read statement, the code crashed.)
Yes.
Fortran 2003 introduced stream access into the language. Prior to this most processors supported something equivalent as an extension, perhaps called "binary" or similar.
Unformatted stream access imposes no record structure on the file. As an example, to read data from the file that corresponds to a single int in the companion C processor (if any) for a particular Fortran processor:
USE, INTRINSIC :: ISO_C_BINDING, ONLY: C_INT
INTEGER, PARAMETER :: unit = 10
CHARACTER(*), PARAMETER :: filename = 'name of your file'
INTEGER(C_INT) :: data
!***
OPEN(unit, filename, ACCESS='STREAM', FORM='UNFORMATTED')
READ (unit) data
CLOSE(unit)
PRINT "('data was ',I0)", data
You may still have issues with endianess and data type size, but those aspects are language independent.
If you are writing to a language standard prior to Fortran 2003 then unformatted direct access reading into a suitable integer variable may work - it is Fortran processor specific but works for many of the current processors.
I need to use something like NSLog but without the timestamp and newline character, so I'm using printf. How can I use this with NSString?
You can convert an NSString into a UTF8 string by calling the UTF8String method:
printf("%s", [string UTF8String]);
//public method that accepts a string argument
- (void) sayThis : ( NSString* ) this
{
printf("%s",[this cString]);
}
According to the NSString.h ( html version ) the UTF8String method is only available on Mac OSX.
(see below )
All the other methods I looked at are marked as 'availability:Openstep'
There are further methods that will return regular char* strings but they might throw character conversion exceptions.
NOTE The string pointers point to memory that might go away so you have to copy the strings if you want to keep a copy of the string contents, but immediate printing should be fine ?
There are also methods that will return an encoded string, and a method to test if the encoding you want will work ( I think ) so you can check if your required encoding will work and then request a string that has been encoded as required.
From reading through the .h file itself there are many encodings and translations between encodings.
These are managed using enumerations so you can pass the type of encoding you want as an argument.
On linux etc. do :
locate NSString.h
** Note this found the html doc file also
otherwise do a :
find /usr -name NSString.h
NOTE Your mileage may vary :)
Thanks.
From the NSString.h html doc file :
cString
- (const char*) cString;
Availability: OpenStep
Returns a pointer to a null terminated string of 8-bit characters in the default encoding. The memory pointed to is not owned by the caller, so the caller must copy its contents to keep it. Raises an NSCharacterConversionException if loss of information would occur during conversion. (See -canBeConvertedToEncoding: .)
cStringLength
- (NSUInteger) cStringLength;
Availability: OpenStep
Returns length of a version of this unicode string converted to bytes using the default C string encoding. If the conversion would result in information loss, the results are unpredictable. Check -canBeConvertedToEncoding: first.
cStringUsingEncoding:
- (const char*) cStringUsingEncoding: (NSStringEncoding)encoding;
Availability: MacOS-X 10.4.0, Base 1.2.0
Returns a pointer to a null terminated string of characters in the specified encoding.
NB. under GNUstep you can used this to obtain a nul terminated utf-16 string (sixteen bit characters) as well as eight bit strings.
The memory pointed to is not owned by the caller, so the caller must copy its contents to keep it.
Raises an NSCharacterConversionException if loss of information would occur during conversion.
canBeConvertedToEncoding:
- (BOOL) canBeConvertedToEncoding: (NSStringEncoding)encoding;
Availability: OpenStep
Returns whether this string can be converted to the given string encoding without information loss.