Why doesn't 'utf8-c8' encoding work when reading filehandles - raku

I wish to read byte sequences that will not decode as valid UTF-8, specifically byte sequences that correspond to high and low surrogates code points. The result should be a raku string.
I read that, in raku, the 'utf8-c8' encoding can be used for this purpose.
Consider code point U+D83F. It is a high surrogate (reserved for the high half of UTF-16 surrogate pairs).
U+D83F has a byte sequence of 0xED 0xA0 0xBF, if encoded as UTF-8.
Slurping a file? Works
If I slurp a file containing this byte sequence, using 'utf8-c8' as the encoding, I get the expected result:
echo -n $'\ud83f' >testfile # Create a test file containing the byte sequence
myprog1.raku:
#!/usr/local/bin/raku
$*OUT.encoding('utf8-c8');
print slurp('testfile', enc => 'utf8-c8');
$ ./myprog1.raku | od -An -tx1
ed a0 bf
✔️ expected result
Slurping a filehandle? Doesn't work
But if I switch from slurping a file path to slurping a filehandle, it doesn't work, even though I set the filehandle's encoding to 'utf8-c8':
myprog2.raku
#!/usr/local/bin/raku
$*OUT.encoding('utf8-c8');
my $fh = open "testfile", :r, :enc('utf8-c8');
print slurp($fh, enc => 'utf8-c8');
#print $fh.slurp; # I tried this too: same error
$ ./myprog2.raku
Error encoding UTF-8 string: could not encode Unicode Surrogate codepoint 55359 (0xD83F)
in block <unit> at ./myprog2.raku line 4
Environment
Edit 2022-10-30: I originally used my distro's package (Fedora Linux 36: Rakudo version 2020.07). I just downloaded the latest Rakudo binary release (2022.07-01). Result was the same.
$ /usr/local/bin/raku --version
Welcome to Rakudo™ v2022.07.
Implementing the Raku® Programming Language v6.d.
Built on MoarVM version 2022.07.
$ uname -a
Linux hx90 5.19.16-200.fc36.x86_64 #1 SMP PREEMPT_DYNAMIC Sun Oct 16 22:50:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
$ lsb_release -a
LSB Version: :core-4.1-amd64:core-4.1-noarch
Distributor ID: Fedora
Description: Fedora release 36 (Thirty Six)
Release: 36
Codename: ThirtySix

Related

perldoc: 'ŷ' output as 'X' with "=encoding utf8"

When writing POD documentation, I realized that Unicode character ŷ became X on output.
Input:
=pod
=encoding utf8
=over
=item I<yt> (ŷ(t))
The value predicted for time I<t>.
=back
Output in PuTTY:
Input in Emacs:
Version of perldoc being used that of Perl 5.18.2 (SLES12 SP4, perl-5.18.2-12.20.1.x86_64), and LANG=en_US.UTF-8.
Update:
It seems to be a bug in Perl or in the package of SLES12 SP4: Using the same test on OpenSUSE Leap 15.1 with Perl 5.26.1, the output looks OK:
yt (ŷ(t))
The value predicted for time t.
However using pod2man from perl-5.26.1-15.87.x86_64 of openSUSE Leap 15.3, the output is not correct.
OTOH using perldoc the output is correct, too.

SHA-256 Hash and base64 encoding

With the string "123456Adrian Wapcaplet" I need to
generate the SHA-256 hash
take the least significant 128 bits of the hash
represent these bits in base64 encoding
I need help with #1 and #3. Below is how I generate this in Linux
$ echo -n "123456Adrian Wapcaplet" | shasum -a 256 | cut -c33-64 | xxd -r -p | base64
jNWe9TyaqmgxG3N2fgl15w==
Install the pgcrypto extension:
create extension pgcrypto;
Then you can use the digest function to do sha256. The rest of the functions you'll need are built in:
select encode(substring(digest('123456Adrian Wapcaplet', 'sha256') from 17), 'base64');
Note that there's an implicit cast from text to bytea performed here when calling digest. It will use the default encoding for the database. To avoid problems due to environmental differences you can specify the encoding for the conversion:
digest(convert_to('123456Adrian Wapcaplet', 'utf8'), 'sha256')

Why do 7zip and gzip add 0x0A at the end of gzip compressed data

Wikipedia states(wrongly apparently at least for real world status) that gzip format demands that last 4 bytes are uncompressed size (mod 4GB).
I have fond the credible answer on SO that explains that sometimes there is junk at the end of the gzip data so you can not reply on last 4 bytes being size.
Unfortunately this matches my experiments(both terminal gzip and 7zip archiver add 0x0A byte for my small test example).
My question is what is the reason for this gzip and 7zip doing this?
Obviously they do it like that because they are written to do that, but I wonder about the motivation to break the format specification.
I know that some formats have padding requirements, but I found nothing for gzip.
edit:process:
echo "Testing rocks:) Debugging sucks :(" >> test_data
rm test_data.gz
gzip -6 test_data
vim -c "noautocmd edit test_data.gz"
in vim: :%!xxd -c 4
and last 5 bytes are size(35) and 0x0a (23 hex=35, then 00 00 00 0a)
7zip process is just using GUI to make a archive.
Your testing process is wrong. Vim is what adds 0x0A to the end of the file. Here is a simpler test, using xxd directly (why did you even use Vim?):
echo "Testing rocks:) Debugging sucks :(" >> test_data
gzip -6 test_data
xxd -c 4 test_data.gz
Output:
0000000: 1f8b 0808 ....
0000004: 453c 5d59 E<]Y
0000008: 0003 7465 ..te
000000c: 7374 5f64 st_d
0000010: 6174 6100 ata.
0000014: 0b49 2d2e .I-.
0000018: c9cc 4b57 ..KW
000001c: 28ca 4fce (.O.
0000020: 2eb6 d254 ...T
0000024: 7049 4d2a pIM*
0000028: 4d4f 0789 MO..
000002c: 1497 0245 ...E
0000030: 14ac 34b8 ..4.
0000034: 00f4 a724 ...$
0000038: 5623 0000 V#..
000003c: 00 .
As you can see, there is no 0x0A at the end. I think Vim adds newlines to the end of files by default, if they are not present.

How do I get all the information regarding the header of an audio file?

How do I get all the information regarding the header of an audio file, so it can be displayed in a readable format like ASCII values?
The audio file maybe of any format, most preferably .wav format.
EDIT:- OS can be windows 8.1 or ubuntu. I actually have to understand all the properties of the file like whether it is mono or stereo, its encoding, etc. maybe specifically .wav file, i would say.
I have knowledge about the C++ language, so that would be better.
There is a very powerful command you can use in a bash script: sox.
To get all the info you need about a wav file, you just have to run:
soxi file.wav
and you'll get something like:
Input File : 'file.wav'
Channels : 1
Sample Rate : 8000
Precision : 16-bit
Duration : 00:02:08.40 = 1027236 samples ~ 9630.34 CDDA sectors
File Size : 2.05M
Bit Rate : 128k
Sample Encoding: 16-bit Signed Integer PCM
sox is available for Windows as well, although I have never used there.
You can use FFPROBE utility from FFMPEG: http://ffmpeg.org/
"ffprobe" gathers information from multimedia streams and prints it in human- and machine-readable fashion.
For example it can be used to check the format of the container used by a multimedia stream
and the format and type of each media stream contained in it.
Code
ffprobe -i <file_name>
ffprobe -i myfile.wav
Output will look something like this
Input #0, wav, from 'myfile.wav':
Duration: 00:04:16.88, bitrate: 16 kb/s
Stream #0:0: Audio: g729 ([131][0][0][0] / 0x0083), 8000 Hz, 2 channels, s16p, 16 kb/s
Output Explanation:
g729 is encoding type here.
16 kb/s is the bit-rate.
2 channels
Sample rate 8000 Hz
For detailed information
ffprobe -i <file_name> -show_streams

Weird pcap header of byte sequence 0a 0d 0d 0a created on Mac?

I have a PCAP file that was created on a Mac with mergecap that can be parsed on a Mac with Apple's libpcap but cannot be parsed on a Linux system. combined file has an extra 16-byte header that contains 0a 0d 0d 0a 78 00 00 00 before the 4d 3c 2b 1a intro that's common in pcap files. Here is a hex dump:
0000000: 0a0d 0d0a 7800 0000 4d3c 2b1a 0100 0000 ....x...M<+.....
0000010: ffff ffff ffff ffff 0100 4700 4669 6c65 ..........G.File
0000020: 2063 7265 6174 6564 2062 7920 6d65 7267 created by merg
0000030: 696e 673a 200a 4669 6c65 313a 2037 2e70 ing: .File1: 7.p
0000040: 6361 7020 0a46 696c 6532 3a20 362e 7063 cap .File2: 6.pc
0000050: 6170 200a 4669 6c65 333a 2034 2e70 6361 ap .File3: 4.pca
0000060: 7020 0a00 0400 0800 6d65 7267 6563 6170 p ......mergecap
Does anybody know what this is? or how I can read it on a Linux system with libpcap?
I have a PCAP file
No, you don't. You have a pcap-ng file.
that can be parsed on a Mac with Apple's libpcap
libpcap 1.1.0 and later can also read some pcap-ng files (the pcap API only allows a file to have one link-layer header type, one snapshot length, and one byte order, so only pcap-ng files where all sections have the same byte order and all interfaces have the same link-layer header type and snapshot length are supported), and OS X Snow Leopard and later have a libpcap based on 1.1.x, so they can read those files.
(OS X Mountain Lion and later have tweaked libpcap to allow it to write pcap-ng files as well; the -P flag makes tcpdump write out pcap-ng files, with text comments attached to some outgoing packets indicating the process ID and process name of the process that sent them - pcap-ng allows text comments to be attached to packets.)
but cannot be parsed on a Linux system
Your Linux system probably has an older libpcap version. (Note: do not be confused by Debian and Debian derivatives calling the libpcap package "libpcap0.8" - they're not still using libpcap 0.8.)
combined file has an extra 16-byte header that contains 0a 0d 0d 0a 78 00 00 00
A pcap-ng file is a sequence of "blocks" that start with a 4-byte block type and a 4-byte length, both in the byte order of the host that wrote them.
They're divided into "sections", each one beginning with a "Section Header Block" (SHB); the block type for the SHB is 0x0a0d0d0a, which is byte-order-independent (so that you don't have to know the byte order to read the SHB) and contains carriage returns and line feeds (so that if the file is, for example, transferred between a UN*X system and a Windows system by a tool that thinks it's transferring a text file and that "helpfully" tries to fix line endings, the SHB magic number will be damaged and it will be obvious that the file was corrupted by being transferred in that fashion; think of it as the equivalent of a shock indicator).
The 0x78000000 is the length; what follows it is the "byte-order magic" field, which is 0x1A2B3C4D (which is not the same as the 0xA1B2C3D4 magic number for pcap files), and which serves the same purposes as the pcap magic number, namely:
it lets code identify that the file is a pcap-ng file
it lets code determine the byte order of the section.
(No, you don't need to know the length before looking for the pcap magic number; once you've found the magic number, you then check the length to make sure it's at least 28 and, if it's less than or equal to 28, you reject the block as not being valid.)
Does anybody know what this is?
A (little-endian) pcap-ng file.
or how I can read it on a Linux system with libpcap?
Either read it on a Linux system with a newer version of libpcap (which may mean a newer version of whatever distribution you're using, or may just mean doing an update if that will get you a 1.1.0 or later version of libpcap), read it with Wireshark or TShark (which have their own library for reading capture files, which supports the native pcap and pcap-ng formats, as well as a lot of other formats), or download a newer version of libpcap from tcpdump.org, build it, install it, and then build whatever other tools need to read pcap-ng files with that version of libpcap rather than the one that comes with the system.
Newer versions of Wireshark write pcap-ng files by default, including in tools such as mergecap; you can get them to write pcap files with a flag argument of -F pcap.