Why would Linux syscall read() return less than the requested size? - file-io

On success, the read(2) system call returns the number of bytes read.
The man page says
It is not an error if this number is smaller than the number of bytes requested; this may happen for example because fewer bytes are actually available right now (maybe because we were close to end-of-file, or because we are reading from a pipe, or from a terminal), or because read() was interrupted by a signal.
To make sure you've read the full file, the solution is apparently to just call read() again, since a return of 0 always means EOF. However, my question is for what reason besides receiving a signal might read() return fewer bytes on a regular file?
For regular files, how would being "close to end-of-file" make fewer bytes get read?

Related

How to read a binary file with TCL

So I have a function I'm using to read data from a file. It works fine if the file is plain text, but when I try to read a binary file, like a png, it returns a different text (diff confirms that). I opened a hex editor to see what was wrong and found out it is putting some c2 bytes along with the file (I don't know if the position is random or if there are other bytes except this c2 one).
This is my function. I just want it to read and save to a variable.
proc read_file {path} {
set channel [open $path r]
fconfigure $channel -translation binary
set return_string "[read $channel]"
close $channel
return "$return_string"
}
To actually print, I'm doing this:
puts -nonewline [read_file file.png]
When you open a file, it defaults to being in text mode . In text mode (which is really a combination of options) the IO layer translates characters from whatever encoding they are in into Tcl's internal encoding, and does the reverse operation on output. The default encoding scheme is platform specific, but in your case it sounds like it is UTF-8. (Tcl uses a complex internal system of encodings; it doesn't expose those to the outside world.)
By contrast, when you put the channel into binary mode, the bytes on the outside are directly mapped to characters in the range 0-255 (and vice versa on output). You get a perfect copy, provided you put both input and output channels in binary mode. (There are other optimisations for binary mode, but they don't matter here.)
When you only put one of the channels in binary mode, you get what looks like corruption. It isn't random though. In particular, when the input is binary but the output is UTF-8, input bytes in the range 128-255 get converted into multiple output bytes, where the first of those bytes is in the sort of range you observed. There are other combinations that mess things up; the whole range of problems is collectively known as mojibake.
tl;dr Don't mix up binary and text data unless you're very careful. The results of getting it wrong are "surprising".

Managing EOF in Trx Library

I am using the TRX library to process the ISO8583 Message. I am receiving a Raw data EOF character. But the last byte is not removed from the buffer as it's not defined in the packager and it's causing an issue in parsing the next transaction. How to manage this?
And while sending a response back how to add EOF character?
Normally, this kind of stuff (protocol characters) is removed before decoding ISO8583 data.
For example, you read 100 bytes from the socket. And it's ISO data + EOF character. You would remove EOF character and process 99 bytes of ISO data through a decoder.
And reverse is true when you send data. You encode your data first then add EOF character. The resulting byte array goes into the socket.
Sorry, I don't know anything about TRX library, but, hopefully, general advice helps you somewhat.

Erasing flash memory in blocks (1024 bytes)

I am working on making a bootloader. I have to erase 1024 bytes of memory before I write anything to those registers in that block. Even if I want to write 2 bytes, I am forced to erase 1024 bytes. My problem is that I don't know where each block starts. For example, lets say I want to write the following bytes to this address.
Address: 0x198F0
Bytes:C80E00010001616FDFECD6F08C8C92EC
When I try to erase 1024 bytes starting from address 0x198F0I noticed that it starts erasing from 0x19800 instead.
How do I know where each block starts from so I can calculate it in software?
The reason I want to know this is so I can copy the entire block into ram before I erase it, then modify it, and write it back to the same block. I am using PIC18f87J11 with MPLAB XC8 compiler. I hope its clear what I am trying to do, otherwise let me know in the comments.
Thanks!
The FLASH memory blocks of PIC18f87J11 are 1024 bytes align. To calcolate start address of some block set last 10 bits of address to 0, so you can use:
StartAddress = AddressPtr and 0xFFFC00
In your case:
0x198F0 and 0xFFFC00 = 0x19800

Bzip2 block header: 1AY&SY

This is the question about bzip2 archive format. Any Bzip2 archive consists of file header, one or more blocks and tail structure. All blocks should start with "1AY&SY", 6 bytes of BCD-encoded digits of the Pi number, 0x314159265359. According to the source of bzip2:
/*--
A 6-byte block header, the value chosen arbitrarily
as 0x314159265359 :-). A 32 bit value does not really
give a strong enough guarantee that the value will not
appear by chance in the compressed datastream. Worst-case
probability of this event, for a 900k block, is about
2.0e-3 for 32 bits, 1.0e-5 for 40 bits and 4.0e-8 for 48 bits.
For a compressed file of size 100Gb -- about 100000 blocks --
only a 48-bit marker will do. NB: normal compression/
decompression do *not* rely on these statistical properties.
They are only important when trying to recover blocks from
damaged files.
--*/
The question is: Is it true, that all bzip2 archives will have blocks with start aligned to byte boundary? I mean all archives created by reference implementation of bzip2, the bzip2-1.0.5+ utility.
I think that bzip2 may parse the stream not as byte stream but as bit stream (the block itself is encoded by huffman, which is not byte-aligned by design).
So, in other words: If grep -c 1AY&SY greater (huffman may generate 1AY&SY inside block) or equal to count of bzip2 blocks in the file?
BZIP2 looks at a bit stream.
From http://blastedbio.blogspot.com/2011/11/random-access-to-bzip2.html:
Anyway, the important bits are that a BZIP2 file contains one or more
"streams", which are byte aligned, each containing one (zero?) or more
"blocks", which are not byte aligned, followed by an end of stream
marker (the six bytes 0x177245385090 which is the square root of pi as
a binary coded decimal (BCD), a four byte checksum, and empty bits for
byte alignment).
The bzip2 wikipedia article also alludes to bit-block alignment (see the File Format section), which seems to be inline from what I remember from school (had to implement the algorithm...).

What is the meaning of \x00 and \xff in Websockets?

Why do messages going through websockets always start with \x00 and end with \xff, as in \x00Your message\xff?
This documentation might help...
Excerpt from section 1.2:-
Data is sent in the form of UTF-8 text. Each frame of data starts
with a 0x00 byte and ends with a 0xFF byte, with the UTF-8 text in
between.
The WebSocket protocol uses this framing so that specifications that
use the WebSocket protocol can expose such connections using an
event-based mechanism instead of requiring users of those
specifications to implement buffering and piecing together of
messages manually.
To close the connection cleanly, a frame consisting of just a 0xFF
byte followed by a 0x00 byte is sent from one peer to ask that the
other peer close the connection.
The protocol is designed to support other frame types in future.
Instead of the 0x00 and 0xFF bytes, other bytes might in future be
defined. Frames denoted by bytes that do not have the high bit set
(0x00 to 0x7F) are treated as a stream of bytes terminated by 0xFF.
Frames denoted by bytes that have the high bit set (0x80 to 0xFF)
have a leading length indicator, which is encoded as a series of
7-bit bytes stored in octets with the 8th bit being set for all but
the last byte. The remainder of the frame is then as much data as
was specified. (The closing handshake contains no data and therefore
has a length byte of 0x00.)
The working spec has changed and no longer uses 0x00 and 0xFF as start and end bytes
http://tools.ietf.org/id/draft-ietf-hybi-thewebsocketprotocol-04.html
I am not 100% sure about this but my guess would be to signify the start and end of the message. Since x00 is a single byte representation of 0 and xFF is a single byte representation of 255