non-matching XRef stream entry size

non-matching XRef stream entry size - pdf

I'm trying to read an XRef stream object but something doesn't adds up. This is my object:
<<
/DecodeParms << /Columns 5/Predictor 12 >>
/Filter /FlateDecode
/ID [<9597C618BC90AFA4A078CA72B2DD061C><48726007F483D547A8BEFF6E9CDA072F>]
/Index [124332 848]
/Info 124331 0 R
/Length 137
/Prev 8983958
/Root 124333 0 R
/Size 125180
/Type /XRef
/W [1 3 1]
>>
I read the 137 bytes of stream, uncompress them through zlib and I get 5088 bytes. This is the beginning of the uncompressed stream (hexdump -C output):
00000000 02 01 00 00 10 00 02 00 00 27 ec 00 02 00 00 01 |.........'......|
00000010 f4 00 02 00 00 01 f7 00 02 00 00 04 5b 00 02 00 |............[...|
00000020 00 02 68 00 02 00 00 0b ac 00 02 00 00 0f e5 00 |..h.............|
00000030 02 00 00 0e 93 00 02 00 00 0d 14 00 02 00 00 0d |................|
What I don't understand is that I should have 5 bytes per entry: /W [1 3 1] means 1+3+1=5 bytes; but the stream's length of 5088 isn't divisible by 5. Also, I realized that 5088 is divisible by 6: 5088/6=848 and that's the number of entry as the second value of the /Index key confirms. Reading the stream keeping the [1 3 1] scheme is, also, impossible already at the second entry (the byte 0xEC isn't a valid entry type).
Where's my mistake?
Thanks a lot for any help.

After decompression each line has 6 bytes: 1 byte for predictor, 5 for 'predicted' data.
After you apply the predictor you get the actual data.

Related

Trying to understand data in cross-reference (XRef) stream in PDF

I'm trying to read a PDF file that is linearized and uses cross-reference streams. I believe that I mostly understand what's happening except for the last two entries in the table. Those two, for objects 5 and 6, appear to be in use but show file offsets that vastly exceed the file size. Also, the PDF file I have doesn't even have objects number 5 or 6 in it.Here is the cross-reference stream:
4 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<ED772C59D33BA74FA1DEE567740067A0><ED772C59D33BA74FA1DEE567740067A0>]/Info 6 0 R/Length 39/Root 8 0 R/Size 7/Type/XRef/W[1 3 0]>>stream
hﬁbb&F…ˆl&ﬁt ¡ÿ"∏ôügÕ≤=‘
endstream
And here are the raw data after FlateDecode, arranged in rows. FlateDecode reports that 35 bytes of data were inflated.
02 00 00 00 00
02 01 19 87 6b
02 00 00 0d 67
02 00 00 01 8c
02 00 00 01 0b
02 01 e7 6a 99
02 00 00 00 01
I also applied a PNG Predictor function (up) which yielded 7 rows of 4 bytes each:
00 00 00 00
01 19 87 6b
01 19 94 d2
00 00 0e f3
00 00 02 97
01 e7 6b a4
01 e7 6a 9a
Row 0 is all zero, check. The offsets for object 1 and 2 do in fact address object 1 and 2 in the PDF file. So far, so good. Object 3 is marked unused, and for sure there is no object 3 in the PDF file.
But then, I'm a little confused that object 4, this cross-reference stream, is marked as unused. Still, since it is object 4 that I am parsing, I've clearly had no difficulty finding it.But where I am completely confused are the rows for object 5 and 6. The "01" in the first column tells me that they are in use. But their offsets exceed the size of the entire file, and in any case, there are no object 5 nor 6 in the file. The Size entry in the dictionary clearly has a value of 7, telling me the table should contain data for objects 0 thru 6. After filtering, I have 28 bytes of data, which makes sense for seven rows of four bytes each.Why are entries for 5 and 6 there at all? And, given that they are there, why are they marked as "in use" with apparently nonsense offsets?The file seems valid. Both Adobe Illustrator and Acrobat Reader open it without complaint. I haven't found anything in the PDF spec about special treatment for the last two rows of an Xref stream. What am I missing?

You interpret the predictor to add the current input row and the previous input row to retrieve the current data row. Shouldn't you add the current input row and the previous data row? That would change results for object 3 onward:
02 00 00 00 00 00 00 00 00
02 01 19 87 6b 01 19 87 6b
02 00 00 0d 67 01 19 94 d2
02 00 00 01 8c 01 19 95 5e
02 00 00 01 0b 01 19 96 69
02 01 e7 6a 99 02 00 00 02
02 00 00 00 01 02 00 00 03
Now objects 3 and 4 have proper offsets matching the data from your pastebin paste and objects 5 and 6 would be marked as objects in object streams.

Gzip deflate noncompressed data format

After reading RFC 1951 and manually wrote a simple gzip file that contains non-compressed data. The uncompressed data file only has one character 'a' with no additional spaces or line breaks. The content of the gzip file is
1f 8b 08 00 00 00 00 00 00 03 01 80 00 7f ff 86 43 be b7 e8 01 00 00 00.
When I was trying to unzip it under Linux system, it gave me an error "gzip: xxx.gz: unexpected end of file".
I think I followed the deflate format of non-compressed data block mentioned in 3.2.4. After 10 bytes gzip header,
01: BFINAL=1 and BTYPE=00
8000: LEN=1
7fff: NLEN
86: a
Followed by CRC and Size.
Can anyone point out anything wrong or missing in the gzip file? Thanks a lot.

8000 is length 128, not 1. 0100 would be length 1. (Interestingly, you managed to correctly represent the total uncompressed length at the end as 01 00 00 00.)
Also an a is hex 61, not 86.
So the correct stream would be:
1f 8b 08 00 00 00 00 00 00 03 01 01 00 fe ff 61 43 be b7 e8 01 00 00 00

ssl client_hello, unidentified data

I am trying to make sense of a SSL Client Hello packet, but I am stuck on the last view bytes.
0000 16 03 00 00 58 01 00 00 54 03 03 52 f3 8a b2 f6 ....X...T..R....
0010 35 b8 08 39 25 5f 61 73 d5 b6 af 4d 3c 1a 2d 70 5..9%_as...M<.-p
0020 58 2e be 8a 89 b6 5c e1 9a 3f 81 00 00 18 00 35 X.....\..?.....5
0030 00 2f 00 0a 00 05 00 04 00 38 00 32 00 13 00 66 ./.......8.2...f
0040 00 39 00 33 00 16 01 00 00 13 ff 01 00 01 00 00 .9.3............
0050 0d 00 0a 00 08 04 02 04 01 02 01 02 02 .............
What I got so far:
16: msg type
03 00: SSL version
00 58: Record Length
01: Handshake Type - Client_Hello
00 00 54: Message Length
03 03: Client preferred version
52 f3 8a b2 f6 35 ... 5c e1 9a 3f 81: random data/ timestamp
00: Session ID Length 0
00 18: Ciphersuit Length
00 35 .. 00 16: cipher suites
01: compression method length
00: compression method
00 13 ff 01 00 01 00 00 0d 00 0a 00 08 04 02 04 01 02 01 02 02: what is this ?
At first a thought it was challenge data, but it seems to be constant over all the packages.
My main guide for deciphering the packet was: http://www.ntu.edu.sg/home/ehchua/programming/webprogramming/HTTP_SSL.html (under Client_Hello)
(sorry for the bad formatting)

The bytes after the compression method are TLS extensions (see RFC 5246, section 7.4.1.2 Client Hello).
0x13 0x00 length of extensions
The first one is the renegotiation_info extension (see RFC 5746, Section 3.2 Extension Definition):
0xff 0x01 renegotiation_info
0x00 0x01 length
0x00 0x00 for inital handshakes
The other one is the signature_algorithms extension (RFC 5246, section 7.4.1.4.1):
0x00 0x0d signature_algorithm
0x00 0x0a length
0x00 0x08 HashAlgorithm: none, SignatureAlgorithm: 0x08
0x04 0x02 HashAlgorithm: sha-256, SignatureAlgorithm: dsa
0x04 0x01 HashAlgorithm: sha-256, SignatureAlgorithm: rsa
0x02 0x01 HashAlgorithm: sha-1, SignatureAlgorithm: rsa
0x02 0x02 HashAlgorithm: sha-1, SignatureAlgorithm: dsa

Why is there information missing in objdump?

I can't manage to find out why there are sometime some .words missing in my assembly code when I run objdump. What do the "..." alone on a line represent?

Inside of the objdump output of -d or -D (disassemble), there will often be multiple instances of lines containing only an ellipsis. This is only because all the bytes between the above and below bytes are all null (0x00).
Below is the output of a disassembled 32bit program. Between the offset of 00234(+4) and 00240 are all 0x00 inside of the executable file.
40022c: 00000034 0x34
400230: 0000016a 0x16a
400234: 000001ac 0x1ac
...
400240: 00000098 0x98
400244: 00000000 nop
400248: 000000a9 0xa9
...
400254: 000000cf 0xcf
Looking at the application we disassembled, you can see that where the ellipsis occurs is all null bytes. No point in outputting these to the user multiple times, so objdump simply removes them. The bold text is where the ellipsis occur. I should also note, that if there is only one section (32 / 64bits) of null bytes, objdump will show this as nop or similar depending on machine.
Offset(h) 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000220 34 00 00 00 4...
00000230 6A 01 00 00 AC 01 00 00 00 00 00 00 00 00 00 00 j...¬...........
00000240 98 00 00 00 00 00 00 00 A9 00 00 00 00 00 00 00 ˜.......©.......
00000250 00 00 00 00 CF 00 00 00 ....Ï...

I've used a -z argument to objdump which suppresses hiding of some zero information. You should see the .word arguments with zeroes.
This seems to be useful when you're passing the output of objdump to another program.

understand hexedit of an elf

Consider the following hexedit display of an ELF file.
00000000 7F 45 4C 46 01 01 01 00 00 00 00 00 .ELF........
0000000C 00 00 00 00 02 00 03 00 01 00 00 00 ............
00000018 30 83 04 08 34 00 00 00 50 14 00 00 0...4...P...
00000024 00 00 00 00 34 00 20 00 08 00 28 00 ....4. ...(.
00000030 24 00 21 00 06 00 00 00 34 00 00 00 $.!.....4...
0000003C 34 80 04 08 34 80 04 08 00 01 00 00 4...4.......
00000048 00 01 00 00 05 00 00 00 04 00 00 00 ............
How many section headers does it have?
Is it an object file or an executable file?
How many program headers does it have?
If there are any program headers, what does the first program header do?
If there are any section headers, at what offset is the section header table?

Strange, this hexdump looks like your homework to me...
There are 36 section headers.
It is an executable.
It has 8 program headers.
As you can tell by the first word (offset 0x34: 0x0006) in the first program header, it is of type PT_PHDR, which just informs about the characteristics of the program header table itself.
The section header table begins at byte 5200 (which is 0x1450 in hex).
How do I know this stuff? By dumping the hex into a binary and reading it with readelf -a (because I am lazy). Except for question no. 4, which I had to figure out manually by reading man 5 elf.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

non-matching XRef stream entry size - pdf

After decompression each line has 6 bytes: 1 byte for predictor, 5 for 'predicted' data. After you apply the predictor you get the actual data.

Related

Trying to understand data in cross-reference (XRef) stream in PDF

Gzip deflate noncompressed data format

ssl client_hello, unidentified data

Why is there information missing in objdump?

understand hexedit of an elf

Categories

Resources