reading a pdf version >= 1.5, how to handle Cross Reference Stream Dictionary

reading a pdf version >= 1.5, how to handle Cross Reference Stream Dictionary - pdf

I'm trying to read the xref table of a pdf version >= 1.5.
the xref table is an object:
58 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<CB05990F613E2FCB6120F059A2BCA25B><E2ED9D17A60FB145B03010B70517FC30>]/Index[38 39]/Info 37 0 R/Length 96/Prev 67529/Root 39 0 R/Size 77/Type/XRef/W[1 2 1]>>stream
hÞbbd``b`:$AD`Ì ‰Õ Vˆ8âXAÄ×HÈ$€t¨ – ÁwHp·‚ŒZ$ìÄb!&F† .#5‰ÿŒ>(more here but can't paste)
endstream
endobj
as you can see
/FlatDecode
/Index [38 39], that is 39 entries in the stream
/W [1 2 1] that is each entry is 1 + 2 + 1 = 4 bytes long
/Root 39 0 R that is root object is number 39
BUT :
the decompressed stream is 195 bytes long (39 * 5 = 195). So the length of an entry is 4 or 5.
Here is the first inflated bytes
02 01 00 10 00 02 00 02 cd 00 02 00 01 51 00 02 00 01 70 00 02 00 05 7a 00 02
^^
if entry length is 4 then the root entry is a free object (see the ^^) !!
if the entry is 5: how to interpret the fields of one entry (reference is implicitly made to PDF Reference, chapter 3.4.7 table 3.16 ) ?
For object 38, the first of the stream: it seems, as it is of type 2, to be the 16 object of the stream object number 256, but there is no object 256 in my pdf file !!!
The question is: how shall I handle the 195 bytes ?

A compressed xref table may have been compressed with one of the PNG filters. If the /Predictor value is set to '10' or greater ("a Predictor value greater than or equal to 10 merely indicates that a PNG predictor is in use; the specific predictor function used is explicitly encoded in the incoming data")1, PNG row filters are supplied inside the compressed data "as usual" (i.e., in the first byte of each 'row', where the 'row' is of the width in /W).
Width [1 2 1] plus Predictor byte:
02 01 00 10 00
02 00 02 cd 00
02 00 01 51 00
02 00 01 70 00
02 00 05 7a 00
02 .. .. .. ..
After applying the row filters ('2', or 'up', for all of these rows), you get this:
01 00 10 00
01 02 ed 00
01 03 3e 00
01 04 ae 00
01 09 28 00
.. .. .. ..
Note: calculated by hand; I might have made the odd mistake here and there. Note that the PNG 'up' filter is a byte filter, and the result of the "up" filter is truncated to 8 bits for each addition.
This leads to the following Type 1 XRef references ("type 1 entries define objects that are in use but are not compressed (corresponding to n entries in a cross-reference table)."):2
#38 type 1: offset 10h, generation 0
#39 type 1: offset 2EDh, generation 0
#40 type 1: offset 33Eh, generation 0
#41 type 1: offset 4AEh, generation 0
#42 type 1: offset 928h, generation 0
1 See LZW and Flate Predictor Functions in PDF Reference 1.7, 6th Ed, Section 3.3: Filters.
2 As described in your Table 3.16 in PDF Ref 1.7.

Related

Writing a streamed cross-reference in PDF: file is detected as damaged?

I am using Typescript to write a streamed cross-reference table at the end of my PDF file, as suggested the answer from my previous post
Here is the XREF table I would have written without stream:
xref
0 1
0000000000 65535 f
21 2
0000084670 00000 n
0000085209 00000 n
73 6
0000085585 00000 n
0000086335 00000 n
0000150988 00000 n
0000151086 00000 n
0000151528 00000 n
0000151707 00000 n
trailer
<<
/Size 79
/Root 21 0 R
/Info 19 0 R
/Prev 116
>>
startxref
152861
%%EOF
And here is the streamed version:
79 0 obj
<<
/Type /XRef /Filter /FlateDecode
/Root 21 0 R
/Info 19 0 R
/Index [0 1 21 2 73 7 ] /W[1 3 0] /DecodeParms<</Columns 4/Predictor 12>> /Prev 116 /Length 45 /Size 80>>
stream
(...data..)
endstream
endobj
startxref
152870
%%EOF
As for the content of the stream, here it is in a byte array form. It was deflated using Poko.deflate() with the default compression level:
120,156,99,98,0,2,38,70,70,175,125,76,140,12,76,210,64,130,177,2,196,122,7,36,254,244,2,9,134,36,144,216,46,16,107,51,144,96,105,2,0,141,44,6,140
I have reversed the process I have found here
The arranged inflated version is as follow:
02 00 00 00 00
02 01 01 4a be // = 84670 -> object 21 is at byte position 84670
02 01 00 02 1b // = 539 -> object 22 is at byte position 85209
02 01 00 01 78 // = 376 -> object 73 is at byte position 85585
02 01 00 02 ee // etc
02 01 00 fc 8d
02 01 00 00 62
02 01 00 01 ba
02 01 00 00 b3
02 01 00 04 82
However, when I try and open the resulting file, all I get is:
What am I missing? The most odd thing about this process is the leading "02" on each line (found in the referenced post here). However, even without it, the same problem seems to occur. What am I missing?

3 problems can be identified easily.
Incorrect startxref offset
Your startxref points to the start of the stream dictionary but it should point to the object number of the indirect object housing the stream.
Missing predictor bytes
You claim that the arranged inflated version is
02 00 00 00 00
02 01 01 4a be // = 84670 -> object 21 is at byte position 84670
02 01 00 02 1b // = 539 -> object 22 is at byte position 85209
02 01 00 01 78 // = 376 -> object 73 is at byte position 85585
02 01 00 02 ee // etc
...
But inflating your stream gives
00 00 00 00
01 01 4a be
01 00 02 1b
01 00 01 78
01 00 02 ee
...
I.e. you forgot to add the predictor bytes.
Prediction only applied to offsets
You fixed the two errors mentioned above and shared that file. Looking at that file another error became clear: You only apply the prediction to the offset part of the cross reference entry, not to the initial byte indicating the type of the entry!
Your inflated stream is now as follows
02, 01, 01, 4a, be,
02, 01, 00, 02, 1b,
02, 01, 00, 01, 78,
02, 01, 00, 02, ee,
02, 01, 00, fc, 8d,
02, 01, 00, 00, 62,
02, 01, 00, 01, bc,
02, 01, 00, 00, b3,
02, 01, 00, 04, 82
Resolving the prediction, therefore, results in
01, 01, 4a, be,
02, 01, 4c, d9,
03, 01, 4d, 51,
04, 01, 4f, 3f,
05, 01, 4b, cc,
06, 01, 4b, 2e,
07, 01, 4c, ea,
08, 01, 4c, 9d,
09, 01, 50, 1f
This contains a lot of incorrect type bytes.

Trying to understand data in cross-reference (XRef) stream in PDF

I'm trying to read a PDF file that is linearized and uses cross-reference streams. I believe that I mostly understand what's happening except for the last two entries in the table. Those two, for objects 5 and 6, appear to be in use but show file offsets that vastly exceed the file size. Also, the PDF file I have doesn't even have objects number 5 or 6 in it.Here is the cross-reference stream:
4 0 obj
<</DecodeParms<</Columns 4/Predictor 12>>/Filter/FlateDecode/ID[<ED772C59D33BA74FA1DEE567740067A0><ED772C59D33BA74FA1DEE567740067A0>]/Info 6 0 R/Length 39/Root 8 0 R/Size 7/Type/XRef/W[1 3 0]>>stream
hﬁbb&F…ˆl&ﬁt ¡ÿ"∏ôügÕ≤=‘
endstream
And here are the raw data after FlateDecode, arranged in rows. FlateDecode reports that 35 bytes of data were inflated.
02 00 00 00 00
02 01 19 87 6b
02 00 00 0d 67
02 00 00 01 8c
02 00 00 01 0b
02 01 e7 6a 99
02 00 00 00 01
I also applied a PNG Predictor function (up) which yielded 7 rows of 4 bytes each:
00 00 00 00
01 19 87 6b
01 19 94 d2
00 00 0e f3
00 00 02 97
01 e7 6b a4
01 e7 6a 9a
Row 0 is all zero, check. The offsets for object 1 and 2 do in fact address object 1 and 2 in the PDF file. So far, so good. Object 3 is marked unused, and for sure there is no object 3 in the PDF file.
But then, I'm a little confused that object 4, this cross-reference stream, is marked as unused. Still, since it is object 4 that I am parsing, I've clearly had no difficulty finding it.But where I am completely confused are the rows for object 5 and 6. The "01" in the first column tells me that they are in use. But their offsets exceed the size of the entire file, and in any case, there are no object 5 nor 6 in the file. The Size entry in the dictionary clearly has a value of 7, telling me the table should contain data for objects 0 thru 6. After filtering, I have 28 bytes of data, which makes sense for seven rows of four bytes each.Why are entries for 5 and 6 there at all? And, given that they are there, why are they marked as "in use" with apparently nonsense offsets?The file seems valid. Both Adobe Illustrator and Acrobat Reader open it without complaint. I haven't found anything in the PDF spec about special treatment for the last two rows of an Xref stream. What am I missing?

You interpret the predictor to add the current input row and the previous input row to retrieve the current data row. Shouldn't you add the current input row and the previous data row? That would change results for object 3 onward:
02 00 00 00 00 00 00 00 00
02 01 19 87 6b 01 19 87 6b
02 00 00 0d 67 01 19 94 d2
02 00 00 01 8c 01 19 95 5e
02 00 00 01 0b 01 19 96 69
02 01 e7 6a 99 02 00 00 02
02 00 00 00 01 02 00 00 03
Now objects 3 and 4 have proper offsets matching the data from your pastebin paste and objects 5 and 6 would be marked as objects in object streams.

How to Parse ISO 8583 message

How can I determine where the MTI start in an ISO 8583 message?
00 1F 60 00 05 80 53 08 00 20 20 01 00 00 80 00 00 92 00 00 00 31 07 00 05 31 32 33 34 31 32 33 34

In that message 00 1F is the length, and 60 00 05 80 53 is the TPDU. (Those aren't part of ISO8583). 08 00 is the MTI. The next 8 bytes are the primary bitmap.
You can buy a copy of the ISO8583 specification from ISO. There is an introduction on wikipedia

position of the MTI is network specific, and should be explained in their technical specifications document.
You can eyeball the MTI by looking for values such as 0100, 0110, 0220, 0230, 0800, etc. in the first 20 bytes, and the are typically followed by 8 to 16 bytes of BMP data
your data shows MTI = 800 with a bitmap = 20 20 01 00 00 80 00 00
That means the following fields are present, 3,11,24,41, with DE 3 (PRoc code) = 920000, DE 11 (STAN) = 003107, and the remaining is shared among 24 and 41, I am not sure about their sizes

In this message a 2 byte header length is used:
00 1F
But some Hosts also use 4 byte header length for ISO 8583 messages. So you can not generalize it, it depends on what you have arranged with the sending Host.

non-matching XRef stream entry size

I'm trying to read an XRef stream object but something doesn't adds up. This is my object:
<<
/DecodeParms << /Columns 5/Predictor 12 >>
/Filter /FlateDecode
/ID [<9597C618BC90AFA4A078CA72B2DD061C><48726007F483D547A8BEFF6E9CDA072F>]
/Index [124332 848]
/Info 124331 0 R
/Length 137
/Prev 8983958
/Root 124333 0 R
/Size 125180
/Type /XRef
/W [1 3 1]
>>
I read the 137 bytes of stream, uncompress them through zlib and I get 5088 bytes. This is the beginning of the uncompressed stream (hexdump -C output):
00000000 02 01 00 00 10 00 02 00 00 27 ec 00 02 00 00 01 |.........'......|
00000010 f4 00 02 00 00 01 f7 00 02 00 00 04 5b 00 02 00 |............[...|
00000020 00 02 68 00 02 00 00 0b ac 00 02 00 00 0f e5 00 |..h.............|
00000030 02 00 00 0e 93 00 02 00 00 0d 14 00 02 00 00 0d |................|
What I don't understand is that I should have 5 bytes per entry: /W [1 3 1] means 1+3+1=5 bytes; but the stream's length of 5088 isn't divisible by 5. Also, I realized that 5088 is divisible by 6: 5088/6=848 and that's the number of entry as the second value of the /Index key confirms. Reading the stream keeping the [1 3 1] scheme is, also, impossible already at the second entry (the byte 0xEC isn't a valid entry type).
Where's my mistake?
Thanks a lot for any help.

After decompression each line has 6 bytes: 1 byte for predictor, 5 for 'predicted' data.
After you apply the predictor you get the actual data.

USB SCSI Inquiry and RequestSense

I am currently writing USB(EHCI) driver for pendrives in assembly for my OS.
I can successfully get the Device-, String-, Configuration(Interface and Endpoint)-descriptors.
However, the SCSI Inquiry and other commnads fail.
This is what I do (after getMaxLUN):
SetConfiguration(1)
Inquiry
Another way:
SetConfiguration(1)
TestUnitReady
RequestSense
TestUnitReady
I correctly set the EndPt in the QueueHead (BulkIn or BulkOut) and the device-address.
I am implementing the USB-driver by the book "B. D. Lunt - USB: The Universal Serial Bus".
If I send Inquiry, this is what I get (return buffer, 36 bytes, hex):
01 00 00 00 01 00 00 00 80 0D 24 00
40 15 16 00 00 20 16 00 00 30 16 00
00 40 16 00 00 50 16 00 00 00 00 00
the 11th byte is 0x24, the length of the returned data (36).
This looks garbage or an error message.
If I send TestUnitReady, then RequestSense returns (18 bytes, hex):
01 00 00 00 01 00 00 00 80 0D 12 00
C0 16 16 00 00 20
the 11th byte is 0x12, the length of the returned data (18).
The contents of the two returned buffers are similar. The CBW of Inquiry and TestUnitReady look good (I use LUN=0). The SCSI-spec-4 says something about an incorrect logical-unit, if the result is not the expected one, but I have only LUN 0.
I am testing with a Sony 4GB pendrive.
Inquiry should work immediately (return a correct data in the buffer) regardless of the CSW after it and I don't understand what the problem can be.
I am experimenting with BulkReset and Clear_Feature(for BulkIn and BulkOut) but that doesn't seem to fix the Inquiry.
The CSW of the TestUnitReady returns Status=01 but after a BulkReset and ClearFeature it returns Status=0 (success).
What can be the reason for the incorrect data in the buffer(s) ?
Any help is appreciated.
EDIT: I tried it with delays too (100ms before sending CBWs, getting CSWs).
It didn't help.
EDIT2: calling GetConfiguration() right after SetConfiguration(1), correctly returns 1.
EDIT3: This must be due to an incorrect pointer. I consider it solved.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

reading a pdf version >= 1.5, how to handle Cross Reference Stream Dictionary - pdf

Related

Writing a streamed cross-reference in PDF: file is detected as damaged?

Trying to understand data in cross-reference (XRef) stream in PDF

How to Parse ISO 8583 message

non-matching XRef stream entry size

USB SCSI Inquiry and RequestSense

Categories

Resources