How to make sense of DICT data in CFF font format - pdf

Problem
I'm trying to parse a OTF/CFF font and is struggling with top DICT part, more specifically the top DICT data part.
CFF File
The beginning of CFF table looks like this in hex editor:
The top DICT starts from the second line from offset 0xC2 with 00 01 "top DICT INDEX count", 01 "top DICT INDEX offsetsize", 01 77 "top DICT INDEX offsets".
The large yellow section is the data part for the DICT, but I simply cannot make sense of it. I referenced: https://typekit.files.wordpress.com/2013/05/5176.cff.pdf
http://wwwimages.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/T1_SPEC.pdf
Things I tried
Since top DICT starts with version, Notice, Copyright which are SID, so I tried to look up the offsetted strings but they were way off the strings.
I tried to encode them using Table 3 in page 10 of the CFF reference pdf, essentially taking two bytes, b0, b1, and calculating the value, but the values seemed unrelated.
Further Information
It seems I'm having difficulty understanding Table 3 and Table 4. So the DICT data is supposed to be 1 or 2 byte operators and variable sized operands, and these are concatenated throughout the data? Some examples would be helpful.

I misunderstood the encoding procedure. You need to start from the beginning, and based on the first byte, need to find which encoding it uses, integer encodings, real encoding, or instructions etc.
Btw, this font has CIDFont Operator Extensions eg F8 1B F8 1C 8D 0C 1E meaning it is a CID font. So it doesn't have encoding offset, don't waste time like me trying to find one!

Related

VB.net Read Cobol File Fields (Pure Binary, EBCDIC, Packed)

I need to read a Cobol file into VB.net. Here is the description of the data types from the documentation:
All Magnetic tape files are recorded in 9-track, 8OOBPI mode with odd parity. They are created IBM equipment disk operating system. IBM System 360 Standard.
Binary - Data is coded in pure binary code.
BCD - Data is coded in binary coded decimal format. (Primarily
for files created by the IBM 1401 System).
EBCDIC - Data is coded in extended binary coded decimal interchange code. :(An IBM developed code.)
Packed - Data is coded in packed decimal format.
File Format:
1-2 Record Count [Numeric] (Binary)
3-4 Filler (Binary)
5-5 Record Type [B or R] (EBCDIC)
6-10 Sales Location Numeric [9 digit number] (Packed)
11-13 Sales Identifier (3 character Alpha) (EBCDIC]
etc
So, I know I should read the entire file into a byte array and that's about the limit of what I know to do...
A) I saw another post on EBCDIC conversation using
System.Text.Encoding.GetEncoding(37)
but it is for an entire file. If I run the whole file through it I see intelligible text, but of course the other fields are junk. I don't know the language to decode a single field properly.
B) I have no idea what to do with PURE Binary format.
C) I don't know how to read Packed, particularly as a single field
I've tried a variety of decoding options for PURE BINARY, but the number I get for the first field is not consistent with the stated length of the rows in the docs.
Packed decimal format:
For s9(5)V9(4) comp-3, 123.45 is represented in byte format as
00 12 34 50 0c
Each digit is represented by 4 bits, there is a 4 bit sign (c) at the end and an assumed decimal after the 3.
Most languages provide a routine for converting byte/bytes into a string i.e. byte x'34' -->> String '34'. So you can:
Convert the bytes to a String representation
Add the decimal point in
Strip off the sign character from the end and add the appropriate sign to the front
There are other ways:
Create an translation array and do an array lookup. (See https://github.com/bmTas/JRecord/blob/master/Source/JRecord_Project/JRecord_Common/src/main/java/net/sf/JRecord/Types/smallBin/TypePackedDecimal9.java for an example)
Process it 4 bits at a time
Other fields
The first field (binary) might be a big endian binary integer or another packed-decimal. There is probably a utility built in the .net to do this.
Convert the character fields from ebcdic to ascii one field at a time
In VBA you did not need to read the whole file in, you could read it record by record. I would presume you can do the same in vb.net
Useful Utilities
These tools might be useful for testing.
The RecordEditor should be able to display the file. The Layout Wizard should be able determine the format of the file. Alternatively use the Cobol copybook below
The Java program CobolToCsv should be able to convert the file to Csv
01 tape-record.
05 record-count pic s9(3) comp.
05 filler pic x(2).
05 record-type pic x.
05 Sales-Location pic s9(9) comp-3.
05 Sales-Identifier pic x(3).

Fill and sign for PDF file not working using Acrobat Reader DC

I'm asking this here because given the searches I've done, it appears Adobe's support is next to non-existent. I have, according to this online validation tool:
https://www.pdf-online.com/osa/validate.aspx
A perfectly valid PDF, which is generated from code. However, when using Acrobat Reader DC I am unable to use Fill And Sign - when attempting to sign, it throws this error:
The operation failed because Adobe Acrobat encountered an unknown error
This is the offending PDF:
https://github.com/DelphiWorlds/MiscStuff/blob/master/Test/PDF/SigningNoWork.pdf
This is one which is very similar, where Fill and Sign works:
https://github.com/DelphiWorlds/MiscStuff/blob/master/Test/PDF/SigningWorks.pdf
Foxit Reader has no issue with either of them - Fill and Sign works without fail.
I would post the source of the files, however because they have binary data, I figure links to them is better.
The question is: why does the first one fail to work, but not the second?
In your non-working file all the fonts are defined with
/FirstChar 30
/LastChar 255
i.e. having 226 glyphs. Their respective Widths arrays only have 224 entries, though, so they are incomplete.
After adding two entries to each Widths array, Adobe Reader here does not run into that unknown error anymore during Fill And Sign.
As the OP inquired how exactly I changed those widths arrays:
I wanted the change to have as few side effects as possible, so I was glad to see that there was some empty space in the font dictionaries in question, so a trivial hex editing sufficed, no need to shift indirect objects and update cross references:
In each of those font definitions in the objects 5, 7, 9, and 11 the Widths array is the last dictionary entry value and ends with some white space, after the last width we have these bytes:
20 0D 0A 5D 0D 0A 3E 3E --- space CR NL ']' CR NL '>' '>'
I added two 0 values using the white space:
20 30 20 30 20 5D 3E 3E --- space '0' space '0' space ']' '>' '>'
Acrobat Reader DC - the free version, does not allow you to do the fill and sign anymore if your document has metadata attached to it.
You need to purchase the Pro DC version, which is like $14.99, in order to continue using the fill and sign on here.
I just got done with a 4 months support exchange of emails with Adobe, and that was their final answer.

Extra "hidden" characters messing with equals test in SQL

I am doing a database (Oracle) migration validation and I am writing scripts to make sure the target matches the source. My script is returning values that, when you look at them, look equal. However, they are not.
For instance, the target has PREAPPLICANT and the source has PREAPPLICANT. When you look at them in text, they look fine. But when I converted them to hex, it shows 50 52 45 41 50 50 4c 49 43 41 4e 54 for the target and 50 52 45 96 41 50 50 4c 49 43 41 4e 54 for the source. So there is an extra 96 in the hex.
So, my questions are:
What is the 96 char?
Would you say that the target has incorrect data because it did not bring the char over? I realize this question may be a little subjective, but I'm asking it from the standpoint of "what is this character and how did it get here?"
Is there a way to ignore this character in the SQL script so that the equality check passes? (do I want the equality to pass or fail here?)
It looks like you have Windows-1252 character set here.
https://en.wikipedia.org/wiki/Windows-1252
Character 96 is an En Dash. This makes sense, as the data was PREAPPLICANT.
One user provided "PREAPPLICANT" and another provided "PRE-APPLICANT" and Windows helpfully converted their proper dash into an en dash.
As such, this doesn't appear to be an error in data, more an error in character sets. You should be able to filter these out without too much effort but then you are changing data. It's kind of like when one person enters "Mr Jones" and another enters "Mr. Jones"--you have to decide how much data massaging you want to do.
As you probably already have done, use the DUMP function to get the byte representation of the data in code of you wish to inspect for weirdness.
Here's some text with plain ASCII:
select dump('Dashes-and "smart quotes"') from dual;
Typ=96 Len=25: 68,97,115,104,101,115,45,97,110,100,32,34,115,109,97,114,116,32,113,117,111,116,101,115,34
Now introduce funny characters:
select dump('Dashes—and “smart quotes”') from dual;
Typ=96 Len=31: 68,97,115,104,101,115,226,128,148,97,110,100,32,226,128,156,115,109,97,114,116,32,113,117,111,116,101,115,226,128,157
In this case, the number of bytes increased because my DB is using UTF8. Numbers outside of the valid range for ASCII stand out and can be inspected further.
Here's another way to see the special characters:
select asciistr('Dashes—and “smart quotes”') from dual;
Dashes\2014and \201Csmart quotes\201D
This one converts non-ASCII characters into backslashed Unicode hex.

Detect if Base 64 string is image or text

Is there a way to detect if the Base 64 string contained in an NSData instance is an image or a text or any other object?
You can't generally just look at the base 64 string and decide, but you can decode the first few bytes of data, look at the hex codes (you can do this by decoding your base-64 string into a NSData and just NSLog it or examining it in the debugger), and draw some conclusions. For example:
Image files generally start with special byte sequences (e.g. JPEG start with the hex bytes FF D8; PNG generally start with hex bytes 89 50 4E 47 0D 0A 1A 0A (e.g. 89 "PNG" CR LF EOF LF, etc.). Note, there are a dizzying number of different image formats, so this is a non-trivial exercise, but sometimes you can get lucky and it will be self-evident that it's one of these common format when you glance at the first few bytes.
NSKeyedArchiver archives generally start with the string "bplist".
ASCII text consists of codes between 20 and 7F (with linefeeds represented by 0A; carriage return and linefeeds represented by OD 0A; tab characters as 09; etc.). Then, again, if it was a text, it's unlikely they'd be base-64 encoding it.
If it was UTF-8 it would conform to the coding pattern outlined here. For example, you can look at the first few high bits of the first byte that might conceivably represent a UTF-8 character, and conclude (a) how many bytes the character is represented by and (b) what high bits will be turned on those subsequent bytes. You can often quickly look at it and confirm whether the data conforms to this UTF-8 pattern or not (especially easy to do for most western languages)
If the first three characters were EF BB BF, that often indicates a UTF-8 byte order mark.
This is, by no means, an exhaustive list of codes, but just a few that leapt out at me.
To do this programmatically and do so exhaustively would be a non-trivial exercise. But if you're just "eye-balling" a base-64 string and trying to draw some logical inferences, decode it and look at the hex bytes and you can quickly narrow down the possibilities, at the very least. If you're unsure about how to interpret it, update your question with the hex representation of the decoded base-64 string (just the first 16-32 bytes, please), and we might be able to point you in the right direction.
It is impossible to clearly distinguish text string and Base64 image encoding string. The only way - check if your string is valid Base 64 encoding string. If it is - probably it is an image. If not - you can be sure it is a text.
How to check if string is valid Base 64 you can ere How to check whether the string is base64 encoded or not.

How do text comments in JPG files work?

JPG files can contain text comments via the FF FE marker. I have a few questions about this:
How do I specify the length of the comment? Is it possible to not specify the length at all, if the comment is at the end of the file?
Is it possible to have a valid jpg file without an image that only consists of a comment? How would such a file look like in binary? I'm assuming it would be:
FF D8 - SOI: start of image (note that no frame data follow)
FF D9 - EOI: end of image
FF FE - COM: text comment
(binary) - (text)
JPEG metadata is stored in a tag structure as follows:
0xFF - tag introducer
0xXX - tag value
0xXX 0xXX - tag length in big-endian order including the length of the length (2)
< tag data (length-2 bytes)>
This structure requires that each tag can contain a maximum of 65534 bytes of metadata. For larger structures, a true length value is stored within the tag data and multiple tags contain the entire structure.
An example of a comment tag. It includes a zero terminator, but this is not required.
FF FE 00 08 48 45 4C 4C 4F 00 - "HELLO"
Most JPEG segments contains a 2 byte marker (0xFFFE in the case of COM), followed by the segment length (2 bytes). See JPEG syntax and structure (Wikipedia) for more details. You must specify the length field for the COM marker.
It is valid to have a tables only (only DHT and DQT segments) JPEG, with no image data. I don't think a no tables nor image data one is valid, but at least you don't need the image data. Not sure how useful it is, or how most JPEG software would interpret it...
The use case for a tables only JPEG, is to use it with "abbreviated streams", (JPEG with only image data, no tables), to share common tables between multiple images.