Where are named pdf characters defined like "f_f", "uni00D0" and "a204"? - pdf

I'm trying to read the official pdf specification "Document management — Portable document format — Part 1: PDF 1.7" (PDF32000_2008.pdf) as bytes and then interpret them according to that specification.
In Annex D, Character Sets and Encodings, there is a list of all named characters, like:
or
When I parse PDF32000_2008.pdf, there are also named characters like "f_f", "uni00D0" and "a204", which are missing in that specification.
My guess is that "f_f" is a symbol for two 'f' characters, which might get printed with a special glyph. There is a unicode "Latin Small Ligature Ff" for 'ff'.
For example, there is also "f_i" in that file, which I expect to mean 'fi', one glyph showing the 2 characters 'f' and 'i'. However, the pdf specification has 'fi' as named character "fi" and what is the point for having 2 named characters pointing to the same symbol ?
I can imagine that "uni00D0" means the unicode character 'Ð'. However, pdf defines it already as named character "Eth"
What could be "a204" ? Maybe Ansi 204 'Ì', which also has already a named character "Igrave" ?
Why do they use also "a62", which would be just a '<' ?
However, my main question is: Where can I find a specification for these additional named characters ?
Of course, Adobe Acrobat understands them, but also Gmail seems not to have a problem with them. So I guess, their meaning must be specified somewhere.

Related

asking for the unicode of letter conjunctions

I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:
First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for fi, so for your example:
<012E> <FB01>
But remember, their use is discouraged.

how to render a dicom file's header unreadable

Kind of a strange question, but I'm doing some testing to handle errors when a dicom file's tags can't be read.
Unfortunately I don't have a damaged dicom available.
Specifically, can anyone advise how to apply some sort of incorrectly encoded text tag or some invalid numeric data tag onto the file, such that it can't be read by python's pydicom package?
you could have a look at the dcmodify tool from the DCMTK. It can be used to insert, modify and delete attributes. I doubt that it is possible to specify invalid attribute values through the command line, but you could surely modify the source code to accomplish that (except you can definitely write attribute values that exceed the maximum length according to the Value Representation).
My approach would be to create a buffer of characters and write binary data to it. Then pass it to the method that writes the value to the attribute.
Examples:
write unicode (UTF-8) sequences which are not a valid unicode character
write ascii characters which are not covered by the characterset specified by (0008,0005) - not sure whether pydicom would run into problems but it would be wrong from the DICOM perspective
write non-numeric characters to attributes with Value Representation "Decimal String" or "Integer String".
formats other than YYYYMMDD for VR "Date"
formats other than HHMMSS.FFFFFF for VR "Time"
other characters than ['0'-'9'], '.' for VR "Unique Identifier"
[edit]: DCMTK, dcmodify: http://dicom.offis.de/dcmtk.php.en

PDF extracted text seems to be unreadable

Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.

is there a regex for PDF font-family and font-names

I am working with FontNames from PDF documents and wish to convert them to one of the 14 standard fonts if possible. An example is:
KAIKCD+Helvetica-Oblique
Are there standards for the punctuation and values of the pre/suf/fixes? I have found both -Oblique .I and -Italic as suffixes all - presumably - meaning Italic. Or are the names semantically void?
The prefix is as defined in the PDF specification:
For a font subset, the PostScript name of the font — the value of the
font’s BaseFont entry and the font descriptor’s FontName entry — shall
begin with a tag followed by a plus sign (+). The tag shall consist of
exactly six uppercase letters; the choice of letters is arbitrary, but
different subsets in the same PDF file shall have different tags.
Otherwise there merely are some common naming patterns; Jon Tan has tried to catalogue some here.

Is there a field in which PDF files specify their encoding?

I understand that it is impossible to determine the character encoding of any stringform data just by looking at the data. This is not my question.
My question is: Is there a field in a PDF file where, by convention, the encoding scheme is specified (e.g.: UTF-8)? This would be something roughly analogous to <html> <head> <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> in HTML.
Thank you very much in advance,
Blz
A quick look at the PDF specification seems to suggest that you can have different encoding inside a PDF-file. Have a look at page 86. So a PDF library with some kind of low level access should be able to provide you with encoding used for a string. But if you just want the text and don't care about the internal encodings used I would suggest to let the library take care of conversions for you.
PDF uses "named" characters, in the sense that a character is a name and not a numeric code. Character "a" has name "a", character "2" has name "two" and the euro sign has name "euro", to give a few examples. PDF defines a few "standard" "base" encodings (named "WinAnsiEncoding", "MacRomanEncoding" and a few more, can't remember exactly), an encoding being a one-to-one correspondence between character names and byte values (yes, only 0 to 255). The exact, normative values for these predefined encodings are in the PDF specification. All these encodings use the ASCII values for the US-ASCII characters, but they differ in higher byte values.
A PDF file may define new encodings by taking a "base" encoding (say, WinAnsiEncoding) and redefining a few bytes, so a PDF author may, for example, define a new encoding named "MySuperbEncoding" as WinAnsiEncoding but with byte value 65 changed to mean character "ntilde" (this definition goes inside the PDF file), and then specifying that some strings in the file use encoding "MySuperbEncoding". In this case, a string containing byte values 65-66-67 would mean characters "ñBC" and not "ABC". And note that I mean characters, nothing to do with glyphs or fonts. Different strings withing the PDF file may use different encodings (this provides a way for using more tan 256 characters in the PDF file, even though every string is defined as a byte sequence, and one byte always corresponds to one character).
So, the answer to your question is: characters within a PDF file can well be encoded internally in an ad-hoc encoding made on the spot for that specific PDF file. PDF parsers should make the appropriate substitutions when necessary. I do not know PDFMiner but I'm surprised that it (being a PDF parser) gives incorrect values, as the specification is very clear on how this must be interpreted. It IS possible to get all the necessary information from the PDF file, but, as Mattias said, it might be a large project and I think a program named PDFMiner should do exactly this kind of job.