I'm parsing PDF file. I decoded all streams, got text from text objects and ToUnicode CMaps. But I don't know, when do I need replace symbols from text to symbols(strings) from ToUnicode CMaps.
When I see some like <01239099> I use this convert table and all is OK. But some files need, that I use convert table, while parsing simple text like
[(.&)-2(.K)-5(.D)-8(.S)], then all is OK too.
Does somebody know rule, which symbols need replace using convert table?
Related
I have Unicode text (a sequence of Unicode codes) and a TTF font (bytes of a TTF file). I would like to write that text into a PDF file using that font.
I understand PDF quite well. I don't mind using two bytes per character. I would like to attach the TTF file as it is (charcode-to-glyf map should be used from a TTF file).
What font Subtype and Encoding value should I use? Is it possible to avoid having ToUnicode record?
I tried to use Subtype = "/TrueType", but it requires to specify FirstChar, LastChar and Widths (which are already inside TTF).
You cannot use Unicode with a Font, at all (except in the limited case of Latin, or nearly Latin, languages), because Fonts use an Encoding, and an Encoding is a single byte array. So you can't reference more than 256 characters from a Font, and a character code can't be more than a single byte.
The first problem with 'using Unicode' is that Unicode is not a simple 2-byte Encoding, its a multi-byte format, with variable lengths and sometimes a single glyph is represented by multiple Unicode code points.
So, in order to deal with this you need to use a CIDFont, not a Font. You cannot 'use the charcode-to-glyf map', by which I assume you mean the CMAP subtable in the TTF font. You must compose the CIDFont with a CMap in order to map the multiple bytes in the text string into the character codes for lookup in the CMap, which gives you the CID to reference the precise character program in the font.
It may be possible to construct a single CMap which would cover every Unicode code point, but I have my doubts, it would certainly be a huge task. However certain CMaps already exist. Adobe publish a standard list on their web site which includes CMaps such as UniCNS-UCS2-H and UniCNS-UCS2-V or UniGB-UTF8-H etc.
You can probably use one of the standard CMaps.
Note that it doesn't matter that the FirstChar, LastChar etc are already stored in the TrueType font, you still need to specify them in the PDF Font object. That's because a PDF consumer might not be rendering the text at all, it could (for example) be extracting the text, in which case it doesn't need to interpret the font provided this information is available.
I have multiple pdf files without 'toUnicode' cmap table. Absence of cmap table restricts me from copying the text from pdf files.
As far as I know, there is a possibility to add 'toUnicode' mapping in pdf file, but in my case adding static values is not an option, different files have different glyph codes.
So the question is the following. Is there any possibility to restore 'toUnicode' cmap table, perhaps with the help of Ghostscript, or are there any options at all?
Thanks.
No, you cannot add ToUnicode CMaps to an existing PDF file using Ghostscript.
In the general case, you can't do it at all, except manually. As you note in the question, different files will be constructed to use different character code->Glyph mappings, which means that the character code to Unicode mapping will also be different.
Since the character code selection is often based on the order in which glyphs are used in a file (so the fist glyph is character code 1, the second is character code 2 etc) you can see that there is no prospect of identifying a 'one size fits all' solution.
You could use some kind of OCR to scan the rendered output, identify each glyph and find the Unicode code point for it. Then you could construct a CMap by identifying the character code for the glyph and mapping it to the Unicode value.
You could, then, add the ToUnicode CMap to the PDF file, and update the Font Descriptor with the object number of the ToUnicode CMap.
Ghostscript won't do any of that for you, and I haven't heard of any tool which will.
Situation: I've a PDF using version 1.6. In that PDF, there are several streams. There were compressed text (Flate) in that streams, so I decompressed these streams. After that, I extracted the Tj-parts of the corresponding, decompressed streams. I assumed that there would be readable text between the brackets before the Tj command, but the result was the following:
Actual Question: As I have no idea, what I've got thre, I would like to know what type of content it is. Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
Further research: The PDFs, which I try to analyze where generated by iTextSharp (seems to be an C# Library for generating PDFs). Don't know whether it is a relevant information, but it might be that that Library uses a special way of encrypt it's text data or something...
I assumed that there would be readable text between the brackets before the Tj command
This assumption only holds for simple PDFs.
To quote from the PDF specification (ISO 32000-1):
A string operand of a text-showing operator shall be interpreted as a sequence of character codes identifying the glyphs to be painted.
With a simple font, each byte of the string shall be treated as a separate character code. The character code shall then be looked up in the font’s encoding to select the glyph, as described in 9.6.6, "Character Encoding".
With a composite font (PDF 1.2), multiple-byte codes may be used to select glyphs. In this instance, one or more consecutive bytes of the string shall be treated as a single character code. The code lengths and the mappings from codes to glyphs are defined in a data structure called a CMap, described in 9.7, "Composite Fonts".
(Section 9.4.3 - Text-Showing Operators - ISO 32000-1)
Thus,
I would like to know what type of content it is.
As quoted above, these "strings" consist of single-byte or multi-byte character codes. These codes depend on the current font's encoding. Each font object in a PDF can have a different encoding.
Those encodings may be some standard encoding (MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding) or some custom encoding. In particular in case of embedded font subsets you often find encodings where 1 is the code of the first glyph drawn on a page, 2 is the code for the second, different glyph, 3 for the third, different one, etc.
Furthermore: Is it possible to get a plain text out of these string or do I need further information to extract plain texts?
As the encoding of the string arguments of text showing instructions depends on the current font, you at least need to keep track of the current font name (Tf instruction) and look up encoding information (Encoding or ToUnicode map) from the current font object.
Section 9.10 - Extraction of Text Content - of ISO 32000-1 explains this in some more detail.
Furthermore, the order of the text showing instructions need not be the order of reading. The word "Hello" can e.g. be shown by first drawing the 'o', then going left, then the 'el', then again left, then the 'H', then going right, and finally the remaining 'l'. And two words need not be separated by a space glyph, there simply might be a text positioning instruction going right a bit.
Thus, in general you also have to keep track of the position of the separate strings drawn.
I have a postscript file which contains Type 3 Font.After converting that postscript to pdf using "gs" command ,I am unable to extract the text from pdf file.Is there any possibility to avoid change Type 3 Fonts to some other font, by substituting or some other way ,so that I can copy the text?
This is another case of miscomprehension regarding type 3 fonts. The fact that a font is a type 3 font has little to do with whether a PostScript program or PDF file using the font is 'searchab;e' or not.
Fonts in PostScript and PDF have an 'Encoding' which maps the character codes 0-255 to a named procedure in the font. Executing that procedure draws the glyph. The character codes can be anything, but are often (for Latin fonts) chosen to match the ASCII encoding.
PDF has the additional concept of a ToUnicode CMap, additional information which maps a character code in a font to a Unicode code point. PostScript has no such analogue, that's not what PostScript is for (its also not what PDF was originally for, which is why ToUnicode CMaps are a later addition to the PDF standard).
In the absence of a ToUnicode CMap Acrobat uses undocumented heuristics to try and guess what the text is. The obvious one (and the only one we know of) is that it treats the character codes as ASCII.
Now, if your original PostScript program has an encoding that maps the character codes as if they were ASCII< then provided you do not subset the font, the resulting PDF file should also contain ASCII character codes. If you do subset the font then the pdfwrite device will reorder the glyphs and the character codes will no longer be ASCII.
If your original PostScirpt file does not order the glyphs in the font using ASCII character codes then there is nothing you can do other than apply OCR, the information simply is not present.
But forget about altering the font type, not only is it not likely to be possible, it isn't the problem.
PDF content are saved as several ways, "(abc) Tj", "(<0035><0035>) Tj" or "\u065".
I want to know if there is a way to convert the PDF code to one type, no matter direct text "(abc) Tj", or hexadecimal "(<0035><0035>) Tj", or Octal "\u065".
I think if convert and encode the PDF to one type, will be easier to analyse the content.
Is it possible to use Ghostscript or something to do that? Thanks
Essentially, no, there is no way to do so. There are two kinds of string, regular strings '(' and ')' delimited, and hex strings '<' and '>' delimited. Hex strings need not be escaped whereas regular text strings do need to be for 'special' characters, like carriage return and linefeed. Octal is also permitted in regular strings.
PDF producers are free to mix and match these all they like, but in general a given PDF producer will usually use one technique throughout.
Because Ghostscript's pdfwrite device is a PDF producer, it will (I believe) generally produce all its output the same way.
What it won't do is 'convert' your original PDF file. It produces a brand new PDF file which should look visually identical but whose internals bear no resemblance to your original PDF. In addition some metadata or fidelity may be lost.