Adding encoding in postscript, ghostscript renders text correctly, but converting to PDF does not show the characters - pdf

We have to construct a postscript file that contains Arabic text, so as English text.
GhostScript shows the Arabic text correctly, but converting it to pdf does not show the Arabic letters.
PS file contains the following:
/TraditionalArabic findfont dup
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /kafinitialarabic put
Encoding 2 /behinitialarabic put
Encoding 3 /yehmedialarabic put
Encoding 4 /seenfinalarabic put
Encoding 5 /eacute put
Encoding 6 /a put
/ArabicTradDict currentdict definefont pop
end
%%Page: 1 1
%%BeginPageSetup
%%PageMedia: Color Weight Type
<< /MediaColor (Blue)/MediaWeight 75 /MediaType () /xx {2.803464567 mul} def /xx {2.83464567 mul} def /PageSize [240 xx 345 xx]>> setpagedevice
%%EndPageSetup
/ArabicTradDict 18 selectfont
72 xx 300 xx moveto
(\004\003\002\001) show
showpage
To run ghostScript: running it from command line to include all windows fonts:
gswin64.exe -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true
To convert the PS file to PDF file: running the following command:
gswin64.exe -dBATCH -dNOPAUSE - sOutputFile=c:/Users/mob/Desktop/TimesNewRomanPSMT.pdf -sDEVICE=pdfwrite - dPDFSETTINGS=/prepress -dCompressFonts=false -dSubsetFonts=false -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true -dEmbedAllFonts=true -f c:/Users/mob/Desktop/TimesNewRomanPSMT.ps
So when converting to PDF, the Arabic characters are not showing correctly, but showing as squares that are of no meaning...
If I use Adobe tool to convert to PDF, the PDF we get is same, except the "eacute -(005) " if included in the PS file, will show after conversion, where as when I convert with the previous command line, all characters that are added from the Encoding are not shown correctly.
Any help with that?

Thanks to KenS hints I was able to solve my problem. The encoding used wrong character names like kafinitialarabic (i mean by wrong, pdf could not understand that), everything that ended with arabic was wrong. The Traditional Arabic font does not have those names for characters. In order to know what it really understood, have converted the ttf font to afm and pfa using the following command, that is converting the true type font to type 42 font which will be understood once embed in postscript file at conversion to pdf
C:\Program Files\gs\gs9.10\bin>gswin64c.exe -dNODISPLAY -q -- ttf2pf.ps times tim
esPS timesAFM
where times is the ttf font name. I then checked the generated pfa file for the characters I wanted to add, instead of kafinitialarabic, there was kafinitial, and for kafmedialarabic there was kafmedial and so on...
It works fine now to add those in encoding, but I want to find a way instead of adding all those characters in the dictionary, I want to use the font like we use with setfont in postscript normally - if that is possible...

As already suggested, you need to ensure the glyph names you use are in the font you use, or create a new font.
I haven't found anything that will choose the correct glyph from the set of initial, medial, final, isolated, depending on context, though.
I resorted to writing a program which takes unicode arabic, reverses it the arabic characters, and then decides which tone of character to use based on it's position in a word, and whether the previous or next characters are forced into isolated or final forms. Unfortunately had to embed quite some intrinsic knowledge about the font in use and the glyph names it has, as well as typos in them, into the program.
If that's of interest, I've stuck it on github, but it's very raw and initial.
It does work, though.
https://github.com/gbjk/arabic2ps
The font I used was a traditional arabic font, with quite a few idiosyncrasies.

Related

How can I properly create multilingual metadata in pdftk

pdftk let's you set the title of a PDF with the following command:
pdftk input.pdf update_info metadata.txt output output.pdf
However, if I use special characters in the metadata.txt file (such as German characters or chinese characters) then it doesn't seem to work.
Here's an example of changing the title:
InfoBegin
InfoKey: Title
InfoValue: Fingerspitzengefühl is a German term.
However, the PDF ends up with a strange character for the ü
In the documentation of pdftk it says that non-ASCII characters should be encoded as XML numerical entities. However, I Googled myself silly but couldn't find anything that works.
The best reference I've found is Numerical Character Reference, which is applicable to XML (and XHTML and SGML).
This is generally used to represent characters that are not directly encodable.
In your case, the character is U+252, ü which can be substituted with ü (Decimal), &0374; (Octal), or ü (Hexidecimal).
Using a decimal reference, your file should be encoded as:
InfoBegin
InfoKey: Title
InfoValue: Fingerspitzengefühl is a German term.
Note:
If you're on 'Nix, you can use recode to encode the file.
% cat metadata.txt | recode ..xml
This answer seems better as there is no need to install extra tools. Instead, it uses PDFtk’s built-in flag dump_data_utf8 and update_info_utf8:
pdftk input.pdf update_info_utf8 metadata.txt output output.pdf
It works perfect for Chinese.

Ghostscript mangles umlauts when reading PDFs

I use this on Linux
gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -o res.txt 1.pdf
when extracting text from some hundred PDFs, however, umlauts and other special chars up to ASCII 255 get mangled. Any ideas?
cf https://archive.org/download/bnmm_gmx_1/1.pdf (contains two "ä")
Partial translation table: (the last one and all other special letters of the Turkish alphabet are mangled using non-printable chars, else I could help myself)
ä = À¤
é = À©
ç = À§
Looks like it ought to work as the fonts have a ToUnicode CMap. I'd suggest you open a bug report.
Note, you are not using ASCII, the embedded and subset fonts are CIDFonts and the CMap in use creates 2-byte character codes (though ridiculously all the high bytes are 0). But for example, the space is actually encoded as character code 0x0003, the '0' is code 0x0013 etc.
By the way a simple example would be useful, its quite hard to pick out the accented glyphs from the regular text in this file.

How can I change type 3 font using ghostscript?

I have a postscript file which contains Type 3 Font.After converting that postscript to pdf using "gs" command ,I am unable to extract the text from pdf file.Is there any possibility to avoid change Type 3 Fonts to some other font, by substituting or some other way ,so that I can copy the text?
This is another case of miscomprehension regarding type 3 fonts. The fact that a font is a type 3 font has little to do with whether a PostScript program or PDF file using the font is 'searchab;e' or not.
Fonts in PostScript and PDF have an 'Encoding' which maps the character codes 0-255 to a named procedure in the font. Executing that procedure draws the glyph. The character codes can be anything, but are often (for Latin fonts) chosen to match the ASCII encoding.
PDF has the additional concept of a ToUnicode CMap, additional information which maps a character code in a font to a Unicode code point. PostScript has no such analogue, that's not what PostScript is for (its also not what PDF was originally for, which is why ToUnicode CMaps are a later addition to the PDF standard).
In the absence of a ToUnicode CMap Acrobat uses undocumented heuristics to try and guess what the text is. The obvious one (and the only one we know of) is that it treats the character codes as ASCII.
Now, if your original PostScript program has an encoding that maps the character codes as if they were ASCII< then provided you do not subset the font, the resulting PDF file should also contain ASCII character codes. If you do subset the font then the pdfwrite device will reorder the glyphs and the character codes will no longer be ASCII.
If your original PostScirpt file does not order the glyphs in the font using ASCII character codes then there is nothing you can do other than apply OCR, the information simply is not present.
But forget about altering the font type, not only is it not likely to be possible, it isn't the problem.

Replace colors in PDF using ghostscript

How can I convert a single color from a PDF document into another color, for example convert all instances of #ff0000 (red) to #ffffff (white)?
I've seen a number of ghostscript commands doing something similar (using setcolor, setcolortransfer), but I can't find a solution for this exact problem.
For example, the follwing will create an image-negative of the input PDF:
gs -o output.pdf -sDEVICE=pdfwrite -c "{1 exch sub}{1 exch sub}{1 exch sub}{1 exch sub} setcolortransfer" -f input.pdf
I'd move past this with a higher level of control, focusing on a single color being replaced with a different (not it's negative) color.
Essentially, you can't (or at least not using Ghostscript).
Firstly you seem to be assuming that the colours will be specified in RGB when in fact they could be specified in CMYK, ICC, CalRGB or Lab. You also need to consider Indexed colour spaces.
Secondly Ghostscript does not 'edit' PDF files, when you send a PDF file as input to Ghostscript it is fully interpreted to graphics primitives and the primitives are processed.
When the output is PDF the primitives are reassembled into a new PDF file. The goal of this process is that the visual appearance of the new PDF file should match the original. It is NOT the same PDF file, its internals will likely be completely different.
Finally, how do you plan to handle images ? Will you process those byte by byte to massage the colours ? Or do you plan to ignore them ? Same goes for shadings, where the colours aren't even present in the PDF file directly, but are generated from functions.
Without knowing why you want to do this I can't even offer a different approach other than; decompress the PDF file, read it and replace the colours manually.

PDF font mapping error

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.