How can I properly create multilingual metadata in pdftk - pdf

pdftk let's you set the title of a PDF with the following command:
pdftk input.pdf update_info metadata.txt output output.pdf
However, if I use special characters in the metadata.txt file (such as German characters or chinese characters) then it doesn't seem to work.
Here's an example of changing the title:
InfoBegin
InfoKey: Title
InfoValue: Fingerspitzengefühl is a German term.
However, the PDF ends up with a strange character for the ü
In the documentation of pdftk it says that non-ASCII characters should be encoded as XML numerical entities. However, I Googled myself silly but couldn't find anything that works.

The best reference I've found is Numerical Character Reference, which is applicable to XML (and XHTML and SGML).
This is generally used to represent characters that are not directly encodable.
In your case, the character is U+252, ü which can be substituted with ü (Decimal), &0374; (Octal), or ü (Hexidecimal).
Using a decimal reference, your file should be encoded as:
InfoBegin
InfoKey: Title
InfoValue: Fingerspitzengefühl is a German term.
Note:
If you're on 'Nix, you can use recode to encode the file.
% cat metadata.txt | recode ..xml

This answer seems better as there is no need to install extra tools. Instead, it uses PDFtk’s built-in flag dump_data_utf8 and update_info_utf8:
pdftk input.pdf update_info_utf8 metadata.txt output output.pdf
It works perfect for Chinese.

Related

asking for the unicode of letter conjunctions

I occasionally encounter some special character while parsing PDF documents. They are actually two English letters, like 'fi', 'tt', or 'ti', but visually they look like conjuncted and they actually exist in PDF string as one character.
I checked the 'ToUnicode' for these characters, but I just found the 'ToUnicode' CMap table are disrupted, therefore I cannot find their unicode.
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Could anybody let me know their unicode code point? Possible to find it from the corresponding font program?
Thanks for any advice.
fi:
tt:
ti:
First of all, what you call letter conjunctions usually is known as ligatures. Thus, I will use that term here from now on.
Unicode discourages the use of specific code points for ligatures:
The existing ligatures exist basically for compatibility and round-tripping with non-Unicode character sets. Their use is discouraged. No more will be encoded in any circumstances.
Ligaturing is a behavior encoded in fonts: if a modern font is asked to display “h” followed by “r”, and the font has an “hr” ligature in it, it can display the ligature. Some fonts have no ligatures, while others (especially fonts for non-Latin scripts) have hundreds of ligatures. It does not make sense to assign Unicode code points to all these font-specific possibilities.
(Unicode FAQ on ligatures)
Thus, you should not use the existing ligature code points.
You appear to attempt to find the correct ToUnicode mapping for ligature glyphs. For this simply remember that the values of ToUnicode mappings do not need to be single code points but may be multiple ones:
n beginbfchar
srcCode dstString
endbfchar
where dstString may be a string of up to 512 bytes.
(ISO 32000-1, section 9.10.3 ToUnicode CMaps)
Concerning your example, therefore:
For example, <012E> Tj will print fi like attached picture. But in its corresponding Font's ToUnicode CMap: <012E> <0001>, which is meaningless.
Simply use
<012E> <00660069>
If you want to use ligature code points nonetheless, query the Wikipedia article on Orthographic Ligatures, it lists some ligature code points. In particular <FB01> for fi, so for your example:
<012E> <FB01>
But remember, their use is discouraged.

Ghostscript mangles umlauts when reading PDFs

I use this on Linux
gs -dBATCH -dNOPAUSE -sDEVICE=txtwrite -o res.txt 1.pdf
when extracting text from some hundred PDFs, however, umlauts and other special chars up to ASCII 255 get mangled. Any ideas?
cf https://archive.org/download/bnmm_gmx_1/1.pdf (contains two "ä")
Partial translation table: (the last one and all other special letters of the Turkish alphabet are mangled using non-printable chars, else I could help myself)
ä = À¤
é = À©
ç = À§
Looks like it ought to work as the fonts have a ToUnicode CMap. I'd suggest you open a bug report.
Note, you are not using ASCII, the embedded and subset fonts are CIDFonts and the CMap in use creates 2-byte character codes (though ridiculously all the high bytes are 0). But for example, the space is actually encoded as character code 0x0003, the '0' is code 0x0013 etc.
By the way a simple example would be useful, its quite hard to pick out the accented glyphs from the regular text in this file.

How can I get localized quotation marks with pandoc?

I would like to get quotation marks like these „...” but when I process my markdown text wih Pandoc, it gives me “...”. It probably boils down to the question how to make Pandoc use locale settings? Here is my command line:
pandoc -o out.pdf -f markdown -t latex in.md
You may want to specify the language for the whole document, so that it not only affects the quotes, but also ligatures, unbreakable spaces, and other locale specifics.
In order to do that, you may specify the lang option. From pandoc's manual:
lang: identifies the main language of the document, using a code according to BCP 47 (e.g. en or en-GB). For some output formats,
pandoc will convert it to an appropriate format stored in the
additional variables babel-lang, polyglossia-lang (LaTeX) and
context-lang (ConTeXt).
Moreover, the same manual states that:
if your LaTeX template or any included header file call for the csquotes package, pandoc will detect this automatically and use \enquote{...} for quoted text.
In my opinion, the best way to ensure quotes localization is thus to add:
---
lang: fr-FR
header-includes:
- \usepackage{csquotes}
---
Or even better, to edit the pandoc default template
pandoc -D latex > ~/.pandoc/templates/default.latex
and permanently add \usepackage{csquotes} to it.
Pandoc currently (Nov. 2015) only generates English type quotes. But you have two options:
You can use the --no-tex-ligatures option to turn off the --smart typography option which is turned on by default for LaTeX output. Then use the proper unicode characters (e.g. „...”) you desire and use a --latex-engine that supports unicode (lualatex or xelatex).
You could use \usepackage[danish=quotes]{csquotes} or similar in your Pandoc LaTeX template. From the README:
If the --smart option is specified, pandoc will produce typographically correct output, converting straight quotes to curly quotes, --- to em-dashes, -- to en-dashes, and ... to ellipses. Nonbreaking spaces are inserted after certain abbreviations, such as “Mr.”
Note: if your LaTeX template calls for the csquotes package, pandoc will detect this automatically and use \enquote{...} for quoted text.
Even though #mb21 already provided an answer, I would like to add, that nowadays it's possible to simply include what you need in (for example) your yaml metadata block, so it is not necessary anymore to create your own template. Example:
---
title: foo
author: bar
# other stuff
header-includes:
- \usepackage[german=quotes]{csquotes}
---
# contents

How to convert the PDF content code to the type like "(<0034>) Tj"?

PDF content are saved as several ways, "(abc) Tj", "(<0035><0035>) Tj" or "\u065".
I want to know if there is a way to convert the PDF code to one type, no matter direct text "(abc) Tj", or hexadecimal "(<0035><0035>) Tj", or Octal "\u065".
I think if convert and encode the PDF to one type, will be easier to analyse the content.
Is it possible to use Ghostscript or something to do that? Thanks
Essentially, no, there is no way to do so. There are two kinds of string, regular strings '(' and ')' delimited, and hex strings '<' and '>' delimited. Hex strings need not be escaped whereas regular text strings do need to be for 'special' characters, like carriage return and linefeed. Octal is also permitted in regular strings.
PDF producers are free to mix and match these all they like, but in general a given PDF producer will usually use one technique throughout.
Because Ghostscript's pdfwrite device is a PDF producer, it will (I believe) generally produce all its output the same way.
What it won't do is 'convert' your original PDF file. It produces a brand new PDF file which should look visually identical but whose internals bear no resemblance to your original PDF. In addition some metadata or fidelity may be lost.

Adding encoding in postscript, ghostscript renders text correctly, but converting to PDF does not show the characters

We have to construct a postscript file that contains Arabic text, so as English text.
GhostScript shows the Arabic text correctly, but converting it to pdf does not show the Arabic letters.
PS file contains the following:
/TraditionalArabic findfont dup
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /kafinitialarabic put
Encoding 2 /behinitialarabic put
Encoding 3 /yehmedialarabic put
Encoding 4 /seenfinalarabic put
Encoding 5 /eacute put
Encoding 6 /a put
/ArabicTradDict currentdict definefont pop
end
%%Page: 1 1
%%BeginPageSetup
%%PageMedia: Color Weight Type
<< /MediaColor (Blue)/MediaWeight 75 /MediaType () /xx {2.803464567 mul} def /xx {2.83464567 mul} def /PageSize [240 xx 345 xx]>> setpagedevice
%%EndPageSetup
/ArabicTradDict 18 selectfont
72 xx 300 xx moveto
(\004\003\002\001) show
showpage
To run ghostScript: running it from command line to include all windows fonts:
gswin64.exe -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true
To convert the PS file to PDF file: running the following command:
gswin64.exe -dBATCH -dNOPAUSE - sOutputFile=c:/Users/mob/Desktop/TimesNewRomanPSMT.pdf -sDEVICE=pdfwrite - dPDFSETTINGS=/prepress -dCompressFonts=false -dSubsetFonts=false -sFONTPATH=%windir%/fonts -dEmbedAllFonts=true -dEmbedAllFonts=true -f c:/Users/mob/Desktop/TimesNewRomanPSMT.ps
So when converting to PDF, the Arabic characters are not showing correctly, but showing as squares that are of no meaning...
If I use Adobe tool to convert to PDF, the PDF we get is same, except the "eacute -(005) " if included in the PS file, will show after conversion, where as when I convert with the previous command line, all characters that are added from the Encoding are not shown correctly.
Any help with that?
Thanks to KenS hints I was able to solve my problem. The encoding used wrong character names like kafinitialarabic (i mean by wrong, pdf could not understand that), everything that ended with arabic was wrong. The Traditional Arabic font does not have those names for characters. In order to know what it really understood, have converted the ttf font to afm and pfa using the following command, that is converting the true type font to type 42 font which will be understood once embed in postscript file at conversion to pdf
C:\Program Files\gs\gs9.10\bin>gswin64c.exe -dNODISPLAY -q -- ttf2pf.ps times tim
esPS timesAFM
where times is the ttf font name. I then checked the generated pfa file for the characters I wanted to add, instead of kafinitialarabic, there was kafinitial, and for kafmedialarabic there was kafmedial and so on...
It works fine now to add those in encoding, but I want to find a way instead of adding all those characters in the dictionary, I want to use the font like we use with setfont in postscript normally - if that is possible...
As already suggested, you need to ensure the glyph names you use are in the font you use, or create a new font.
I haven't found anything that will choose the correct glyph from the set of initial, medial, final, isolated, depending on context, though.
I resorted to writing a program which takes unicode arabic, reverses it the arabic characters, and then decides which tone of character to use based on it's position in a word, and whether the previous or next characters are forced into isolated or final forms. Unfortunately had to embed quite some intrinsic knowledge about the font in use and the glyph names it has, as well as typos in them, into the program.
If that's of interest, I've stuck it on github, but it's very raw and initial.
It does work, though.
https://github.com/gbjk/arabic2ps
The font I used was a traditional arabic font, with quite a few idiosyncrasies.