PDF Digital Signature has "Bad parameter" in Acrobat

PDF Digital Signature has "Bad parameter" in Acrobat - pdf

I'm trying to create an open source library for Digital Signing of PDF files.
Bad parameter
I got most of it done, but I have a problem that the signature shows the following error:
Error during signature verification.
Adobe Acrobat error.
Bad parameter.
I tried to find the problem, but until now have not found it.
I have created 2 pdf files that are striped of almost all other data, except the needed info.
Does anyone know where this error might originate from?
I have already tried different online and offline validators, but non of them pointed me in the right direction.
Does anyone know if this error might originate from the certificate and not the pdf struture itself?
Invalid byte range
While creating this post I also tested it on other pdf file too, but got the error:
Error during signature verification.
Unexpected byte range values defining scope of signed data.
Details: The signature byte range is invalid
Note a slice of the pdf describes it as:
...
/SubFilter/adbe.pkcs7.detached
/ByteRange[0 4197 22193 30080 ]
/Contents<30820...
I have multiple times recalculated the ByteRange attribute and even tried changing it by one byte in each direction, but that will always result in Signature processing error..
I don't know what else can be incorrect about the ByteRange. (the added spaces are the same as how Acrobat pads the byterange.)
If anyone might have an idea on what the problem might is, let me know.
Files
Here are my resulting files:
result_bad_param_with_image.pdf (mirror1) (mirror2)
result_bad_param_no_image.pdf (mirror1) (mirror2)
result2_invalid_byte_range_with_image.pdf (mirror1) (mirror2)
result2_invalid_byte_range_no_image.pdf (mirror1) (mirror2)
A signature file (same as the Contents field in a pdf, excepts directly in separate file):
signature.der (mirror1) (mirror2)
The content of the signature.der is printed here: https://pastebin.com/W4EGJ2fX
(using openssl cms -inform DER -in signature.der -cmsout -print command)
(I know the signature is self signed and does not contains a lot of info, but this should not matter for this, I think, this was just to create these examples)
Edit: New links after solving some problems and added some extra files:
result.pdf
signature.der
signed_content.der

There are some errors in your signature and an uncommon structure which in the context of digital signatures may result in rejection by a validator.
Incorrect Signed Hash Value Inside Signature Container
Signing in CMS signature containers with signed attributes makes use of two hash values:
the hash value of the signed byte ranges of the PDF; that value is correct in your example files;
the hash value of the signed attributes in the SignerInfo of the signature container; that value is not correct in your example files.
PS: Looking into the mismatch once again, it turns out that your signed attributes are not DER encoded: The DER encoding in particular sorts the elements of a SET in a specific order, and in your case the attribute order is not the DER order. The specification requires the signed attributes to be DER encoded, though.
PPS: In a comment you argued
I just checked the order of the SET and I can not find anything that is wrong with it. Here is my reasoning, let me know what part is incorrect. The items should be order according to the 'key' which in my case is an Oid.
First of all, this reasoning is flawed: The type in question is a set type (more exactly an ASN.1 SET OF), not some map type; and the DER encoding rule set only knows the ASN.1 base types. Thus, that OID (which just is an arbitrary part of the attribute structure) cannot be the generic ordering key.
And indeed, a quick glance at the specification shows:
11.6 Set-of components
The encodings of the component values of a set-of value shall appear in ascending order, the encodings being compared
as octet strings with the shorter components being padded at their trailing end with 0-octets.
NOTE – The padding octets are for comparison purposes only and do not appear in the encodings.
(ISO/IEC 8825-1 / ITU-T Rec. X.690, section 11 "Restrictions on BER employed by both CER and DER")
Thus, in case of the signed attributes essentially you first DER encode the attribute elements and then sort the resulting byte arrays as described above.
As an aside, this issue of your signature merely causes problems in probably half the validators around. Some validators do not check or DER re-encode the signed attributes, so they get the same hash as you get. Others either check the encoding up front (and, therefore, throw an error because of the issue) or simply re-encode the attributes in DER (and, therefore, get a different hash than you get).
Problematic Extended Key Usage of Signer Certificate
Your signer certificate has an extended key usage value 1.3.6.1.4.1.311.80.1 (Microsoft's OID for Document Encryption) and only that. Adobe validation used to only support certificates with either no extended key usage or one or more of the following:
emailProtection
codeSigning
anyExtendedKeyUsage
1.2.840.113583.1.1.5 (Adobe Authentic Documents Trust)
See Enterprise Toolkit » Digital Signatures Guide for IT » A: Changes Across Releases.
Incorrect Incremental Updates
You sign in an incremental update to the original PDF. This in general is a good idea as it allows to extract the unsigned original document.
But one needs to add the incremental update correctly, and in case of result2_invalid_byte_range_no_image.pdf and result2_invalid_byte_range_with_image.pdf it is done incorrectly: The original revision there is created using cross reference tables but your incremental updates use pure cross reference streams. This is incorrect, you have to continue with the same kind of cross references.
When opening documents with a mix of cross reference tables and pure cross reference streams, Adobe Acrobat internally repairs this which in particular relocates signatures and so makes byte ranges incorrect.
Uncommon Signature Field Structure
You use an uncommon signature field structure in your example PDFs, you separate the widget from the field and only update the field, not the widget, in signing.
While this strictly speaking is ok, I would implement the common structures while making the code work at all, and deviate only thereafter.
PS: In a comment you asked whether I could elaborate on this.
Your signing implementation in a first step adds an incremental update with an empty signature field and a widget as indirectly referenced kid, e.g.:
16 0 obj
<<
/Type /Annot
/F 4
/Subtype /Widget
/BS << /Type /Border /S /S /W 0 >>
/Parent 17 0 R
/P 2 0 R
/Rect [141.75 664.89 276.75 702.39]
/AP << /N 18 0 R >>
/MK << /BC [.1882353 .1882353 .1882353] /BG [1.00 1.00 1.00] /R 0 >>
/DA (/TiRo 0 Tf 0 0 0 rg\r
)
>>
endobj
17 0 obj
<<
/Kids [16 0 R]
/FT /Sig
/T (eyJ1c2VySWQiOiIyNzIifQ==)
>>
In another incremental update you then sign the field with a direct signature value but don't change the widget, e.g.:
17 0 obj
<<
/Kids [16 0 R]
/FT /Sig
/T (eyJ1c2VySWQiOiIyNzIifQ==)
/V << /Type/Sig ... >>
>>
This is uncommon in some ways:
Usually for signatures the option to merge field and widget is used.
Usually for signatures (except for usage right signatures) the signature dictionary is not a direct but an indirect value of the key V.
Usually the appearance of a signature field is updated together with the signature dictionary if there is an appearance at all.
Also, unless other form fill-ins shall happen between adding an empty signature field and signing it, fields usually are added and filled in the same document update.
Thus, more common would be a single incremental update (or even full re-save) containing something like this:
92 0 obj
<<
/AP << /N 94 0 R >>
/DA (/MyriadPro-Regular 0 Tf 0 Tz 0 g)
/F 132
/FT /Sig
/MK <<>>
/P 1 0 R
/Rect [117.575 499.561 515.968 520.938]
/Subtype /Widget
/T (Signature3)
/Type /Annot
/V 93 0 R
>>
endobj
93 0 obj
<<
/ByteRange [ 0 3227714 5751810 2789]
...
>>
As said above, though, your structure strictly speaking is ok, too. But the "Bad parameter" only occurs when validating from the widget in the document or from the signature panel, but it does not occur when validating your signature using the "Validate Signature" button of the "Signature Properties" dialog. Because of that I think it's possible that Adobe is iritated by an uncommon structure.

Related

The signature byte range is invalid in Acrobat

I'm trying to make a timestamp of the PDF document using our library that we're making. I added a new section to the PDF document. I added new annotation object for the signature and the signature object containing the actual signature, and also a new xref table for the new section. When I check the xref entries everything seems right.
When I try to verify my signature in Acrobat I get the following error message "there are errors in the formatting of information contained in this signature (The signature byte range is invalid)".
However, when I check the byte range, everything seems right. I goes from the beginning of the document to the opening brackets of the Contents part, and from the end of it to the end of the document. I compared it with the document that has valid signature and it seems that byte ranges look the same.
I really don't understand what's wrong and why is Acrobat showing this error.
Here's the link to the signed file if anybody wants to have a look: https://ufile.io/mckajk9h
PS: I can share some part of the code, but the actual question is about the Acrobat reader and how it interprets PDF signatures, not my code. So, the relevant part should be the resulting PDF file that I shared.

There are some issues in your PDF.
The Major Issue
The major one which results in the non-intuitive error message: The first entry in your incremental update cross reference table is one byte too short:
As you can see, the first cross reference table entry (0000000000 65535 f\n) is one byte too short, according to the specification it has to be exactly 20 bytes long but yours is only 19 bytes long.
Whenever Adobe Acrobat sees structurally broken cross reference tables, it internally repairs the file. In the repaired file the objects are rearranged which renders the signature byte range invalid.
As soon as I had fixed this by adding a space between f and \n, Adobe Acrobat did not complain about a formatting error anymore. Of course it claimed that the document had been altered or corrupted, after all I had altered it. But at least it accepted the signature structure.
Minor Issues
Some of your offsets are incorrect, in your incremental update you have
xref
0 2
0000000000 65535 f
0000003029 00000 n
12 2
0000003144 00000 n
0000003265 00000 n
Thus, object 12 should start at 3144 but it actually starts at 3145.
Furthermore, you have
startxref
19548
%%EOF
So the xref keyword should start at 19548 but it starts at 19549.

Digital signature not visible in Adobe reader but visible in Foxit reader

I'm using this library https://github.com/vbuch/node-signpdf to sign a pdf document. After I have signed the document I can see the signature when I open the pdf with Foxit reader but not when I open it with Adobe reader DC. I also tried Adobe reader XI but there I can't see it either.
When I open the document in pdf xchange viewer I get this error: non critical errors detected in the xref table.
Any ideas what the problem could be?
That's the file I signed: https://drive.google.com/file/d/1AZvS4sP2Y3FwW4Deod87Dgxc9I0QZkoc/view?usp=sharing

In your example PDF the name of the signature field consists of 10 bytes, 9 bytes with value 0x00 and one byte with value 0x01. Apparently Adobe Reader does not like that field name.
After some experiments it looks like Adobe Reader does not like a field name starting with a 0x00 byte.
Maybe it contains some code that determines string lengths in a c'ish manner and interprets a 0x00 as end-of-string. A field name with a leading 0x00 byte, therefore, is interpreted as empty string, a field name not accepted by Adobe Reader either.
Thus, please use a signature field name made of (in particular starting with) some meaningful characters. As validators usually display the name of the signature field, this is a good idea anyways.
In terms of lowlevel PDF objects:
The signature field object looks like this:
18 0 obj
<<
/Type /Annot
/Subtype /Widget
/FT /Sig
/Rect [0 0 0 0]
/V 17 0 R
/T ( )
/F 4
/P 1 0 R
>>
endobj
but only like this, the string value of the T entry actually contains the above mentioned nine 0x00 bytes and one 0x01 byte. This is the value that must be changed to a non-empty string not starting with 0x00. I would propose not using bytes < 0x20 at all. Furthermore, the dot, 0x2e, must not be used in the name, it is reserved for separating partial names.

Text object gets deleted in Adobe Reader 11

In adobe acrobat x i was inserting text objects and when it is opened in adobe reader 10 it was opening properly.but in adobe reader 11 when i click on that pdf file text objects gets deleted.why this happens? How to solve it?
The source pdf file click here
The pdf file which has problem when double clicking on it in adobe reader 11.
click here

In a nutshell:
You try to change the contents of a free text annotation by changing its normal appearance stream.
This is insufficient: A compliant PDF viewer may ignore this entry and provide their own appearances. So it's mere luck that older Adobe Reader versions chose to not ignore your change.
Thus, you also need to change the information a PDF viewer is expected to create their own appearance from, i.e. foremost the rich text value of RC (in the free text annotation dictionary) that shall be used to generate the appearance of the annotation, and also the Contents value which is the Text that shall be displayed for the annotation.
Furthermore there are defects in your PDFs:
the cross reference table in your first attempt result.pdf was broken;
the intent (IT value) of the free text annotation in your source files is spelled incorrectly.
In detail:
Your result.pdf is broken. Different PDF viewers may display broken PDFs differently.
Some details:
It has been created based on your Src.pdf in append mode but additionally the following change in the original revision has been made to its /Pages object:
In the source:
6 0 obj
<</Count 6
/Type /Pages
/Kids [ 7 0 R 8 0 R 9 0 R 10 0 R 11 0 R 12 0 R ]
>>
endobj
In the result:
6 0 obj
<</Count 3
/Type /Pages
/Kids [ 7 0 R 8 0 R 9 0 R 12 0 R 11 0 R 10 0 R ]
>>
endobj
So the order of the last three pages was changed (which is ok) and the /Count was reduced from 6 to 3. This is inconsistent as there still are 6 child objects but according to the PDF specification ISO 32000-1, Count is
The number of leaf nodes (page objects) that are descendants of this node within the page tree.
Furthermore the cross reference stream of the appended revision is broken.
xref
0 1
0000000000 65535 f
24 1
0001465240 00000 n
57 1
0001466075 00000 n
66 1
0001466909 00000 n
73 1
0001467744 00000 n
93 1
0001473484 00000 n
131 1
0001478703 00000 n
The entries are 19 bytes long including their respectively ending single byte newline character According to the spec, though,
Each entry shall be exactly 20 bytes long, including the end-of-line marker.
The format of an in-use entry shall be: nnnnnnnnnn ggggg n eol
where [...] eol shall be a 2-character end-of-line sequence
There may be more errors in the PDF but you may want to start fixing these.
EDIT
Now with the new PDF Pay-in.pdf with a proper cross reference at hand, let's look at it more in-depth.
Adobe Preflight complains about many occurances of:
[...]
An unexpected value is associated with the key
Key: IT
Value: /FreeTextTypewriter
Type: CosName
Formal Representation: Annot.AnnotFreeText
Cos ID: 86
Traversal Path: ->Pages->Kids->[0]->Annots->[13]
[...]
Ok, let's look at that object 86:
86 0 obj
<< /P 8 0 R
/Type /Annot
/CreationDate (D:20130219194939+05'30')
/T (winman)
/NM (0f202782-2274-44b8-9081-af4010be86d4)
/Subj (Typewritten Text)
/M (D:20130219195100+05'30')
/F 4
/Rect [ 53.2308 33.488 552.088 826.019 ]
/DS (font: Helv 12.0pt;font-stretch:Normal; text-align:left; color:#000000 )
/AP <</N 107 0 R >>
/Contents (wwww)
/IT /FreeTextTypewriter
/BS 108 0 R
/Subtype /FreeText
/Rotate 90
/DA (16.25 TL /Cour 12 Tf)
/RC (<?xml version="1.0"?>
<body xmlns="http://www.w3.org/1999/xhtml"
xmlns:xfa="http://www.xfa.org/schema/xfa-data/1.0/"
xfa:APIVersion="Acrobat:10.0.0"
xfa:spec="2.0.2"
style="font-size:12.0pt;text-align:left;color:#000000;font-weight:normal;
font-style:normal;font-family:Helv;font-stretch:normal">
<p dir="ltr">
<span style="line-height:16.3pt;font-family:Helvetica">wwww</span>
</p>
</body>)
>>
endobj
Preflight stated that it is unhappy about the line /IT /FreeTextTypewriter. Looking at the PDF specification again uncovers for annotations with /Subtype /FreeText, i.e. Free Text Annotations specified in section 12.5.6.6:
IT name
(Optional; PDF 1.6) A name describing the intent of the free text annotation (see also the IT entry in Table 170). The following values shall be valid:
FreeText The annotation is intended to function as a plain free-text annotation. A plain free-text annotation is also known as a text box comment.
FreeTextCallout The annotation is intended to function as a callout. The callout is associated with an area on the page through the callout line specified in CL.
FreeTextTypeWriter The annotation is intended to function as a click-to-type or typewriter object and no callout line is drawn.
Default value: FreeText
Thus, your value FreeTextTypewriter is invalid (remember, PDF names are case sensitive!). Therefore, the annotation is (slightly) broken which may already result in all kinds of problems.
But there are other important entries here, too, to understand your issue: All you do in your appended changes is to replace the appearance stream in object 107 (as per /AP <</N 107 0 R >>) of this annotation by a different one. But this annotation contains an RC value, too, which according to the specification is
A rich text string (see 12.7.3.4, “Rich Text Strings”) that shall be used to generate the appearance of the annotation.
Thus, any PDF viewer may regenerate the appearance from that rich text description, especially as the specification in section 12.5.2 says about the content of the AP dictionary
Individual annotation handlers may ignore this entry and provide their own appearances.
Thus, simply replacing the normal appearance stream does not suffice to permanently change the appearance of that annotation, you have to change the appearance dictionary and at least remove any alternative source for the appearance.
Furthermore the entry /Contents (wwww) is not replaced by your appended changes either. So a PDF viewer trying to decide whether to use the appearance stream or not will feel tempted to somehow create a new appearance as your appearance stream in no way represents that value.
Especially when starting to manipulate the free text (e.g. when clicking into the PDF in your case), the PDF viewer knows it eventually will have to create a new appearance anyways, and unless the current appearance is as it would have created it anyway, the viewer may prefer to begin anew starting with an appearance derived from the rich text or even the contents value.

Create PDF file without dictionary data type (angle brackets)

I want to create a PDF file which does not contain angle brackets in it's source.
This apparently implies not using the dictionary data type, as this involves << and >>.
Is it possible to completely avoid angle brackets and still create a PDF file with formatted content?
Can it be done hiding in a stream, using a character encoding technique or with an alternative dictionary notation?
The solution is needed for an obfuscation technique; the bracket-problem cannot be circumvented.

I think this is not possible. Every element in a PDF file is contained in some dictionary: the document catalog (the root dictionary), the page objects, the page content streams, all of them are dictionaries that require the char sequence << >>.
Sample catalog dictionary:
1 0 obj
<<
/Pages 2 0 R
/Type /Catalog
>>
endobj
If you want to use a "instructions sequence only" presentation format, you may try using PostScript instead.
Edit after comments:
Using a stream object with some filter encoding will not solve your problem, since you still need to specify the filter type in the stream dictionary.
Example:
5 0 obj
<</Length 6 0 R /Filter /FlateDecode>>
stream
***illegible characters***
endobj

Unicode in PDF

My program generates relatively simple PDF documents on request, but I'm having trouble with unicode characters, like kanji or odd math symbols. To write a normal string in PDF, you place it in brackets:
(something)
There is also the option to escape a character with octal codes:
(\527)
but this only goes up to 512 characters. How do you encode or escape higher characters? I've seen references to byte streams and hex-encoded strings, but none of the references I've read seem to be willing to tell me how to actually do it.
Edit: Alternatively, point me to a good Java PDF library that will do the job for me. The one I'm currently using is a version of gnujpdf (which I've fixed several bugs in, since the original author appears to have gone AWOL), that allows you to program against an AWT Graphics interface, and ideally any replacement should do the same.
The alternatives seem to be either HTML -> PDF, or a programmatic model based on paragraphs and boxes that feels very much like HTML. iText is an example of the latter. This would mean rewriting my existing code, and I'm not convinced they'd give me the same flexibility in laying out.
Edit 2: I didn't realise before, but the iText library has a Graphics2D API and seems to handle unicode perfectly, so that's what I'll be using. Though it isn't an answer to the question as asked, it solves the problem for me.
Edit 3: iText is working nicely for me. I guess the lesson is, when faced with something that seems pointlessly difficult, look for somebody who knows more about it than you.

In the PDF reference in chapter 3, this is what they say about Unicode:
Text strings are encoded in
either PDFDocEncoding or Unicode character encoding. PDFDocEncoding is a
superset of the ISO Latin 1 encoding and is documented in Appendix D. Unicode
is described in the Unicode Standard by the Unicode Consortium (see the Bibliography).
For text strings encoded in Unicode, the first two bytes must be 254 followed by
255. These two bytes represent the Unicode byte order marker, U+FEFF, indicating
that the string is encoded in the UTF-16BE (big-endian) encoding scheme specified
in the Unicode standard. (This mechanism precludes beginning a string using
PDFDocEncoding with the two characters thorn ydieresis, which is unlikely to
be a meaningful beginning of a word or phrase).

The simple answer is that there's no simple answer. If you take a look at the PDF specification, you'll see an entire chapter — and a long one at that — devoted to the mechanisms of text display. I implemented all of the PDF support for my company, and handling text was by far the most complex part of exercise. The solution you discovered — use a 3rd party library to do the work for you — is really the best choice, unless you have very specific, special-purpose requirements for your PDF files.

Algoman's answer is wrong in many things. You can make a PDF document with Unicode in it and it's not rocket science, though it needs some work.
Yes he is right, to use more than 255 characters in one font you have to create a composite font (CIDFont) pdf object.
Then you just mention the actual TrueType font you want to use as a DescendatFont entry of CIDFont.
The trick is that after that you have to use glyph indices of a font instead of character codes. To get this indices map you have to parse cmap section of a font - get contents of the font with GetFontData function and take hands on TTF specification.
And that's it! I've just did it and now I have a Unicode PDF!
Sample Code for parsing cmap section is here: https://web.archive.org/web/20150329005245/http://support.microsoft.com/en-us/kb/241020
And yes, don't forget /ToUnicode entry as #user2373071 pointed out or user will not be able to search your PDF or copy text from it.

As dredkin pointed out, you have to use the glyph indices instead of the Unicode character value in the page content stream. This is sufficient to display Unicode text in PDF, but the Unicode text would not be searchable. To make the text searchable or have copy/paste work on it, you will also need to include a /ToUnicode stream. This stream should translate each glyph in the document to the actual Unicode character.

See Appendix D (page 995) of the PDF specification. There is a limited number of fonts and character sets pre-defined in a PDF consumer application. To display other characters you need to embed a font that contains them. It is also preferable to embed only a subset of the font, including only required characters, in order to reduce file size. I am also working on displaying Unicode characters in PDF and it is a major hassle.
Check out PDFBox or iText.
http://www.adobe.com/devnet/pdf/pdf_reference.html

I have worked several days on this subject now and what I have learned is that unicode is (as good as) impossible in pdf. Using 2-byte characters the way plinth described only works with CID-Fonts.
seemingly, CID-Fonts are a pdf-internal construct and they are not really fonts in that sense - they seem to be more like graphics-subroutines, that can be invoked by addressing them (with 16-bit addresses).
So to use unicode in pdf directly
you would have to convert normal fonts to CID-Fonts, which is probably extremely hard - you'd have to generate the graphics routines from the original font(?), extract character metrics etc.
you cannot use CID-Fonts like normal fonts - you cannot load or scale them the way you load and scale normal fonts
also, 2-byte characters don't even cover the full Unicode space
IMHO, these points make it absolutely unfeasible to use unicode directly.
What I am doing instead now is using the characters indirectly in the following way:
For every font, I generate a codepage (and a lookup-table for fast lookups) - in c++ this would be something like
std::map<std::string, std::vector<wchar_t> > Codepage;
std::map<std::string, std::map<wchar_t, int> > LookupTable;
then, whenever I want to put some unicode-string on a page, I iterate its characters, look them up in the lookup-table and - if they are new, I add them to the code-page like this:
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{
if(LookupTable[fontname].find(*i) == LookupTable[fontname].end())
{
LookupTable[fontname][*i] = Codepage[fontname].size();
Codepage[fontname].push_back(*i);
}
}
then, I generate a new string, where the characters from the original string are replaced by their positions in the codepage like this:
static std::string hex = "0123456789ABCDEF";
std::string result = "<";
for(std::wstring::const_iterator i = str.begin(); i != str.end(); i++)
{
int id = LookupTable[fontname][*i] + 1;
result += hex[(id & 0x00F0) >> 4];
result += hex[(id & 0x000F)];
}
result += ">";
for example, "H€llo World!" might become <01020303040506040703080905>
and now you can just put that string into the pdf and have it printed, using the Tj operator as usual...
but you now have a problem: the pdf doesn't know that you mean "H" by a 01. To solve this problem, you also have to include the codepage in the pdf file. This is done by adding an /Encoding to the Font object and setting its Differences
For the "H€llo World!" example, this Font-Object would work:
5 0 obj
<<
/F1
<<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Encoding
<<
/Type /Encoding
/Differences [ 1 /H /Euro /l /o /space /W /r /d /exclam ]
>>
>>
>>
endobj
I generate it with this code:
ObjectOffsets.push_back(stream->tellp()); // xrefs entry
(*stream) << ObjectCounter++ << " 0 obj \n<<\n";
int fontid = 1;
for(std::list<std::string>::iterator i = Fonts.begin(); i != Fonts.end(); i++)
{
(*stream) << " /F" << fontid++ << " << /Type /Font /Subtype /Type1 /BaseFont /" << *i;
(*stream) << " /Encoding << /Type /Encoding /Differences [ 1 \n";
for(std::vector<wchar_t>::iterator j = Codepage[*i].begin(); j != Codepage[*i].end(); j++)
(*stream) << " /" << GlyphName(*j) << "\n";
(*stream) << " ] >>";
(*stream) << " >> \n";
}
(*stream) << ">>\n";
(*stream) << "endobj \n\n";
Notice that I use a global font-register - I use the same font names /F1, /F2,... throughout the whole pdf document. The same font-register object is referenced in the /Resources Entry of all pages. If you do this differently (e.g. you use one font-register per page) - you might have to adapt the code to your situation...
So how do you find the names of the glyphs (/Euro for "€", /exclam for "!" etc.)? In the above code, this is done by simply calling "GlyphName(*j)". I have generated this method with a BASH-Script from the list found at
http://www.jdawiseman.com/papers/trivia/character-entities.html
and it looks like this
const std::string GlyphName(wchar_t UnicodeCodepoint)
{
switch(UnicodeCodepoint)
{
case 0x00A0: return "nonbreakingspace";
case 0x00A1: return "exclamdown";
case 0x00A2: return "cent";
...
}
}
A major problem I have left open is that this only works as long as you use at most 254 different characters from the same font. To use more than 254 different characters, you would have to create multiple codepages for the same font.
Inside the pdf, different codepages are represented by different fonts, so to switch between codepages, you would have to switch fonts, which could theoretically blow your pdf up quite a bit, but I for one, can live with that...

dredkin's answer has worked fine for me in the forward direction (unicode text to PDF representation).
I was writing an increasingly convoluted comment there about the reverse direction (PDF representation to text, when copying from the PDF document), explained by user2373071. The method referred to throughout this thread is the definition of a /ToUnicode map (which, incidentally, is optional). I found it simplest to map from glyphs to characters using the beginbfrange srcCode1 srcCode2 [ dstString1 m ] endbfrange construct.
This seems to work OK in Adobe Reader, but two glyphs (0x100 and 0x1ef) cause the mapping for cyrillic characters to fail in browsers and SumatraPDF (the copy/paste provides the glyph IDs instead of the characters. By excluding those two glyphs I made it work there. (I really can't see what's special about these glyphs, and it's independent of font (i.e. it's the same glyphs, but different characters, in Times/Georgia/Palatino, and these values are afaik identically mapped in UTF-16. Any ideas welcome!)
However, and more importantly,
I have reached the conclusion that the whole /ToUnicode mechanism is fundamentally flawed in concept, because many fonts re-use glyphs for multiple characters. Consider simple ones like 0x20 and 0xa0 (ordinary and non-breaking space); 0x2d and 0xad (hyphen and soft hyphen); these two are in the 8-bit character range. Slightly beyond that are 0x3b and 0x37e (semi-colon and greek question mark). And it would be quite reasonable to re-use cyrillic small a and latin small a, and similar homoglyphs. So the point is, in the non-ASCII world that prompts us to worry about Unicode at all, we will encountering a one-to-many mapping from glyphs to characters, and will therefore be bound to pick up the wrong character at some point - which rather removes the point of being able to extract the text in the first place.
The other method in the (1.7) PDF reference is to use /ActualText instead of /ToUnicode. This is better in principle, because completely avoids the homoglyph problem I've mentioned above, and the overhead is probably bearable, but it only seems to be implemented in Adobe Reader (i.e. I haven't got anything consistent or meaningful from SumatraPdf or four browsers).

I'm not a PDF expert, and (as Ferruccio said) the PDF specs at Adobe should tell you everything, but a thought popped up in my mind:
Are you sure you are using a font that supports all the characters you need?
In our application, we create PDF from HTML pages (with a third party library), and we had this problem with cyrillic characters...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas