I'm looking for a grammar of PDF 1.7 (BNF or variant)
absolutely not googleable
I am not aware of any formal specification of the PDF file format in the form of a grammar, BNF or not.
But I happen to know for sure that the ISO technical committee 171/SC2 which currently works on the specification of PDF-2.0 has an agenda topic of "Updates from ad hoc committees: [...] iv. File format syntax for validating PDF files (L. Rosenthol)" for its next face to face meeting taking place in Berlin, Sept 11-12 2012. -- Which agenda item I take as "some more people seem to be interested in a more formal description of the PDF syntax"... :-)
Leonard Rosenthol is an Adobe PDF higher-up, and he frequently answers questions in the Adobe user forums. Maybe it is a good idea to ask a question there? Chances are, there you'll get a better answer than here.
PDF is a binary format that is not context-free. In PDF for example you need to read and interpret the size of a binary stream before parsing the stream.
Example:
10 0 obj
<</Type /XObject
/Subtype /Image
/Width 260
/Height 52
/ColorSpace /DeviceRGB
/SMask 10 0 R
/BitsPerComponent 8
/Filter /FlateDecode
/Length 4570>> stream
--- insert binary data here ---
endstream
endobj
There is no way to tell if your binary data will contain the tokens endstream or endobj inside, so you have no other choice than reading the length of the stream before parsing it.
BNF can only be used for context-free grammars, so it is not possible to construct a BNF grammar for PDF.
Take a look at the specification here:
PDF Reference Document
Related
I have a few PDF files which are in the Urdu language, and some of the PDF files are in the Arabic language.
I want to convert the PDF files to text format. I have issued the following Ghostscript code from the command line in my Windows 10 system:
gswin64c.exe -sDEVICE=txtwrite -o output.txt new.pdf
The text file is generated, however, the contents of text file is not in the Urdu language or Arabic language.
This is how it looks like (I have pasted a portion of output as it is huge):
ی첺جⰧ�� ہ셈ے
How can I properly convert PDF to text using Ghostscript?
Well basically the answer is that the PDF files you have supplied have 'not terribly good' ToUnicode CMap tables.
Looking at your first file we see that it uses one font:
26 0 obj
<<
/BaseFont /CCJSWK+JameelNooriNastaleeq
/DescendantFonts 28 0 R
/Encoding /Identity-H
/Subtype /Type0
/ToUnicode 29 0 R
/Type /Font
>>
endobj
That has a ToUnicode CMap in object 29, the ToUnicode CMap is meant to map character codes to Unicode code points. Looking at the first piece of text as an example we see:
/C2_0 1 Tf
13 0 0 13 39.1302 561.97 Tm
<0003>Tj
/Span<</ActualText<FEFF0645062A>>> BDC
<38560707>Tj
So that's character code 0x003 (notice no marked content for the first characetr), looking at the ToUnicode CMap we see:
<0003> <0020>
So character code 0x003 maps to Unicode point U+0020, a space. The next two character codes are 3856 and 0707. Again consulting the ToUnicode CMap we see:
<3856> <062A0645>
So that single character code maps to two Unicode code points, U+062A and U+0645, Which is 'Teh' ت and 'Meem' م
So far so good. The next code is 0707, when we look at the ToUnicode CMap it comes up as 0xFFFD, which is the 'replacement character' �. Obviously that's meaningless.
We then have this :
0.391 0 Td
[<011C07071FEE>1 <0003>243.8 <2E93>]TJ
/Span<</ActualText<FEFF0644>>> BDC
<0707>Tj
EMC
So that's character codes 0x011C, 0x0707, 0x1FEE, 0x0003, 0x2E93 followed by 0x0707. Notice that the final <0707> is associated with a Marked Content definition which says the ActualText is Unicode 0x0644, which is the 'Lam' glyph ل
So clearly the ToUnicode CMap should be mapping 0707 to U+0644, and it doesn't.
Now when given a ToUnicode CMap the text extraction code trusts it. So your problem with this file is that the ToUnicode CMap is 'wrong', and that's why the text is coming out incorrect.
I haven't tried to debug further through the file, it is possible there are other errors.
Your second file has this ToUnicode CMap:
26 0 obj
<<
/Length 606
>>
stream
/CIDInit /ProcSet findresource begin 12 dict begin begincmap /CIDSystemInfo <<
/Registry (AABACF+TT1+0) /Ordering (T42UV) /Supplement 0 >> def
/CMapName /AABACF+TT1+0 def
/CMapType 2 def
1 begincodespacerange <0003> <0707> endcodespacerange
15 beginbfchar
<0003> <0020>
<0011> <002E>
<00e7> <062A>
<00ec> <062F>
<00ee> <0631>
<00f3> <0636>
<00f8> <0641>
<00fa> <0644>
<00fc> <0646>
<00fe> <0648>
<0119> <0647>
<011a> <064A>
<0134> <0066>
<013b> <006D>
<0707> <2423>
endbfchar
2 beginbfrange
<00e4> <00e5> <0627>
<011f> <0124> <0661>
endbfrange
endcmap CMapName currentdict /CMap defineresource pop end end
The first text in the file is:
<3718>Tj
And again, that's not in the CMap. Because the text extraction code prioritises the CMAp (because it's usually reliable), the missing entries cause the extraction to basically fail.
In addition to the fact that the ToUnicode CMaps are incorrect, the embedded fonts are subset and use an Identity-H CMap for drawing. That eliminates another source of information we could use.
Fundamentally the only way you're going to get text out of that PDF fie is manual transcription or OCR software.
Since you are using Ghostscript on Windows, the distributed binary includes Tesseract so you could try using that with pdfwrite and an Urdu training file to produce a PDF file with a possibly better ToUnicode CMap. You could then extract the text from that PDF file.
You would have to tell the pdfwrite device not to use the embedded ToUnicode CMaps, see the UseOCR switch documented here https://ghostscript.com/doc/9.56.1/VectorDevices.htm#PDF
And information on setting up the OCR engine and getting output here https://ghostscript.com/doc/9.56.1/Devices.htm#OCR-Devices
You may get better results by using an 'image' OCR output and then using the text extraction on that file to get the text out.
Adobe Glyph List (AGL) is described as
is a mapping of 4,281 glyph names to one or more Unicode characters.
For what I understand those are PDF Names like /Adieresis allow to specify the respective unicode character U+00C4 and if my understanding is correct those 4,281 Names can be used to specify a mapping like done here for the font named /F1 in a pages /Resources dictionary:
<<
/Type /Page
/Resources <<
/Font <<
/F1 <<
/Type /Font
/Subtype /Type1
/BaseFont /Times-Roman
/Encoding <<
/Differencs [ 1 /Adiaresis /adiaresis ]
>>
>>
>>
>>
The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
Also I am confused that there is a /toUnicode feature in PDF allowing to associate glyphs/cmaps of embedded fonts with the unicode characters they those glyphs should represent (hence there was some thinking about "unicode") yet I cannot seem to find the way to use any reasonable unicode codepoints or half-way working encoding (i.e. UTF-8) to make use of the built-in fonts in PDF.
So am is my assumption correct that without going the length to generate a font to embed within a pdf file, the text can only ever be at most from the set of those 4,281 characters only?
In order to support all 65,557 characters within Unicode's Basic Multilingual Plane, it would be required to generate a font containing the used glyphs in the text, since except those 4,281 AGL glyph there seems to be no way to reference to those unicode characters, correct?
Motivation
It would be nice to have a way in PDF that would be the equivalent to HTML5's
<meta charset="utf-8">. Allowing text to be encoded in one simple compatible encoding for unicode, and not having to deal with complicated things as CID/GID/Postscript Glyph Names etc.
This answer first discusses the use of non-AGL names in differences arrays and the more encompassing encodings of composite fonts. Then it discusses which fonts a viewer actually does have to have available. Finally it considers all this in light of the clarifications accompanying your bounty offer.
AGL names and Differences arrays
First let's consider the focal point of your original question,
The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
i.e. your assumption is that only those 4,281 AGL glyph names can be used in the Differences array of the encoding entry of a simple font.
This is not the case, you can also use arbitrary names not found on the AGL. E.g. using this font
7 0 obj
<<
/Type /Font
/Subtype /TrueType
/BaseFont /Arial
/FirstChar 32
/LastChar 32
/Widths [500]
/FontDescriptor 8 0 R
/Encoding 9 0 R
>>
endobj
8 0 obj
<<
/Type /FontDescriptor
/FontName /Arial
/FontFamily (Arial)
/Flags 32
/FontBBox [-665.0 -325.0 2000.0 1040.0]
/ItalicAngle 0
/Ascent 1040
/Descent -325
/CapHeight 716
/StemV 88
/XHeight 519
>>
endobj
9 0 obj
<<
/Type /Encoding
/BaseEncoding /WinAnsiEncoding
/Differences [32 /uniAB55]
>>
endobj
the instruction
( ) Tj
shows you a ꭕ ('LATIN SMALL LETTER CHI WITH LOW LEFT SERIF' U+AB55 which if I saw correctly is not on the AGL) on a system with Arial (ArialMT.ttf) installed.
Thus, to display an arbitrary glyph, you merely need a font you know containing that glyph with a name you know available to the PDF viewer in question. The name doesn't have to be an AGL name, it can be arbitrary!
Encodings of composite fonts
Furthermore, you often aren't even required to enumerate the characters you need as long as your required characters are in the same named encoding for composite fonts!
Here the Encoding shall be
The name of a predefined CMap, or a stream containing a CMap that maps character codes to font numbers and CIDs. If the descendant is a Type 2 CIDFont whose associated TrueType font program is not embedded in the PDF file, the Encoding entry shall be a predefined CMap name (see 9.7.4.2, "Glyph Selection in CIDFonts").
And among the predefined CMaps there are numerous CJK ones. As long as the viewer in question has access to a matching font, you can use a composite font with such an encoding to get access to a lot of CJK glyphs.
Which fonts does a viewer have to have available?
Thus, if the viewer in question has appropriate fonts available, you don't need to embed font programs to display any glyph. But which fonts does a viewer have available?
Usually a viewer will allow access to all fonts registered with the operation system it is running on, but strictly speaking it only has to have very few fonts accessible, PDF processors supporting PDF 1.0 to PDF 1.7 files only need to know the so-called standard 14 fonts and pure PDF 2.0 processors need to know none.
Annex D of the specification clarifies the character ranges to support:
All characters listed in D.2, "Latin character set and encodings" shall be supported for the Times, Helvetica, and Courier font families, as listed in 9.6.2.2, "Standard Type 1 fonts (standard 14 fonts) (PDF 1.0-1.7)" by a PDF processor that supports PDF 1.0 to 1.7.
D.4, "Symbol set and encoding" and D.5, "ZapfDingbats set and encoding" describe the character sets and built-in encodings for the Symbol and ZapfDingbats (ITC Zapf Dingbats) font programs, which belong to the standard 14 predefined fonts.
D.2 essentially is a table describing the StandardEncoding, MacRomanEncoding, WinAnsiEncoding, and PDFDocEncoding. These all are very similar single byte encodings.
D.4 and D.5 contain a single table each describing additional single byte encodings.
Thus, all you can actually expect from a PDF 1.x viewer are these less than 1000 characters!
(You wondered about this in comments to this answer to another question of yours.)
Concerning your clarifications
In your text accompanying your bounty offer you expressed a desire for
being enabled to create a "no frills" program that is able to generate pdf files, where the input data are UTF-8 unicode strings. "No frills" being a reference to the fact that such a software would ideally be able to skip handling font porgam data (such as createing a subset font pogram for inclusion into the pdf).
As explained above, you can do so, either by customized encodings of a number of simple fonts or by the more encompassing named encodings of composite fonts. If you know that the target PDF viewer has these fonts available, that is!
sketch a way that actually would allow to have characters from at least the Adobe-GB1 charset as referenced via "UniCNS−UTF16−H" to be rendered in pdf-viewers, while the pdf file not having any font program embedded for that achieving this.
"UniCNS−UTF16−H" just happens to be one of the predefined encodings allowable for composite fonts. Thus, you can use a composite font with this encoding without embedding the font program as long as the viewer has the appropriate fonts accessible. As far as Adobe Reader is concerned, this usually amounts to having the Extended Asian Language Pack installed.
the limitations to use anything else the WinAnsiEncoding, MacRomanEncoding, MacExpertEncoding with those 14 standard fonts.
As explained above you can merely count on less than 1000 glyphs being available for sure in an arbitrary PDF 1.x viewer. In a pure PDF 2.0 viewer you actually cannot count on even that!
The specification quotes above are from ISO 32000-2; similar requirements can already be found in ISO 32000-1.
Without embeded fonts, is PDF limited to only 4281 characters (of AGL)?
No. Though you should embed fonts to help ensure that the PDF looks the same everywhere.
Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?
It is possible yes, though you would ideally stick with a "standard" encoding, such one of the Orderings. See the "Predefined CMaps" in the PDF specification for these.
If you start making changes to the encoding, such as using Differences, then you are making run time font substitution for the PDF processing program much more difficult.
Regarding /ToUnicode that is just for text extraction, and has nothing to do with rendering. If you stick with a standard encoding as recommended above this is not needed.
There is no 4,281 font glyph limit inherent in PDF. I think you are a bit confused, you don't have to embed fonts in a PDF. Besides the Standard 14 fonts all PDF viewers should be able to handle, PDF software is going to look for fonts installed on the system when not embedded otherwise so it's not as if you have no embedded fonts you lose the ability to display glyphs at all.
You would define a different encoding with the Differences array if the base encoding doesn't reflect what is in the font.
ToUnicode comes into play for text extraction vs text showing.
I'we build PDF, using PDFBox. I've visible signature too. I write some text like that:
...
builderSting.append("Tm\n");
builderSting.append(" /F1 " + fontSize + "\n");
builderSting.append("Tf\n");
builderSting.append("(hello world)");
builderSting.append("Tj\n");
builderSting.append("ET");
...
PDStream stream= ...;
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
everything works well. but if I write some unicode characters in builderString, there is "???"s instead of text.
that's sample PDF: link here
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
9 0 obj
<<
/Type /XObject
/Subtype /Form
/BBox [100 50 0 0]
/Matrix [1 0 0 1 0 0]
/Resources <<
/Font 11 0 R
/XObject <<
/img0 12 0 R
>>
/ProcSet [/PDF /Text /ImageB /ImageC /ImageI]
>>
/FormType 1
/Length 13 0 R
>>
stream
q 93.70079 0 0 50 0 0 cm /img0 Do Q
BT
1 0 0 1 93.70079 25 Tm
/F1 2
Tf
(????)Tj
ET
endstream
endobj
I've font with Encoding WinAsciEncoding. can i use another encoding in pdfbox?
PDFont font = PDTrueTypeFont.loadTTF(template, new File("//fontName.ttf"));
font.setFontEncoding(new WinAnsiEncoding());
QUESTION 2) I 've embedded font in PDF. but text is not written with this font (in visible singature Rectangle). Why?
Question 3) when I remove font, text was still there (when the text was in english). what is the default font? /F1 - which is is 1st font?
Question 4) How to calculate width of my text in visible signature ? Any ideas?
QUESTION 1) when I see PDF structure , there is Question-Marks instead of text. Yes. and I dont know how to write with unicode characters?
I assume that with unicode characters you mean characters present in Unicode but not in e.g. Latin-1. (Because the letter 'a' for example does have a Unicode representation, too, but most likely won't cause you trouble.)
You call getBytes("ISO-8859-1") on your StringBuilder result. Your unicode characters most likely are not in ISO 8859-1. Thus, String.getBytes returns the ASCII code for a question mark in their respective place.
If the question was merely how to write to an output stream with unicode characters in Java, the answer would be easy: Choose an encoding which contains all you characters, e.g. UTF-8, which all consumers of your program support, and call String.getBytes for that encoding.
The case at hand is different, though, as you want to serialize those information as a PDF form xobject stream. In this context your whole approach is somewhere along the route from highly questionable to completely wrong:
In PDFs, each font might come along with its own encoding which might be similar to a common encoding, e.g. /WinAnsiEncoding, or completely custom. These encodings, furthermore, in many cases are restricted to one byte per character, but in case of composite fonts they can also be multi-byte-encodings.
As a corollary, not all elements of the stream elements need to be encoded using the same encoding. E.g. the operator names Tm, Tf, and Tj are encoded using their ASCII codes while the characters of a string to be displayed have to be encoded using the respective font's encoding (and may thereafter be yet again hex-encoded if added in sharp brackets <>).
Thus, creating the stream as a string and then converting them to bytes with a single encoding only works if all used fonts use the same encoding (for the actually used code points) which furthermore needs to be ASCII'ish to correctly represent the operators.
Essentially, you should directly construct the stream in some byte buffer and for each inserted element use the appropriate encoding. In case of characters to be displayed, therefore, you have to be aware of the encoding used by the currently selected font.
If you want to do it right, first study the PDF specification ISO 32000-1, especially the sections on general syntax and chapter 9 Text.
QUESTION 2) I've embedded font in PDF. but text is not written with this font (in visible signature Rectangle). Why?
In the resources of the stream xobject in question there is exactly one embedded font associated to the name /F0. In your stream, though, you have /F1 2 Tf, i.e. you select a font /F1 at size 2.
Question 3) when I remove font, text was still there (when the text was in english). what is the default font?
According to the specification, section 9.3.1,
font shall be the name of a font resource in the Font subdictionary of the current
resource dictionary [...]
There is no initial value for either font or size
Most likely, though, PDF viewers for the sake of compatibility with old or broken documents use some default font.
Question 4) How to calculate width of my text in visible signature ? Any ideas?
The widths obviously depends on the metrics of the font used (glyph widths in this case) and the graphics state you set (font size, character spacing, word spacing, current transformation matrix, text transformation matrix, ...).
In your case you hardly do anything in the graphics state and, therefore, only the selected font size from it is of interest. so the more interesting part are the character widths from the font metrics. As long as you use the standard 14 fonts, you find the metrics here. As soon as you start using other, custom fonts, you have to read them from the font definition files yourself.
Ad 1)
Could it be that
stream.createOutputStream().write(builder.toString().getBytes("ISO-8859-1"));
should be
stream.createOutputStream().write(builderString.toString().getBytes("UTF-8"));
The conversion in getBytes to ISO-8859-1 would make some special characters missing in ISO-8859-1 a ?.
I want to create a PDF file which does not contain angle brackets in it's source.
This apparently implies not using the dictionary data type, as this involves << and >>.
Is it possible to completely avoid angle brackets and still create a PDF file with formatted content?
Can it be done hiding in a stream, using a character encoding technique or with an alternative dictionary notation?
The solution is needed for an obfuscation technique; the bracket-problem cannot be circumvented.
I think this is not possible. Every element in a PDF file is contained in some dictionary: the document catalog (the root dictionary), the page objects, the page content streams, all of them are dictionaries that require the char sequence << >>.
Sample catalog dictionary:
1 0 obj
<<
/Pages 2 0 R
/Type /Catalog
>>
endobj
If you want to use a "instructions sequence only" presentation format, you may try using PostScript instead.
Edit after comments:
Using a stream object with some filter encoding will not solve your problem, since you still need to specify the filter type in the stream dictionary.
Example:
5 0 obj
<</Length 6 0 R /Filter /FlateDecode>>
stream
***illegible characters***
endobj
I am demoing an idea I have been playing around with, and while the Adobe specification says that including PS XObjects is not a good idea, some PDF readers should still support this functionality. Anyways, that is beside the fact. I have been using the Adobe PDF specification and have the following PDF object. This merely uses PostScript to generate a pseudo random value and then print it to the page. Ideally, each time this page is rendered a new value should display:
5 0 obj
<< /Type/XObject
/Subtype/PS
/Length 103
>>
stream
/Times findfont 10 scalefont setfont
/str 32 string def
10 20 moveto
rand str cvs show
endstream
endobj
Each time any PDF viewer I have tested this against reads this object I get errors such as:"Error (741): Missing 'endstream'" And similarly for every token in that stream. I am sure my offsets are correct. And while I know my PDF viewer does support some PS for forms and such, is there anything obviously incorrect. If anyone has a sample PDF I can go from, that would be nice. The form examples that I tested my reader against have not been too helpful. If I run just the PS code from GhostView it works fine. Thanks for any insight.
I've scoured my back collection of PDF files and come up with 2 which contain PS XObjects (this really is deprecated). I can't, unfortunately, share tehm as they are customer data files :-(
However, here is an extract from one of them:
74 0 obj
<<
/Type /XObject
/Subtype /PS
/Filter /FlateDecode
/Length 77 0 R
/Name /Ps1
>>
stream
....endstream
Note 1, there is no EOL between the end of data and the 'endstream' token.
77 0 obj
4480
endobj
The offset of the 0x0A following the 'stream' token is 0xdab15, the offset of the 'e' in endstream is 0xdbc96. That is 4481 bytes. SO it looks to me like the /Length should contain all the bytes after the EOL for the 'stream' token' right up to the last byte before the 'e' in the endstream token.
I think it would be OK to insert a 0x0A after the stream data and before the endstream. That would come down to a whitespace after the stream data before the token, and PDF is supposed to be tolerant of whitespace.
This is consistent with the description of the /Length entry for stream dictionaries in Table 3.4 (p62 of the 1.7 PDF reference):
The number of bytes from the beginning of the line fol-lowing the keyword stream to the last byte just before the keyword endstream. (There may be an additional EOL marker, preceding endstream, that is not included in the count and is not logically part of the stream data.) See “Stream Extent,” above, for further discussion.
I think (if I've counted correctly) that the /Length in your example should be 87, assuming one byte line terminators in the PostScript fragment.