Are byte order marks allowed in PDF document? - pdf

I'm having an issue with a filter program I wrote. It detects if a file is a PDF document by reading the first 5 bytes of the file and comparing it to a fixed buffer :
25 50 44 46 2D
This works fine except that I'm seeing a few files that starts with a byte order mark instead:
EF BB BF 25 50 44 46 2D
^-------^
I'm wondering if that is actually allowed by the PDF specs. If I check section 7.5 of that documentation, I read it as "no":
The first line of a PDF file shall be a header consisting of the 5 characters %PDF– followed by a version number of the form 1.N, where N is a digit between 0 and 7
Yet, I see these documents in the wild and the users gets confused because PDF reader programs can open these documents by my filter reject them.
So: are BOM markers allowed at the start of PDF documents ? (I'm NOT talking about string objects here but the PDF file itself)

So: are BOM markers allowed at the start of PDF documents ?
No, just like you read in the specification, nothing is allowed before the "%PDF" bytes.
But Adobe Reader has a long history of accepting files in spite of some leading or trailing trash bytes.
Cf. the implementation notes in Appendix H of Adobe's pdf_reference_1-7:
3.4.1, “File Header”
Acrobat viewers require only that the header appear somewhere within
the first 1024 bytes of the file.
Acrobat viewers also accept a header of the form
%!PS−Adobe−N.n PDF−M.m
...
3.4.4, “File Trailer”
Acrobat viewers require only that the %%EOF marker appear somewhere
within the last 1024 bytes of the file.
And people have a tendency to think that a PDF that Adobe Reader displays as desired is valid, there are many PDFs in the wild that do have trash bytes up front.

No, a BOM is not valid at the front a PDF file.
A PDF is a binary file format so a BOM wouldn't actually make sense, it would be like having a BOM at the front of a ZIP file or a JPEG.
I'm guessing the PDFs that you are consuming are coming from misconfigured applications that either have something already at the front of their output buffer already or, more likely, are created with the incorrect assumption that a PDF is a text-based format.

Related

Arabic pdf text extraction

I'm trying to extract text from Arabic pdfs - raw data extraction not OCR -.
I tried many packages, tools and none of them worked, python packages, pdfBox, adobe API, and many other tools and all of them field to extract the text correctly, either it reads the text LTR or it do wrong decoding.
Here is a two sample from different tools
sample 1:
املحتويات
7 الثانية الطبعة مقدمة
9 وتاريخه األدب -١
51 الجاهليون -٢
95 الشعر نحل أسباب -٣
149 والشعراء الشعر -٤
213 مرض شعر -٥
271 الشعر -٦
285 الجاهيل النثر -٧
sample 2:
ﺔﻴﻧﺎﺜﻟا ﺔﻌﺒﻄﻟا ﺔﻣﺪﻘﻣ
ﻪﺨﻳرﺎﺗو بدﻷا -١
نﻮﻴﻠﻫﺎﺠﻟا -٢
ﺮﻌﺸﻟا ﻞﺤﻧ بﺎﺒﺳأ -٣
ءاﺮﻌﺸﻟاو ﺮﻌﺸﻟا -٤
ﴬﻣ ﺮﻌﺷ -٥
ﺮﻌﺸﻟا -٦
ﲇﻫﺎﺠﻟا ﺮﺜﻨﻟا -٧
original text
and yes I can copy it and get the same rendered text.
are there any tool that can extract Arabic text correctly
the book link can be found here
The text in a PDF is not the same as the text used for its construction, we can see that in your example where page 7 is shown in Arabic on the surface but is coded as 7 in the plain text.
However a greater problem is the Languages as supported by fonts, so in Notepad I had to accept a script font to see a similarity, but that is using a font substitution.
Another complication is Unicode and whitespace ordering.
so the result from
pdftotext -f 5 -l 5 في_الأدب_الجاهلي.pdf try.txt
At best will look like
Thus in summary your Sample 1 is equal if not better, than any other simple attempt.
Later Edit from B.A. comment below
I found a way to go around this, after extracting the text I open the txt file and normalize its content using unicodedata python module that offers unicodedata.normalize() function. So I can now say that pdftotext is the best tool for Arabic text extraction
Unicode Normalization should be fixing that issue. (you can choose NFKC)
Most programming languages have a normal.
check here for more info about normalization.
https://unicode.org/reports/tr15/

Minimal PDF size according to specs

I'm reading PDF specs and I have a few questions about the structure it has.
First of all, the file signature is %PDF-n.m (8 bytes).
After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?
After that, there should be a body, a xref table and a trailer and an %%EOF.
What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?
Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to? The first or the last xref table?
First of all, the file signature is %PDF-n.m (8 bytes). After that the docs says there might be at least 4 bytes of binary data (but there also might not be any). The docs don't say how many binary bytes there could be, so that is my first question. If I was trying to parse a PDF file, how should I parse that part? How would I know how many binary bytes (if any) where placed in there? Where should I stop parsing?
Which docs do you have? The PDF specification ISO 32000-1 says:
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be
immediately followed by a comment line containing at least four binary characters—that is, characters whose
codes are 128 or greater.
Thus, those at least 4 bytes of binary data are not immediately following the file signature without any structure but they are on a comment line! This implies that they are
preceded by a % (which starts a comment, i.e. data you have to ignore while parsing anyways) and
followed by an end-of-line, i.e. CR, LF, or CR LF.
So it is easy to recognize while parsing. In particular it merely is a special case of a comment line and nothing to treat specially.
(sigh, I just saw you and #Jongware cleared that in comments while I wrote this...)
What could be the minimal file size of a PDF, assuming there isn't anything at all (no objects, whatsoever) in the PDF file and assuming the file doesn't contain the optional binary bytes section at the beginning?
If there are no objects, you don't have a PDF file as certain objects are required in a PDF file, in particular the catalog. So do you mean a minimal valid PDF file?
As you commented you indeed mean a minimal valid PDF.
Please have a look at the question What is the smallest possible valid PDF? on stackoverflow, there are some attempts to create minimal PDFs adhering more or less strictly to the specification. Reading e.g. #plinth's answer you will see stuff that is not PDF anymore but still accepted by Adobe Reader.
Third and last question: If there were more than one body+xref+trailer sections, where would be offset just before the %%EOF be pointing to?
Normally it would be the last cross reference table/stream as the usual use case is
you start with a PDF which has but one cross reference section;
you append an incremental update with a cross reference section pointing to the original as previous, and the new offset before %%EOF points to that new cross reference;
you append yet another incremental update with a cross reference section pointing to the cross references from the first update as previous, and the new offset before %%EOF points to that newest cross reference;
etc...
The exception is the case of linearized documents in which the offset before the %%EOF points to the initial cross references which in turn point to the section at the end of the file as previous. For details cf. Annex F of ISO 32000-1.
And as you can of course apply incremental updates to a linearized document, you can have mixed forms.
In general it is best for a parser to be able to parse any order of partial cross references. And don't forget, there are not only cross reference sections but also alternatively cross reference streams.

How to search my PDF with grep?

I have followed ideas from this thread but it does not work.
https://unix.stackexchange.com/questions/6704/how-can-i-grep-in-pdf-files
pdftotext PercivalWalden.pdf - | grep 'Slepian'
pdftotext PercivalWalden.pdf - | grep 'Naive'
pdftotext PercivalWalden.pdf - | grep 'Filter'
I know for sure that 'Filter' appears at least 100 times in this book.
Any ideas?
If you really can grep a given string (that you can 'see' and read on a rendered or printed PDF page) from a PDF, even with the help of pdftotext, then you must be very lucky indeed.
First off: most of the advice from the link you provided to unix.stackexchange.com is very uninformed (to put it most politely). Most of the answers there are clearly written by people who are not familiar with the huge range of PDF variations out there.
In your case, you are trying to convert the file with the help of pdftotext first, streaming the output to stdout.
There are many types of PDF where pdftotext cannot extract the text at all. The reasons for this may be (listings below not complete):
The "text" that you see is not based on using a font. It may be one big raster image generated by a scan or other production process, then embedded into a PDF file shell. This may make the page only appear to be text strings.
The "text" that you see is not based on using a font. It may be a series of small vector drawings (or small raster images), that only look like text strings to our eyes and brain.
There are many software applications, which do convert fonts to so-called 'outlines'. The reason for this seemingly strange behaviour may be:
Circumvent licensing problems (when a certain font disallows its embedding).
Impose a handicap upon attempts to extract the text.
Accidentally wrong setting in the PDF generating application.
The font is embedded as a subset in the PDF file (by the PDF generating software -- users usually do not have much control over the details of this operation) and uses a 'custom' encoding, but the file does not provide a toUnicode table to map the glyphs to characters.
'Glyphs' are the well-defined shapes in each font drawn on screen. Glyphs map to characters for the computer -- our eyes merely see these shapes and our brains translate these to characters without needing a toUnicode table. Programs like pdftotext require a toUnicode table to reverse the translation of glyphs back to characters.
You can use a command line utility named pdffonts to gain a first insight into the fonts used by your PDF file. Example output:
pdffonts paper-projectiris---final.pdf
name type encoding emb sub uni object ID
-------------------------- ------------ -------------- --- --- --- ---------
TCQJEF+CMCSC10 Type 1 Builtin yes yes no 96 0
VPAFLY+CMBX12 Type 1 Builtin yes yes no 97 0
CWAIXW+CMTI12 Type 1 Builtin yes yes no 98 0
OBMDLT+CMR12 Type 1 Builtin yes yes no 99 0
In this case, text extraction (and your method of grepping for strings) should work:
Even though the column named uni (telling if a toUnicode map is embedded in the PDF file)
says no for each single font, the encoding column does not contain custom, but builtin (meaning that a glyph->character mapping is provided with the font file, which is of type Type 1.
To sum it up: Without access to your PDF file it is impossible to tell why you cannot "grep" for the strings you are looking for!

PDF format. function of %-started sequence

What is a function of hex sequence "25 E2 E3 CF D3", found at the beginning of some documents? It should be a comment as far as I understand, but it's content is not any meaningful text and the same sequence occurs in many documents.
It identifies the PDF file as containing binary data.
From the freely available PDF Reference (section 7.5.2, p. 40):
If a PDF file contains binary data, as most do (see 7.2, "Lexical Conventions"), the header line shall be
immediately followed by a comment line containing at least four binary characters—that is, characters whose
codes are 128 or greater. This ensures proper behaviour of file transfer applications that inspect data near the
beginning of a file to determine whether to treat the file’s contents as text or as binary.

PDF font mapping error

While rendering a PDF file generated by PDFCreator 0.9.x. I noticed it contains an error in the character mapping. Now, an error in a PDF file is nothing to be wondered about, Acrobat does wonders in rendering faulty PDF files hence a lot of PDF generators create PDFs that do not adhere fully to the PDF standard.
I trief to create a small example file: http://test.continuit.nl/temp/Document.pdf
The single page renders a single glyph (a capital A) using a Tj command (See stream 5 0 obj). The font selected (7 0 obj) contains a font with a single glyph embedded. So far so good. The char is referenced by char #1. Given the Encoding of the font it contains a Differences part: [ 1 /A ]. Thus char 1 -> character /A. Now in the embedded subset font there is a cmap that matches no glyph at character 65 (eg capital A) the cmap section of the font does define the character in exactly the order in the PDF file Font -> Encoding -> Differences array.
It looks like the character mapping / encoding is done twice. Only Files from PDFCreator 0.9.x seem to be affected.
My question is: Is this correct (or did I make a mistake and is the PDF correct) and what would you do to detect this situation in order to solve the rendering problem.
Note: I do need to be able to render these PDFs..
Solution
In the ISO32000 file there is a remark that symbolic TrueType fonts (flag bit 3 is on in the font descriptor) the encoding is not allowed and you should IGNORE it, using a simple 1on1 encoding always. SO all in all, if it is a symbolic font, I ignore the Encoding object altogether and this solves the problem.
The first point is that the file opens and renders correctly in Acrobat, so its almost certain that the file is correct. In fact it opens and renders correctly in a wide range of PDF consumers, so in fact it is correct.
The font in question is a TrueType font, so actually yes, there are two kinds of 'encoding'. First there is PDF/PostScript Encoding. This maps a character code into a glyph name. In your case it maps character code 1 to glyph name /A.
In a PostScript font we would then look up the name /A in the CharStrings dictionary, and that would give us the character description, which we would then execute. Things are different with a TrueType font though.
You can find this on page 430 of the 1.7 PDF Reference Manual, where it states that:
"A TrueType font program’s built-in encoding maps directly from character codes to glyph descriptions by means of an internal data structure called a “cmap” (not to be confused with the CMap described in Section 5.6.4, “CMaps”)."
I believe in your case that you simply need to use the character code (0x01) directly in the CMAP sub table. This will give you a GID of 36.