Convert pdf from version 1.1 to 1.4 (or higher) - pdf

How can I convert pdf files from version 1.1 to 1.4 (or higher)?
Actually I need some sort of command line tool for batch converting or some API to be able to convert dynamically severall documents.

Use Ghostscript tool.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -o output.pdf input.pdf

Pdf 1.1 is forward compatible with pdf 1.4. Everything in pdf 1.1 will work with pdf 1.4 - it's guaranteed by the spec. Let's assume that you've got some justifiable reason why this is not good enough for you (let's assume, for example, that you have a non-spec compliant tool that consumes PDF and explodes on any file version less that 1.4).
We can focus on the main syntactic differences between versions.
All PDF files have a header somewhere in the first 1024 bytes. In most cases, it's the very first line, but that's not guaranteed (I'm looking at you GhostScript!). The header looks like this in PDF 1.1:
%PDF-1.1
in PDF 1.4, it looks like this:
%PDF-1.4
So in theory, all you need is a tool that will look in the first 1024 bytes for a file for "%PDF-1.1" and change it to "%PDF-1.4". You could use sed, perl, etc to do something like that for you. You could write it in C and you would be tempted to do something like this:
#define PDFHEADERSIZE 1024
bool ChangeFileToNewPdfVersion(char *file)
{
char *replacePoint = NULL;
FILE *fp = fopen(file, "rw");
char buf[PDFHEADERSIZE + 1];
buf[PDFHEADERSIZE] = '\0';
if (fread(buf, 1, PDFHEADERSIZE, fp) != PDFHEADERSIZE) { fclose(fp); return false; }
fseek(fp, 0, SEEK_SET);
if ((replacePoint = strstr(buf, "%PDF-1.1")) == NULL) { fclose(fp); return false; }
replacePoint[7] = '4';
if (fwrite(buf, 1, PDFHEADERSIZE, fp) != PDFHEADERSIZE) { fclose(fp); return false; }
fflush(fp);
fclose(fp);
return;
}
which will work in most sane cases. It will not work if the file starts, for example, with 0 bytes, which would serve as null terminators in the block of data.
A better choice (really) would be to cobble up a simple state machine to find %PDF-1. by reading 1 byte at a time until it either finds it or passes 1017 (1024 less the header length), then reads the next byte, if it's a '1', it seeks back a byte and writes a '4'.
The only other thing you would need to worry about is that PDF 1.4 suggests that the document catalog should contain a Version key with the file version. Since this is defined as optional in the spec, you are safe to ignore it.
So this will solve your problem. I do not, however, believe that you should need to do this. Really.
You should take some time to read part of the PDF spec, specifically section I.2 about version numbers and compatibility.

I just had this problem. Trying to submit some PDF's to a finanicial institution. "We only support PDF 1.4 or newer". Apparently our HP scanner creates version 1.3 PDF's.
I opened the PDF file with Notepad++ and changed the 3 to a 4 and saved it. It was that simple.
It's the very first part of the file and it's in plain text.

Another option for a small number of pdf files is to open them in Chrome or other browser then save as PDF or print to PDF. In my case, using Chrome, it saved to a newer pdf version and the bank accepted it

Related

How to check if a PDF has any kind of digital signature

I need to understand if a PDF has any kind of digital signature. I have to manage huge PDFs, e.g. 500MB each, so I just need to find a way to separate non-signed from signed (so I can send just signed PDFs to a method that manages them). Any procedure found until now involves attempt to extract certificate via e.g. Bouncycastle libs (in my case, for Java): if it is present, pdf is signed, if it not present or a exception is raised, is it not (sic!). But this is obviously time/memory consuming, other than an example of resource-wastings implementation.
Is there any quick language-independent way, e.g. opening PDF file, and reading first bytes and finding an info telling that file is signed?
Alternatively, is there any reference manual telling in detail how is made internally a PDF?
Thank you in advance
You are going to want to use a PDF Library rather than trying to implement this all yourself, otherwise you will get bogged down with handling the variations of Linearized documents, Filters, Incremental updates, object streams, cross-reference streams, and more.
With regards to reference material; per my cursory search, it looks like Adobe is no longer providing its version of the ISO 32000:2008 specification to any and all, though that specification is mainly a translation of the PDF v1.7 Reference manual to ISO-conforming language.
So assuming the PDF v1.7 Reference, the most relevant sections are going to be 8.7 (Digital Signatures), 3.6.1 (Document Catalog), and 8.6 (Interactive Forms).
The basic process is going to be:
Read the Document Catalog for 'Perms' and 'AcroForm' entries.
Read the 'Perms' dictionary for 'DocMDP','UR', or 'UR3' entries. If these entries exist, In all likelyhood, you have either a certified document or a Reader-enabled document.
Read the 'AcroForm' entry; (make sure that you do not have an 'XFA' entry, because in the words of Fraizer from Porgy and Bess: Dat's a complication!). You basically want to first check if there is an (optional) 'SigFlags' entry, in which case a non-zero value would indicate that there is a signature in the Fields Array. Otherwise, you need to walk each entry of the 'Fields' Array looking for a field dictionary with an 'FT' (Field Type) entry set to 'Sig' (signature), with a 'V' (Value) entry that is not null.
Using a PDF library that can use the document's cross-reference table to navigate you to the right indirect objects should be faster and less resource-intensive than a brute-force search of the document for a certificate.
This is not the optimal solution, but it is another one... you can to check "Sigflags" and stop at the first match:
grep -m1 "/Sigflags" ${PDF_FILE}
or get such files inside a directory:
grep -r --include=*.pdf -m1 -l "/Sigflags" . > signed_pdfs.txt
grep -r --include=*.pdf -m1 -L "/Sigflags" . > non_signed_pdfs.txt
Grep can be very fast for big files. You can run that in a batch for certain time and process the resulting lists (.txt files) after that.
Note that the file could be modified incrementally after a signature, and the last version might not be signed. That would be the actual meaning of "signed".
Anyway, if the file doesn't have a /Sigflags string , it is almost sure that it was never signed.
Note the conforming readers start reading backwards (from the end of the file) because there is the cross-reference table that says where is every object.
I advice you to use peepdf to check the inner structure of the file. It supports executing it commands over the file. For example:
$ peepdf -C "search /SigFlags" signed.pdf
[6]
$ peepdf -C "search /SigFlags" non-signed.pdf
Not found!!
But I have not tested the performance of that. You can use it to browse over the internal structure of the PDF an learn from the PDF v1.7 Reference. Check for the Annexs with PDF examples there.
Using command line you can check if a file has a digital signature with pdfsig tool from poppler-utils package (works on Ubuntu 20.04).
pdfsig pdffile.pdf
will produce output with detailed data on the signatures included and validation data. If you need to scan a pdf file tree and get a list of signed pdfs you can use a bash command like:
find ./path/to/files -iname '*.pdf' \
-exec bash -c 'pdfsig "$0"; \
if [[ $? -eq 0 ]]; then \
echo "$0" >> signed-files.txt; fi' {} \;
You will get a list of signed files in signed-files.txt file in the local directory.
I have found this to be much more reliable than trying to grep some text out of a pdf file (for example, the pdfs produced by signing services in Lithuania do not contain the string "SigFlags" which was mentioned in the previous answers).
After six years, this is the solution I implemented in Java via IText that can find any PADES signature presence on an unprotected PDF file.
This easy method returns a 3-state Boolean (don't wallop me for that, lol): Boolean.TRUE means "signed"; Boolean.FALSE means "not signed"; null means that something nasty happened reading the PDF (and in this case, I send the file to the old slow analysis procedure). After about half a million PADES-signed PDFs were scanned, I didn't have any false negatives, and after about 7 million of unsigned PDFs I didn't have any false positives.
Maybe I was just lucky (my PDF files were just signed once, and always in the same way), but it seems that this method works - at least for me. Thanks #Patrick Gallot
private Boolean isSigned(URL url)
{
try {
PdfReader reader = new PdfReader(url);
PRAcroForm acroForm = reader.getAcroForm();
if (acroForm == null) {
return false;
}
// The following can lead to false negatives
// boolean hasSigflags = acroForm.getKeys().contains(PdfName.SIGFLAGS);
// if (!hasSigflags) {
// return false;
// }
List<?> fields = acroForm.getFields();
for (Object k : fields) {
FieldInformation fi = (FieldInformation) k;
PdfObject ft = fi.getInfo().get(PdfName.FT);
if (PdfName.SIG.equals(ft)) {
logger.info("Found signature named {}", fi.getName());
return true;
}
}
} catch (Exception e) {
logger.error("Whazzup?", e);
return null;
}
return false;
}
Another function that should work correctly (I found it checking recently a paper written by Bruno Lowagie, Digital Signatures for PDF documents, page 124) is the following one:
private Boolean isSignedShorter(URL URL)
{
try {
PdfReader reader = new PdfReader(url);
AcroFields fields = reader.getAcroFields();
return !fields.getSignatureNames().isEmpty();
} catch (Exception e) {
logger.warn("Whazzup?", e);
return null;
}
}
I personally tested it on about a thousand signed/unsigned PDFs and it seems to work too, probably better than mine in case of complex signatures.
I hope to have given a good starting point to solve my original issue :)

Batch OCR Program for PDFs [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
This has been asked before, but I don't really know if the answers help me. Here is my problem: I got a bunch of (10,000 or so) pdf files. Some were text files that were saved using adobe's print feature (so their text is perfect and I don't want to risk screwing them up). And some were scanned images (so they don't have any text and I will have to settle for OCR). The files are in the same directory and I can't tell which is which. Ultimately I want to turn them into .txt files and then do string processing on them. So I want the most accurate OCR possible.
It seems like people have recommended:
adobe pdf (I don't have a licensed copy of this so ... plus if ABBYY finereader or something is better, why pay for it if I won't use it)
ocropus (I can't figure out how to use this thing),
Tesseract (which seems like it was great in 1995 but I'm not sure if there's something more accurate plus it doesn't do pdfs natively and I've have to convert to TIFF. that raises its own problem as I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff. plus i don't want 10,000 30 page documents turned into 30,000 individual tiff images).
wowocr
pdftextstream (that was from 2009)
ABBYY FineReader (apparently its' $$$, but I will spend $600 to get this done if this thing is significantly better, i.e. has more accurate ocr).
Also I am a n00b to programming so if it's going to take like weeks to learn how to do something, I would rather pay the $$$. Thx for input/experiences.
BTW, I'm running Linux Mint 11 64 bit and/or windows 7 64 bit.
Here are the other threads:
Batch OCRing PDFs that haven't already been OCR'd
Open source OCR
PDF Text Extraction Approach Using OCR
https://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred
Just to put some of your misconceptions straight...
" I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff."
You can convert PDFs to TIFF with the help of Free (as in liberty) and free (as in beer) Ghostscript. Your choice if you want to do it on Linux Mint or on Windows 7. The commandline for Linux is:
gs \
-o input.tif \
-sDEVICE=tiffg4 \
input.pdf
"i don't want 10,000 30 page documents turned into 30,000 individual tiff images"
You can have "multipage" TIFFs easily. Above command does create such TIFFs of the G4 (fax tiff) flavor. Should you even want single-page TIFFs instead, you can modify the command:
gs \
-o input_page_%03d.tif \
-sDEVICE=tiffg4 \
input.pdf
The %03d part of the output filename will automatically translate into a series of 001, 002, 003 etc.
Caveats:
The default resolution for the tiffg4 output device is 204x196 dpi. You probably want a better value. To get 720 dpi you should add -r720x720 to the commandline.
Also, if your Ghostscript installation uses letter as its default media size, you may want to change it. You can use -gXxY to set widthxheight in device points. So to get ISO A4 output page dimensions in landscape you can add a -g8420x5950 parameter.
So the full command which controls these two parameters, to produce 720 dpi output on A4 in portrait orientation, would read:
gs \
-o input.tif \
-sDEVICE=tiffg4 \
-r720x720 \
-g5950x8420 \
input.pdf
Figured I would try to contribute by answering my own question (have written some nice code for myself and could not have done it without help from this board). If you cat the pdf files in unix (well, osx for me), then the pdf files that have text will have the word "Font" in them (as a string, but mixed in with other text) b/c that's how the file tells Adobe what fonts to do display.
The cat command in bash seems to have the same output as reading the file in binary mode in python (using 'rb' mode when opening file instead of 'w' or 'r' or 'a'). So I'm assuming that all pdf files that contain text with have the word "Font" in the binary output and that no image-only files ever will. If that's always true, then this code will make a list of all pdf files in a single directory that have text and a separate list of those that have only images. It saves each list to a separate .txt file, then you can use a command in bash to move the pdf files to the appropriate folder.
Once you have them in their own folders, then you can run your batch ocr solution on just the pdf files in the images_only folder. I haven't gotten that far yet (obviously).
import os, re
#path is the directory with the files, other 2 are the names of the files you will store your lists in
path = 'C:/folder_with_pdfs'
files_with_text = open('files_with_text.txt', 'a')
image_only_files = open('image_only_files.txt', 'a')
#have os make a list of all files in that dir for a loop
filelist = os.listdir(path)
#compile regular expression that matches "Font"
mysearch = re.compile(r'.*Font.*', re.DOTALL)
#loop over all files in the directory, open them in binary ('rb'), search that binary for "Font"
#if they have "Font" they have text, if not they don't
#(pdf does something to understand the Font type and uses this word every time the pdf contains text)
for pdf in filelist:
openable_file = os.path.join(path, pdf)
cat_file = open(openable_file, 'rb')
usable_cat_file = cat_file.read()
#print usable_cat_file
if mysearch.match(usable_cat_file):
files_with_text.write(pdf + '\n')
else:
image_only_files.write(pdf + '\n')
To move the files, I entered this command in bash shell:
cat files_with_text.txt | while read i; do mv $i Volumes/hard_drive_name/new_destination_directory_name; done
Also, I didn't re-run the python code above, I just hand-edited the thing, so it might be buggy, Idk.
This is an interesting problem. If you are willing to work on Windows in .NET, you can do this with dotImage (disclaimer, I work for Atalasoft and wrote most of the OCR engine code). Let's break the problem down into pieces - the first is iterating over all your PDFs:
string[] candidatePDFs = Directory.GetFiles(sourceDirectory, "*.pdf");
PdfDecoder decoder = new PdfDecoder();
foreach (string path in candidatePDFs) {
using (FileStream stm = new FileStream(path, FileMode.Open)) {
if (decoder.IsValidFormat(stm)) {
ProcessPdf(path, stm);
}
}
}
This gets a list of all files that end in .pdf and if the file is a valid pdf, calls a routine to process it:
public void ProcessPdf(string path, Stream stm)
{
using (Document doc = new Document(stm)) {
int i=0;
foreach (Page p in doc.Pages) {
if (p.SingleImageOnly) {
ProcessWithOcr(path, stm, i);
}
else {
ProcessWithTextExtract(path, stm, i);
}
i++;
}
}
}
This opens the file as a Document object and asks if each page is image only. If so it will OCR the page, else it will text extract:
public void ProcessWithOcr(string path, Stream pdfStm, int page)
{
using (Stream textStream = GetTextStream(path, page)) {
PdfDecoder decoder = new PdfDecoder();
using (AtalaImage image = decoder.Read(pdfStm, page)) {
ImageCollection coll = new ImageCollection();
coll.Add(image);
ImageCollectionImageSource source = new ImageCollectionImageSource(coll);
OcrEngine engine = GetOcrEngine();
engine.Initialize();
engine.Translate(source, "text/plain", textStream);
engine.Shutdown();
}
}
}
what this does is rasterizes the PDF page into an image and puts it into a form that is palatable for engine.Translate. This doesn't strictly need to be done this way - one could get an OcrPage object from the engine from an AtalaImage by calling Recognize, but then it would be up to client code to loop over the structure and write out the text.
You'll note that I've left out GetOcrEngine() - we make available 4 OCR engines for client use: Tesseract, GlyphReader, RecoStar, and Iris. You would select the one that would be best for your needs.
Finally, you would need the code to extract text from the pages that already have perfectly good text on them:
public void ProcessWithTextExtract(string path, Stream pdfStream, int page)
{
using (Stream textStream = GetTextStream(path, page)) {
StreamWriter writer = new StreamWriter(textStream);
using (PdfTextDocument doc = new PdfTextDocument(pdfStream)) {
PdfTextPage page = doc.GetPage(i);
writer.Write(page.GetText(0, page.CharCount));
}
}
}
This extracts the text from the given page and writes it to the output stream.
Finally, you need GetTextStream():
public Stream GetTextStream(string sourcePath, int pageNo)
{
string dir = Path.GetDirectoryName(sourcePath);
string fname = Path.GetFileNameWithoutExtension(sourcePath);
string finalPath = Path.Combine(dir, String.Format("{0}p{1}.txt", fname, pageNo));
return new FileStream(finalPath, FileMode.Create);
}
Will this be a 100% solution? No. Certainly not. You could imagine PDF pages that contain a single image with a box draw around it - this would clearly fail the image only test but return no useful text. Probably, a better approach is to just use the extracted text and if that doesn't return anything, then try an OCR engine. Changing from one approach to the other is a matter of writing a different predicate.
The simplest approach would be to use a single tool such a ABBYY FineReader, Omnipage etc to process the images in one batch without having to sort them out into scanned vs not scanned images. I believe FineReader converts the PDF's to images before performing OCR anyway.
Using an OCR engine will give you features such as automatic deskew, page orientation detection, image thresholding, despeckling etc. These are features you would have to buy an image processng library for and program yourself and it could prove difficult to find an optimal set of parameters for your 10,000 PDF's.
Using the automatic OCR approach will have other side effects depending on the input images and you would find you would get better results if you sorted the images and set optimal parameters for each type of images. For accuracy it would be much better to use a proper PDF text extraction routine to extract the PDF's that have perfect text.
At the end of the day it will come down to time and money versus the quality of the results that you need. At the end of the day, a commercial OCR program will be the quickest and easiest solution. If you have clean text only documents then a cheap OCR program will work as well as an expensive solution. The more complex your documents, the more money you will need to spend to process them.
I would try finding some demo / trial versions of commercial OCR engines and just see how they perform on your different document types before spending too much time and money.
I have written a small wrapper for Abbyy OCR4LINUX CLI engine (IMHO, doesn't cost that much) and Tesseract 3.
The wrapper can batch convert files like:
$ pmocr.sh --batch --target=pdf --skip-txt-pdf /some/directory
The script uses pdffonts to determine if a PDF file has already been OCRed to skip them. Also, the script can work as system service to monitor a directory and launch an OCR action as soon as a file enters the directory.
Script can be found here:
https://github.com/deajan/pmOCR
Hopefully, this helps someone.

How can I extract embedded fonts from a PDF as valid font files?

I'm aware of the pdftk.exe utility that can indicate which fonts are used by a PDF, and wether they are embedded or not.
Now the problem: given I had PDF files with embedded fonts -- how can I extract those fonts in a way that they are re-usable as regular font files? Are there (preferably free) tools which can do that? Also: can this be done programmatically with, say, iText?
You have several options. All these methods work on Linux as well as on Windows or Mac OS X. However, be aware that most PDFs do not include to full, complete fontface when they have a font embedded. Mostly they include just the subset of glyphs used in the document.
Using pdftops
One of the most frequently used methods to do this on *nix systems consists of the following steps:
Convert the PDF to PostScript, for example by using XPDF's pdftops (on Windows: pdftops.exe helper program.
Now fonts will be embedded in .pfa (PostScript) format + you can extract them using a text editor.
You may need to convert the .pfa (ASCII) to a .pfb (binary) file using the t1utils and pfa2pfb.
In PDFs there are never .pfm or .afm files (font metric files) embedded (because PDF viewer have internal knowledge about these). Without these, font files are hardly usable in a visually pleasing way.
Using fontforge
Another method is to use the Free font editor FontForge:
Use the "Open Font" dialogbox used when opening files.
Then select "Extract from PDF" in the filter section of dialog.
Select the PDF file with the font to be extracted.
A "Pick a font" dialogbox opens -- select here which font to open.
Check the FontForge manual. You may need to follow a few specific steps which are not necessarily straightforward in order to save the extracted font data as a file which is re-usable.
Using mupdf
Next, MuPDF. This application comes with a utility called pdfextract (on Windows: pdfextract.exe) which can extract fonts and images from PDFs. (In case you don't know about MuPDF, which still is relatively unknown and new: "MuPDF is a Free lightweight PDF viewer and toolkit written in portable C.", written by Artifex Software developers, the same company that gave us Ghostscript.)
(Update: Newer versions of MuPDF have moved the former functionality of 'pdfextract' to the command 'mutool extract'. Download it here: mupdf.com/downloads)
Note: pdfextract.exe is a command-line program. To use it, do the following:
c:\> pdfextract.exe c:\path\to\filename.pdf # (on Windows)
$> pdfextract /path/tofilename.pdf # (on Linux, Unix, Mac OS X)
This command will dump all of the extractable files from the pdf file referenced into the current directory. Generally you will see a variety of files: images as well as fonts. These include PNG, TTF, CFF, CID, etc. The image names will be like img-0412.png if the PDF object number of the image was 412. The fontnames will be like FGETYK+LinLibertineI-0966.ttf, if the font's PDF object number was 966.
CFF (Compact Font Format) files are a recognized format that can be converted to other formats via a variety of converters for use on different operating systems.
Again: be aware that most of these font files may have only a subset of characters and may not represent the complete typeface.
Update: (Jul 2013) Recent versions of mupdf have seen an internal reshuffling and renaming of their binaries, not just once, but several times. The main utility used to be a 'swiss knife'-alike binary called mubusy (name inspired by busybox?), which more recently was renamed to mutool. These support the sub-commands info, clean, extract, poster and show. Unfortunatey, the official documentation for these tools isn't up to date (yet). If you're on a Mac using 'MacPorts': then the utility was renamed in order to avoid name clashes with other utilities using identical names, and you may need to use mupdfextract.
To achieve the (roughly) equivalent results with mutool as its previous tool pdfextract did, just run mubusy extract ....*
So to extract fonts and images, you may need to run one of the following commandlines:
c:\> mutool.exe extract filename.pdf # (on Windows)
$> mutool extract filename.pdf # (on Linux, Unix, Mac OS X)
Downloads are here: mupdf.com/downloads
Using gs (Ghostscript)
Then, Ghostscript can also extract fonts directly from PDFs. However, it needs the help of a special utility program named extractFonts.ps, written in PostScript language, which is available from the Ghostscript source code repository.
Now use it, you need to run both, this file extractFonts.ps and your PDF file. Ghostscript will then use the instructions from the PostScript program to extract the fonts from the PDF. It looks like this on Windows (yes, Ghostscript understands the 'forward slash', /, as a path separator also on Windows!):
gswin32c.exe ^
-q -dNODISPLAY ^
c:/path/to/extractFonts.ps ^
-c "(c:/path/to/your/PDFFile.pdf) extractFonts quit"
or on Linux, Unix or Mac OS X:
gs \
-q -dNODISPLAY \
/path/to/extractFonts.ps \
-c "(/path/to/your/PDFFile.pdf) extractFonts quit"
I've tested the Ghostscript method a few years ago. At the time it did extract *.ttf (TrueType) just fine. I don't know if other font types will also be extracted at all, and if so, in a re-usable way. I don't know if the utility does block extracting of fonts which are marked as protected.
Using pdf-parser.py
Finally, Didier Stevens' pdf-parser.py: this one is probably not as easy to use, because you need to have some know-how about internal PDF structures. pdf-parser.py is a Python script which can do a lot of other things too. It can also decompress and extract arbitrary streams from objects, and therefore it can extract embedded font files too.
But you need to know what to look for. Let's see it with an example. I have a file named big.pdf. As a first step I use the -s parameter to search the PDF for any occurrence of the keyword FontFile (pdf-parser.py does not require a case sensitive search):
pdf-parser.py -s fontfile big.pdf
In my case, for my big1.pdf, I get this result:
obj 9 0
Type: /FontDescriptor
Referencing: 15 0 R
<<
/Ascent 728
/CapHeight 716
/Descent -210
/Flags 32
/FontBBox [ -665 -325 2000 1006 ]
/FontFile2 15 0 R
/FontName /ArialMT
/ItalicAngle 0
/StemV 87
/Type /FontDescriptor
/XHeight 519
>>
obj 11 0
Type: /FontDescriptor
Referencing: 16 0 R
<<
/Ascent 728
/CapHeight 716
/Descent -210
/Flags 262176
/FontBBox [ -628 -376 2000 1018 ]
/FontFile2 16 0 R
/FontName /Arial-BoldMT
/ItalicAngle 0
/StemV 165
/Type /FontDescriptor
/XHeight 519
>>
It tells me that there are two instances of FontFile2 inside the PDF, and these are in PDF objects no. 15 and no. 16, respectively. Object no. 15 holds the /FontFile2 for font /ArialMT, object no. 16 holds the /FontFile2 for font /Arial-BoldMT.
To show this more clearly:
pdf-parser.py -s fontfile big1.pdf | grep -i fontfile
/FontFile2 15 0 R
/FontFile2 16 0 R
A quick peeking into the PDF specification reveals the the keyword /FontFile2 relates to a 'stream containing a TrueType font program' (/FontFile would relate to a 'stream containing a Type 1 font program' and /FontFile3 would relate to a 'stream containing a font program whose format is specified by the Subtype entry in the stream dictionary' {hence being either a Type1C or a CIDFontType0C subtype}.)
To look specifically at PDF object no. 15 (which holds the font /ArialMT), one can use the -o 15 parameter:
pdf-parser.py -o 15 big1.pdf
obj 15 0
Type:
Referencing:
Contains stream
<<
/Length1 778552
/Length 1581435
/Filter /ASCIIHexDecode
>>
This pdf-parser.py output tells us that this object contains a stream (which it will not directly display) that has a length of 1.581.435 Bytes and is encoded ( == "compressed") with ASCIIHexEncode and needs to be decoded ( == "de-compressed" or "filtered") with the help of the standard /ASCIIHexDecode filter.
To dump any stream from an object, pdf-parser.py can be called with the -d dumpname parameter. Let's do it:
pdf-parser.py -o 15 -d dumped-data.ext big1.pdf
Our extracted data dump will be in the file named dumped-data.ext. Let's see how big it is:
ls -l dumped-data.ext
-rw-r--r-- 1 kurtpfeifle staff 1581435 Apr 11 00:29 dumped-data.ext
Oh look, it is 1.581.435 Bytes. We saw this figure in the previous command's output. Opening this file with a text editor confirms that its content is ASCII hex encoded data.
Opening the file with a font reading tool like otfinfo (this is a part of the lcdf-typetools package) will lead to some disappointment at first:
otfinfo -i dumped-data.ext
otfinfo: dumped-data.ext: not an OpenType font (bad magic number)
OK, this is because we did not (yet) let pdf-parser.py make use of its full magic: to dump a filtered, decoded stream. For this we have to add the -f parameter:
pdf-parser.py -o 15 -f -d dumped-data-decoded.ext big1.pdf
What's the size is this new file?
ls -l dumped-data-decoded.ext
-rw-r--r-- 1 kurtpfeifle staff 778552 Apr 11 00:39 dumped-data-decoded.ext
Oh, look: that exact number was also already stored in the PDF object no. 15 dictionary as the value for key /Length1...
What does file think it is?
file dumped-data-decoded.ext
dumped-data-decoded.ext: TrueType font data
What does otfinfo tell us about it?
otfinfo -i dumped-data-decoded.ext
Family: Arial
Subfamily: Regular
Full name: Arial
PostScript name: ArialMT
Version: Version 5.10
Unique ID: Monotype:Arial Regular:Version 5.10 (Microsoft)
Designer: Monotype Type Drawing Office - Robin Nicholas, Patricia Saunders 1982
Manufacturer: The Monotype Corporation
Trademark: Arial is a trademark of The Monotype Corporation.
Copyright: © 2011 The Monotype Corporation. All Rights Reserved.
License Description: You may use this font to display and print content as permitted by
the license terms for the product in which this font is included.
You may only (i) embed this font in content as permitted by the
embedding restrictions included in this font; and (ii) temporarily
download this font to a printer or other output device to help
print content.
Vendor ID: TMC
So Bingo!, we have a winner: pdf-parser.py did indeed extract a valid font file for us. Given the size of this file (778.552 Bytes), it looks like this font had been embedded even completely in the PDF...
We could rename it to arial-regular.ttf and install it as such and happily make use of it.
Caveats:
In any case you need to follow the license that applies to the font. Some font licences do not allow free use and/or distribution. Pirating fonts is like pirating any software or other copyrighted material.
Most PDFs which are in the wild out there do not embed the full font anyway, but only subsets. Extracting a subset of a font is only useful in a very limited scope, if at all.
Please do also read the following about Pros and (more) Cons regarding font extraction efforts:
http://typophile.com/node/34377 — not available anymore, but can bee seen on Wayback Machine at https://web.archive.org/web/20110717120241/typophile.com/node/34377
Use online service http://www.extractpdf.com. No need to install anything.
Even though this question is 10 years old, it is still valid and as technology changes so does a valid answer.
In searching the current answers noticed none of them note WOFF (Web Open Font Format) (W3C) (Wikipedia) which can be used to recreate the individual characters (glyphs) and display them in a web page accurately.
Using the free online web page by IDR Solutions, PDF to HTML5 (link), convert a PDF to a zip file. In the resulting zip will be a font directory of woff file types. Current Internet browsers support woff files if you were not aware. (reference) These can be examined at the online site FontDrop! (link).
WOFF files can be converted to/from OTF or TTF at WOFFer – WOFF font converter
Also the zip file from PDF to HTML5 will contain an HTML file for each page of the PDF that can be opened in an Internet browser and is one of the best and most accurate PDF translations I have found or seen.
Eventually found the FontForge Windows installer package and opened the PDF through the installed program. Worked a treat, so happy.
http://www.verypdf.com/app/pdf-font-extractor/pdf-font-extracting-tool.html
IMO easiest way to extract fonts (Windows).
PDF2SVG version 6.0 from PDFTron does a reasonable job. It produces OpenType (.otf) fonts by default. Use --preserve_fontnames to preserve "the font/font-family naming scheme as obtained from the source file."
PDF2SVG is a commercial product, but you can download a free demo executable (which includes watermarks on the SVG output but doesn't otherwise restrict usage). There may be other PDFTron products that also extract fonts, but I only recently discovered PDF2SVG myself.
One of the best online tools currently available to extract pdf fonts is http://www.pdfconvertonline.com/extract-pdf-fonts-online.html
This is a followup to the font-forge section of #Kurt Pfeifle's answer, specific to Red Hat (and possibly other Linux distros).
After opening the PDF and selecting the font you want, you will want to select "File -> Generate Fonts..." option.
If there are errors in the file, you can choose to ignore them or save the file and edit them. Most of the errors can be fixed automatically if you click "Fix" enough times.
Click "Element -> Font Info...", and "Fontname", "Family Name" and "Name for Humans" are all set to values you like. If not, modify them and save the file somewhere. These names will determine how your font appears on the system.
Select your file name and click "Save..."
Once you have your TTF file, you can install it on your system by
Copying it to folder /usr/share/fonts (as root)
Running fc-cache -f /usr/share/fonts/ (as root)

Programmatically add comments to PDF header

Has anyone had any success with adding additional information to a PDF file?
We have an electronic medical record system which produces medical documents for our users. In the past, those documents have been Print-To-File (.prn) files which we have fed to a system that displayed them as part of an enterprise medical record.
Now the hospital's enterprise medical record vendor wants to receive the documents as PDF, but still wants all of the same information stored in the header.
Honestly, we can't figure out how to put information into a PDF file that doesn't break the PDF file.
Here is the start of one of our PDFs...
%PDF-1.4
%âãÏÓ
6 0 obj
<<
/Type /XObject
/Subtype /Image
/BitsPerComponent 8
/Width 854
/Height 130
/ColorSpace /DeviceRGB
/Filter /DCTDecode
/Length 17734>>
stream
In our PRN files, we would insert information like this:
%MRN% TEST000001
%ACCT% TEST0000000000001
%DATE% 01/01/2009^16:44
%DOC_TYPE% Clinical
%DOC_NUM% 192837475
%DOC_VER% 1
My question is, can I insert this information into a PDF in a manner which allows the document server to perform post-processing, yet is NOT visible to the doctor who views the PDF?
Thank you,
David Walker
Yes, you can. Any line in a PDF file that starts with a percent sign is a comment and as such ignored (the first two lines of the PDF actually are comments as well). So you can pretty much insert your information into the PDF as you did into the PRN.
However:
The PDF format works with byte position references, so if you insert data into a finished PDF file, this will push the rest of the data away from their original position and thus break the file. You can also not append it to the file, because a PDF file has to end with
startxref
123456
%%EOF
(the 123456 is an example). You could insert your data right before these three lines. The byte position of the "startxref" part is never referenced anywhere, so you won't break anything if you push this final part towards the end.
Edit: This of course assumes there is no checksumming, signing or encryption going on. That would make things more complicated.
Edit 2: As Javier pointed out correctly, you can also just add your data to the end and just add a copy of the three lines to the end of that. Boils down to the same thing, but it's a little easier.
PDFs are supposed to have multiple versions just appending at the end; but the very end must have the offset to the main reference table. Just read the last three lines, append your data and reattach the original ending.
You can either remove the original ending or let it there. PDF readers will just go to the end and use the second-to-last line to find the reference table.
Have you ever thought to embed your additional info inside the PDF as a separate file?
The generic PDF specification allows to "attach files" to PDFs. Attached files can be anything: *.txt, *.doc, *.xsl, *.html or even .pdf. Attached files are contained in the PDF "container" file without corrupting the container's own content. (Special-purpose PDF specifications such as PDF/A- and PDF/X-* may impose some restrictions about embedded/attached files.)
That allows you to tie additional info and/or data to PDF files and allow for common storage and processing. Attached files are supposed to not disturb any PDF viewer's rendering.
I've used that feature frequently, for various purposes:
store the parent document (like .doc) inside the .pdf from which the .pdf was created in the first place;
tag a job ticketing information to a printfile that is sent to the printshop;
etc.pp.
Of course, recently discovered and published flaws in PDF processing software (and in the PDF spec itself) suggest to stay away from embedding/attaching binary files to PDF files --
because more and more Readers will by default stop you from easily extracting/detaching the embedded/attached files.
However, there is no reason why you shouldn't be able to put your additional info into a medical-record-info.txt file of arbitrary lenght and internal format and attach it to the PDF:
MRN TEST000001
ACCT TEST0000000000001
DATE 2009-01-01
TIME 16:44:33.76
DOC_TYPE Clinical
DOC_NUM 192837475
DOC_VER 1
MORE_INFO blah blah
Hi, guys,
can you please process this file faster than usual? If you don't,
someone will be dying.
Seriously, David.
FWIW, the commandline tools pdftk.exe (Windows) and pdftk (Linux) are able to attach and detach embedded files from their container PDF. Acrobat Reader can also handle attachments.
You could setup/program/script your document server handling the PDF to automatically detach the embedded .txt file and trigger actions according to its content.
Of course, the doctor who views the PDF would be able to see there is a file attachment in the PDF. But it wouldn't appear in his "normal" viewing. He'd have to take specific additional actions in order to extract and view it. (And then there is the option to set a password on the PDF to protect it from un-authorized file detachments. And/or encode, obscure, rot13 the .txt. Not exactly rock-solid methods, but 99% of doctors wouldn't be able to accomplish it even if you teach them how to...)
You can still insert comments into a PDF file using the % character. But anyone would be able to access with a text editor.
Your vendor could remove these comments after post-processing, so it doesn't actually get to the doctors.
You can store the data as real PDF metadata. For example, with CAM::PDF you can write metadata like this:
use CAM::PDF;
my $pdf = CAM::PDF->new('temp.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
$info->{PRN} = CAM::PDF::Node->new('dictionary', {
DOC_TYPE => CAM::PDF::Node->new('string', 'Clinical'),
DOC_NUM => CAM::PDF::Node->new('number', 192837475),
DOC_VER => CAM::PDF::Node->new('number', 1),
});
$pdf->cleanoutput('out.pdf');
The Info node of the PDF then looks like this:
8 0 obj
<< /CreationDate (D:20080916083455-04'00')
/ModDate (D:20080916083729-04'00')
/PRN << /DOC_NUM 192837475 /DOC_TYPE (Clinical) /DOC_VER 1 >> >>
endobj
You can read the PRN data back out like so (simplistic code...)
my $pdf = CAM::PDF->new('out.pdf') || die;
my $info = $pdf->getValue($pdf->{trailer}->{Info}) || die;
my $prn = $info->{PRN};
if ($prn) {
my $prndict = $pdf->getValue($prn);
for my $key (sort keys %{$prndict}) {
print "$key = ", $pdf->getValue($prndict->{$key}), "\n";
}
}
Which makes output like this:
DOC_NUM = 192837475
DOC_TYPE = Clinical
DOC_VER = 1
PDF supports arbitrarily nested arrays, dictionaries and references so just about any data can be represented. For example, I built an entire filesystem embedded in a PDF just for fun!
At one point we were changing some Acrobat JS code by doing a text replace in a plain (unencrypted) PDF. The trick was that the lengths of each PDF block were hard coded in the document. So, we could not change the number of characters. We would just add extra spaces.
It worked great, the JS code executed an all.
Have you thought about using XMP?

Ghostscript Multipage PDF to PNG

I've been using ghostscript to do pdf to image generation of a single page from the pdf. Now I need to be able to pull multiple pages from the pdf and produce a long vertical image.
Is there an argument that I'm missing that would allow this?
So far I'm using the following arguments when I call out to ghostscript:
string[] args ={
"-q",
"-dQUIET",
"-dPARANOIDSAFER", // Run this command in safe mode
"-dBATCH", // Keep gs from going into interactive mode
"-dNOPAUSE", // Do not prompt and pause for each page
"-dNOPROMPT", // Disable prompts for user interaction
"-dFirstPage="+start,
"-dLastPage="+stop,
"-sDEVICE=png16m",
"-dTextAlphaBits=4",
"-dGraphicsAlphaBits=4",
"-r300x300",
// Set the input and output files
String.Format("-sOutputFile={0}", tempFile),
originalPdfFile
};
I ended up adding "%d" to the "OutputFile" parameter so that it would generate one file per page. Then I just read up all of the files and stitched them together in my c# code like so:
var images =pdf.GetPreview(1,8); //All of the individual images read in one per file
using (Bitmap b = new Bitmap(images[0].Width, images.Sum(img=>img.Height))) {
using (var g = Graphics.FromImage(b)) {
for (int i = 0; i < images.Count; i++) {
g.DrawImageUnscaled(images[i], 0, images.Take(i).Sum(img=>img.Height));
}
}
//Do Stuff
}
If you can use ImageMagick, you could use one of its good commands:
montage -mode Concatenate -tile 1x -density 144 -type Grayscale input.pdf output.png
where
-density 144 determins resolution in dpi, increase it if needed, default is 72
-type Grayscale use it if your PDF has no colors, you'll save some KBs in the resulting image
First check the output device options; but I don't think there's an option for that.
Most probably you'll need to do some imposition yourself, either make GhostScript do it (you'll have to write a PostScript program), or stitch the resulting rendered pages with ImageMagick or something similar.
If you want to try the PostScript route (likely to be the most efficient), check the N-up examples included in the GhostScript package.