Ghostscript Multipage PDF to PNG - pdf

I've been using ghostscript to do pdf to image generation of a single page from the pdf. Now I need to be able to pull multiple pages from the pdf and produce a long vertical image.
Is there an argument that I'm missing that would allow this?
So far I'm using the following arguments when I call out to ghostscript:
string[] args ={
"-q",
"-dQUIET",
"-dPARANOIDSAFER", // Run this command in safe mode
"-dBATCH", // Keep gs from going into interactive mode
"-dNOPAUSE", // Do not prompt and pause for each page
"-dNOPROMPT", // Disable prompts for user interaction
"-dFirstPage="+start,
"-dLastPage="+stop,
"-sDEVICE=png16m",
"-dTextAlphaBits=4",
"-dGraphicsAlphaBits=4",
"-r300x300",
// Set the input and output files
String.Format("-sOutputFile={0}", tempFile),
originalPdfFile
};

I ended up adding "%d" to the "OutputFile" parameter so that it would generate one file per page. Then I just read up all of the files and stitched them together in my c# code like so:
var images =pdf.GetPreview(1,8); //All of the individual images read in one per file
using (Bitmap b = new Bitmap(images[0].Width, images.Sum(img=>img.Height))) {
using (var g = Graphics.FromImage(b)) {
for (int i = 0; i < images.Count; i++) {
g.DrawImageUnscaled(images[i], 0, images.Take(i).Sum(img=>img.Height));
}
}
//Do Stuff
}

If you can use ImageMagick, you could use one of its good commands:
montage -mode Concatenate -tile 1x -density 144 -type Grayscale input.pdf output.png
where
-density 144 determins resolution in dpi, increase it if needed, default is 72
-type Grayscale use it if your PDF has no colors, you'll save some KBs in the resulting image

First check the output device options; but I don't think there's an option for that.
Most probably you'll need to do some imposition yourself, either make GhostScript do it (you'll have to write a PostScript program), or stitch the resulting rendered pages with ImageMagick or something similar.
If you want to try the PostScript route (likely to be the most efficient), check the N-up examples included in the GhostScript package.

Related

How to make a vector PDF searchable?

My workflow includes making figures in Inkscape, which are then converted to PDF and included into LaTeX documents. In these figures, I often have to include mathematical formulas. For that, I use TexText. For font consistency and simplicity, when I want to add some plain text to my figure, I also use TexText. When the resulting SVG is converted to PDF, the TexText-generated text is not searchable.
How can I make a PDF from the SVG such that it is searchable while remaining a vector PDF?
I know I could rasterize the figure and then use e.g. Tesseract to create a searchable PDF. But the resulting PDF will of course contain a rasterized version of my figure. I would like the figure itself to remain vector graphics.
I am guessing there has to be a way that would go something like this: indeed rasterize the PDF and use Tesseract to extract the text. But then take the output of Tesseract and somehow add it to the original vector PDF. Unfortunately, I don't know how to do this.
It turns out a question that is directly relevant to mine was answered on another StackExchange, here. The script that answers my actual question is svgToSearchablePDF.sh, and I post it below. It uses, as the key element, the script pdf-merge-text.sh from the accepted answer to that other question. For completeness, I will repost pdf-merge-text.sh in this answer.
The solution
Note that perhaps you will need to magnify the SVG file before converting it to a searchable PDF: larger image sizes help the OCR process. To magnify, in Inkscape, select the entire image, then go to Object -> Transform… . In the Transform tab, select Scale. Then select 'Scale proportionally', and finally, in either 'Width' or 'Height', enter something like '300' (make sure % is selected in the menu immediately to the right of the 'Width' field). Next, File->Document Properties…. In the 'Document Properties' tab, under Custom size, click on Resize page to content. Make sure that either nothing is selected or else that the entire image is selected. Then click the button 'Resize page to drawing or selection'. Save the SVG.
The script svgToSearchablePDF.sh uses an SVG file as input and produces a searchable vector PDF file as output. It is assumed that all of the following are installed: Tesseract, Inkscape, and Ghostscript.
For example, assume that we used Inkscape to create the file mygraphics.svg. Then the following command will produce a searchable PDF file mygraphics.pdf:
svgToSearchablePDF.sh mygraphics.svg
The scripts
First, svgToSearchablePDF.sh:
#!/bin/bash
filename="$1"
inkscape ${filename%.*}.svg -o ${filename%.*}_auxfile.png
inkscape ${filename%.*}.svg -o ${filename%.*}_auxfileVCT.pdf
tesseract ${filename%.*}_auxfile.png ${filename%.*}_auxfileTXT -l eng pdf
pdf-merge-text.sh ${filename%.*}_auxfileTXT.pdf ${filename%.*}_auxfileVCT.pdf ${filename%.*}.pdf
rm -f ${filename%.*}_auxfile.png ${filename%.*}_auxfileVCT.pdf ${filename%.*}_auxfileTXT.pdf
As I said, that script uses the script pdf-merge-text.sh from here. For completeness, here it is:
#!/usr/bin/env bash
set -eu
pdf_merge_text() {
local txtpdf; txtpdf="$1"
local imgpdf; imgpdf="$2"
local outpdf; outpdf="${3--}"
if [ "-" != "${txtpdf}" ] && [ ! -f "${txtpdf}" ]; then echo "error: text PDF does not exist: ${txtpdf}" 1>&2; return 1; fi
if [ "-" != "${imgpdf}" ] && [ ! -f "${imgpdf}" ]; then echo "error: image PDF does not exist: ${imgpdf}" 1>&2; return 1; fi
if [ "-" != "${outpdf}" ] && [ -e "${outpdf}" ]; then echo "error: not overwriting existing output file: ${outpdf}" 1>&2; return 1; fi
(
local txtonlypdf; txtonlypdf="$(TMPDIR=. mktemp --suffix=.pdf)"
trap "rm -f -- '${txtonlypdf//'/'\\''}'" EXIT
gs -o "${txtonlypdf}" -sDEVICE=pdfwrite -dFILTERIMAGE "${txtpdf}"
pdftk "${txtonlypdf}" multistamp "${imgpdf}" output "${outpdf}"
)
}
pdf_merge_text "$#"

Convert PDF text into outlines?

Does anybody know a way to vectorize the text in a PDF document? That is, I want each letter to be a shape/outline, without any textual content. I'm using a Linux system, and open source or a non-Windows solution would be preferred.
The context: I'm trying to edit some old PDFs, for which I no longer have the fonts. I'd like to do it in Inkscape, but that will replace all the fonts with generic ones, and that's barely readable. I've also been converting back and forth using pdf2ps and ps2pdf, but the font info stays there. So when I load it into Inkscape, it still looks awful.
Any ideas? Thanks.
To achieve this, you will have to:
Split your PDF into individual pages;
Convert your PDF pages into SVG;
Edit the pages you want
Reassemble the pages
This answer will omit step 3, since that's not programmable.
Splitting the PDF
If you don't want a programmatic way to split documents, the modern way would be with using stapler. In your favorite shell:
stapler burst file.pdf
Would generate {file_1.pdf,...,file_N.pdf}, where 1...N are the PDF pages. Stapler itself uses PyPDF2 and the code for splitting a PDF file is not that complex. The following function splits a file and saves the individual pages in the current directory. (shamelessly copying from the commands.py file)
import math
import os
from PyPDF2 import PdfFileWriter, PdfFileReader
def split(filename):
with open(filename) as inputfp:
inputpdf = PdfFileReader(inputfp)
base, ext = os.path.splitext(os.path.basename(filename))
# Prefix the output template with zeros so that ordering is preserved
# (page 10 after page 09)
output_template = ''.join([
base,
'_',
'%0',
str(math.ceil(math.log10(inputpdf.getNumPages()))),
'd',
ext
])
for page in range(inputpdf.getNumPages()):
outputpdf = PdfFileWriter()
outputpdf.addPage(inputpdf.getPage(page))
outputname = output_template % (page + 1)
with open(outputname, 'wb') as fp:
outputpdf.write(fp)
Converting the individual pages to SVG
Now to convert the PDFs to editable files, I'd probably use pdf2svg.
pdf2svg input.pdf output.svg
If we take a look at the pdf2svg.c file, we can see that the code in principle is not that complex (assuming the input filename is in the filename variable and the output file name is in the outputname variable). A minimal working example in python follows. It requires the pycairo and pypoppler libraries:
import os
import cairo
import poppler
def convert(inputname, outputname):
# Convert the input file name to an URI to please poppler
uri = 'file://' + os.path.abspath(inputname)
pdffile = poppler.document_new_from_file(uri, None)
# We only have one page, since we split prior to converting. Get the page
page = pdffile.get_page(0)
# Get the page dimensions
width, height = page.get_size()
# Open the SVG file to write on
surface = cairo.SVGSurface(outputname, width, height)
context = cairo.Context(surface)
# Now we finally can render the PDF to SVG
page.render_for_printing(context)
context.show_page()
At this point you should have an SVG in which all text has been converted to paths, and will be able to edit with Inkscape without rendering issues.
Combining steps 1 and 2
You can call pdf2svg in a for loop to do that. But you would need to know the number of pages beforehand. The code below figures the number of pages and does the conversion in a single step. It requires only pycairo and pypoppler:
import os, math
import cairo
import poppler
def convert(inputname, base=None):
'''Converts a multi-page PDF to multiple SVG files.
:param inputname: Name of the PDF to be converted
:param base: Base name for the SVG files (optional)
'''
if base is None:
base, ext = os.path.splitext(os.path.basename(inputname))
# Convert the input file name to an URI to please poppler
uri = 'file://' + os.path.abspath(inputname)
pdffile = poppler.document_new_from_file(uri, None)
pages = pdffile.get_n_pages()
# Prefix the output template with zeros so that ordering is preserved
# (page 10 after page 09)
output_template = ''.join([
base,
'_',
'%0',
str(math.ceil(math.log10(pages))),
'd',
'.svg'
])
# Iterate over all pages
for nthpage in range(pages):
page = pdffile.get_page(nthpage)
# Output file name based on template
outputname = output_template % (nthpage + 1)
# Get the page dimensions
width, height = page.get_size()
# Open the SVG file to write on
surface = cairo.SVGSurface(outputname, width, height)
context = cairo.Context(surface)
# Now we finally can render the PDF to SVG
page.render_for_printing(context)
context.show_page()
# Free some memory
surface.finish()
Assembling the SVGs into a single PDF
To reassemble you can use the pair inkscape / stapler to convert the files manually. But it is not hard to write code that does this. The code below uses rsvg and cairo. To convert from SVG and merge everything into a single PDF:
import rsvg
import cairo
def convert_merge(inputfiles, outputname):
# We have to create a PDF surface and inform a size. The size is
# irrelevant, though, as we will define the sizes of each page
# individually.
outputsurface = cairo.PDFSurface(outputname, 1, 1)
outputcontext = cairo.Context(outputsurface)
for inputfile in inputfiles:
# Open the SVG
svg = rsvg.Handle(file=inputfile)
# Set the size of the page itself
outputsurface.set_size(svg.props.width, svg.props.height)
# Draw on the PDF
svg.render_cairo(outputcontext)
# Finish the page and start a new one
outputcontext.show_page()
# Free some memory
outputsurface.finish()
PS: It should be possible to use the command pdftocairo, but it doesn't seem to call render_for_printing(), which makes the output SVG maintain the font information.
I'm afraid to vectorize the PDFs you would still need the original fonts (or a lot of work).
Some possibilities that come to mind:
dump the uncompressed PDF with pdftk and discover what the font names are, then look for them on FontMonster or other font service.
use some online font recognition service to get a close match with your font, in order to preserve kerning (I guess kerning and alignment are what's making your text unreadable)
try replacing the fonts manually (again pdftk to convert the PDF to a PDF which is editable with sed. This editing will break the PDF, but pdftk will then be able to recompress the damaged PDF to a useable one).
Here's what you really want - font substitution. You want some code/app to be able to go through the file and make appropriate changes to the embedded fonts.
This task is doable and is anywhere from easy to non-trivial. It's easy when you have a font that matches the metrics of the font in the file and the encoding used for the font is sane. You could probably do this with iText or DotPdf (the latter is not free beyond the evaluation, and is my company's product). If you modified pdf2ps, you could probably manage changing the fonts on the way through too.
If the fonts used in the file are font subsets that have creative reencoding, then you are in hell and will likely have all manner of pain doing the change. Here's why:
PostScript was designed at a point when there was no Unicode. Adobe used a single byte for characters and whenever you rendered any string, the glyph to draw was taken from a 256 entry table called the encoding vector. If a standard encoding didn't have what you wanted, you were encouraged to make fonts on the fly based on the standard font that differed only in encoding.
When Adobe created Acrobat, they wanted to make transition from PostScript as easy as possible so that font mechanism was modeled. When the ability to embed fonts into PDFs was added, it was clear that this would bloat the files, so PDF also included the ability to have font subsets. Font subsets are made by taking an existing font and removing all the glyphs that won't be used and re-encoding it into the PDF. The may be no standard relationship between the encoding vector and the code points in the file - all those may be changed. Instead, there may be an embedded PostScript function /ToUnicode which will translate encoded characters to a Unicode representation.
So yeah, non-trivial.
For the folks who come after me:
The best solutions I found were to use Evince to print as SVG, or to use the pdf2svg program that's accessible via Synaptic on Mint. However, Inkscape wasn't able to cope with the resulting SVGs--it entered an infinite loop with the error message:
File display/nr-arena-item.cpp line 323 (?): Assertion item->state & NR_ARENA_ITEM_STATE_BBOX failed
I'm giving up this quest for now, but maybe I'll try again in a year or two. In the meantime, maybe one of these solutions will work for you.

Comparison of two pdf files

I need to compare the contents of two almost similar files and highlight the dissimilar portions in the corresponding pdf file. Am using pdfbox. Please help me atleast with the logic.
If you prefer a tool with a GUI, you could try this one: diffpdf. It's by Mark Summerfield, and since it's written with Qt, it should be available (or should be buildable) on all platforms where Qt runs on.
Here's a screenshot:
You can do the same thing with a shell script on Linux. The script wraps 3 components:
ImageMagick's compare command
the pdftk utility
Ghostscript
It's rather easy to translate this into a .bat Batch file for DOS/Windows...
Here are the building blocks:
pdftk
Use this command to split multipage PDF files into multiple singlepage PDFs:
pdftk first.pdf burst output somewhere/firstpdf_page_%03d.pdf
pdftk 2nd.pdf burst output somewhere/2ndpdf_page_%03d.pdf
compare
Use this command to create a "diff" PDF page for each of the pages:
compare \
-verbose \
-debug coder -log "%u %m:%l %e" \
somewhere/firstpdf_page_001.pdf \
somewhere/2ndpdf_page_001.pdf \
-compose src \
somewhereelse/diff_page_001.pdf
Note, that compare is part of ImageMagick. But for PDF processing it needs Ghostscript as a 'delegate', because it cannot do so natively itself.
Once more, pdftk
Now you can again concatenate your "diff" PDF pages with pdftk:
pdftk \
somewhereelse/diff_page_*.pdf \
cat \
output somewhereelse/diff_allpages.pdf
Ghostscript
Ghostscript automatically inserts meta data (such as the current date+time) into its PDF output. Therefore this is not working well for MD5hash-based file comparisons.
If you want to automatically discover all cases which consist of purely white pages (that means: there are no visible differences in your input pages), you could also convert to a meta-data free bitmap format using the bmp256 output device. You can do that for the original PDFs (first.pdf and 2nd.pdf), or for the diff-PDF pages:
gs \
-o diff_page_001.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
diff_page_001.pdf
md5sum diff_page_001.bmp
Just create an all-white BMP page with its MD5sum (for reference) like this:
gs \
-o reference-white-page.bmp \
-r72 \
-g595x842 \
-sDEVICE=bmp256 \
-c "showpage quit"
md5sum reference-white-page.bmp
I had this very problem myself and the quickest way that I've found is to use PHP and its bindings for ImageMagick (Imagick).
<?php
$im1 = new \Imagick("file1.pdf");
$im2 = new \Imagick("file2.pdf");
$result = $im1->compareImages($im2, \Imagick::METRIC_MEANSQUAREERROR);
if($result[1] > 0.0){
// Files are DIFFERENT
}
else{
// Files are IDENTICAL
}
$im1->destroy();
$im2->destroy();
Of course, you need to install the ImageMagick bindings first:
sudo apt-get install php5-imagick # Ubuntu/Debian
I have come up with a jar using apache pdfbox to compare pdf files - this can compare pixel by pixel & highlight the differences.
Check my blog : http://www.testautomationguru.com/introducing-pdfutil-to-compare-pdf-files-extract-resources/ for example & download.
To get page count
import com.taguru.utility.PDFUtil;
PDFUtil pdfUtil = new PDFUtil();
pdfUtil.getPageCount("c:/sample.pdf"); //returns the page count
To get page content as plain text
//returns the pdf content - all pages
pdfUtil.getText("c:/sample.pdf");
// returns the pdf content from page number 2
pdfUtil.getText("c:/sample.pdf",2);
// returns the pdf content from page number 5 to 8
pdfUtil.getText("c:/sample.pdf", 5, 8);
To extract attached images from PDF
//set the path where we need to store the images
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.extractImages("c:/sample.pdf");
// extracts & saves the pdf content from page number 3
pdfUtil.extractImages("c:/sample.pdf", 3);
// extracts & saves the pdf content from page 2
pdfUtil.extractImages("c:/sample.pdf", 2, 2);
To store PDF pages as images
//set the path where we need to store the images
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.savePdfAsImage("c:/sample.pdf");
To compare PDF files in text mode (faster – But it does not compare the format, images etc in the PDF)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesTextMode(file1, file2);
// compare the 3rd page alone
pdfUtil.comparePdfFilesTextMode(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.comparePdfFilesTextMode(file1, file2, 1, 5);
To compare PDF files in Binary mode (slower – compares PDF documents pixel by pixel – highlights pdf difference & store the result as image)
String file1="c:/files/doc1.pdf";
String file1="c:/files/doc2.pdf";
// compares the pdf documents & returns a boolean
// true if both files have same content. false otherwise.
pdfUtil.comparePdfFilesBinaryMode(file1, file2);
// compare the 3rd page alone
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 3, 3);
// compare the pages from 1 to 5
pdfUtil.comparePdfFilesBinaryMode(file1, file2, 1, 5);
//if you need to store the result
pdfUtil.highlightPdfDifference(true);
pdfUtil.setImageDestinationPath("c:/imgpath");
pdfUtil.comparePdfFilesBinaryMode(file1, file2);
To compare PDFs on macOS Monterey (i.e. version 12), I was able to install diff-pdf using homebrew, and run it.
The --view option didn't work for me, but the --output-diff did.

Batch OCR Program for PDFs [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
This has been asked before, but I don't really know if the answers help me. Here is my problem: I got a bunch of (10,000 or so) pdf files. Some were text files that were saved using adobe's print feature (so their text is perfect and I don't want to risk screwing them up). And some were scanned images (so they don't have any text and I will have to settle for OCR). The files are in the same directory and I can't tell which is which. Ultimately I want to turn them into .txt files and then do string processing on them. So I want the most accurate OCR possible.
It seems like people have recommended:
adobe pdf (I don't have a licensed copy of this so ... plus if ABBYY finereader or something is better, why pay for it if I won't use it)
ocropus (I can't figure out how to use this thing),
Tesseract (which seems like it was great in 1995 but I'm not sure if there's something more accurate plus it doesn't do pdfs natively and I've have to convert to TIFF. that raises its own problem as I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff. plus i don't want 10,000 30 page documents turned into 30,000 individual tiff images).
wowocr
pdftextstream (that was from 2009)
ABBYY FineReader (apparently its' $$$, but I will spend $600 to get this done if this thing is significantly better, i.e. has more accurate ocr).
Also I am a n00b to programming so if it's going to take like weeks to learn how to do something, I would rather pay the $$$. Thx for input/experiences.
BTW, I'm running Linux Mint 11 64 bit and/or windows 7 64 bit.
Here are the other threads:
Batch OCRing PDFs that haven't already been OCR'd
Open source OCR
PDF Text Extraction Approach Using OCR
https://superuser.com/questions/107678/batch-ocr-for-many-pdf-files-not-already-ocred
Just to put some of your misconceptions straight...
" I don't have a licensed copy of acrobat so I don't know how I'd convert 10,000 files to tiff."
You can convert PDFs to TIFF with the help of Free (as in liberty) and free (as in beer) Ghostscript. Your choice if you want to do it on Linux Mint or on Windows 7. The commandline for Linux is:
gs \
-o input.tif \
-sDEVICE=tiffg4 \
input.pdf
"i don't want 10,000 30 page documents turned into 30,000 individual tiff images"
You can have "multipage" TIFFs easily. Above command does create such TIFFs of the G4 (fax tiff) flavor. Should you even want single-page TIFFs instead, you can modify the command:
gs \
-o input_page_%03d.tif \
-sDEVICE=tiffg4 \
input.pdf
The %03d part of the output filename will automatically translate into a series of 001, 002, 003 etc.
Caveats:
The default resolution for the tiffg4 output device is 204x196 dpi. You probably want a better value. To get 720 dpi you should add -r720x720 to the commandline.
Also, if your Ghostscript installation uses letter as its default media size, you may want to change it. You can use -gXxY to set widthxheight in device points. So to get ISO A4 output page dimensions in landscape you can add a -g8420x5950 parameter.
So the full command which controls these two parameters, to produce 720 dpi output on A4 in portrait orientation, would read:
gs \
-o input.tif \
-sDEVICE=tiffg4 \
-r720x720 \
-g5950x8420 \
input.pdf
Figured I would try to contribute by answering my own question (have written some nice code for myself and could not have done it without help from this board). If you cat the pdf files in unix (well, osx for me), then the pdf files that have text will have the word "Font" in them (as a string, but mixed in with other text) b/c that's how the file tells Adobe what fonts to do display.
The cat command in bash seems to have the same output as reading the file in binary mode in python (using 'rb' mode when opening file instead of 'w' or 'r' or 'a'). So I'm assuming that all pdf files that contain text with have the word "Font" in the binary output and that no image-only files ever will. If that's always true, then this code will make a list of all pdf files in a single directory that have text and a separate list of those that have only images. It saves each list to a separate .txt file, then you can use a command in bash to move the pdf files to the appropriate folder.
Once you have them in their own folders, then you can run your batch ocr solution on just the pdf files in the images_only folder. I haven't gotten that far yet (obviously).
import os, re
#path is the directory with the files, other 2 are the names of the files you will store your lists in
path = 'C:/folder_with_pdfs'
files_with_text = open('files_with_text.txt', 'a')
image_only_files = open('image_only_files.txt', 'a')
#have os make a list of all files in that dir for a loop
filelist = os.listdir(path)
#compile regular expression that matches "Font"
mysearch = re.compile(r'.*Font.*', re.DOTALL)
#loop over all files in the directory, open them in binary ('rb'), search that binary for "Font"
#if they have "Font" they have text, if not they don't
#(pdf does something to understand the Font type and uses this word every time the pdf contains text)
for pdf in filelist:
openable_file = os.path.join(path, pdf)
cat_file = open(openable_file, 'rb')
usable_cat_file = cat_file.read()
#print usable_cat_file
if mysearch.match(usable_cat_file):
files_with_text.write(pdf + '\n')
else:
image_only_files.write(pdf + '\n')
To move the files, I entered this command in bash shell:
cat files_with_text.txt | while read i; do mv $i Volumes/hard_drive_name/new_destination_directory_name; done
Also, I didn't re-run the python code above, I just hand-edited the thing, so it might be buggy, Idk.
This is an interesting problem. If you are willing to work on Windows in .NET, you can do this with dotImage (disclaimer, I work for Atalasoft and wrote most of the OCR engine code). Let's break the problem down into pieces - the first is iterating over all your PDFs:
string[] candidatePDFs = Directory.GetFiles(sourceDirectory, "*.pdf");
PdfDecoder decoder = new PdfDecoder();
foreach (string path in candidatePDFs) {
using (FileStream stm = new FileStream(path, FileMode.Open)) {
if (decoder.IsValidFormat(stm)) {
ProcessPdf(path, stm);
}
}
}
This gets a list of all files that end in .pdf and if the file is a valid pdf, calls a routine to process it:
public void ProcessPdf(string path, Stream stm)
{
using (Document doc = new Document(stm)) {
int i=0;
foreach (Page p in doc.Pages) {
if (p.SingleImageOnly) {
ProcessWithOcr(path, stm, i);
}
else {
ProcessWithTextExtract(path, stm, i);
}
i++;
}
}
}
This opens the file as a Document object and asks if each page is image only. If so it will OCR the page, else it will text extract:
public void ProcessWithOcr(string path, Stream pdfStm, int page)
{
using (Stream textStream = GetTextStream(path, page)) {
PdfDecoder decoder = new PdfDecoder();
using (AtalaImage image = decoder.Read(pdfStm, page)) {
ImageCollection coll = new ImageCollection();
coll.Add(image);
ImageCollectionImageSource source = new ImageCollectionImageSource(coll);
OcrEngine engine = GetOcrEngine();
engine.Initialize();
engine.Translate(source, "text/plain", textStream);
engine.Shutdown();
}
}
}
what this does is rasterizes the PDF page into an image and puts it into a form that is palatable for engine.Translate. This doesn't strictly need to be done this way - one could get an OcrPage object from the engine from an AtalaImage by calling Recognize, but then it would be up to client code to loop over the structure and write out the text.
You'll note that I've left out GetOcrEngine() - we make available 4 OCR engines for client use: Tesseract, GlyphReader, RecoStar, and Iris. You would select the one that would be best for your needs.
Finally, you would need the code to extract text from the pages that already have perfectly good text on them:
public void ProcessWithTextExtract(string path, Stream pdfStream, int page)
{
using (Stream textStream = GetTextStream(path, page)) {
StreamWriter writer = new StreamWriter(textStream);
using (PdfTextDocument doc = new PdfTextDocument(pdfStream)) {
PdfTextPage page = doc.GetPage(i);
writer.Write(page.GetText(0, page.CharCount));
}
}
}
This extracts the text from the given page and writes it to the output stream.
Finally, you need GetTextStream():
public Stream GetTextStream(string sourcePath, int pageNo)
{
string dir = Path.GetDirectoryName(sourcePath);
string fname = Path.GetFileNameWithoutExtension(sourcePath);
string finalPath = Path.Combine(dir, String.Format("{0}p{1}.txt", fname, pageNo));
return new FileStream(finalPath, FileMode.Create);
}
Will this be a 100% solution? No. Certainly not. You could imagine PDF pages that contain a single image with a box draw around it - this would clearly fail the image only test but return no useful text. Probably, a better approach is to just use the extracted text and if that doesn't return anything, then try an OCR engine. Changing from one approach to the other is a matter of writing a different predicate.
The simplest approach would be to use a single tool such a ABBYY FineReader, Omnipage etc to process the images in one batch without having to sort them out into scanned vs not scanned images. I believe FineReader converts the PDF's to images before performing OCR anyway.
Using an OCR engine will give you features such as automatic deskew, page orientation detection, image thresholding, despeckling etc. These are features you would have to buy an image processng library for and program yourself and it could prove difficult to find an optimal set of parameters for your 10,000 PDF's.
Using the automatic OCR approach will have other side effects depending on the input images and you would find you would get better results if you sorted the images and set optimal parameters for each type of images. For accuracy it would be much better to use a proper PDF text extraction routine to extract the PDF's that have perfect text.
At the end of the day it will come down to time and money versus the quality of the results that you need. At the end of the day, a commercial OCR program will be the quickest and easiest solution. If you have clean text only documents then a cheap OCR program will work as well as an expensive solution. The more complex your documents, the more money you will need to spend to process them.
I would try finding some demo / trial versions of commercial OCR engines and just see how they perform on your different document types before spending too much time and money.
I have written a small wrapper for Abbyy OCR4LINUX CLI engine (IMHO, doesn't cost that much) and Tesseract 3.
The wrapper can batch convert files like:
$ pmocr.sh --batch --target=pdf --skip-txt-pdf /some/directory
The script uses pdffonts to determine if a PDF file has already been OCRed to skip them. Also, the script can work as system service to monitor a directory and launch an OCR action as soon as a file enters the directory.
Script can be found here:
https://github.com/deajan/pmOCR
Hopefully, this helps someone.

Convert pdf from version 1.1 to 1.4 (or higher)

How can I convert pdf files from version 1.1 to 1.4 (or higher)?
Actually I need some sort of command line tool for batch converting or some API to be able to convert dynamically severall documents.
Use Ghostscript tool.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -o output.pdf input.pdf
Pdf 1.1 is forward compatible with pdf 1.4. Everything in pdf 1.1 will work with pdf 1.4 - it's guaranteed by the spec. Let's assume that you've got some justifiable reason why this is not good enough for you (let's assume, for example, that you have a non-spec compliant tool that consumes PDF and explodes on any file version less that 1.4).
We can focus on the main syntactic differences between versions.
All PDF files have a header somewhere in the first 1024 bytes. In most cases, it's the very first line, but that's not guaranteed (I'm looking at you GhostScript!). The header looks like this in PDF 1.1:
%PDF-1.1
in PDF 1.4, it looks like this:
%PDF-1.4
So in theory, all you need is a tool that will look in the first 1024 bytes for a file for "%PDF-1.1" and change it to "%PDF-1.4". You could use sed, perl, etc to do something like that for you. You could write it in C and you would be tempted to do something like this:
#define PDFHEADERSIZE 1024
bool ChangeFileToNewPdfVersion(char *file)
{
char *replacePoint = NULL;
FILE *fp = fopen(file, "rw");
char buf[PDFHEADERSIZE + 1];
buf[PDFHEADERSIZE] = '\0';
if (fread(buf, 1, PDFHEADERSIZE, fp) != PDFHEADERSIZE) { fclose(fp); return false; }
fseek(fp, 0, SEEK_SET);
if ((replacePoint = strstr(buf, "%PDF-1.1")) == NULL) { fclose(fp); return false; }
replacePoint[7] = '4';
if (fwrite(buf, 1, PDFHEADERSIZE, fp) != PDFHEADERSIZE) { fclose(fp); return false; }
fflush(fp);
fclose(fp);
return;
}
which will work in most sane cases. It will not work if the file starts, for example, with 0 bytes, which would serve as null terminators in the block of data.
A better choice (really) would be to cobble up a simple state machine to find %PDF-1. by reading 1 byte at a time until it either finds it or passes 1017 (1024 less the header length), then reads the next byte, if it's a '1', it seeks back a byte and writes a '4'.
The only other thing you would need to worry about is that PDF 1.4 suggests that the document catalog should contain a Version key with the file version. Since this is defined as optional in the spec, you are safe to ignore it.
So this will solve your problem. I do not, however, believe that you should need to do this. Really.
You should take some time to read part of the PDF spec, specifically section I.2 about version numbers and compatibility.
I just had this problem. Trying to submit some PDF's to a finanicial institution. "We only support PDF 1.4 or newer". Apparently our HP scanner creates version 1.3 PDF's.
I opened the PDF file with Notepad++ and changed the 3 to a 4 and saved it. It was that simple.
It's the very first part of the file and it's in plain text.
Another option for a small number of pdf files is to open them in Chrome or other browser then save as PDF or print to PDF. In my case, using Chrome, it saved to a newer pdf version and the bank accepted it