Extract area from pdf

Extract area from pdf - pdf

I want to extract an area given by x-y coordinates from a pdf page. The extracted area may be stored as a page in a new pdf document. This needs to be done several times and so I would want the process to be scripted. Are there any tools / libraries that can help do this?

If iText (for Java) or iText(Sharp) (for .Net) are acceptable libraries for you, you can use them to import an existing page from some PDF as a template of which sections can be displayed in another PDF.
Have a look at the example TilingHero.java / TilingHero.cs from chapter 6 of iText in Action — 2nd Edition. The central code is:
PdfImportedPage page = writer.getImportedPage(reader, 1);
// adding the same page 16 times with a different offset
float x, y;
for (int i = 0; i < 16; i++) {
x = -pagesize.getWidth() * (i % 4);
y = pagesize.getHeight() * (i / 4 - 3);
content.addTemplate(page, 4, 0, 0, 4, x, y);
document.newPage();
}
As you see, the original page is imported once and different sections of it are displayed on different pages.
(iText and iTextSharp are available either for free --- subject to the AGPL --- or commercially)

You may use 'pdftoppm' to do this task:
pdftoppm -f <first page> -l <last page> -jpeg -x <start x> -y <start y> -W <width> -H <height> -jpeg <in file> > <out file>
For exaple, crop the area of the first PDF page from point (x,y) = (100,200), which is the upper left corner of your crop area, with a width of 50 and a height of 80 and save it to a JPEG file is done by using:
pdftoppm -f 1 -l 1 -jpeg -x 100 -y 200 -W 50 -H 80 'my.pdf' > 'crop.jpg'
If you get in trouble with your documents resolution, you can use the '-r' option of 'pdftoppm' (see the man page of 'pdftoppm' for more).
Certainly, you can easily convert the JPEG file into a PDF, if needed.

Using ghostscript, you can crop the pdf the following way:
gs -f original.pdf -o final.pdf -sDEVICE=pdfwrite \
-c "[/CropBox [x-left y-bottom x-right y-top] /PAGES pdfmark"
x-left, y-bottom, etc., coordinates may be substituted with the required coordinates. Note that for gs, coordinates (0, 0) are at the left-bottom of the page.
This can then be easily scripted.

Related

PDF Dimensions of Page Out of Range Errors from Ghostscript

I'm trying to produce new PDFs that alter dimensions only the first page (using CropBox). I used a modified version of How do I crop pages 3&4 in a multipage pdf using ghostscript
Here is what's strange: everything runs properly, but when I open the PDFs in typical applications (Preview, Acrobat, etc.), they either crash or I get a "Warning: Dimensions of Page May be Out of Range" error. In Acrobat, only one page will display, even tho page count is 2, 45, 60, or whatever.
Even stranger: I emailed the PDFs to someone to see if it was a machine-specific issue. In Gmail, everything looks fine in Google Apps's PDF viewer. So the process 'worked,' but it looks like there's something about the dimensions or page size that is throwing other apps off.
I've tried multiple GS options (dPDFFitPage, dPrinted=false, dUseCropBox, changing paper size to something other than legal), but nothing seems to work.
I'm attaching a version of a PDF that underwent this process and generates these errors as well. https://www.dropbox.com/s/ka13b7bvxmql4d2/imfwb.pdf?dl=0
Modified output is below. xmin, ymin, xmax, ymax, height, width are variables defined elsewhere in the bigger script of which GS is a part. Data are grabbed using pdfinfo
gs \
-o output/#{filename} \
-sDEVICE=pdfwrite \
-c \"<</EndPage {
0 eq {
pop /Page# where {
/Page# get
1 eq {
(page 1) == flush
[/CropBox [#{xmin} #{ymin} #{xmax} #{ymax}] /PAGE pdfmark
true
}
{
(not page 1) == flush
[/CropBox [0 #{height.to_f} #{width.to_f} #{height.to_f}] /PAGE pdfmark
true
} ifelse
}{
true
} ifelse
}
{
false
}
ifelse
}
>> setpagedevice\" \
-f #{filename}"
`#{cmd}`

For pages after the first you set
[/CropBox [0 #{height.to_f} #{width.to_f} #{height.to_f}] /PAGE pdfmark
I.e. a crop box with zero height!
E.g. in case of your sample document page 2 has the crop box [0 792.0 612.0 792.0].
This surely is not what you want...
If you really want to "produce new PDFs that alter dimensions only the first page (using CropBox)", why do you change the crop box of later pages at all? Simply don't do anything in that case!
Why "Dimensions of Page May be Out of Range"?
Well, ISO 32000-1 in its normative Annex C declares:
The minimum page size should be 3 by 3 units in default user space
Thus, according to that older PDF specification a page height of 0 indeed is out of range for PDF!
Meanwhile, though, ISO 32000-2 has dropped that requirement, so strictly speaking a page height of zero should be nothing to complain about...

-dSubsetFonts=false option stops showing TrueType fonts /glyphshow

I have a PostScript that uses TrueType fonts. However, I want to include rarly used characters like registration marks (®) and right/left single/double quotes (’, “ etc).
So I used glyphshow and called the names of the glyphs
%!
<< /PageSize [419.528 595.276] >> setpagedevice
/DeviceRGB setcolorspace
% Page 1
%
% Set the Original to be the top left
%
0 595.276 translate
1 -1 scale
gsave
%
% Save this state before moving x y specifically for images
%
1 -1 scale
/BauerBodoniBT-Roman findfont 30 scalefont setfont % set the pt size %-3.792 - 16
1 0 0 setrgbcolor
10 -40 moveto /quoteright glyphshow
10 -80 moveto /registered glyphshow
/Museo-700 findfont 30 scalefont setfont % set the pt size %-3.792 - 16
1 0 1 setrgbcolor
10 -120 moveto /quoteright glyphshow
10 -180 moveto /registered glyphshow
showpage
When I execute this PostScript using the following command (due to my requirement for the pdf to be editable in Illustrator i.e. can be opened with all fonts intact) the PDF shows nothing but seems to contain the glyphs if you copy and paste from the pdf into a text file.
gs -o gly_subsetfalse.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dSubsetFonts=false -dPDFSETTINGS=/prepress textglyph.ps
However, this above command now causes issues with pulling it into Illustrator. The rare glyphs become unrecongisble (', Æ). Normal characters and regular glyphs seem fine i.e. /a glyphshow and just show text appear in pdf and illustrator.
So, it seems that having the SubsetFonts option as True shows rare glyphs but this stops me from pulling the PDF into Illustrator.
Attached are the TrueType Fonts for reference and two PDFs (one with subsetfont option being truw and the other not - default).
I have also tried the following command with the same ill results (no visible glyphs appearing on the PDF and Illustrator incorrectly shows the glyphs).
gs -o gly_subsetfalse_embedallfonts.pdf -sDEVICE=pdfwrite -dCompatibilityLevel=1.3 -dPDFSETTINGS=/prepress -dSubsetFonts=false -dEmbedAllFonts=true textglyph.ps
But with this command I also get a PreFlight error from the PDF if that helps:
"Glyph width info in PDF does not match width info in embedded font"
Attached are all the files spoke about above - click here.
Encoding the font also does not produce good results.
I have encoded a TrueType(and a Type42) font in my PostScript and listed a few new characters to glyphshow.
Results are:
Command 1:
gs -o encode_ttf_subset_false.pdf -sDEVICE=pdfwrite -dSubsetFonts=false encode.ps
Results 1:
Open the PDF in Acrobat does NOT display any glyphshow characters.
Command 2:
gs -o encode_ttf_subset_true.pdf -sDEVICE=pdfwrite encode.ps
Results 2:
Open the PDF in Acrobat and it DOES show the glyphshow characters but not in Illustrator.
Command 3:
gs -o encode_ttf_subset_false_embedtrue.pdf -sDEVICE=pdfwrite -dSubsetFonts=false -dEmbedAllFonts=true encode.ps
Results 3:
Same as Result 1 (glyphshow characters do not appear).
Below is my new PostScript with Encoded TTF and Type42 (I've also included them in my file further below).
Is this a bug at least with Ghostscript?
/museobold findfont dup %%%%% This is the Type42 Font
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /oacute put
Encoding 2 /aacute put
Encoding 3 /eacute put
Encoding 4 /questiondown put
Encoding 5 /quotedblleft put
Encoding 6 /quoteright put
Encoding 7 /quotedblbase put
/museobold-Esp currentdict definefont pop
end
/museobold-Esp 18 selectfont
72 600 moveto
(\005D\001lnde est\002 el camino a San Jos\003? More characters \006 and \007) show
%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%
/BauerBodoniBT-Roman findfont dup
length dict
copy begin
/Encoding Encoding 256 array copy def
Encoding 1 /oacute put
Encoding 2 /aacute put
Encoding 3 /eacute put
Encoding 4 /questiondown put
Encoding 5 /quotedblleft put
Encoding 6 /quoteright put
Encoding 7 /quotedblbase put
/BauerBodoniBT-Roman-Esp currentdict definefont pop
end
/BauerBodoniBT-Roman-Esp 18 selectfont
72 630 moveto
(\005D\001lnde est\002 el camino a San Jos\003? More characters \006 and \007) show
showpage
Click here to downloading the following: BBBTRom.ttf (TrueType font); 3 pdfs (results 1, 2 and 3); museobold (TrueType font converted to Type42 using ttftotype42) and encode.ps.

This is back to your problem with using Illustrator as a general PDF application. It can't do that. Now as you note you've found ways round that in the past, this time I believe you are out of luck.
The PostScript glypshow operator doesn't have a PDF equivalent. Also, because of the way glyphshow works, we cannot simply use any existing font instance to store the glyph (because the glyph may not be, and probably isn't, present in the Encoding). As a result pdfwrite does the only thing it can. It makes a new font which consists only of the glyphs used by glyphshow from the specific original font's CharStrings.
Because we don;t have an Encoding to work from we have to use a custom (suymbolic) Encoding (because fonts in a PDF file have to have an Encoding) which from your previous experience I suspect means that Illustrator is unable to read the font we embed.
Using glyphshow with pdfwrite is something I would not encourage.
Now having said that, there should not be a problem with the PDF file when SubsetFonts is true, though I do have an open bug report which sounds similar. You haven't actually said which version of Ghostscript you are using, so I can't be sure if its the same problem. (nor do I have the same fonts etc). Note that this is not (I believe) related to your problem with Illustrator, that's caused by your use of glyphshow and some Illustrator limitation.
As a general rule I would not use -dPDFSETTINGS, certainly not while trying to debug a problem, nor would I limit the output to PDF 1.3.

ImageMagick generate pdf with special page numbering

I generate a PDF from a set of PNG files using this command:
convert -- $(ls -v -- src/*.png) out/book.pdf
where there're some files with names like -03.png, which I need to have smaller page numbers than others. But I get a PDF which has -01 having page number 1, -02 number 2, etc., and 01 starts from page number 6.
The PDF is a scanned book, which has some elements like table of contents etc. which aren't included in page numbering. I remember to have seen some PDFs which have special page numbers like vii before normal Arabic numbers start.
I've tried using -scene -5 to add an offset to page numbers, but this didn't change the result.
So what should I instead do to make page "01.png" have page number 1, etc., and previous ones have some other numbers (negative or Latin, anything) and appear at the beginning of the document?

First, you want to sort files numerically counting optional minus sign, which you won't do with command you show.
Second, you talk about PageLabels for PDF pages, which you can add using Ghostscript and pdfmark operator.
Try this command:
ls src/*.png | \
sort -n | \
convert #- pdf:- | \
gs \
-sDEVICE=pdfwrite \
-o out/book.pdf \
-c '[{Catalog}<</PageLabels<</Nums[0<</P(-3)>>1<</P(-2)>>2<</P(-1)>>3<</S/D>>]>>>>/PUT pdfmark' \
-f -
It's for 3 pages -3, -2 and -1, followed by any number of pages labelled 1, 2, 3 etc. Modify according to your needs.

Convert subset of postscript file to pdf documents

I have a system that generates large quantities of PostScript files that each contain multiple, multi-page documents. I want to write a script that takes these large PostScript documents and outputs multiple PDF documents from each.
For example one postscript file contains 200 letters to customers, each of which is 10 pages long. This postscript file contains 2000 pages. I want to output from this 1 ps document, 200x 10 page PDFs, one for each customer.
I'm thinking GhostScript is the way to go for this level of document manipulation but I'm not sure the best way to go - Is there a function in GhostScript to take 'pages 1-10' of the input ps file? Do I have to output the entire ps file as 2000 separate ps files (1 per page) then combine them back together again?
Or is there a much simpler way of acheiving my goal with something other than GhostScript?
Many Thanks,
Ben

Technically this will be possible in the next release of Ghostscript, or using the HEAD code in the Git repository. It is now possible to switch devices when using pdfwrite which will cause the device to close and complete the current PDF file. Switching back again will start a new one.
Combine this with a BeginPage and/or EndPage procedure in the page device dictionary, and you should be able to do something like what you want.
Caveat; I haven't tried any of this, and it will take some PostScript programming to get it to work.
Because of the nature of PostScript, there is no way to extract the 'N'th page from a file, so there is no way to specify a range of pages.
As lsemi suggests you could first convert to one large PDF file and then extract the ranges you want. Ghostscript is able to use the FirstPage and LastPage switches to do this (unlike PostScript, it is possible to extract a specific page from a PDF file).

Well, you might first make the PS into a PDF object collection (or directly generate a PDF from GhostScript by printing to the PDFWriter device), and then "cut" from the big PDF using pdftk, which would be quite fast.

Create the complete PDF file first with the help of Ghostscript:
gs \
-o 2000p.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
2000p.ps
Use pdftk to extract PDF files with 10 pages each:
for i in $(seq 0 10 199); do \
export start=$(( ${i} * 1 + 1 )); \
export end=$(( ${start} + 9 )); \
pdftk \
2000p.pdf \
cat ${start}-${end} \
output pages---${start}..${end}.pdf; \
done
You can have Ghostscript generate a 2000page sample+test PDF for you by first creating a sample PostScript file named '2000p.ps' with these contents:
%!PS
/H1 {/Helvetica findfont 48 scalefont setfont .2 .2 1 setrgbcolor} def
/pageframe {1 0 0 setrgbcolor 2 setlinewidth 10 10 575 822 rectstroke} def
/gopageno {H1 300 700 moveto } def
1 1 2000 {pageframe gopageno
4 string cvs
dup stringwidth pop
-1 mul 0 rmoveto
show
showpage} for
and then run this command:
gs -o 2000p.pdf -sDEVICE=pdfwrite -g5950x8420 2000p.ps

High-res images from PDFS

I'm working on a project in which I need to extract a TIFF per page from multi-page PDFs. The PDFs contain images only and there is one image per page (I believe they were made on some kind of photocopier/scanner, but haven't confirmed this). The TIFFs are then used to create several other derivative versions of the document so the higher the resolution the better.
I've found two recipes, both with helpful aspects, but neither is ideal. Hoping someone can help me tune one of them, or offer a third option.
Recipe 1, pdfimages and ImageMagick:
First do:
$ pdfimages $MY_PDF.pdf foo"
Which results in several .pbm files (named foo-000.pbm, foo-001.pbm), etc.
Then for each *.pbm do:
$ convert $each -resize 3200x3200\> -quality 100 $new_name.tif
Pro: The resultant TIFFs are a healthy 3300+ pixels on the long dimension, (-resize just serves to normalize everything)
Con: The orientation of the pages is lost, and they come out rotated different directions (they follow logical patterns, so probably they are the orientation in which they were fed to the scanner??).
Recipe 2 Imagemagick solo:
convert +adjoin $MY_PDF.pdf pages.tif
This gives me a TIFF per page (pages-0.tif, pages-1.tif, etc.).
Pro: Orientation stays!
Con: The long dimension of the resultant file is < 800 px, which is too small to be useful, and it looks as though there is some compression applied.
How can I ditch the scaling of the image stream in the PDF, but retain the orientation? Is there some more magick in ImageMagick that I'm missing? Something else entirely?

Sorry for the noise on this old topic, but google took me here as one of the top results and it might take others, so I thought I'd post the solution for the TO's question that I found here: http://robfelty.com/2008/03/11/convert-pdf-to-png-with-imagemagick
In Short: You have to tell ImageMagick at which density it should scan the PDF.
so convert -density 600x600 foo.pdf foo.png will tell ImageMagick to treat the PDF as if it had a 600dpi resolution and thus output much larger PNGs. In my case, the resulting foo.png was sized 5000x6600px. You can optionally add -resize 3000x3000 or whatever size you require and it will be scaled down.
Note that as long as you only have vector images or text in your PDF-files, density might be set as high as needed. If the PDF contains rasterized images, it won't look good if you set it higher than those images' dpi, surprise! :)
Chris

I wanted to share my solution...it may not work for everyone, but since nothing else has come around maybe it will help someone else. I wound up going with the first option in my question, which was to use pdfimages to get large images that were rotated every which way. I then found a way to use OCR and word counts to guess at the orientation, which got me from (estimated) 25% rotated accurately to above 90%.
The flow is as follows:
Use pdfimages (apt-get install poppler-utils) to get a set of pbm
files (not shown below).
For each file:
Make four versions, rotated 0, 90, 180, and 270 degrees (I refer to them as "north", "east", "south", and "west" in my code).
OCR each. The two with the lowest word count are likely the right-side up and upside down versions. This was over 99% accurate in my set of images processed to date.
From the two with the lowest word count, run the OCR output through a spell check. The file with the least spelling errors (i.e. most recognizable words) is likely to be correct. For my set this was about 93% (up from 25%) accurate based on a sample of 500.
YMMV. My files are bitonal and highly textual. The source images are an average of 3300 px on the long side. I can't speak to greyscale or color, or files with a lot of images. Most of my source PDFs are bad scans of old photocopies, so the accuracy might be even better with cleaner files. Using -despeckle during the rotation made no difference and slowed things down considerably (~5×). I chose ocrad for speed and not accuracy since I only need rough numbers and am throwing away the OCR. Re: performance, my nothing-special Linux desktop machine can run the whole script over about 2-3 files/per second.
Here's the implementation in a simple bash script:
#!/bin/bash
# Rotates a pbm file in place.
# Pass a .pbm as the only arg.
file=$1
TMP="/tmp/rotation-calc"
mkdir $TMP
# Dependencies:
# convert: apt-get install imagemagick
# ocrad: sudo apt-get install ocrad
ASPELL="/usr/bin/aspell"
AWK="/usr/bin/awk"
BASENAME="/usr/bin/basename"
CONVERT="/usr/bin/convert"
DIRNAME="/usr/bin/dirname"
HEAD="/usr/bin/head"
OCRAD="/usr/bin/ocrad"
SORT="/usr/bin/sort"
WC="/usr/bin/wc"
# Make copies in all four orientations (the src file is north; copy it to make
# things less confusing)
file_name=$(basename $file)
north_file="$TMP/$file_name-north"
east_file="$TMP/$file_name-east"
south_file="$TMP/$file_name-south"
west_file="$TMP/$file_name-west"
cp $file $north_file
$CONVERT -rotate 90 $file $east_file
$CONVERT -rotate 180 $file $south_file
$CONVERT -rotate 270 $file $west_file
# OCR each (just append ".txt" to the path/name of the image)
north_text="$north_file.txt"
east_text="$east_file.txt"
south_text="$south_file.txt"
west_text="$west_file.txt"
$OCRAD -f -F utf8 $north_file -o $north_text
$OCRAD -f -F utf8 $east_file -o $east_text
$OCRAD -f -F utf8 $south_file -o $south_text
$OCRAD -f -F utf8 $west_file -o $west_text
# Get the word count for each txt file (least 'words' == least whitespace junk
# resulting from vertical lines of text that should be horizontal.)
wc_table="$TMP/wc_table"
echo "$($WC -w $north_text) $north_file" > $wc_table
echo "$($WC -w $east_text) $east_file" >> $wc_table
echo "$($WC -w $south_text) $south_file" >> $wc_table
echo "$($WC -w $west_text) $west_file" >> $wc_table
# Take the bottom two; these are likely right side up and upside down, but
# generally too close to call beyond that.
bottom_two_wc_table="$TMP/bottom_two_wc_table"
$SORT -n $wc_table | $HEAD -2 > $bottom_two_wc_table
# Spellcheck. The lowest number of misspelled words is most likely the
# correct orientation.
misspelled_words_table="$TMP/misspelled_words_table"
while read record; do
txt=$(echo $record | $AWK '{ print $2 }')
misspelled_word_count=$(cat $txt | $ASPELL -l en list | wc -w)
echo "$misspelled_word_count $record" >> $misspelled_words_table
done < $bottom_two_wc_table
# Do the sort, overwrite the input file, save out the text
winner=$($SORT -n $misspelled_words_table | $HEAD -1)
rotated_file=$(echo $winner | $AWK '{ print $4 }')
mv $rotated_file $file
# Clean up.
if [ -d $TMP ]; then
rm -r $TMP
fi

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas