How can I extracting the last four pages from 1000+ pdfs in batch? - pdf

I have around 1000+ pdfs. I need to extract the last 4 pages of each document and then save each file.
What is the easiest way to do this?
I don't have access to Adobe Pro and the number of documents will require automation.

Use you favourite shell or scripting language to call:
cpdf in.pdf ~4-end -o out.pdf
You may need to check that there are at least four pages in the file as well.

Related

Use ghostscript to delete a page (not extracting a range)

I know ghostscript can use -dfirstpage -dlastpage to only make a file from a range of pages, but I need to make it (or another command line program) delete the 2nd page in any pdf where the range of pages is not explicitly told. I thought this would be far easier because most printers let you specify "1,3-end" and I have been using PDFCreator to do it that way.
The one way I can think of doing it (very very messy) is to extract page 1, extract pages 3 to end, and then merge the two pdfs. But I also don't know how to have GS determine the number of pages.
Use the right tool for the job!
For reasons outlined by KenS, Ghostscript is not the best tool for what you want to achieve. A better tool for this task is pdftk. To remove the 2nd page from input.pdf, you should run this command line:
pdftk input.pdf cat 1 3-end output output.pdf
OK first things first, if you use Ghostscript's pdfwrite device you are NOT extracting, or deleting, or performing any other 'manipulation' operation on your source PDF file. I keep on reiterating this, but I'm going to say it again.
When you pass an input file through Ghostscript it is completely interpreted to a series of graphical primitives which are passed to the device, in general the device will render the primitives to a bitmap. In the case of the 'high level' devices such as pdfwrite, the primitives are re-assmebled into a brand new file, in the case of pdfwrite a PDF file.
This flexibility allows for input in a number of different page description languages (PostScript, PDF, PCL, PCL-XL, XPS) and then output in a few different high level formats (PostScript, EPS, flavours of PDF, XPS, PCL, PCL-XL).
But the new file bears no relation to the original, other than its appearance.
Now, having got that out of the way... You can use the pdf_info.ps PostScript program, supplied in the 'toolin' directory of the Ghostscript installation, to get a variety of information about PDF files, one of the things you can get is the number of pages in the PDF. You also don't need to bother, run the file once with -dLastPage=1, then run it again with -dFirstPage=2 (don't set LastPage), then run both resulting files to create a file with the pages from each combined.

Ghostscript skips characters when merging PDFs

I have a problem when using Ghostscript (version 8.71) on Ubuntu to merge PDF files created with wkhtmltopdf.
The problem I experience on random occasions is that some characters get lost in the merge process and replaced by nothing (or space) in the merged PDF. If I look at the original PDF it looks fine but after merge some characters are missing.
Note that one missing character, such as number 9 or the letter a, can be lost in one place in the document but show up fine somewhere else in the document so it is not a problem displaying it or a font issue as such.
The command I am using is:
gs \
-q \
-dNOPAUSE \
-sDEVICE=pdfwrite \
-sOutputFile=/tmp/outputfilename \
-dBATCH \
/var/www/documents/docs/input1.pdf \
/var/www/documents/docs/input2.pdf \
/var/www/documents/docs/input3.pdf
Anyone else that have experienced this, or even better know a solution for it?
I've seen this happening if the names for embedded font subsets are identical, but the real content of these subsets are different (containing different glyph sets).
Check all your input files for the fonts used. Use Poppler's pdffonts utility for this:
for i in input*.pdf; do
pdffonts ${i} | tee ${i}.pdffonts.txt
done
Look for the font names used in each PDF.
My theory/bet is on you seeing identical font names used (names which are similar to BAAAAA+ArialMT) by different input files.
The BAAAAA+ font name prefix to be used for subset fonts is supposed to be random (though the official specification is not very clear about this). Some applications use predictable prefixes, however, starting with BAAAAA+, CAAAAAA+ DAAAAA+ etc. (OpenOffice.org and LibreOffice are notorious for this). This means that the prefix BAAAAA+ gets used in every single file where at least one subset font is used...
It can easily happen that your input files do not use the exact same subset of characters. However the identical names used could make Ghostscript think that the font really is the same. It (falsely) 'optimizes' the merged PDF and embeds only one of the 2 font instances (both having the same name, for example BAAAAA+Arial). However, this instance may not include some glyphs which where part of the other instance(s).
This leads to some characters missing in merged output.
I know that more recent versions of Ghostscript have seen a heavy overhaul of their font handling code. Maybe you'll be more lucky with trying Ghostscript v9.06 (the most recent release to date).
I'm very much interested in investigating this in even bigger detail. If you can provide a sample of your input files (as well as the merged output given by GS v8.70), I can test if it works better with v9.06.
What you could do to avoid this problem
Try to always embed fonts as full sets, not subsets:
I don't know if and how you can control to have full font embedding when using wkhtmltopdf.
If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
If Ghostscript generates your input PDFs the commandline parameters to enforce full font embeddings are:
gs -o output.pdf -sDEVICE=pdfwrite -dSubsetFonts=false input.file
Some type of fonts cannot be embedded fully, but only subsetted (TrueType, Type3, CIDFontType0, CIDFontType1, CIDFontType2). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do the following only if you are sure that no-one else gets to see or print or use your individual input files: Do not embed the fonts at all -- only embed when merging with Ghostscript the final result PDF from your inputs.
I don't know if and how you can control to have no font embedding when using wkhtmltopdf.
If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
If Ghostscript generates your input PDFs the commandline parameters to prevent font embedding are:
gs -o output.pdf -sDEVICE=pdfwrite -dEmbedAllFonts=false -c "<</AlwaysEmbed [ ]>>setpagedevice" input.file
Some type of fonts cannot be embedded fully, but only subsetted (Type3, CIDFontType1). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do not use Ghostscript, but rather use pdftk for merging PDFs. pdftk is a more 'dumb' utility than Ghostscript (at least older versions of pdftk are) when it comes to merging PDFs, and this dumbness can be an advantage...
Update
To answer once more, but this time more explicitly (following the extra question of #sacohe in the comments below. In many (not all) cases the following procedure will work:
Re-'distill' the input PDF files with the help of Ghostscript (preferably the most recent version from the 9.0x series).
The command to use is this (or similar):
gs -o redistilled-out.pdf -sDEVICE=pdfwrite input.pdf
The resulting output PDF should then be using different (unique) prefixes to the font names, even when the input PDF used the same name prefix for different font (subsets).
This procedure worked for me when I processed a sample of original input files provided to me by 'Mr R', the author of the original question. After that fix, the "skipped character problem" was gone in the final result (a merged PDF created from the fixed input files).
I wanted to give some feedback that unfortunately the re-processing trick doesn't seem to work with ghostscript 8.70 (in redhat/centos releases) and files exported as pdf from word 2010 (which seems to use ABCDEE+ prefix for everything). and i haven't been able to find any pre-built versions of ghostscript 9 for my platform.
you mention that older versions of pdftk might work. we moved away from pdftk (newer versions) to gs, because some pdf files would cause pdftk to coredump. #Kurt, do you think that trying to find an older version of pdftk might help? if so, what version do you recommend?
another ugly method that halfway works is to use:
-sDEVICE=pdfwrite -dCompatibilityLevel=1.2 -dHaveTrueType=false
which converts the fonts to bitmap, but it then causes the characters on the page to be a bit light (not a big deal), trying to select text is off by about one line height (mildly annoying), and worst is that even though the characters display ok, copy/paste gives random garbage in the text.
(I was hoping this would be a comment, but I guess I can't do that, is answer closed?)
From what I can tell, this issue is fixed in Ghostscript version 9.21. We were having a similar issue where merged PDFs were missing characters, and while #Kurt Pfeifle suggestion of re-distilling those PDFs did work, it seems a little infeasible/silly to us. Some of our merged PDFs consisted of up to 600 or more individual PDFs, and re-distilling every single one of those to merge them just seemed nuts
Our production version of Ghostscript was 9.10 which was causing this problem. But when I did some tests on 9.21 the problem seemed to vanish. I have been unable to produce a document with missing or mangled characters using GS 9.21 so I think that's the real solution here.

Cross-platform library for separating and converting PDFs to images?

I'm currently planning an application which involves manipulating PDFs. My goal is to have a program that i can pass in a PDF as an input which then saves separated grayscale images of the colour channels that the PDF consists of as an output. This is basically a simple RIP.
I'm currently using a solution using GhostScript but i want to rewrite the application to optimise speed and usability. (GhostScript doesn't separate PDFs for example.)
Do you know of any other open source libraries that i may find useful to achieve this?
Did you ever try to run (I'm assuming Windows here):
mkdir separated
gswin32c.exe ^
-o separated/page_%04d.tif ^
-sDEVICE=tiffsep ^
d:/path/to/input.pdf
(you can also try -sDEVICE=tiffsep1) and then looked at the files you've gotten in the separated sub directory?!? And this is not a case of Ghostscript separated PDFs in your mind?
The device tiffsep creates multiple output files:
One single 32bit composite CMYK file (tiff32nc format) per PDF page.
Multiple tiffgray files (each compressed with LZW) per PDF page, one for each separation.
The device tiffsep1 behaves similarly, but...
...doesn't create the composite output file...
...and it creates tiffg4 output files for the separations.

How to extract text from a PDF? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
The community reviewed whether to reopen this question 4 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Can anyone recommend a library/API for extracting the text and images from a PDF?
We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.
We would like that data to be output in xml or json format. We're currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.
Are there alternatives (commercial ones or free) for extracting text from a pdf programatically?
I was given a 400 page pdf file with a table of data that I had to import - luckily no images. Ghostscript worked for me:
gswin64c -sDEVICE=txtwrite -o output.txt input.pdf
The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. -dSIMPLE and -dCOMPLEX made no difference in this case.
An efficient command line tool, open source, free of any fee, available on both linux & windows : simply named pdftotext. This tool is a part of the xpdf library.
http://en.wikipedia.org/wiki/Pdftotext
Since today I know it: the best thing for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.
PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible".
TET's first incarnation is a library. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces.
pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes.
And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.
I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.
This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements.
TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...
Give it a try.
For python, there is PDFMiner and pyPDF2. For more information on these, see Python module for converting PDF to text.
Here is my suggestion.
If you want to extract text from PDF, you could import the pdf file into Google Docs, then export it to a more friendly format such as .html, .odf, .rtf, .txt, etc. All of this using the Drive API. It is free* and robust. Take a look at:
https://developers.google.com/drive/v2/reference/files/insert https://developers.google.com/drive/v2/reference/files/get
Because it is a rest API, it is compatible with ALL programing languages. The links I posted aboove have working examples for many languages including: Java, .NET, Python, PHP, Ruby, and others.
I hope it helps.
PdfTextStream (which you said you have been looking at) is now free for single threaded applications. In my opinion its quality is much better than other libraries (esp. for things like funky embedded fonts, etc).
It is available in Java and C#.
Alternatively, you should have a look at Apache PDFBox, open source.
One of the comments here used gs on Windows. I had some success with that on Linux/OSX too, with the following syntax:
gs \
-q \
-dNODISPLAY \
-dSAFER \
-dDELAYBIND \
-dWRITESYSTEMDICT \
-dSIMPLE \
-f ps2ascii.ps \
"${input}" \
-dQUIET \
-c quit
I used dSIMPLE instead of dCOMPLEX because the latter outputs 1 character per line.
Docotic.Pdf library may be used to extract text from PDF files as plain text or as a collection of text chunks with coordinates for each chunk.
Docotic.Pdf can be used to extract images from PDFs, too.
Disclaimer: I work for Bit Miracle.
As the question is specifically about alternative tools to get data from PDF as XML so you may be interested to take a look at the commercial tool "ByteScout PDF Extractor SDK" that is capable of doing exactly this: extract text from PDF as XML along with the positioning data (x,y) and font information:
Text in the source PDF:
Products | Units | Price
Output XML:
<row>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="212" y="126" width="47" height="11">Products</text>
</column>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="428" y="126" width="27" height="11">Units</text>
</column>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="503" y="126" width="26" height="11">Price</text>
</column>
</row>
P.S.: additionally it also breaks the text into a table based structure.
Disclosure: I work for ByteScout
The best thing I can currently think of (within the list of "simple" tools) is Ghostscript (current version is v.8.71) and the PostScript utility program ps2ascii.ps. Ghostscript ships it in its lib subdirectory. Try this (on Windows):
gswin32c.exe ^
-q ^
-sFONTPATH=c:/windows/fonts ^
-dNODISPLAY ^
-dSAFER ^
-dDELAYBIND ^
-dWRITESYSTEMDICT ^
-dCOMPLEX ^
-f ps2ascii.ps ^
-dFirstPage=3 ^
-dLastPage=7 ^
input.pdf ^
-dQUIET ^
-c quit
This command processes pages 3-7 of input.pdf. Read the comments in the ps2ascii.ps file itself to see what the "weird" numbers and additional infos mean (they indicate strings, positions, widths, colors, pictures, rectangles, fonts and page breaks...). To get a "simple" text output, replace the -dCOMPLEX part by -dSIMPLE.
I know that this topic is quite old, but this need is still alive. I read many documents, forum and script and build a new advanced one which supports compressed and uncompressed pdf :
https://gist.github.com/smalot/6183152
In some cases, command line is forbidden for security reasons.
So a native PHP class can fit many needs.
Hope it helps everone
For image extraction, pdfimages is a free command line tool for Linux or Windows (win32):
pdfimages: Extract and Save Images From A Portable Document Format ( PDF ) File
Apache pdfbox has this feature - the text part is described in:
http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html
for an example implementation see
https://github.com/WolfgangFahl/pdfindexer
the testcase TestPdfIndexer.testExtracting shows how it works
QuickPDF seems to be a reasonable library that should do what you want for a reasonable price.
http://www.quickpdflibrary.com/ - They have a 30 day trial.
On my Macintosh systems, I find that "Adobe Reader" does a reasonably good job. I created an alias on my Desktop that points to the "Adobe Reader.app", and all I do is drop a pdf-file on the alias, which makes it the active document in Adobe Reader, and then from the File-menu, I choose "Save as Text...", give it a name and where to save it, click "Save", and I'm done.

How to wget a website and then export all the pages into a single pdf?

IS there any way i can do that?
I did this once. It was almost 10 years ago, however, so I don't remember the details.
I used:
wget to download the pages
html2ps to convert the individual pages to PostScript
ps2ps to splice the individual PostScript files together
ps2ps again, to put 4 pages on 1
Then I sent the PostScript file to the printer. Since you want PDF you could add an additional step of ps2pdf.
Yes, you could write an AppleScript to automate Safari to do the printing (and the fetching, if you'd like, though you can drive wget from AppleScript using do shell script). If you don't fancy using AppleScript, you could use Automator instead; the result would be pretty much the same.
BTW if you aren't using Mac OS X, it wasn't clear from the question you asked :-)