How does ghostscript convert PDF to .txt? - pdf

GNU Ghostscript is able to convert pdf files to .txt (text files) in terminal.
gs -sDEVICE=txtwrite -o output.txt input.pdf
I was wondering how it accomplishes this task? Does it use OCR?
I'm not looking for a very hefty explanation, but just a push in the right direction (links to guides etc. would also do it).
Thank you!

No it doesn't do OCR, and that's why it has limitations. It has multiple techniques and uses them in a heirarchical fashion:
If the font has a ToUnicode CMap, use that to get the Unicode code
points
If not, then check the glyph names (if available) against a standard
list
Assume the character codes are ASCII.
Since Ghostscript and the associated txtwrite device are open source, you can easily just read the source code for more information.

Related

Output PDF with Arabic as text

I had this task to convert PDFs containing Persian (Farsi) to Text. I naturally turned to pdfminer to achieve this, however it didn't perform well and the extracted Farsi was different to that in the PDF. It looked correct (because I don't know the alphabet) but someone who can read it said that there are extra letters.
I suspect this might be a problem with all right-to-left written text.
To save others the time, here is what i did (I answer this myself below - other answers are welcome).
Let me phrase this as a question so as to fit the SO guidelines:
PDFMiner isn't working to pdf2text Persian (Farsi) documents. What are other options?
Examples are found under https://www.humanservices.gov.au/individuals/information-in-your-language. Specifically I was looking at:
https://www.humanservices.gov.au/sites/default/files/documents/4863-1506ar.pdf
https://www.humanservices.gov.au/sites/default/files/2017/01/9284-1607ar.pdf
I installed Poppler - https://en.wikipedia.org/wiki/Poppler_(software) and used the pdftotext:
mac$ brew install poppler
mac$ pdftotext file.pdf file.txt

Why does the combination pdf2ps / ps2pdf shrink the PDF?

When researching how to compress a bunch of PDFs with pictures inside (ideally in a lossless fashion, but I'll settle for lossy) I found that a lot of people recommend doing this:
$ pdf2ps file.pdf
$ ps2pdf file.ps
This works! The resulting file is smaller and looks at least good enough.
How / why does this work?
Which settings can I tweak in this process?
If there is some lossy conversion, which one is that?
Where is the catch?
People who recommend this procedure rarely do so from a background of expertise or knowledge -- it's rather based on gut feelings.
The detour of generating a new PDF via PostScript and back (also called "refrying a PDF") is never going to give you the optimal results. Sometimes it is useful, f.e. in cases were the original PDF isn't printed at all, or cannot be processed by another application. But these cases are very rare.
In any case, this "roundtrip" conversion will never lead to the same PDF file as initially.
Also the pdf2ps and ps2pdf tools aren't an independent tools at all: they are just simple wrapper scripts around a Ghostscript (gs or gswin32c.exe) command line. You can check that yourself by doing:
cat $(which ps2pdf)
cat $(which pdf2ps)
This will also reveal the (default) parameters these simple wrappers use for the respective conversions.
If you are unlucky, you will have an ancient Ghostscript installed. The PostScript which is then generated by pdf2ps will be Level 1 PS, and this will be "lossy" for many fonts which could be used by more modern PDF files, resulting in rasterization of previous vector fonts. Not exactly the output you'd like to look at...
Since both tools are using Ghostscript anyway (but behind your back), you are better off to run Ghostscript yourself. This gives you more control over the parameters it uses. Especially advantageous is the fact that this way you can get a direct PDF->PDF conversion, without any detour via an intermediary PostScript file format.
Here are a few answers which would give you some hints about what parameters you could use in order to drive the file size down in a semi-controlled way in your output PDF:
Optimize PDF files (with Ghostscript or other) (StackOverflow)
Remove / Delete all images from a PDF using Ghostscript or ImageMagick (StackOverflow)

Use ghostscript to delete a page (not extracting a range)

I know ghostscript can use -dfirstpage -dlastpage to only make a file from a range of pages, but I need to make it (or another command line program) delete the 2nd page in any pdf where the range of pages is not explicitly told. I thought this would be far easier because most printers let you specify "1,3-end" and I have been using PDFCreator to do it that way.
The one way I can think of doing it (very very messy) is to extract page 1, extract pages 3 to end, and then merge the two pdfs. But I also don't know how to have GS determine the number of pages.
Use the right tool for the job!
For reasons outlined by KenS, Ghostscript is not the best tool for what you want to achieve. A better tool for this task is pdftk. To remove the 2nd page from input.pdf, you should run this command line:
pdftk input.pdf cat 1 3-end output output.pdf
OK first things first, if you use Ghostscript's pdfwrite device you are NOT extracting, or deleting, or performing any other 'manipulation' operation on your source PDF file. I keep on reiterating this, but I'm going to say it again.
When you pass an input file through Ghostscript it is completely interpreted to a series of graphical primitives which are passed to the device, in general the device will render the primitives to a bitmap. In the case of the 'high level' devices such as pdfwrite, the primitives are re-assmebled into a brand new file, in the case of pdfwrite a PDF file.
This flexibility allows for input in a number of different page description languages (PostScript, PDF, PCL, PCL-XL, XPS) and then output in a few different high level formats (PostScript, EPS, flavours of PDF, XPS, PCL, PCL-XL).
But the new file bears no relation to the original, other than its appearance.
Now, having got that out of the way... You can use the pdf_info.ps PostScript program, supplied in the 'toolin' directory of the Ghostscript installation, to get a variety of information about PDF files, one of the things you can get is the number of pages in the PDF. You also don't need to bother, run the file once with -dLastPage=1, then run it again with -dFirstPage=2 (don't set LastPage), then run both resulting files to create a file with the pages from each combined.

Ghostscript skips characters when merging PDFs

I have a problem when using Ghostscript (version 8.71) on Ubuntu to merge PDF files created with wkhtmltopdf.
The problem I experience on random occasions is that some characters get lost in the merge process and replaced by nothing (or space) in the merged PDF. If I look at the original PDF it looks fine but after merge some characters are missing.
Note that one missing character, such as number 9 or the letter a, can be lost in one place in the document but show up fine somewhere else in the document so it is not a problem displaying it or a font issue as such.
The command I am using is:
gs \
-q \
-dNOPAUSE \
-sDEVICE=pdfwrite \
-sOutputFile=/tmp/outputfilename \
-dBATCH \
/var/www/documents/docs/input1.pdf \
/var/www/documents/docs/input2.pdf \
/var/www/documents/docs/input3.pdf
Anyone else that have experienced this, or even better know a solution for it?
I've seen this happening if the names for embedded font subsets are identical, but the real content of these subsets are different (containing different glyph sets).
Check all your input files for the fonts used. Use Poppler's pdffonts utility for this:
for i in input*.pdf; do
pdffonts ${i} | tee ${i}.pdffonts.txt
done
Look for the font names used in each PDF.
My theory/bet is on you seeing identical font names used (names which are similar to BAAAAA+ArialMT) by different input files.
The BAAAAA+ font name prefix to be used for subset fonts is supposed to be random (though the official specification is not very clear about this). Some applications use predictable prefixes, however, starting with BAAAAA+, CAAAAAA+ DAAAAA+ etc. (OpenOffice.org and LibreOffice are notorious for this). This means that the prefix BAAAAA+ gets used in every single file where at least one subset font is used...
It can easily happen that your input files do not use the exact same subset of characters. However the identical names used could make Ghostscript think that the font really is the same. It (falsely) 'optimizes' the merged PDF and embeds only one of the 2 font instances (both having the same name, for example BAAAAA+Arial). However, this instance may not include some glyphs which where part of the other instance(s).
This leads to some characters missing in merged output.
I know that more recent versions of Ghostscript have seen a heavy overhaul of their font handling code. Maybe you'll be more lucky with trying Ghostscript v9.06 (the most recent release to date).
I'm very much interested in investigating this in even bigger detail. If you can provide a sample of your input files (as well as the merged output given by GS v8.70), I can test if it works better with v9.06.
What you could do to avoid this problem
Try to always embed fonts as full sets, not subsets:
I don't know if and how you can control to have full font embedding when using wkhtmltopdf.
If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
If Ghostscript generates your input PDFs the commandline parameters to enforce full font embeddings are:
gs -o output.pdf -sDEVICE=pdfwrite -dSubsetFonts=false input.file
Some type of fonts cannot be embedded fully, but only subsetted (TrueType, Type3, CIDFontType0, CIDFontType1, CIDFontType2). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do the following only if you are sure that no-one else gets to see or print or use your individual input files: Do not embed the fonts at all -- only embed when merging with Ghostscript the final result PDF from your inputs.
I don't know if and how you can control to have no font embedding when using wkhtmltopdf.
If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
If Ghostscript generates your input PDFs the commandline parameters to prevent font embedding are:
gs -o output.pdf -sDEVICE=pdfwrite -dEmbedAllFonts=false -c "<</AlwaysEmbed [ ]>>setpagedevice" input.file
Some type of fonts cannot be embedded fully, but only subsetted (Type3, CIDFontType1). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do not use Ghostscript, but rather use pdftk for merging PDFs. pdftk is a more 'dumb' utility than Ghostscript (at least older versions of pdftk are) when it comes to merging PDFs, and this dumbness can be an advantage...
Update
To answer once more, but this time more explicitly (following the extra question of #sacohe in the comments below. In many (not all) cases the following procedure will work:
Re-'distill' the input PDF files with the help of Ghostscript (preferably the most recent version from the 9.0x series).
The command to use is this (or similar):
gs -o redistilled-out.pdf -sDEVICE=pdfwrite input.pdf
The resulting output PDF should then be using different (unique) prefixes to the font names, even when the input PDF used the same name prefix for different font (subsets).
This procedure worked for me when I processed a sample of original input files provided to me by 'Mr R', the author of the original question. After that fix, the "skipped character problem" was gone in the final result (a merged PDF created from the fixed input files).
I wanted to give some feedback that unfortunately the re-processing trick doesn't seem to work with ghostscript 8.70 (in redhat/centos releases) and files exported as pdf from word 2010 (which seems to use ABCDEE+ prefix for everything). and i haven't been able to find any pre-built versions of ghostscript 9 for my platform.
you mention that older versions of pdftk might work. we moved away from pdftk (newer versions) to gs, because some pdf files would cause pdftk to coredump. #Kurt, do you think that trying to find an older version of pdftk might help? if so, what version do you recommend?
another ugly method that halfway works is to use:
-sDEVICE=pdfwrite -dCompatibilityLevel=1.2 -dHaveTrueType=false
which converts the fonts to bitmap, but it then causes the characters on the page to be a bit light (not a big deal), trying to select text is off by about one line height (mildly annoying), and worst is that even though the characters display ok, copy/paste gives random garbage in the text.
(I was hoping this would be a comment, but I guess I can't do that, is answer closed?)
From what I can tell, this issue is fixed in Ghostscript version 9.21. We were having a similar issue where merged PDFs were missing characters, and while #Kurt Pfeifle suggestion of re-distilling those PDFs did work, it seems a little infeasible/silly to us. Some of our merged PDFs consisted of up to 600 or more individual PDFs, and re-distilling every single one of those to merge them just seemed nuts
Our production version of Ghostscript was 9.10 which was causing this problem. But when I did some tests on 9.21 the problem seemed to vanish. I have been unable to produce a document with missing or mangled characters using GS 9.21 so I think that's the real solution here.

How to extract text from a PDF? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
The community reviewed whether to reopen this question 4 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Can anyone recommend a library/API for extracting the text and images from a PDF?
We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.
We would like that data to be output in xml or json format. We're currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.
Are there alternatives (commercial ones or free) for extracting text from a pdf programatically?
I was given a 400 page pdf file with a table of data that I had to import - luckily no images. Ghostscript worked for me:
gswin64c -sDEVICE=txtwrite -o output.txt input.pdf
The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. -dSIMPLE and -dCOMPLEX made no difference in this case.
An efficient command line tool, open source, free of any fee, available on both linux & windows : simply named pdftotext. This tool is a part of the xpdf library.
http://en.wikipedia.org/wiki/Pdftotext
Since today I know it: the best thing for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.
PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible".
TET's first incarnation is a library. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces.
pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes.
And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.
I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.
This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements.
TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...
Give it a try.
For python, there is PDFMiner and pyPDF2. For more information on these, see Python module for converting PDF to text.
Here is my suggestion.
If you want to extract text from PDF, you could import the pdf file into Google Docs, then export it to a more friendly format such as .html, .odf, .rtf, .txt, etc. All of this using the Drive API. It is free* and robust. Take a look at:
https://developers.google.com/drive/v2/reference/files/insert https://developers.google.com/drive/v2/reference/files/get
Because it is a rest API, it is compatible with ALL programing languages. The links I posted aboove have working examples for many languages including: Java, .NET, Python, PHP, Ruby, and others.
I hope it helps.
PdfTextStream (which you said you have been looking at) is now free for single threaded applications. In my opinion its quality is much better than other libraries (esp. for things like funky embedded fonts, etc).
It is available in Java and C#.
Alternatively, you should have a look at Apache PDFBox, open source.
One of the comments here used gs on Windows. I had some success with that on Linux/OSX too, with the following syntax:
gs \
-q \
-dNODISPLAY \
-dSAFER \
-dDELAYBIND \
-dWRITESYSTEMDICT \
-dSIMPLE \
-f ps2ascii.ps \
"${input}" \
-dQUIET \
-c quit
I used dSIMPLE instead of dCOMPLEX because the latter outputs 1 character per line.
Docotic.Pdf library may be used to extract text from PDF files as plain text or as a collection of text chunks with coordinates for each chunk.
Docotic.Pdf can be used to extract images from PDFs, too.
Disclaimer: I work for Bit Miracle.
As the question is specifically about alternative tools to get data from PDF as XML so you may be interested to take a look at the commercial tool "ByteScout PDF Extractor SDK" that is capable of doing exactly this: extract text from PDF as XML along with the positioning data (x,y) and font information:
Text in the source PDF:
Products | Units | Price
Output XML:
<row>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="212" y="126" width="47" height="11">Products</text>
</column>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="428" y="126" width="27" height="11">Units</text>
</column>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="503" y="126" width="26" height="11">Price</text>
</column>
</row>
P.S.: additionally it also breaks the text into a table based structure.
Disclosure: I work for ByteScout
The best thing I can currently think of (within the list of "simple" tools) is Ghostscript (current version is v.8.71) and the PostScript utility program ps2ascii.ps. Ghostscript ships it in its lib subdirectory. Try this (on Windows):
gswin32c.exe ^
-q ^
-sFONTPATH=c:/windows/fonts ^
-dNODISPLAY ^
-dSAFER ^
-dDELAYBIND ^
-dWRITESYSTEMDICT ^
-dCOMPLEX ^
-f ps2ascii.ps ^
-dFirstPage=3 ^
-dLastPage=7 ^
input.pdf ^
-dQUIET ^
-c quit
This command processes pages 3-7 of input.pdf. Read the comments in the ps2ascii.ps file itself to see what the "weird" numbers and additional infos mean (they indicate strings, positions, widths, colors, pictures, rectangles, fonts and page breaks...). To get a "simple" text output, replace the -dCOMPLEX part by -dSIMPLE.
I know that this topic is quite old, but this need is still alive. I read many documents, forum and script and build a new advanced one which supports compressed and uncompressed pdf :
https://gist.github.com/smalot/6183152
In some cases, command line is forbidden for security reasons.
So a native PHP class can fit many needs.
Hope it helps everone
For image extraction, pdfimages is a free command line tool for Linux or Windows (win32):
pdfimages: Extract and Save Images From A Portable Document Format ( PDF ) File
Apache pdfbox has this feature - the text part is described in:
http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html
for an example implementation see
https://github.com/WolfgangFahl/pdfindexer
the testcase TestPdfIndexer.testExtracting shows how it works
QuickPDF seems to be a reasonable library that should do what you want for a reasonable price.
http://www.quickpdflibrary.com/ - They have a 30 day trial.
On my Macintosh systems, I find that "Adobe Reader" does a reasonably good job. I created an alias on my Desktop that points to the "Adobe Reader.app", and all I do is drop a pdf-file on the alias, which makes it the active document in Adobe Reader, and then from the File-menu, I choose "Save as Text...", give it a name and where to save it, click "Save", and I'm done.