Output PDF with Arabic as text - pdf

I had this task to convert PDFs containing Persian (Farsi) to Text. I naturally turned to pdfminer to achieve this, however it didn't perform well and the extracted Farsi was different to that in the PDF. It looked correct (because I don't know the alphabet) but someone who can read it said that there are extra letters.
I suspect this might be a problem with all right-to-left written text.
To save others the time, here is what i did (I answer this myself below - other answers are welcome).
Let me phrase this as a question so as to fit the SO guidelines:
PDFMiner isn't working to pdf2text Persian (Farsi) documents. What are other options?
Examples are found under https://www.humanservices.gov.au/individuals/information-in-your-language. Specifically I was looking at:
https://www.humanservices.gov.au/sites/default/files/documents/4863-1506ar.pdf
https://www.humanservices.gov.au/sites/default/files/2017/01/9284-1607ar.pdf

I installed Poppler - https://en.wikipedia.org/wiki/Poppler_(software) and used the pdftotext:
mac$ brew install poppler
mac$ pdftotext file.pdf file.txt

Related

How does ghostscript convert PDF to .txt?

GNU Ghostscript is able to convert pdf files to .txt (text files) in terminal.
gs -sDEVICE=txtwrite -o output.txt input.pdf
I was wondering how it accomplishes this task? Does it use OCR?
I'm not looking for a very hefty explanation, but just a push in the right direction (links to guides etc. would also do it).
Thank you!
No it doesn't do OCR, and that's why it has limitations. It has multiple techniques and uses them in a heirarchical fashion:
If the font has a ToUnicode CMap, use that to get the Unicode code
points
If not, then check the glyph names (if available) against a standard
list
Assume the character codes are ASCII.
Since Ghostscript and the associated txtwrite device are open source, you can easily just read the source code for more information.

ghostPCL: why is this file not converted properly to PDF?

I am using ghostpcl-9.18-win64. This is the script that I used to generate the pdf file:
gpcl6win64-9.18.exe -sDEVICE=pdfwrite -sOutputFile=%1.pdf -dNOPAUSE %1.txt
The file to test can be found here and the result of running ghostpcl can be found here.
If you take a look at the pdf file it contains only a page (there should be 2) and some of the text is missing. Why is that? I always pictured in my mind that ghostpcl would produce a pdf identical with a printout. Am I missing something, parameters perhaps?
As a matter of fact, when I used the lpr command to print the file on RHEL it printed exactly what I expected. I wonder how reliable is the ghostpcl tool in converting pcl files to PDF. And if it's not that reliable, a broader question is: is there another tool to do it? I am interested mainly in the linux version.
The txt file is based on a file generated using SQR.
Thanks
In fact the OP did raise a bug report (but didn't mention it here):
http://bugs.ghostscript.com/show_bug.cgi?id=696509
The opinion of our PCL maintainer is that the output is correct, inasmuch as it matches at least one HP printer. See the URL above for slightly more details.
Based on the discussion on the bug thread, the input file is invalid because it should have had CRLFs instead of LFs only.
If I convert the LFs to CRLFs then my input file is converted as expected to PDF. However, converting LFs to CRLFs is not a general solution. According to support LFs can be used for images. In that case converting such an LF to CRLF could break the image.
It seems that there is one thing I was wrong about on the bug thread, in our system, lpr includes carriage returns as well in the final file that gets sent to the printer. I followed the instructions here: https://wiki.ubuntu.com/DebuggingPrintingProblems, and the instructions in the 'Getting the data which would go to the printer' section to print to a file and the file includes Carriage returns.

Ghostscript skips characters when merging PDFs

I have a problem when using Ghostscript (version 8.71) on Ubuntu to merge PDF files created with wkhtmltopdf.
The problem I experience on random occasions is that some characters get lost in the merge process and replaced by nothing (or space) in the merged PDF. If I look at the original PDF it looks fine but after merge some characters are missing.
Note that one missing character, such as number 9 or the letter a, can be lost in one place in the document but show up fine somewhere else in the document so it is not a problem displaying it or a font issue as such.
The command I am using is:
gs \
-q \
-dNOPAUSE \
-sDEVICE=pdfwrite \
-sOutputFile=/tmp/outputfilename \
-dBATCH \
/var/www/documents/docs/input1.pdf \
/var/www/documents/docs/input2.pdf \
/var/www/documents/docs/input3.pdf
Anyone else that have experienced this, or even better know a solution for it?
I've seen this happening if the names for embedded font subsets are identical, but the real content of these subsets are different (containing different glyph sets).
Check all your input files for the fonts used. Use Poppler's pdffonts utility for this:
for i in input*.pdf; do
pdffonts ${i} | tee ${i}.pdffonts.txt
done
Look for the font names used in each PDF.
My theory/bet is on you seeing identical font names used (names which are similar to BAAAAA+ArialMT) by different input files.
The BAAAAA+ font name prefix to be used for subset fonts is supposed to be random (though the official specification is not very clear about this). Some applications use predictable prefixes, however, starting with BAAAAA+, CAAAAAA+ DAAAAA+ etc. (OpenOffice.org and LibreOffice are notorious for this). This means that the prefix BAAAAA+ gets used in every single file where at least one subset font is used...
It can easily happen that your input files do not use the exact same subset of characters. However the identical names used could make Ghostscript think that the font really is the same. It (falsely) 'optimizes' the merged PDF and embeds only one of the 2 font instances (both having the same name, for example BAAAAA+Arial). However, this instance may not include some glyphs which where part of the other instance(s).
This leads to some characters missing in merged output.
I know that more recent versions of Ghostscript have seen a heavy overhaul of their font handling code. Maybe you'll be more lucky with trying Ghostscript v9.06 (the most recent release to date).
I'm very much interested in investigating this in even bigger detail. If you can provide a sample of your input files (as well as the merged output given by GS v8.70), I can test if it works better with v9.06.
What you could do to avoid this problem
Try to always embed fonts as full sets, not subsets:
I don't know if and how you can control to have full font embedding when using wkhtmltopdf.
If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
If Ghostscript generates your input PDFs the commandline parameters to enforce full font embeddings are:
gs -o output.pdf -sDEVICE=pdfwrite -dSubsetFonts=false input.file
Some type of fonts cannot be embedded fully, but only subsetted (TrueType, Type3, CIDFontType0, CIDFontType1, CIDFontType2). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do the following only if you are sure that no-one else gets to see or print or use your individual input files: Do not embed the fonts at all -- only embed when merging with Ghostscript the final result PDF from your inputs.
I don't know if and how you can control to have no font embedding when using wkhtmltopdf.
If you generate your input PDFs from Libre/OpenOffice, you're out of luck and you'll have no control over it.
If you use Acrobat to generate your input PDFs, you can tweak font embedding details in the Distiller settings.
If Ghostscript generates your input PDFs the commandline parameters to prevent font embedding are:
gs -o output.pdf -sDEVICE=pdfwrite -dEmbedAllFonts=false -c "<</AlwaysEmbed [ ]>>setpagedevice" input.file
Some type of fonts cannot be embedded fully, but only subsetted (Type3, CIDFontType1). See this answer to question "Why doesnt Acrobat Distiller embed all fonts fully?" for more details.
Do not use Ghostscript, but rather use pdftk for merging PDFs. pdftk is a more 'dumb' utility than Ghostscript (at least older versions of pdftk are) when it comes to merging PDFs, and this dumbness can be an advantage...
Update
To answer once more, but this time more explicitly (following the extra question of #sacohe in the comments below. In many (not all) cases the following procedure will work:
Re-'distill' the input PDF files with the help of Ghostscript (preferably the most recent version from the 9.0x series).
The command to use is this (or similar):
gs -o redistilled-out.pdf -sDEVICE=pdfwrite input.pdf
The resulting output PDF should then be using different (unique) prefixes to the font names, even when the input PDF used the same name prefix for different font (subsets).
This procedure worked for me when I processed a sample of original input files provided to me by 'Mr R', the author of the original question. After that fix, the "skipped character problem" was gone in the final result (a merged PDF created from the fixed input files).
I wanted to give some feedback that unfortunately the re-processing trick doesn't seem to work with ghostscript 8.70 (in redhat/centos releases) and files exported as pdf from word 2010 (which seems to use ABCDEE+ prefix for everything). and i haven't been able to find any pre-built versions of ghostscript 9 for my platform.
you mention that older versions of pdftk might work. we moved away from pdftk (newer versions) to gs, because some pdf files would cause pdftk to coredump. #Kurt, do you think that trying to find an older version of pdftk might help? if so, what version do you recommend?
another ugly method that halfway works is to use:
-sDEVICE=pdfwrite -dCompatibilityLevel=1.2 -dHaveTrueType=false
which converts the fonts to bitmap, but it then causes the characters on the page to be a bit light (not a big deal), trying to select text is off by about one line height (mildly annoying), and worst is that even though the characters display ok, copy/paste gives random garbage in the text.
(I was hoping this would be a comment, but I guess I can't do that, is answer closed?)
From what I can tell, this issue is fixed in Ghostscript version 9.21. We were having a similar issue where merged PDFs were missing characters, and while #Kurt Pfeifle suggestion of re-distilling those PDFs did work, it seems a little infeasible/silly to us. Some of our merged PDFs consisted of up to 600 or more individual PDFs, and re-distilling every single one of those to merge them just seemed nuts
Our production version of Ghostscript was 9.10 which was causing this problem. But when I did some tests on 9.21 the problem seemed to vanish. I have been unable to produce a document with missing or mangled characters using GS 9.21 so I think that's the real solution here.

How to extract text from a PDF? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
The community reviewed whether to reopen this question 4 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
Can anyone recommend a library/API for extracting the text and images from a PDF?
We need to be able to get at text that is contained in pre-known regions of the document, so the API will need to give us positional information of each element on the page.
We would like that data to be output in xml or json format. We're currently looking at PdfTextStream which seems pretty good, but would like to hear other peoples experiences and suggestions.
Are there alternatives (commercial ones or free) for extracting text from a pdf programatically?
I was given a 400 page pdf file with a table of data that I had to import - luckily no images. Ghostscript worked for me:
gswin64c -sDEVICE=txtwrite -o output.txt input.pdf
The output file was split into pages with headers, etc., but it was then easy to write an app to strip out blank lines, etc, and suck in all 30,000 records. -dSIMPLE and -dCOMPLEX made no difference in this case.
An efficient command line tool, open source, free of any fee, available on both linux & windows : simply named pdftotext. This tool is a part of the xpdf library.
http://en.wikipedia.org/wiki/Pdftotext
Since today I know it: the best thing for text extraction from PDFs is TET, the text extraction toolkit. TET is part of the PDFlib.com family of products.
PDFlib.com is Thomas Merz's company. In case you don't recognize his name: Thomas Merz is the author of the "PostScript and PDF Bible".
TET's first incarnation is a library. That one can probably do everything Budda006 wanted, including positional information about every element on the page. Oh, and it can also extract images. It recombines images which are fragmented into pieces.
pdflib.com also offers another incarnation of this technology, the TET plugin for Acrobat. And the third incarnation is the PDFlib TET iFilter. This is a standalone tool for user desktops. Both these are free (as in beer) to use for private, non-commercial purposes.
And it's really powerful. Way better than Adobe's own text extraction. It extracted text for me where other tools (including Adobe's) do spit out garbage only.
I just tested the desktop standalone tool, and what they say on their webpage is true. It has a very good commandline. Some of my "problematic" PDF test files the tool handled to my full satisfaction.
This thing will from now on be my recommendation for every sophisticated and challenging PDF text extraction requirements.
TET is simply awesome. It detects tables. Inside tables, it identifies cells spanning multiple columns. It identifies table rows and contents of each table cell separately. It deals very well with hyphenations: it removes hyphens and restores complete words. It supports non-ASCII languages (including CJK, Arabic and Hebrew). When encountering ligatures, it restores the original characters...
Give it a try.
For python, there is PDFMiner and pyPDF2. For more information on these, see Python module for converting PDF to text.
Here is my suggestion.
If you want to extract text from PDF, you could import the pdf file into Google Docs, then export it to a more friendly format such as .html, .odf, .rtf, .txt, etc. All of this using the Drive API. It is free* and robust. Take a look at:
https://developers.google.com/drive/v2/reference/files/insert https://developers.google.com/drive/v2/reference/files/get
Because it is a rest API, it is compatible with ALL programing languages. The links I posted aboove have working examples for many languages including: Java, .NET, Python, PHP, Ruby, and others.
I hope it helps.
PdfTextStream (which you said you have been looking at) is now free for single threaded applications. In my opinion its quality is much better than other libraries (esp. for things like funky embedded fonts, etc).
It is available in Java and C#.
Alternatively, you should have a look at Apache PDFBox, open source.
One of the comments here used gs on Windows. I had some success with that on Linux/OSX too, with the following syntax:
gs \
-q \
-dNODISPLAY \
-dSAFER \
-dDELAYBIND \
-dWRITESYSTEMDICT \
-dSIMPLE \
-f ps2ascii.ps \
"${input}" \
-dQUIET \
-c quit
I used dSIMPLE instead of dCOMPLEX because the latter outputs 1 character per line.
Docotic.Pdf library may be used to extract text from PDF files as plain text or as a collection of text chunks with coordinates for each chunk.
Docotic.Pdf can be used to extract images from PDFs, too.
Disclaimer: I work for Bit Miracle.
As the question is specifically about alternative tools to get data from PDF as XML so you may be interested to take a look at the commercial tool "ByteScout PDF Extractor SDK" that is capable of doing exactly this: extract text from PDF as XML along with the positioning data (x,y) and font information:
Text in the source PDF:
Products | Units | Price
Output XML:
<row>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="212" y="126" width="47" height="11">Products</text>
</column>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="428" y="126" width="27" height="11">Units</text>
</column>
<column>
<text fontName="Arial" fontSize="11.0" fontStyle="Bold" x="503" y="126" width="26" height="11">Price</text>
</column>
</row>
P.S.: additionally it also breaks the text into a table based structure.
Disclosure: I work for ByteScout
The best thing I can currently think of (within the list of "simple" tools) is Ghostscript (current version is v.8.71) and the PostScript utility program ps2ascii.ps. Ghostscript ships it in its lib subdirectory. Try this (on Windows):
gswin32c.exe ^
-q ^
-sFONTPATH=c:/windows/fonts ^
-dNODISPLAY ^
-dSAFER ^
-dDELAYBIND ^
-dWRITESYSTEMDICT ^
-dCOMPLEX ^
-f ps2ascii.ps ^
-dFirstPage=3 ^
-dLastPage=7 ^
input.pdf ^
-dQUIET ^
-c quit
This command processes pages 3-7 of input.pdf. Read the comments in the ps2ascii.ps file itself to see what the "weird" numbers and additional infos mean (they indicate strings, positions, widths, colors, pictures, rectangles, fonts and page breaks...). To get a "simple" text output, replace the -dCOMPLEX part by -dSIMPLE.
I know that this topic is quite old, but this need is still alive. I read many documents, forum and script and build a new advanced one which supports compressed and uncompressed pdf :
https://gist.github.com/smalot/6183152
In some cases, command line is forbidden for security reasons.
So a native PHP class can fit many needs.
Hope it helps everone
For image extraction, pdfimages is a free command line tool for Linux or Windows (win32):
pdfimages: Extract and Save Images From A Portable Document Format ( PDF ) File
Apache pdfbox has this feature - the text part is described in:
http://pdfbox.apache.org/apidocs/org/apache/pdfbox/util/PDFTextStripper.html
for an example implementation see
https://github.com/WolfgangFahl/pdfindexer
the testcase TestPdfIndexer.testExtracting shows how it works
QuickPDF seems to be a reasonable library that should do what you want for a reasonable price.
http://www.quickpdflibrary.com/ - They have a 30 day trial.
On my Macintosh systems, I find that "Adobe Reader" does a reasonably good job. I created an alias on my Desktop that points to the "Adobe Reader.app", and all I do is drop a pdf-file on the alias, which makes it the active document in Adobe Reader, and then from the File-menu, I choose "Save as Text...", give it a name and where to save it, click "Save", and I'm done.

What are the relative merits of pdflatex?

Not sure this is a programming question, but we use LaTeX for all our API documentation and user documentation, so I hope it will go through.
Can someone please explain what are the relative merits of using pdflatex as opposed to the "classic" technique of
latex foo
dvips -Ppdf foo
ps2pdf foo.ps
From time to time I run into people who have difficulty because things don't work in pdflatex, and I know that using pdflatex gives up two things I have grown to value:
Can't use the very speedy xdvi viewer
Can't use the PStricks package
I should add that I typically get PDF with hyperlinks by using something on the order of
\usepackage[ps2pdf,colorlinks=true]{hyperref}
so it's not necessary to use pdflatex to get good PDF.
So
What are the advantages of pdflatex that I don't know about?
What are the disadvantages of the old tools that I've overlooked?
My favorite pdflatex feature is the microtype package, which is available only when using pdflatex to go directly to PDF, and really produces stunning results with no effort on my part. Apart from that, the only caveats I run into are image formats:
pdflatex supports PDF, PNG, and JPG images.
the postscript drivers support (at least) EPS.
Also, if you want to install fonts, the procedures are slightly different depending on what fonts that driver supports. (Hint: use XeTeX to instantly enable OpenType fonts.)
As it turns out, I recently read a post that shows the difference directly. Any document that uses tables or narrow columns will be improved automatically. I also find the inter-word spacing to be far more pleasing with pdflatex.
Is xdvi much faster than xpdf? I find the edit, TeX, view cycle to be very quick with pdflatex.
Have you tried MetaPost or MetaFun for graphics? I tend to put graphics creation in the hands of the capable, but MetaFun would likely be the package I'd use. Just reading the manuals is a pleasure.
Also pdftex is the engine under development (towards luatex) and maintenance. I'm not sure the DVI counterparts are as actively maintained.
PStricks is supplanted by Tikz.
I didn't use xdvi in years, so pardon the trollish rhetorical questions: Does xdvi display vector fonts? Does it support synctex (jumping to and from code)? Does it have the confort of use of PDF readers like Skim?
Taco Hoekwater is working on Escrito, a Postscript interpreter written in Lua, which would allow you to use pstricks in Luatex. He has an impressive project completion record: maybe I should have used "will" rather than "would" in the previous sentence.
I used pdflatex to generate the PDF for my ICFP 2009 paper. (I still needed to use standard latex to generate the PostScript file.) I did so for two reasons:
I couldn't seem to get ps2pdf to generate Letter, rather than A4 output, no matter what command line options I used.
For the printers, I needed to produce a version 1.3 PDF file, not 1.4. pdflatex made this easy to do. I set the PDF author and title information while I was at it.
Both of these problems may be fixable in some way, but as a first-time latex user, I didn't find any obvious solutions, nor did more experienced users whom I'd asked.