PDF text extraction from given coordinates - pdf

I would like to extract text from a portion (using coordinates) of PDF using Ghostscript.
Can anyone help me out?

Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only.
First: Ghostscript's txtwrite output device (not so good)
gs \
-dBATCH \
-dNOPAUSE \
-sDEVICE=txtwrite \
-dFirstPage=3 \
-dLastPage=5 \
-sOutputFile=- \
/path/to/your/pdf
This will output all text contained on pages 3-5 to stdout. If you want output to a text file, use
-sOutputFile=textfilename.txt
gs Update:
Recent versions of Ghostscript have seen major improvements in the txtwrite device and bug fixes. See recent Ghostscript changelogs (search for txtwrite on that page) for details.
Second: Ghostscript's ps2ascii.ps PostScript utility (better)
This one requires you to download the latest version of the file ps2ascii.ps from the Ghostscript Git source code repository. You'd have to convert your PDF to PostScript, then run this command on the PS file:
gs \
-q \
-dNODISPLAY \
-P- \
-dSAFER \
-dDELAYBIND \
-dWRITESYSTEMDICT \
-dSIMPLE \
/path/to/ps2ascii.ps \
input.ps \
-c quit
If the -dSIMPLE parameter is not defined, each output line contains some additional info beyond the pure text content about fonts and fontsize used.
If you replace that parameter by -dCOMPLEX, you'll get additional infos about colors and images used.
Read the comments inside the ps2ascii.ps to learn more about this utility. It's not comfortable to use, but for me it worked in most cases I needed it....
Third: XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
A more comfortable way to do text extraction: use pdftotext (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try:
pdftotext \
-f 13 \
-l 17 \
-layout \
-opw supersecret \
-upw secret \
-eol unix \
-nopgbrk \
/path/to/your/pdf
- |less
This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less...
pdftotext -h displays all available commandline options.
Of course, both tools only work for the text parts of PDFs (if they have any). Oh, and mathematical formula also won't work too well... ;-)
pdftotext Update:
Recent versions of Poppler's pdftotext have now options to extract "a portion (using coordinates) of PDF" pages, like the OP asked for. The parameters are:
-x <int> : top left corner's x-coordinate of crop area
-y <int> : top left corner's y-coordinate of crop area
-W <int> : crop area's width in pixels (defaults to 0)
-H <int> : crop area's height in pixels (defaults to 0)
Best, if used with the -layout parameter.
Fourth: MuPDF's mutool draw command can also extract text
The cross-platform, open source MuPDF application (made by the same company that also develops Ghostscript) has bundled a command line tool, mutool. To extract text from a PDF with this tool, use:
mutool draw -F txt the.pdf
will emit the extracted text to <stdout>. Use -o filename.txt to write it into a file.
Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)
TET, the Text Extraction Toolkit from the pdflib family of products can find the x-y-coordinate of text content in a PDF file (and much more). TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...) Quote from their website:
Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.
In my experience, while it's does not sport the most straight-forward CLI interface you can imagine: after you got used to it, it will do what it promises to do, for most PDFs you throw towards it...
And there are even more options:
podofotxtextract (CLI tool) from the PoDoFo project (Open Source)
calibre (normally a GUI program to handle eBooks, Open Source) has a commandline option that can extract text from PDFs
AbiWord (a GUI word processor, Open Source) can import PDFs and save its files as .txt: abiword --to=txt --to-name=output.txt input.pdf

I'm not sure GhostScript can accept coordinates, but you can convert the PDF to a image and send it to an OCR engine either as a subimage cropped from the given coordinates or as the whole image along with the coordinates. Some OCR API accepts a rectangle parameter to narrow the region for OCR.
Look at VietOCR for a working example, which uses Tesseract as its OCR engine and GhostScript as PDF-to-image converter.

Debenu Quick PDF Library can extract text from a defined area on a page. The SetTextExtractionArea function lets you specify the x and y coordinates and then you can also specify the width and height of the area.
Left = The horizontal coordinate of the left edge of the area
Top = The vertical coordinate of the top edge of the area
Width = The width of the area
Height = The height of the area
Then the GetPageText function can be called immediately after this to extract the text from that defined area.
Here's an example using C# (though the library is multi-platform and can be used with many different programming languages):
DPL.LoadFromFile(#"Sample.pdf", "");
DPL.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left
DPL.SetTextExtractionArea(35, 35, 229, 30); // Left, Top, Width, Height
string ExtractedContent = DPL.GetPageText(8);
Console.WriteLine(ExtractedContent);
Using GetPageText it is also possible to return just the text located in that area or the text located in that area as well as information about the text's font such as name, color and size.

Related

Using Ghostscript to look only at specific area when using inkcov

I use Ghostscript to check if part of a pdf page is empty of content or not by cropping the area I'm interested in (using iTextSharp in vb.net) and then running:
gswin64c.exe -o - -sDEVICE=inkcov c:\cropped.pdf
This works well enough, but it would be much better if I could get Ghostscript to look at a specific rectangle area on the page (instead of the whole page), to remove the need for cropping first.
Is this possible?
You can use -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS along with -dFIXEDMEDIA to specify a fixed media size. You want to set that to the 'window' size you want to use.
You then need to shift the PDF content so that the portion of the page you are interested in lies under the window. To do that you need to write some PostScript code.
You need to define a BeginPage procedure, that will be executed at the start of every page, in there you need to translate the page origin.
So lets take a specific example:
Say that you have an A4 page (612x792 points) and you want to limit the rendered content to a 1x2 inch rectangle at the top right. We start by defining the media size:
-dDEVICEWIDTHPOINTS=72 -dDEVICEHEIGHTPOINTS=144 -dFIXEDMEDIA
We then need to shift the origin (0,0) which is the bottom left corner in PostScript and PDF, so that only the top right section of the content lies on the page. That means we need to shift the content 540 points to the left, and 648 points down. So our BeginPage procedure is:
/BeginPage {-540 -648 translate}
and we install that in the current page device thus:
<<
/BeginPage {-540 -648 translate}
>> setpagedevice
In order to introduce arbitrary PostScript we use the -c command (and we need to tell GS when we're done, we usually use the -f switch for that but any switch would do)
So your original command line would then be:
gswin64c -dDEVICEWIDTHPOINTS=72 -dDEVICEHEIGHTPOINTS=144 -dFIXEDMEDIA -sDEVICE=inkcov -o - -c "<</BeginPage {-540 -648 translate}>> setpagedevice" -f original.pdf
Obviously the exact numbers here are going to be up to you to determine.

Batch convert svg to pdf page size

I have a number of svg files created with inkscape that contain text in non-standard fonts. As far as I understand, in order to have them printed I need to convert the text to paths. It seems that if I just use
convert input.svg output.pdf
the text is automatically converted to paths. Is this correct?
However my problem is with the page size. The input svg have a page size of A5, landscape. However the converted pdf seem to be cut on the right and bottom of the image by about 5% of the image width/height.
Why is that? How do I fix it?
As long as you have Inkscape on your system, ImageMagick convert actually delegates the PDF export to Inkscape. You can use it directly on the command line as
inkscape -zA output.pdf input.svg
Quote from man:
Used fonts are subset and embedded.
There are some options to manipulate the export area. -C explicitely sets the page area, -D the drawing bounding box.
You could even preserve the SVG format by using
inkscape -Tl output.svg input.svg
which would convert text to path.
Lastely, since you have to batch-process multiple files, you should open a shell with
inkscape --shell
and process all files in one go. Otherwise, startup time of inkscape would be 1-3 seconds for every file. Something like:
ls -1 *.svg | awk -F. \
'{ print "-AC " $1 ".pdf" $0 }
END { print "quit" }' | \
inkscape --shell

Reverse white and black colors in a PDF

Given a black and white PDF, how do I reverse the colors such that background is black and everything else is white?
Adobe Reader does it (Preferences -> Accessibility) for viewing purposes only in the program. But does not change the document inherently such that the colors are reversed also in other PDF readers.
How to reverse colors permanently?
You can run the following Ghostscript command:
gs -o inverted.pdf \
-sDEVICE=pdfwrite \
-c "{1 exch sub}{1 exch sub}{1 exch sub}{1 exch sub} setcolortransfer" \
-f input.pdf
Acrobat will show the colors inverted.
The four identical parts {1 exch sub} are meant for CMYK color spaces and are applied to C(yan), M(agenta), Y(ellow) and (blac)K color channels in the order of appearance.
You may use only three of them -- then it is meant for RGB color spaces and is applied to R(ed), G(reen) and B(lue).
Of course you can "invent" you own transfer functions too, instead of the simple 1 exch sub one: for example {0.5 mul} will just use 50% of the original color values for each color channel.
Note: Above command will show ALL colors inverted, not just black+white!
Caveats:
Some PDF viewers won't display the inverted colors, notably Preview.app on Mac OS X, Evince, MuPDF and PDF.js (Firefox PDF Viewer) won't. But Chrome's native PDF viewer PDFium will do it, as well as Ghostscript and Adobe Reader.
It will not work with all PDFs (or for all pages of the PDF), because it is also dependent on how exactly the document's colors are defined.
Update
Command above updated with added -f parameter (required) before the input.pdf. Sorry for not noticing this flaw in my command line before. I got aware of it again only because some good soul gave it its first upvote today...
Additional update: The most recent versions of Ghostscript do not require the added -f parameter any more. Verified with v9.26 (may also be true even with v9.25 or earlier versions).
Best method would be to use "pdf2ps - Ghostscript PDF to PostScript translator", which convert the PDF to PS file.
Once PS file is created, open it with any text editor & add {1 exch sub} settransfer before first line.
Now "re-convert" the PS file back to PDF with same software used above.
If you have the Adobe PDF printer installed, you go to Print -> Adobe PDF -> Advanced... -> Output area and select the "Invert" checkbox. Your printed PDF file will then be inverted permanently.
None of the previously posted solutions worked for me so I wrote this simple bash script. It depends on pdftk and awk. Just copy the code into a file and make it executable. Then run it like:
$ /path/to/this_script.sh /path/to/mypdf.pdf
The script:
#!/bin/bash
pdftk "$1" output - uncompress | \
awk '
/^1 1 1 / {
sub(/1 1 1 /,"0 0 0 ",$0);
print;
next;
}
/^0 0 0 / {
sub(/0 0 0 /,"1 1 1 ",$0);
print;
next;
}
{ print }' | \
pdftk - output "${1/%.pdf/_inverted.pdf}" compress
This script works for me but your mileage may vary. In particular sometimes the colors are listed in the form 1.000 1.000 1.000 instead of 1 1 1. The script can easily be modified as needed. If desired, additional color conversions could be added as well.
For me, the pdf2ps -> edit -> ps2pdf solution did not work. The intermediate .ps file is inverted correctly, but the final .pdf is the same as the original. The final .pdf in the suggested gs solution was also the same as the original.
Cross Platform try MuPDF
Mutool draw -I -o out.pdf in.pdf [range of pages]
It should permanently change colours in many viewers
Later Edit
A sample file that did not reverse was one with linework only (no image) and the method needed was to save the graphics as inverted image then reuse that to build a replacement PDF, however beware converting the whole pages to image will make any searchable text just simply unsearchable pixels thus would need to be run with the OCR active on rebuild.
The two commands needed will be something like (%4d means numbers for images start output0001)
mutool draw -o output%4d.png -I input.pdf
For Linux users the folowing second pass should work easily:-
mutool convert -O compress -o output.pdf output*.png
For windows users you will for now (v1.19) need to combine by scripting or use groups
mutool convert -O compress -o output.pdf output0001.png output0002.png output0003.png
next version may include an #filelist option see https://bugs.ghostscript.com/show_bug.cgi?id=703163
This is probably just a frontend for the ghostscript command Kurt Pfeifle posted, but you could also use imagemagick with something like:
convert -density 300 -colorspace RGB -channel RGB -negate input.pdf output.pdf

Ghostscript: convert PDF to EPS with embeded font rather than outlined curve

I use the following command to convert a PDF to EPS:
gswin32 -dNOCACHE -dNOPAUSE -dBATCH -dSAFER -sDEVICE=epswrite -dLanguageLevel=2 -sOutputFile=test.eps -f test.pdf
I then use the following command to convert the EPS to another PDF (test2.pdf) to view the EPS figure.
gswin32 -dSAFER -dNOPLATFONTS -dNOPAUSE -dBATCH -dEPSCrop -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dCompatibilityLevel=1.4 -dMaxSubsetPct=100 -dSubsetFonts=true -dEmbedAllFonts=true -sOutputFile=test2.pdf -f test.eps
I found the text in the generated test2.pdf have been converted to outline curves. There is no font embedded anymore either.
Is it possible to convert PDF to EPS without convert text to outlines? I mean, to EPS with embedded font and text.
Also after the conversion (test.pdf -> test.eps -> test2.pdf), the height and width of the PDF figure (test2.pdf) is a little bit smaller than the original PDF (test.pdf):
test.pdf:
test2.pdf:
Is it possible to keep the width and height of the figure after conversion?
Here is the test.pdf: https://dl.dropboxusercontent.com/u/45318932/test.pdf
I tried KenS's suggestion:
gswin32 -dNOPAUSE -dBATCH -dSAFER -sDEVICE=eps2write -dLanguageLevel=2 -sOutputFile=test.eps -f test.pdf
gswin32 -dSAFER -dNOPLATFONTS -dNOPAUSE -dBATCH -dEPSCrop -sDEVICE=pdfwrite -dPDFSETTINGS=/printer -dCompatibilityLevel=1.4 -dMaxSubsetPct=100 -dSubsetFonts=true -dEmbedAllFonts=true -sOutputFile=test2.pdf -f test.eps
I can see the converted test2.pdf have very weird font:
that is different from the original font in test.pdf:
When I copy the text from test2.pdf, I only get a couple of symbols like:
✕ ✖ ✗✘✙ ✚✛
Here is the test2.pdf: https://dl.dropboxusercontent.com/u/45318932/test2.pdf
I was using the latest Ghostscript 9.15. So what is the problem?
I just noticed you are using epswrite, you don't want to do that. That device is terrible and has been deprecated (and removed now). Use the eps2write device instead (you will need a relatively recent version of Ghostscript).
There's nothing you can do with epswrite except throw it away, it makes terrible EPS files. It also can't make level 2 files, no matter what you set -dLanguageLevel to
oh, and don't use -dNOCACHE, that prevents fonts being processed and decomposes everything to outlines or bitmaps.
UPDATE
You set subset fonts to true. By doing so the character codes which are used are more or less random. The first glyph in the document (say for example the 'H' in 'Hello World') gets the code 1, the second one (eg 'e') gets the code 2 and so on.
If you have a ToUnicode CMap, then Acrobat and other readers can convert these character codes to Unicode code points, without that the readers have to fall back on heuristics, the final one being 'treat it as ASCII'. Because the encoding arrangement isn't ASCII, then you get gibberish. MS Windows' PostScript output can contain additional ToUnicode information, but that's not something we try to mimic in ps2write. After all, presumably you had a PDF file already....
Every time you do a conversion you run the risk of this kind of degradation, you should really try and minimise this in your workflow.
The problem is even worse in this case, the input PDF file has a TrueType CID Font. Basic language level 2 PostScript can't handle CIDFonts (IIRC this was introduced in version 2015). Since eps2write only emits basic level 2 it cannot write the font as a CIDFont. So instead it captures the glyph outlines and stores them in a type 3 font.
However, our EPS/PS output doesn't attempt to embed ToUnicode information in the PostScript (its non-standard, very few applications can make use of it and it therefore makes the files larger for little benefit). In addition CIDFonts use multiple (2 or more) bytes for the character code, so there's no way to encode the type 3 fonts as ASCII.
Fundamentally you cannot use Ghostscript to go PDF->PS->PDF and still be able to copy/paste/search text, if the input contains CIDFonts.
By the way, there's no point in setting -dLanguageLevel at all. eps2write only creates level 2 output.
I used Inkscape To convert a .pdf to .EPS. Just upload the .pdf file to Inkscape, in the options to open chose high mesh, and save as . an EPS file.

Cropped PCL after gswin PDF to PCL conversion

I have a PDF, which I want to convert to PCL
I convert PDF to PCL using the following command:
(gs 8.70)
gswin32c.exe -q -dNOPAUSE -dBATCH \
-sDEVICE=ljetplus -dDuplex=false -dTumble=false \
-sPAPERSIZE=a4 -sOutputFile="d:\doc1.pcl" \
-f"d:\doc1.pdf" -c -quit
When I view or print the output PCL, it is cropped. I would expect the output to start right at the edge of the paper (at least in the viewer).
Is there any to way get the whole output without moving the contents of the page away from the paper edge?
I tried the -dPDFFitpage option which works, but results in a scaled output.
You are using -sPAPERSIZE=a4. This causes the PCL to render for A4 sized media.
Very likely, your input PDF is made for a non-A4 size. That leaves you with 3 options:
...you either use that exact page size for the PCL too (which your printer possibly cannot handle),
...or you have to add -dPDFFitPage (as you tried, but didn't like),
...or you skip the -sPAPERSIZE=... parameter altogether (which most likely will automatically use the same as size as the PDF, and which your printer possibly cannot handle...)
Update 1:
In case ljetplus is not a hard requirement for your requested PCL format variant, you could try this:
gs -sDEVICE=pxlmono -o pxlmono.pcl a4-fo.pdf
gs -sDEVICE=pxlcolor -o pxlcolor.pcl a4-fo.pdf
Update 2:
I can confirm now that even the most recent version of Ghostscript (v9.06) cannot handle non-Letter page sizes for ljetplus output.
I'd regard this as a bug... but it could well be that it won't be fixed, even if reported at the GS bug tracker. However, the least that can be expected is that it will get documented as a known limitation for ljetplus output...