Auto-crop pdf from command line - pdf

I need to automatically crop a pdf file (remove white margins). So far, I tried two tools which aren't perfect:
pdfcrop
Issue: it doesn't crop some pdfs.
pdf-crop-margins
Issue: sometimes it crops too much (fine details).

I had the same problem,.. I render very long single page PDF's with wkhtmltopdf like such
wkhtmltopdf \
--disable-javascript \
--print-media-type \
--zoom 2 \
--page-width 750px \
--page-height 100000px \
https://www.foobar.com \
foobar.pdf;
Than I need to trim off all the whitespace on the bottom
I tried Briss https://formulae.brew.sh/formula/briss#default, but that did not work for me.
So I tried pdfcropmargins that did not work for you and Bingo!
On MacOs 11.6 I need to access the command like so:
/Users/<usernamehere>/Library/Python/3.8/bin/pdfcropmargins -p 0 foobar.pdf;
That will crop with a file write out named foobar_cropped.pdf

Related

Ghostscript: ERROR: A pdfmark destination page x points beyond the last page y when use %003d

This command:
gs -sOutputFile=/destination/%003d.pdf \
-sDEVICE=pdfwrite \
-dBATCH \
-dNOPAUSE \
initial.pdf
give me the error:
GPL Ghostscript 9.26: ERROR: A pdfmark destination page 10 points beyond the last page 1.
for each page, but the same command without the 003 doesn't return any error:
gs -sOutputFile=/destination/%d.pdf \
-sDEVICE=pdfwrite \
-dBATCH \
-dNOPAUSE \
initial.pdf
You are producing each page of the input PDF file as a separate PDF file when you specify %d. Thus each destination output file only has one page.
Your input file has 'something' (could be an outline, a Link, a Dest or possibly something else) which points to page 10 in the original file. Ghostscript's PDF interpreter converts that to a pdfmark and emits it.
Now in both cases you should be emitting one file per page, so I would expect both command lines to give you an error because page 10 is, clearly, outside the range of pages in any file.
Its hard to see why %d instead of %003d doesn't give an error, I would expect that it should. However without the original PDF file to experiment with I can't tell what is going on. Your best bet, if you think this is a bug, is to open a bug report at https://bugs.ghostscript.com
You should also try the current version (9.50) the one you are using ia somewhat out of date.

Ghostscript to convert PDF 2 Back to PDF version 1.7

I need to build a PDF serverside by reading in a number of PDFs and inserting each page into a new multipage PDF. The problem is that the PDFs are provided in version 2.0 format, but my application can only read version 1.7. I would like to convert the version 2 files back into a version 1.7 file so that my application can read it.
I am using ghostscript version 9.27, and have tried several commands, but each time I end up with an empty PDF. Example:
/usr/local/bin/gs \
-q -dNOPROMPT \
-dBATCH \
-dDEVICEWIDTH=595 \
-dDEVICEHEIGHT=842 \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.7 \
-sFileName=pdf-version-2.pdf \
-sOutputFile=fileout.pdf
There is no error, just an empty PDF. The "file" command does give the expected output "PDF document, version 1.7" but that's not much good when the file is blank. Any help greatly appreciated!
OK so I think the problem is your command line (as pointed out to me by one of my colleagues). You've specified -sFileName=pdf-version-2.pdf, which looks like you're trying to specify the input file.
There is no Ghostscript switch -sFileName, you specify the input filename(s) by simply putting the name on the command line. So you really want:
/usr/local/bin/gs \
-q -dNOPROMPT \
-dBATCH \
-dDEVICEWIDTH=595 \
-dDEVICEHEIGHT=842 \
-sDEVICE=pdfwrite \
-dCompatibilityLevel=1.7 \
-sOutputFile=fileout.pdf \
pdf-version-2.pdf
For reasons that were good 30 years ago, the Ghostscript command line switches are copied into the PostScript environment, where they can be accessed by PostScript programs. So while its true (and possibly the szource of your confusion) that some of the utility programs shipped with Ghostscript do use -sFileName, Ghostscript itself doesn't, it just defines a PostScript variable using that name, so that programs can read it.
Because you've specified BATCH and NOPROMPT, but haven't specified an input file, the interpreter starts up, erases the current page to white, then exits. Closing the pdfwrite device causes it to write out the current content of the page, which is, well, white, resulting in your empty PDF file.
The slightly modified command line above worked well for me, but as I noted in my comments specifying DEVICEWIDTHPOINTS and DEVICEHEIGHTPOINTS won't actually do anything.

Reverse white and black colors in a PDF

Given a black and white PDF, how do I reverse the colors such that background is black and everything else is white?
Adobe Reader does it (Preferences -> Accessibility) for viewing purposes only in the program. But does not change the document inherently such that the colors are reversed also in other PDF readers.
How to reverse colors permanently?
You can run the following Ghostscript command:
gs -o inverted.pdf \
-sDEVICE=pdfwrite \
-c "{1 exch sub}{1 exch sub}{1 exch sub}{1 exch sub} setcolortransfer" \
-f input.pdf
Acrobat will show the colors inverted.
The four identical parts {1 exch sub} are meant for CMYK color spaces and are applied to C(yan), M(agenta), Y(ellow) and (blac)K color channels in the order of appearance.
You may use only three of them -- then it is meant for RGB color spaces and is applied to R(ed), G(reen) and B(lue).
Of course you can "invent" you own transfer functions too, instead of the simple 1 exch sub one: for example {0.5 mul} will just use 50% of the original color values for each color channel.
Note: Above command will show ALL colors inverted, not just black+white!
Caveats:
Some PDF viewers won't display the inverted colors, notably Preview.app on Mac OS X, Evince, MuPDF and PDF.js (Firefox PDF Viewer) won't. But Chrome's native PDF viewer PDFium will do it, as well as Ghostscript and Adobe Reader.
It will not work with all PDFs (or for all pages of the PDF), because it is also dependent on how exactly the document's colors are defined.
Update
Command above updated with added -f parameter (required) before the input.pdf. Sorry for not noticing this flaw in my command line before. I got aware of it again only because some good soul gave it its first upvote today...
Additional update: The most recent versions of Ghostscript do not require the added -f parameter any more. Verified with v9.26 (may also be true even with v9.25 or earlier versions).
Best method would be to use "pdf2ps - Ghostscript PDF to PostScript translator", which convert the PDF to PS file.
Once PS file is created, open it with any text editor & add {1 exch sub} settransfer before first line.
Now "re-convert" the PS file back to PDF with same software used above.
If you have the Adobe PDF printer installed, you go to Print -> Adobe PDF -> Advanced... -> Output area and select the "Invert" checkbox. Your printed PDF file will then be inverted permanently.
None of the previously posted solutions worked for me so I wrote this simple bash script. It depends on pdftk and awk. Just copy the code into a file and make it executable. Then run it like:
$ /path/to/this_script.sh /path/to/mypdf.pdf
The script:
#!/bin/bash
pdftk "$1" output - uncompress | \
awk '
/^1 1 1 / {
sub(/1 1 1 /,"0 0 0 ",$0);
print;
next;
}
/^0 0 0 / {
sub(/0 0 0 /,"1 1 1 ",$0);
print;
next;
}
{ print }' | \
pdftk - output "${1/%.pdf/_inverted.pdf}" compress
This script works for me but your mileage may vary. In particular sometimes the colors are listed in the form 1.000 1.000 1.000 instead of 1 1 1. The script can easily be modified as needed. If desired, additional color conversions could be added as well.
For me, the pdf2ps -> edit -> ps2pdf solution did not work. The intermediate .ps file is inverted correctly, but the final .pdf is the same as the original. The final .pdf in the suggested gs solution was also the same as the original.
Cross Platform try MuPDF
Mutool draw -I -o out.pdf in.pdf [range of pages]
It should permanently change colours in many viewers
Later Edit
A sample file that did not reverse was one with linework only (no image) and the method needed was to save the graphics as inverted image then reuse that to build a replacement PDF, however beware converting the whole pages to image will make any searchable text just simply unsearchable pixels thus would need to be run with the OCR active on rebuild.
The two commands needed will be something like (%4d means numbers for images start output0001)
mutool draw -o output%4d.png -I input.pdf
For Linux users the folowing second pass should work easily:-
mutool convert -O compress -o output.pdf output*.png
For windows users you will for now (v1.19) need to combine by scripting or use groups
mutool convert -O compress -o output.pdf output0001.png output0002.png output0003.png
next version may include an #filelist option see https://bugs.ghostscript.com/show_bug.cgi?id=703163
This is probably just a frontend for the ghostscript command Kurt Pfeifle posted, but you could also use imagemagick with something like:
convert -density 300 -colorspace RGB -channel RGB -negate input.pdf output.pdf

When converting first page of a PDF into an image using Ghostscript, sometimes I get "extra" space. Why?

I am building a simple script which converts the first page of a PDF into an image using Ghostscript. Here is the command I use:
gs -q -o output.png -sDEVICE=pngalpha -dLastPage=1 input.pdf
This works beautifully with some PDFs, e.g. if I convert the first page of a PDF that looks like this:
I actually get this first page as an image and there aren't any problems.
But I have noticed that with some first pages of other PDFs, like the following:
With the same gs command, after the conversion, the .png image looks like this:
The problem is that I get this extra white space on the left inside the image when I convert that page, why does GhostsScript do this? Where does that extra blank white space come from?
Most likely, your PDFs do not use identical values for /MediaBox and for /CropBox. For details about these technical terms related to a page, see this illustration from the German Wikipedia:
In other words: the /CropBox values (if given) for a PDF page determines which (smaller) part of the overall page information (which is inside the /MediaBox) the PDF viewer should be made visible to the user (or to the printer).
Solution
To determine what are the different values for all the pages of your book(s), run this command:
pdfinfo -f 1 -l 1000 -box my.pdf
To see these values just for the first page, run
pdfinfo -l 1 -box my.pdf
For Ghostscript to give the results you want, add -dUseCropBox to your command line:
gs -q -o output.png -sDEVICE=pngalpha -dLastPage=1 -dUseCropBox input.pdf

PDF text extraction from given coordinates

I would like to extract text from a portion (using coordinates) of PDF using Ghostscript.
Can anyone help me out?
Yes, with Ghostscript, you can extract text from PDFs. But no, it is not the best tool for the job. And no, you cannot do it in "portions" (parts of single pages). What you can do: extract the text of a certain range of pages only.
First: Ghostscript's txtwrite output device (not so good)
gs \
-dBATCH \
-dNOPAUSE \
-sDEVICE=txtwrite \
-dFirstPage=3 \
-dLastPage=5 \
-sOutputFile=- \
/path/to/your/pdf
This will output all text contained on pages 3-5 to stdout. If you want output to a text file, use
-sOutputFile=textfilename.txt
gs Update:
Recent versions of Ghostscript have seen major improvements in the txtwrite device and bug fixes. See recent Ghostscript changelogs (search for txtwrite on that page) for details.
Second: Ghostscript's ps2ascii.ps PostScript utility (better)
This one requires you to download the latest version of the file ps2ascii.ps from the Ghostscript Git source code repository. You'd have to convert your PDF to PostScript, then run this command on the PS file:
gs \
-q \
-dNODISPLAY \
-P- \
-dSAFER \
-dDELAYBIND \
-dWRITESYSTEMDICT \
-dSIMPLE \
/path/to/ps2ascii.ps \
input.ps \
-c quit
If the -dSIMPLE parameter is not defined, each output line contains some additional info beyond the pure text content about fonts and fontsize used.
If you replace that parameter by -dCOMPLEX, you'll get additional infos about colors and images used.
Read the comments inside the ps2ascii.ps to learn more about this utility. It's not comfortable to use, but for me it worked in most cases I needed it....
Third: XPDF's pdftotext CLI utility (more comfortable than Ghostscript)
A more comfortable way to do text extraction: use pdftotext (available for Windows as well as Linux/Unix or Mac OS X). This utility is based either on Poppler or on XPDF. This is a command you could try:
pdftotext \
-f 13 \
-l 17 \
-layout \
-opw supersecret \
-upw secret \
-eol unix \
-nopgbrk \
/path/to/your/pdf
- |less
This will display the page range 13 (first page) to 17 (last page), preserve the layout of a double-password protected named PDF file (using user and owner passwords secret and supersecret), with Unix EOL convention, but without inserting pagebreaks between PDF pages, piped through less...
pdftotext -h displays all available commandline options.
Of course, both tools only work for the text parts of PDFs (if they have any). Oh, and mathematical formula also won't work too well... ;-)
pdftotext Update:
Recent versions of Poppler's pdftotext have now options to extract "a portion (using coordinates) of PDF" pages, like the OP asked for. The parameters are:
-x <int> : top left corner's x-coordinate of crop area
-y <int> : top left corner's y-coordinate of crop area
-W <int> : crop area's width in pixels (defaults to 0)
-H <int> : crop area's height in pixels (defaults to 0)
Best, if used with the -layout parameter.
Fourth: MuPDF's mutool draw command can also extract text
The cross-platform, open source MuPDF application (made by the same company that also develops Ghostscript) has bundled a command line tool, mutool. To extract text from a PDF with this tool, use:
mutool draw -F txt the.pdf
will emit the extracted text to <stdout>. Use -o filename.txt to write it into a file.
Fifth: PDFLib's Text Extraction Toolkit (TET) (best of all... but it is PayWare)
TET, the Text Extraction Toolkit from the pdflib family of products can find the x-y-coordinate of text content in a PDF file (and much more). TET has a commandline interface, and it's the most powerful of all text extraction tools I'm aware of. (It can even handle ligatures...) Quote from their website:
Geometry
TET provides precise metrics for the text, such as the position on the page, glyph widths, and text direction. Specific areas on the page can be excluded or included in the text extraction, e.g. to ignore headers and footers or margins.
In my experience, while it's does not sport the most straight-forward CLI interface you can imagine: after you got used to it, it will do what it promises to do, for most PDFs you throw towards it...
And there are even more options:
podofotxtextract (CLI tool) from the PoDoFo project (Open Source)
calibre (normally a GUI program to handle eBooks, Open Source) has a commandline option that can extract text from PDFs
AbiWord (a GUI word processor, Open Source) can import PDFs and save its files as .txt: abiword --to=txt --to-name=output.txt input.pdf
I'm not sure GhostScript can accept coordinates, but you can convert the PDF to a image and send it to an OCR engine either as a subimage cropped from the given coordinates or as the whole image along with the coordinates. Some OCR API accepts a rectangle parameter to narrow the region for OCR.
Look at VietOCR for a working example, which uses Tesseract as its OCR engine and GhostScript as PDF-to-image converter.
Debenu Quick PDF Library can extract text from a defined area on a page. The SetTextExtractionArea function lets you specify the x and y coordinates and then you can also specify the width and height of the area.
Left = The horizontal coordinate of the left edge of the area
Top = The vertical coordinate of the top edge of the area
Width = The width of the area
Height = The height of the area
Then the GetPageText function can be called immediately after this to extract the text from that defined area.
Here's an example using C# (though the library is multi-platform and can be used with many different programming languages):
DPL.LoadFromFile(#"Sample.pdf", "");
DPL.SetOrigin(1); // Sets 0,0 coordinate position to top left of page, default is bottom left
DPL.SetTextExtractionArea(35, 35, 229, 30); // Left, Top, Width, Height
string ExtractedContent = DPL.GetPageText(8);
Console.WriteLine(ExtractedContent);
Using GetPageText it is also possible to return just the text located in that area or the text located in that area as well as information about the text's font such as name, color and size.