Saving the output from DiffPDF / ComparePDF command line. - Comparing folders of PDF's - pdf

We have to do a comparison of about 1500 PDF's in one folder with 1500 PDF's in another to check for visual differences.
We have found DiffPDF(and comparePDF command line version) for Windows which is a lot faster than our automated Acrobat Pro comparisons.
So far I have used:
comparepdf -v=2 =c=a old.pdf new.pdf
but the problem with this is that it just returns "these files are different". Does anyone know of any way to save the output from command line? You can do this from the GUI but that would mean using something like TestCOmplete to automate it :(
Or are there better ways of doing a comparison of 2 PDF's visually- with output/highlighting/
Bonus points for C# .net libraries.

You could have a look at these answers to similar questions:
PDF compare on linux command line
How to compare two pdf files through command line
How to unit test a Python function that draws PDF graphics?
However, I have no idea if any of these would be performing faster than what your automated Acrobat Pro comparison does... Let me know if you found out, will you?
Shortcut:
For simplicity, let's assume your input files to be compared are similar enough, and each being only 1 page. (For multi-page input expand the base idea of this answer...)
The two most essential commands any such comparison boils down to are these:
compare.exe ^
%input1% ^
%input2% ^
-compose src ^
%output%.tmp.pdf
and
pdftk.exe ^
%output%.tmp.pdf ^
background %input1% ^
output %output%.pdf
The first command generates a PDF with all differential pixels colored in red. (A default resolution is used here, 72 dpi. For a more fine-grained view on pixel differences add -density 200 (that will mean: 200 dpi) or higher -- but your processing time will increase accordingly as will the disk space needed by the output...)
The second command tries to merge the resulting PDF with a background taken from ${input1}.
Optionally, you may add -verbose -debug coder after the compare command for a better idea about what's going on.
compare.exe is a commandline tool from the great, great ImageMagick family of utilities (available for Linux, Windows, Unix and MacOSX). But it requires a Ghostscript installation to use as a 'delegate' in order to be able to process PDF input. pdftk.exe is also a commandline utility, available for the same platforms. Both a Free Software.
After the first command, you'll have an output file which has only red pixels where there are differences found on the page.
After the second command, you'll have an output with all red 'diff' pixels in the context of the first input PDF.
Example output:
Here are screenshots of two 1-page PDF files with differences in their content:
Here are screenshots of the output produced by the two commands above:
The left one shows the intermediate result (after first command), with only the difference pixels displaying as red (identical pixels being white).
The screenshot on the right shows the red difference pixels, but this time with the input PDF file number 1 as a (gray) background (after second command).
(PDF input files courtesy of Mark Summerfield, author of the beautiful DiffPDF tool.)

I had the same problem, diffpdf is quick and nice but GUI only.
[comparepdf] is console one but reports only exit code (no diff itself).
[diff-pdf] has both console mode and diff.pdf output but it is slow and output is not friendly.
I have tried to add the required code to diffpdf,
you can find it here: http://github.com/taurus-forever/diffpdf-console

Related

Tesseract : Line detection too sensitive

I am trying to detect the .pdf file text.
They are first converted to an image, then given to Tesseract.
The detection is good but they make too many line breaks.
For example if the file is a bit panched on the right, the sentence:
"I like Tesseract for reading text"
become:
"text read for Tesseract like I"
And that's already after a treatment because the raw text is :
"textreadforTesseractlikeI"
The bug occurs since the source .pdf are in 300DPI, I understand that the problem comes from the resolution but I cannot find how to solve it.
Here is my Tesseract cmd Tesseract.exe dummy.pdf dumy-ocr.pdf --psm 12 --dpi 300 -l bvr+fra+eng+deu hocr pdf
First, I would like to solve the problem of too many lines,
Then I would find out how to make the image perfectly straight
Thank you in advance for your help
https://i.stack.imgur.com/crmdO.jpg
You seem to be working backwards.
The "many" lines and thus word reversal are due to the anti-clockwise rotation.
text"
reading
for
Tesseract
like
"I
Fix that first and then the words will naturally all be placed on the same lines.
If using Leptonica in conjunction with Tesseract it is supposed to help with the pre-processing including deskew.
However there is a very small but powerful open source GUI and Command Line tool for Windows, Linux, and macOS that you could use from a shell see https://galfar.vevb.net/wp/projects/deskew/ it is also available on GitHub as an appveyor CI artifact so for the most up to date version (currently 5 days ago) follow the green tick at https://github.com/galfar/deskew

Ghostscript: How to auto-crop STDIN to "bounding box" and write to PDF?

Here have been already quite a few questions and answers about cropping documents with Ghostscript.
However, the answers are not matching my exact needs and are still confusing to me.
I expected that there would be a single option e.g. "-AutoCropToBBox" or something like this.
For clarification, as a bounding box, I understand the smallest rectangular box which contains all (non-white(?)) printed objects completely.
Furthermore, I want/have to use a printer port redirection (RedMon) to generate a cropped PDF via printing to a Postscript-printer from basically any application.
So, under Win7/64bit, I set the redirected port properties:
Redirected port properties Win7/64bit
The output is redirected to
C:\Windows\system32\cmd.exe
The arguments for the program are:
/c gswin64c.exe -sDEVICE=pdfwrite -o -sOutputFile="%1".pdf -
"%1" contains the user input for filename. With this, I get a full-page PDF. Fine!
But how to add the cropping options?
Additional question:
If I have a multipage document will such an (auto-)cropping be individual for each page? Or would there be an option to keep it all the same e.g. like the first page or like the largest bounding box of all pages?
Another related issue:
the window for prompting for the filename is always popping up behind the application I am printing from. Any ideas to always bring it to the front?
Another question:
There is the Perl-script "ps2eps" and program bbox.exe (see http://ctan.org/pkg/ps2eps). It's said there that Ghostscript (or ps2epsi) is occationally(?) calculating wrong bounding boxes. Is this (still) true?
Thanks for your help.
Well your first problem is that PostScript programs are normally written to expect to be rendered to a specific media size, and are usually not tightly bounded to it. White space is important for readability.
So ordinarily the PostScript program you generate will request a specific media size, and the interpreter will do its best to match that. If it can't match it then it will use a strategy to try and get as close as possible, and scale the entire content to fit that media.
You can't expect the printer to perform any of those things if it doesn't know the required size until its finished, and you can't be certain of the bounding box until you have rendered all the marking content. It is true that some files generally EPS files have a %%BoundingBox comment but.. that's a comment, it has no effect in PostScript, its there for the benefit of applications which don't want to interpret the PostScript.
So that's why the simple switch you want isn't there, it would break the interpreter's normal functioning, for rendering.
So, the first thing you need to do is determine the bounding box of the content. You can do that, as Stefan says, by using the bbox device. And on that note, as far as I know the bbox device produces accurate output. If it does not then we would appreciate a bug report proving it so we can fix it. If people don't report bugs how are we supposed to know about them ? Its disappointing to see someone spreading FUD instead of helping out with a bug report.......
ps2epsi isn't Ghostscript, its a crappy cheap and cheerful script, I wouldn't use it. However..... If the original PostScript leaves stuff on the stack then it will end up as a corrupted (or invalid) EPS file and the original PostScript should be fixed before trying to use it as it will break any PostScript program that tries to use it (eg if you include the EPS in a docuemnt and then print it).
So if you are using Ghostscript, and you want to take a PostScript program and get an EPS out of it, use the eps2write device. It won't have a preview bu frankly who cares.
Now if I remember correctly the bbox device (and eps2write) record all marking operations, you can't simply record all the non-white marking operations; what if the white overwrites an existing mark on the page ? What if the media is not white ? Note that if you render to a PNG with Ghostscript, the untouched portion of the output is transparent, whereas white marks are not.
So the bbox is the extent of all the marking operations, regardless of the colour. The only other way to proceed would be to render the content and count the non-white pixels. But that only works at a specific resolution, change the resolution and the precise bounding box may change as well.
Once you have the Bounding Box you can tell Ghostscript to use media that size. Note that you will almost certainly also have to translate the origin, as its unlikely that the content will start tightly at the bottom left corner. You will need -dDEVICEWIDTHPOINTS and -dDEVICEHEIGHTPOINTS to set the media size, and you will need to use -c and -f to send PostScript to alter the origin appropriately. In simple cases an '-x -y translate' will suffice but if the program executes initgraphics you will instead have to set a BeginPage procedure to alter the initial CTM.
If you set the media size with -dDEVICEWIDTHPOINTS etc then all pages will be the same size. If you don't want that then you need to write a BeginPage procedure to resize each page individually (you will also need to hook setpagedevice and remove the /PageSize entries from the dictionary.
I've no idea why Windows is putting the dialog box behind the active Window, it seems to have started doing that with Windows 7 (or possibly Vista). I don't see any way to alter that because I'm not sure what is generating the dialog.....
Personally I would suggest that you try the 2-step approach of running the original through Ghostscript's eps2write device and then take the EPS and create a PDF file using the pdfwrite device and the -dEPSCrop switch. Double converting is bad, but other solutions are worse. Note that EPS files cannot be multi-page, so you will have to create 'n' EPS files from an n-page PostScript program, and then supply a command line listing each EPS file as input to the pdfwrite device.
Take an example file and try this out from the command line before you try scripting it.
As I understood from #KenS explanations:
the way eps2write works, it may not or will not or actually cannot result in the minimum possible bounding box
it needs to be a 2-step process via -sDEVICE=bbox
So, I now ended up with the following process to "print" a PDF with a correct minimum possible bounding box:
Redirected Printer Port to cmd.exe
C:\Windows\system32\cmd.exe
Arguments for the program:
/c gswin64c.exe -q -o "%1".ps -sDEVICE=ps2write - && gswin64c.exe -q -dBATCH -dNOPAUSE -sDEVICE=bbox -dLastPage=1 "%1".ps 2>&1 >nul | perl.exe C:\myFiles\CropPS2PDF.pl "%1"
Unfortunately, it requires a little Perl script (let's call it: CropPS2PDF.pl):
#!usr/bin/perl -w
use strict;
my $FileName = $ARGV[0];
$/ = undef;
my $Crop = <STDIN>;
$Crop =~ /%%BoundingBox: (\d+) (\d+) (\d+) (\d+)/s; # get the bbox coordinates
my ($llx, $lly, $urx, $ury) = ($1, $2, $3, $4);
print "\n$FileName: $llx, $lly, $urx, $ury \n"; # print just to check
my $Command = qq{gswin64c.exe -q -o $FileName.pdf -sDEVICE=pdfwrite -c "[/CropBox [$llx $lly $urx $ury]" -c " /PAGE pdfmark" -f $FileName.ps};
print $Command; # print just to check
system($Command); # execute command
It seems to work... :-)
Improvements are welcome.
My questions are still:
Can this be done somehow without Perl? Just Win7, cmd.exe and Ghostscript?
Is there maybe a way without writing the PS-File to disk which I do not need? Of course, I could also delete it afterwards with the Perl-script.

PDF to PostScript Using Ghostscript: large files having issues printing

I'm currently using Ghostscript to convert 500 page PDF files into PostScript.
I'm using Windows 7, Ghostscript x64 v 9.16, and a Kodak Digimaster Commercial Printer.
I use the following arguments for GhostScript to convert a PDF into PS:
C:\Program Files\gs\gs9.16\bin\gswin64c.exe"
-dCompressFonts=true
-dSubsetFonts=true
-dEmbedAllFonts=true
-sFONTPATH=C:\Windows\Fonts\
-dNOPAUSE
-dBATCH
-sDEVICE=ps2write
-sOutputFile="PostScript.ps"
"MyPdf.pdf"
I then add %KDK (proprietary) commands to dictate which pages need to print on which paper by using the %KDKSlip command based on the Printer documentation.
The example below would print all pages on Letter duplex except for pages 1/2 and 5/6. Pages 1/2 would print on a paper defined under the name of "YellowPerf", while 5/6 would print on "TriPerf":
%!PS-Adobe-3.0
%%BoundingBox: 0 0 612 792
%%HiResBoundingBox: 0 0 612.00 792.00
%%Creator: GPL Ghostscript 916 (ps2write)
%%LanguageLevel: 2
%%CreationDate: D:20150506143059-05'00'
%%Pages: 8
%%DocumentMedia: Letter 612 792 0 white ()
%%+ YellowPerf 612 792 0 yellow ()
%%+ TriPerf 612 792 0 white ()
%KDKRequirements: duplex
%KDKSlip: YellowPerf duplex 1
%%+ TriPerf duplex 5
%%EndComments
%%BeginProlog
This is then sent to a Kodak Digimaster printer using a Windows command:
> COPY PostScript.ps PrinterName
This has worked fine with smaller documents, but I'm having issues with larger page sets.
When I attempted to print to the Digimaster using a 500 page PDF to Postscript file, it had errors occur: "Busy, do not reset the RIP".
File size of those that didn't work:
PostScript File Size: 52 MB
PDF File Size: 41 MB
File sizes of those that did work:
PostScript File Size 1MB
PDF File Size: .8 MB
Why does this work fine with smaller files but get hosed on larger files?
Would anyone have any advice?
It is not necessarily the filesize of the PostScript that causes your problem:
It could be the PostScript itself, or
it could be that you made a mistake with your editing of the PS files when you inserted the (proprietary) %KDK-comments.
Are you sure your text editor doesn't silently change your linefeed characters?! This could also change the binary parts of the PostScript!
Also, I'm not sure if the copy command does handle print jobs like it should. I would prefer the lpr command (ah... is that even still available on your version of Windows?!)
To debug this and to explore a few different roads to successful printing, I would try a few different steps:
To debug
Send the original PostScript, without the added %%KDK DSC header comments, to the printer.
That printer model has a nice feature you can utilize: you can check if its RIP processes the input file completely and successfully without needing to output your 500 pages on (wrong) paper and waste it therefore (you'd also need to discard it afterwards -- too much work too). Just click the red "Stop" button on its user interface monitor.
Does that one complete the RIP process successfully?
Yes? Now you can now even print it. Before you do so you can even modify the job settings to select a particular paper tray, by clicking on some button on the interface (can't recall the exact button label though). Then "release" the job and it will print.
If it worked, you can again turn your attention to get your %%KDK lines right.
If it didn't you have to try another route.
Check if a different PDF-to-PS converter is working
Create a PostScript file with the help of pdftops (see here for the pdftops.exe version -- read the README to see which options are available).
Proceed analog to above: first see if it completes the RIP process. Then continue with your %KDK manipulations....
Check if the direct PDF printing is working
The Digimaster model can consume PDF directly. (Well, internally it uses its own PDF-to-PS converter, but that isn't visible to the outside -- so it doesn't really count as a PDF RIP...)
If that works, you can even prepend your appropriate %KDK comments to the PDF file, similar to the lines below (don't rely on me getting the details right, it's from the top of my head, and memory is decades old!):
%!PS-Adobe-3.0
%%.........................
%%DocumentMedia: ..........
%KDKRequirements: .........
%KDKInserts: ..............
%KDKSlip: .................
%KDKBody: .................
%KDKCovers: ...............
%KDKPDFPrintAnnotations: on
%KDKPDFFitToPage: on
%KDKBinaryOK: on
<esc>%-12345X
%%Emulation: pdf
%PDF-1.5
%...here follow the lines of the original PDF file...
...
Send jobs via "Kodak Printfile Downloader" (KPD)
For Windows there used to be the so-called 'Kodak Print File Downloader' (KPD). The KPD is an application, not a printer driver. Not sure if it is still available.
You could open its GUI, then load a PS, PDF, PCL or TIFF file into its to-be-printed-list of jobs. Then select a few job options (like trays, stapling, sorting etc.). Lastly, send the job off to the Digimaster...
The KPD essentially does the same thing, as you want to achieve: insert %KDK commands into the file header. But you want to do it with a script or an editor (and possibly automatically via a batch process, once it works).
The KPD requires interactive user activity and cannot be scripted.
But you can (ab-)use it to intercept the files it creates from the Windows spooling system, study them and then adapt your scripted efforts so that they also work....
Update
(I had wanted to add this already in my initial answer. But time ran out, so I skipped it for the time being..)
Observe the RIP processing directly at the printer UI
Digimaster printers have their own built-in touchscreen or flatscreen or tube monitor (depending on the age of the model). They also typically have a full-time operator who knows the machine and its tweaks and peculiarities quite will. The machine may be quite a distance from the user sending a job.
So the following should be done when debugging a print problem:
Ask the operator to set the printer to "stop printing", but still "receiving new jobs".
Submit any job(s) you want.
Walk up to the printer and its operator.
Release the job for RIP-ping and observe what happens:
You may see everything going alright and completing until the last page (you know how many pages you submitted, right?)
Or you may see the job aborting at a certain page number.
Or you may see the printer RIP chewing extremely long on a certain page (or several pages), but finally completing the job.
Or you may see the printer RIP hanging with a certain page forever.
Or...
In any case, the details which are observable here may give important clues about where to look next...

Undo Pdfnup Operation

I have a Pdf file which contains several slides per page, including text (not only images).
This pdf was probably created using pdfnup.
Can I revert the pdfnup operation so that each slide is shown on one page?
As far as I know, there is no simple to be used 'undo' operation.
However, the following answers show you the approach principle, how you can achieve the undo-equivalent operation using Ghostscript:
Convert PDF 2 sides per page to 1 side per page (Superuser)
How can I split a PDF's pages down the middle? (Superuser)
Cropping a PDF using Ghostscript 9.01 (Stackoverflow)
PDF - Remove White Margins (Stackoverflow)
(Should these not help you to find the final solution, ask again. But then to come up with a fully working commandline, I'd need the complete output of the following command first: pdfinfo -f 1 -l 100 -box your.pdf.)

How to merge many PDF files into a single one? [duplicate]

This question already has answers here:
Merge / convert multiple PDF files into one PDF [closed]
(23 answers)
Closed 6 years ago.
I have 16 pdfs that I want to convert into a single one... I am on Ubuntu 10.10, how can I do it?
First, get Pdftk:
sudo apt-get install pdftk
Now, as shown on example page, use
pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf
for merging pdf files into one.
You can also use Ghostscript to merge different PDFs. You can even use it to merge a mix of PDFs, PostScript (PS) and EPS into one single output PDF file:
gs \
-o merged.pdf \
-sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress \
input_1.pdf \
input_2.pdf \
input_3.eps \
input_4.ps \
input_5.pdf
However, I agree with other answers: for your use case of merging PDF file types only, pdftk may be the best (and certainly fastest) option.
Update:
If processing time is not the main concern, but if the main concern is file size (or a fine-grained control over certain features of the output file), then the Ghostscript way certainly offers more power to you. To highlight a few of the differences:
Ghostscript can 'consolidate' the fonts of the input files which leads to a smaller file size of the output. It also can re-sample images, or scale all pages to a different size, or achieve a controlled color conversion from RGB to CMYK (or vice versa) should you need this (but that will require more CLI options than outlined in above command).
pdftk will just concatenate each file, and will not convert any colors. If each of your 16 input PDFs contains 5 subsetted fonts, the resulting output will contain 80 subsetted fonts. The resulting PDF's size is (nearly exactly) the sum of the input file bytes.
You can use http://www.mergepdf.net/ for example
Or:
PDFTK http://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/
If you are NOT on Ubuntu and you have the same problem (and you wanted to start a new topic on SO and SO suggested to have a look at this question) you can also do it like this:
Things You'll Need:
* Full Version of Adobe Acrobat
Open all the .pdf files you wish to merge. These can be minimized on your desktop as individual tabs.
Pull up what you wish to be the first page of your merged document.
Click the 'Combine Files' icon on the top left portion of the screen.
The 'Combine Files' window that pops up is divided into three sections. The first section is titled, 'Choose the files you wish to combine'. Select the 'Add Open Files' option.
Select the other open .pdf documents on your desktop when prompted.
Rearrange the documents as you wish in the second window, titled, 'Arrange the files in the order you want them to appear in the new PDF'
The final window, titled, 'Choose a file size and conversion setting' allows you to control the size of your merged PDF document. Consider the purpose of your new document. If its to be sent as an e-mail attachment, use a low size setting. If the PDF contains images or is to be used for presentation, choose a high setting. When finished, select 'Next'.
A final choice: choose between either a single PDF document, or a PDF package, which comes with the option of creating a specialized cover sheet. When finished, hit 'Create', and save to your preferred location.
Tips & Warnings
Double check the PDF documents prior to merging to make sure all pertinent information is included. Its much easier to re-create a single PDF page than a multi-page document.
There are lots of free tools that can do this.
I use PDFTK (a open source cross-platform command-line tool) for things like that.
Also seem pdfjam: http://www2.warwick.ac.uk/fac/sci/statistics/staff/academic/firth/software/pdfjam/