In order to test the results of most of my business requirements I need to find out if a PDF is rendered correctly.
A typical test involves a few UI interactions with an Web application and the download of the resulting PDF.
Then the results should be compared to their expectations.
Is there a testing framework capable of examining a PDF?
There are already lots and lots answers on StackOverflow dealing with similar questions.
Look at this list:

How to unit test a Python function that draws PDF graphics? (StackOverflow)
Visual diff PDF files in order to determine pixel perfectness (StackOverflow)
How to compare two PDF files through command line (StackOverflow)
Comparison of two PDF files (StackOverflow)
PDF compare on Linux command line (StackOverflow)
Related
We are changing systems and the new system only outputs .DOC or .TXT files for reports. Several of the reports that come out need to be converted to PDF so they are available for our web users on a daily basis. Currently I am testing about 1500 of a single report and before the system is ready I will need to support at least 10 types of reports, each possibly have this 1500 or so convert.
So far I have not found a way to convert this many reports effectively. Part of the problem is that the reports must be converted to a specific size PDF for the them to be read easily. I have tested some software solutions but so far I have not been able find a solution.
I really like Batch Document Converter Pro. We have uses software from this company before and it worked very well for out needs. Whenever I try it though it gives the error
Problem with conversion: word to pdf, check word 2007 or greater is installed and the MS PDF Addon pack for office 2007
I have tried installing different versions of Office (including 2007) on the machine and installed the addon pack with no change.
One tool to try is Libre Office since:
it can run on multiple platforms
it can be driven from the command line or programmatic API
you can use it manually to confirm whether it will do what you need before doing any scripting/programming
it does pretty good conversions
the docx files page format will transition naturally to the PDF
the text files will be converted into a "normal" page layout
I would suggest you firstly install Libre Office, and open some of your documents by hand then export to PDF. If the results are good enough, then you can automate this to run in batches.
If the first step is promising, then the simplest automation is to use the command line. eg:
c:\Program Files\LibreOffice 5\program\soffice --convert-to pdf myDoc.docx
I hope that helps.
I wrote a script which tests several rasterization programs, by using the official W3C SVG test suite and comparing the rasterized png with the expected pngs pixel by pixel.
The problem is, with v. 1.1 first edition (2011) and v. tiny 1.2 (2008) test suites, in a lot of images the vectors doesn't match with the expected png, because the revision number is not the same, making a lot of false-positives (more than 90%), like this one.
However it's ok with the v. 1.1 first edition test suite.
I could trunk the png to remove the area with the revision number, but it's a really derpy solution.
So by which png should I compare the rasterized vectors ?
Thanks.
There is no non-derpy solution to this problem, for a few reasons. The test images from this time were never meant to be ref-tests (that is, they are not pixel-by-pixel matches). Also, some of the tests that appear in the later test suites were not accepted as legitimate tests, so the revision numbers were not updated.
The later SVG 1.1 2nd edition test suite should be considered canonical, but even that contains some revision-mark errors, like coords-trans-06-t.
This is actually an issue for the SVG WG to resolve, and I'll raise it with them. The revision numbers in all approved tests should match the PNG references, and we can revise the tests so that the revision numbers match.
In the future we'll be converting these tests (and writing new ones) for SVG 2 as reftests and scripted tests in the web-platform-tests project. The SVG 1.1 tests are at this point unmaintained.
If you really need up-to-date PNG reference images, you could regenerate them. They are generated using Batik's command line SVG to PNG conversion tool. In the SVG Working Group's old CVS repository, there is script (script/generate_reference_images.pl) to do the conversions and a set of SVG files to use (imagePatches/) to convert for tests that we knew Batik didn't get right with the original markup.
I've zipped up SVG 1.1 Second Edition test suite sources and put them here in case you want to try re-generating the images.
I wasn't able to find anything on the internet and I get the feeling that what I want is not such a trivial thing. To make a long story short: I'd like to get my hands on the underlying code that describes the PDF document of a selected area from a .pdf file. I've been looking for libraries or open source readers but couldn't find anything useful yet.
Does there exist something that might be able to accomplish my needs here or anything that might be reused (like an open source reader) to get there a little faster and not having to write everything from scratch?
You can convert a whole PDF document to PostScript using pdftops, one of the utilities from the poppler PDF rendering library.
This utility enables you to convert individual pages, which is at least a start.
If you just want to extract bitmapped images, try pdfimages from the same package. This extraction can also be restricted to individual pages.
The poppler library was originally written for UNIX-like systems, but there are a couple of windows builds available.
The open source tool from iText called iText RUPS does what you want, showing you all the PDF commands for a particular PDF and allow you to visualize the structure and relationships.
http://sourceforge.net/projects/itextrups/
I have to extract text from invoices and bills pdf files
The files layouts can get complex, though its mostly filled with tables.
I've read a few dozens articles already about the pdf format, how easy it is for our brain to grasp it and how hard it is for a machine to understand its structure.
Also downloaded a few tools like the python's pdfminer and some java tools, some even have rule based layout extraction, like LA-PDBtext these are all great libraries, leaving you the final step.
Adobe also has an online service called exportPdf but it can't be customized
Bottom line, I understand that in order to extract text from structured pdf files and convert it to XML for example, there should be some level of manual work.
I also found From Data Extractor, a non free tool with the ability to set extraction rules that claims to do the job, though its hard to find a proper manual and it runs only on windows.
I thought I may even try a to convert those files to images and try tesseract-ocr but decided to ask for advice here before I spend more time on it.
I'll be very grateful if someone with such experience give me a hint.
I've done a lot of PDF extraction and I can confirm as you've already discovered that it can be a painful process to start. One of the important things to understand is that there is no concept of "tables" within a PDF, just text that happens to have lines around it. Also, there's no guarantee that the linear order of text within the PDF code actually matches the visual order when printed. In other words, there's no guarantee that "hello world" is written in that order, it could be draw 'word' at coord 20 then draw 'hello' at coord 10. Most PDF creators don't do this but still there's no guarantee. The more creative a PDF creator is (InDesign, Illustrator, etc) the more likely the text is going to be harder to get out. And actually, once a designer starts messing with fonts too much some programs will sometimes actually output words one character at a time, changing the font just slightly each time.
That said, I'd recommend the first one that you looked at, LA-PDFText. You can run it in discovery mode (blockify) from which you can create rules. I don't have Java installed anymore so I can't test it but it seems very promising.
Your second one, A-PDF Form Data Extractor, only really works with actual PDF forms. If this is your case I'd recommend just using an open source solution like iText/iTextSharp.
The last OCR one makes me cringe. I just can't imagine going through those hoops would get you better text representation than parsing the PDF. But then again, PDF is a visual format so maybe it would.
Personally I use iText/iTextSharp for this kind of thing but I also like to do things the hard way.
It is not clear if you are looking for the development tool to automate the data extraction from bills and invoices or just for the one time tool (utility) that can be used by the non-developer?
Anyway here are some specialized tools including engines they use:
Tabula (open-source, especially designed to extract data from tables in PDF. Can export shell scripts for batch processing, runs as the localhost web service, powered by JRuby Tabula engine)
Viet OCR (open-source .NET desktop utility for text extraction from PDF and images, based on tesseract oct engine)
Bytescout PDF Viewer (freeware closed source .NET utility, detects and extracts tables, including scanned invoices, powered by PDF Extractor SDK)
DISCLAIMER: I work for ByteScout.
I'm currently planning an application which involves manipulating PDFs. My goal is to have a program that i can pass in a PDF as an input which then saves separated grayscale images of the colour channels that the PDF consists of as an output. This is basically a simple RIP.
I'm currently using a solution using GhostScript but i want to rewrite the application to optimise speed and usability. (GhostScript doesn't separate PDFs for example.)
Do you know of any other open source libraries that i may find useful to achieve this?
Did you ever try to run (I'm assuming Windows here):
mkdir separated
gswin32c.exe ^
-o separated/page_%04d.tif ^
-sDEVICE=tiffsep ^
d:/path/to/input.pdf
(you can also try -sDEVICE=tiffsep1) and then looked at the files you've gotten in the separated sub directory?!? And this is not a case of Ghostscript separated PDFs in your mind?
The device tiffsep creates multiple output files:
One single 32bit composite CMYK file (tiff32nc format) per PDF page.
Multiple tiffgray files (each compressed with LZW) per PDF page, one for each separation.
The device tiffsep1 behaves similarly, but...
...doesn't create the composite output file...
...and it creates tiffg4 output files for the separations.