PostScript versus PDF as an output format - pdf

I'm currently writing a typesetting application and I'm using PSG as the backend for producing postscript files. I'm now wondering whether that choice makes sense. It seems the ReportLab Toolkit offers all the features PSG offers, and more. ReportLab outputs PDF however.
Advantages PDF offers:
transparancy
better support for character encodings (Unicode, for example)
ability to embed TrueType and even OpenType fonts
hyperlinks and bookmarks
Is there any reason to use Postscript instead of directly outputting to PDF? While Postscript is a full programming language as opposed to PDF, as a basic output format for documents, that doesn't seem to offer any advantage. I assume a PDF can be readily converted to PostScript for printing?
Some useful links:
Wikipedia: PDF
Adobe: PostScript vs. PDF

If you're planning on only outputting to a PostScript printer, then use PostScript. Otherwise, use PDF.
PDF is more widely supported by non-printer devices. And for your purposes, there aren't any technical advantages of PS over PDF (other than not being able to dump the file directly to a printer).

Here are some things to consider:
gzipped postscript is often much smaller than an equivalent PDF
PDF is basically a generalized container format, if you didn't know that you can embed videos in PDF, that should give you pause
PDF contains scripts that have been used for exploits (though this may be more the fault of bad PDF reader software)

PDF is a much more self-contained format and a high level of functionality. It also has more tools. UNless you specifically need Postscript, stick to PDF.

Avoid PDF like the plague. Adobe invented PDF and pushed PDF to the consumers to make more money from suckers who believed all the hype about PDF that Adobe told its users. PDF is a bloated format that requires a slow and non-free reader to read and process correctly. Most free readers do not support 100% of Adobe features and likely support a subset of features that is are found in Postscript. For instance reportlab does not support 100% of PDF features.
Historical fake technical arguments to use PDF have been
No loops in PDF which stops helps processing, False as other formats such as XML without loops have memory and processing issues.
More fully feature, False argument as Postscript is more powerful and can do what PDF can do with less features.
Postscript has to load all of the pages as it is a language. This is of course not true as C,C++, Java and many other language can load code at runtime.
Postscript is missing feature X, True but mostly because of
Adobe inventing a new format to make money not because feature X cannot be
added to Postscript.
The real reason to use PDF instead of Postscript is that PDF readers are more common than Postscript readers.

Related

quickly inspect OCR text layer on PDF file

Is there any program that will allow me to superimpose the text (OCR) layer of a PDF on top of the PDF rendering?
I want to quickly see if the text layer has errors or not.
It would be more convenient if that can be done with a program, if not, some cli command or script would also work.
Superimpose? It implies you'd like to add text while I believe you'd like to have access to the text for detection and possibly further analysis of the OCRed text quality. Perhaps need further clarification.
Our developers worked for some time on algorithms to detect the presence of text in PDFs and then evaluate its quality. There are many cases that can trick a basic algorithm - Bates number or imprinter added into image-only PDF makes it seem like PDF has high-quality text while it has no actual text. Some copiers produce "searchable PDFs" while using very low-quality OCR that contains many errors, but not necessarily on the first page that is typically some kind of title page with large fonts, thus the first line of the text encountered by an algorithm seems high quality. Or the first page may have text while other pages do not, yet algorithm may believe the whole PDF has text.
In our commercial high-volume server-based OCR software (used by service bureaus, SaaS platforms, libraries, backlog conversions, etc.) we now have advanced detection of PDFs with existing text layer and "smart decisions" which can filter out many of these false positive situations. Our OCR can skip re-OCRing PDFs with high-quality text in PDFs. If you are looking for a high-quality inexpensive OCR platform, such detection is a feature in it, but it can't be used separately without our OCR. OCR workflow is used as a part of that filter. Our developers wrote and integrated these algorithms without external tools.
I am with www.wisetrend.com where we provide software solutions and consulting for various OCR projects.

Searching text inside AFP files

I've been asked to convert files from PDF to AFP and I've managed it using the IBM afp printer's driver. I was wondering if there's a way to search inside the afp file . I know I can do it on the pdf file but I've been asked to crosscheck the converted files searching inside it.
Is there a reason since a pdf file of 370kb is converted to a 11.5Mb afp file ? is it converted as an image ? (this would clarify why I couldn't search inside it)
C is the best option to you to search a string in AFP PTX records. However it depends on how are you converting your PDF to AFP. If you use IBM print dirvers it will rasterize the text. So, you'll be not able to search.
AFP Explorer is one of the best freeware tool if your request is one time.
http://www.compulsivecode.com/project_afpexplorer.aspx
We use COMPART CPMCOPY and CPMILL to convert POS and PDF files into AFP. where you will have MFF filters to get the required output. However it is licensed product.
IBM AFP printer driver can be configured, to some extent. Check this manual page: Creating AFP Resources Using the IBM AFP Printer Drivers for further details.
Make sure that "Print Text as Graphics" is turned off.
Some AFP viewers have the feature of text search within AFP files. Consider BTB Viewer (warning, it looks ridiculously outdated).
If you wish to develop your own solution, consider that in general, searching for text in AFP documents is complicated since each "logical" text block can be split into a series of MO:DCA text instructions, each positioned individually. And it is not for granted that these instructions will be sequential. So expect for problems searching for multi-word strings.
"Conversion" PDF to AFP is a generic term. It depends on what software you used to convert, and what settings were used for conversion. For instance, consider embedded images. Since many AFP devices do not support JPEG compression for I:OCA, the conversion app may convert raster images to raw 24-bit bitmap which is ridiculously ineffective in terms of file size; an innocent background image of 1000×1000 px would take a whopping 3Mb of file size (while the original JPEG stream can be tens kbytes).

Pdf tools to analyze pdf attributes

Is there any pdf tools that generate information regarding the loading time and memory usage to display pdf in browser, and also total element inside the pdf?
Unfortunately not really. I've done some of this research, not for PDF in a browser but (and perhaps this is what you are looking at as well) PDF on mobile devices.
There are a number of factors that contribute and that to some extent can be tested for:
Whether or not big images exist in the PDF and what resolution they are. This is linked directly to memory usage.
What compression method is used for image compression. Decompressing JPEG-2000 images specifically can increase load time significantly. Even worse, as JPEG-2000 can be progressively decompressed, it can give the appearance of a really bad PDF until the images has been fully decompressed and loaded (this is ugly specifically on somewhat older tablets for example).
How complex the transparency effects are that are used in the document.
How many fonts are used in the document.
How many line-art objects (vector elements) with a large number of nodes (points) are used on a page.
You can test what is in the document using Acrobat Pro to some extent (there is a well-hidden tool when you save an optimised PDF file that can audit what objects use how much of the space in a PDF document). You can also use a preflight solution such as pdfToolbox from callas (I'm affiliated with this company) or pitstop from enfocus; these tools would allow you to get a report with the results of custom checks such as image resolution, compression, vector objects, color spaces etc.

Are all PDF files compressed?

So there are some threads here on PDF compression saying that there is some, but not a lot of, gain in compressing PDFs as PDFs are already compressed.
My question is: Is this true for all PDFs including older version of the format?
Also I'm sure its possible for someone (an idiot maybe) to place bitmaps into the PDF rather than JPEG etc. Our company has a lot of PDFs in its DBs (some older formats maybe). We are considering using gzip to compress during transmission but don't know if its worth the hassle
PDFs in general use internal compression for the objects they contain. But this compression is by no means compulsory according to the file format specifications. All (or some) objects may appear completely uncompressed, and they would still make a valid PDF.
There are commandline tools out there which are able to decompress most (if not all) of the internal object streams (even of the most modern versions of PDFs) -- and the new, uncompressed version of the file will render exactly the same on screen or on paper (if printed).
So to answer your question: No, you cannot assume that a gzip compression is adding only hassle and no benefit. You have to test it with a representative sample set of your files. Just gzip them and take note of the time used and of the space saved.
It also depends on the type of PDF producing software which was used...
Instead of applying gzip compression, you would get much better gain by using PDF utilities to apply compression to the contents within the format as well as remove things like unneeded embedded fonts. Such utilities can downsample images and apply the proper image compression, which would be far more effective than gzip. JBIG2 can be applied to bilevel images and is remarkably effective, and JPEG can be applied to natural images with the quality level selected to suit your needs. In Acrobat Pro, you can use Advanced -> PDF Optimizer to see where space is used and selectively attack those consumers. There is also a generic Document -> Reduce File Size to automatically apply these reductions.
Update:
Ika's answer has a link to a PDF optimization utility that can be used from Java. You can look at their sample Java code there. That code lists exactly the things I mentioned:
Remove duplicated fonts, images, ICC profiles, and any other data stream.
Optionally convert high-quality or print-ready PDF files to small, efficient and web-ready PDF.
Optionally down-sample large images to a given resolution.
Optionally compress or recompress PDF images using JBIG2 and JPEG2000 compression formats.
Compress uncompressed streams and remove unused PDF objects.

which version of PDF for general use/distribution?

I do a lot of quick-and-dirty PDF creation of long documents (100+ pages), for distribution to clients; my clients are often individuals, but sometimes corporate managers at banks and insurance companies.
Acrobat Pro allows you to save in many versions of PDF, from Acrobat 4 - Acrobat 10. Which should I use, as a general rule?
I don't often use advanced features in my documents: usually pictures and text. Since I send via email, I want the best compression possible... my documents often have lots of images. However, since my clients are banks and such, not cutting-edge technologists, I don't think they have the most recent Acrobat/PDF reader installed.
What is the best PDF version, as a compromise between document compression and widespread adoption?
I recommend PDF 1.4 - Acrobat 5. PDF/A-1 (PDF for archiving) standard is also based on PDF 1.4.