Searching text inside AFP files - pdf

I've been asked to convert files from PDF to AFP and I've managed it using the IBM afp printer's driver. I was wondering if there's a way to search inside the afp file . I know I can do it on the pdf file but I've been asked to crosscheck the converted files searching inside it.
Is there a reason since a pdf file of 370kb is converted to a 11.5Mb afp file ? is it converted as an image ? (this would clarify why I couldn't search inside it)

C is the best option to you to search a string in AFP PTX records. However it depends on how are you converting your PDF to AFP. If you use IBM print dirvers it will rasterize the text. So, you'll be not able to search.
AFP Explorer is one of the best freeware tool if your request is one time.
http://www.compulsivecode.com/project_afpexplorer.aspx
We use COMPART CPMCOPY and CPMILL to convert POS and PDF files into AFP. where you will have MFF filters to get the required output. However it is licensed product.

IBM AFP printer driver can be configured, to some extent. Check this manual page: Creating AFP Resources Using the IBM AFP Printer Drivers for further details.
Make sure that "Print Text as Graphics" is turned off.
Some AFP viewers have the feature of text search within AFP files. Consider BTB Viewer (warning, it looks ridiculously outdated).
If you wish to develop your own solution, consider that in general, searching for text in AFP documents is complicated since each "logical" text block can be split into a series of MO:DCA text instructions, each positioned individually. And it is not for granted that these instructions will be sequential. So expect for problems searching for multi-word strings.
"Conversion" PDF to AFP is a generic term. It depends on what software you used to convert, and what settings were used for conversion. For instance, consider embedded images. Since many AFP devices do not support JPEG compression for I:OCA, the conversion app may convert raster images to raw 24-bit bitmap which is ridiculously ineffective in terms of file size; an innocent background image of 1000×1000 px would take a whopping 3Mb of file size (while the original JPEG stream can be tens kbytes).

Related

How to convert scanned document images to a PDF document with high compression?

I need to convert scanned document images to a PDF document with high compression. Compression ratio is very important. Can someone recommend any solution on C# for this task?
Best regards, Alexander
There is a free program called PDFBeads that can do it. It requires Ruby, ImageMagick and optionally jbig2enc.
The PDF format itself will probably add next to no overhead in your case. I mean your images will account for most of the output file size.
So, you should compress your images with highest possible compression. For black-and-white images you might get smallest output using FAX4 or JBIG2 compression schemes (both supported in PDF files).
For other images (grayscale, color) either use smallest possible size, lowest resolution and quality, or convert images to black-and-white and use FAX4/JBIG2 compression scheme.
Please note, that most probably you will lose some detail of any image while converting to black-and-white.
If you are looking for a library that can help you with recompression then have a look at Docotic.Pdf library (Disclaimer: I am one of developers of the library).
The Optimize images sample code shows how to recompress images before adding them to PDF. The sample shows how to recompress with JPEG, but for FAX4 the code will be almost the same.

Pdf tools to analyze pdf attributes

Is there any pdf tools that generate information regarding the loading time and memory usage to display pdf in browser, and also total element inside the pdf?
Unfortunately not really. I've done some of this research, not for PDF in a browser but (and perhaps this is what you are looking at as well) PDF on mobile devices.
There are a number of factors that contribute and that to some extent can be tested for:
Whether or not big images exist in the PDF and what resolution they are. This is linked directly to memory usage.
What compression method is used for image compression. Decompressing JPEG-2000 images specifically can increase load time significantly. Even worse, as JPEG-2000 can be progressively decompressed, it can give the appearance of a really bad PDF until the images has been fully decompressed and loaded (this is ugly specifically on somewhat older tablets for example).
How complex the transparency effects are that are used in the document.
How many fonts are used in the document.
How many line-art objects (vector elements) with a large number of nodes (points) are used on a page.
You can test what is in the document using Acrobat Pro to some extent (there is a well-hidden tool when you save an optimised PDF file that can audit what objects use how much of the space in a PDF document). You can also use a preflight solution such as pdfToolbox from callas (I'm affiliated with this company) or pitstop from enfocus; these tools would allow you to get a report with the results of custom checks such as image resolution, compression, vector objects, color spaces etc.

Reducing the size of pdf generated from software using proprietary fonts

I am trying to bring an Indian Magazine online. This magazine is typed in CorelDraw using the proprietary Devenagari font (http://www.modular-infotech.com/html/shreelipi.html). So these guys have provided a USB dongle that you have to have attached to the machine when you want to access the fonts, and this software has been in use for past 10 years.
To put the magazine online, we've tried to convert it to pdf (by printing). The resultant pdf size is of the order of 30-50MB, even when the pdf does not have even a single image. I am guessing it converts the whole text into an image
It would be really difficult for users to read this magazine given its size. Though when I convert it to .swf format (for add flipbook kind of functionality) - the size reduces to 5-6MB. But there are people who like to download the magazine and then read. I have had no luck reducing the size of pdf.
I have done lot of research on web. The postscript, primo pdf do not help much. The best I could get was 30% reduction using DocuCom pdf printer. But it is still 20MB. I have tried to play with resolution, compression and quality but the best I could get was 18MB.
Ideally I would like to reduce it to less than 2MB.
I would be really grateful if you could help me reduce the size of the pdf! Considering that it has no images, I am hopeful that I can get some really good compression.
The (35MB) magazine can be downloaded from: http://merajhola.in/jin-march.pdf
I can't see any easy way to reduce the size of this PDF. There are no embedded fonts and all the text is drawn using vector graphics primitives. No amount of tweaking the resolution, compression and quality will have a significant improvement.
One possible option would be to embed the font as a subset rather than use vector graphics. That will almost certainly make a big difference, however I doubt the proprietary font license will allow it.
I'm sorry, but this Shree-Lipi thing just sounds wrong in 2012. It would be much better to use proper OpenType fonts with modern (say InDesign) or free (say LuaTeX) software.

Is it possible to extract tiff files from PDFs without external libraries?

I was able to use Ned Batchelder's python code, which I converted to C++, to extract jpgs from pdf files. I'm wondering if the same technique can be used to extract tiff files and if so, does anyone know the appropriate offsets and markers to find them?
Thanks,
David
PDF files may contain different image data (not surprisingly).
Most common cases are:
Fax data (CCITT Group 3 and 4)
raw raster data with decoding parameters and optional palette all compressed with Deflate or LZW compression
JPEG data
Recently, I (as developer of a PDF library) start noticing more and more PDFs with JBIG2 image data. Also, JPEG2000 sometimes can be put into a PDF.
I should say, that you probably can extract JPEG/JBIG2/JPEG2000 data into corresponding *.jpeg / *.jp2 / *.jpx files without external libraries but be prepared for all kinds of weird PDFs emitted by broken generators. Also, PDFs quite often use object streams so you'll need to implement sophisticated parser for PDF.
Fax data (i.e. what you probably call TIFF) should be at least packed into a valid TIFF. You can borrow some code for that from open source libtiff for example.
And then comes raw raster data. I don't think that it makes sense to try to extract such data without help of a library. You could do that, of course, but it will take months of work.
So, if you are trying to extract only specific kind of image data from a set of PDFs all created with the same generator, then your task is probably feasible. In all other cases I would recommend to save time, money and hair and use a library for the task.
PDF files store Jpegs as actual JPEGS (DCT and JPX encoding) so in most cases you can rip the data out. With Tiffs, you are looking for CCITT data (but you will need to add a header to the data to make it a Tiff). I wrote 2 blog articles on images in PDF files at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/ and http://www.jpedal.org/PDFblog/2011/07/extract-raw-jpeg-images-from-a-pdf-file/ which might help.

How to optimize PDF file size?

I have an input PDF file (usually, but not always generated by pdfTeX), which I want to convert to an output PDF, which is visually equivalent (no matter the resolution), it has the same metadata (Unicode text info, hyperlinks, outlines etc.), but the file size is as small as possible.
I know about the following methods:
java -cp Multivalent.jar tool.pdf.Compress input.pdf (from http://multivalent.sourceforge.net/). This recompresses all streams, removes unused objects, unifies equivalent objects, compresses whitespace, removes default values, compresses the cross-reference table.
Recompressing suitable images with jbig2 and PNGOUT.
Re-encoding Type1 fonts as CFF fonts.
Unifying equivalent images.
Unifying subsets of the same font to a bigger subset.
Remove fillable forms.
When distilling or otherwise converting (e.g. gs -sDEVICE=pdfwrite), make sure it doesn't degrade image quality, and doesn't increase (!) the image sizes.
I know about the following techniques, but they don't apply in my case, since I already have a PDF:
Use smaller and/or less fonts.
Use vector images instead bitmap images.
Do you have any other ideas how to optimize PDF?
Optimize PDF Files
Avoid Refried Graphics
For graphics that must be inserted as bitmaps, prepare them for maximum compressibility and minimum dimensions. Use the best quality images that you can at the output resolution of the PDF. Inserting compressed JPEGs into PDFs and Distilling them may recompress JPEGs, which can create noticeable artifacts. Use black and white images and text instead of color images to allow the use of the newer JBIG2 standard that excels in monochromatic compression. Be sure to turn off thumbnails when saving PDFs for the Web.
Use Vector Graphics
Use vector-based graphics wherever possible for images that would normally be made into GIFs. Vector images scale perfectly, look marvelous, and their mathematical formulas usually take up less space than bitmapped graphics that describe every pixel (although there are some cases where bitmap graphics are actually smaller than vector graphics). You can also compress vector image data using ZIP compression, which is built into the PDF format. Acrobat Reader version 5 and 6 also support the SVG standard.
Minimize Fonts
How you use fonts, especially in smaller PDFs, can have a significant impact on file size. Minimize the number of fonts you use in your documents to minimize their impact on file size. Each additional fully embedded font can easily take 40K in file size, which is why most authors create "subsetted" fonts that only include the glyphs actually used.
Flatten Fat Forms
Acrobat forms can take up a lot of space in your PDFs. New in Acrobat 8 Pro you can flatten form fields in the Advanced -> PDF Optimizer -> Discard Objects dialog. Flattening forms makes form fields unusable and form data is merged with the page. You can also use PDF Enhancer from Apago to reduce forms by 50% by removing information present in the file but never actually used. You can also combine a refried PDF with the old form pages to create a hybrid PDF in Acrobat (see "Refried PDF" section below).
see article
From PDF specification version 1.5 there are two new methods of compression, object streams and cross reference streams.
You mention that the Multivalent.jar compress tool compresses the cross reference table. This usually means the cross reference table is converted into a stream and then compressed.
The format of this cross reference stream is not fixed. You can change the bit size of the three "columns" of data. It's also possible to pre-process the stream data using a predictor function which will improve the compression level of the data. If you look inside the PDF with a text editor you might be able to find the /Predictor entry in the cross reference stream dictionary to check whether the tool you're using is taking advantage of this feature.
Using a predictor on the compression might be handy for images too.
The second type of compression offered is the use of object streams.
Often in a PDF you have many similar objects. These can now be combined into a single object and then compressed. The documentation for the Multivalent Compress tool mentions that object streams are used but doesn't have many details on the actual choice of which objects to group together. The compression will be better if you group similar objects together into an object stream.