How to create Lucene index where the documents are scanned images among other things? - lucene

My database stores resumes as blob data-field. Resumes may be Microsoft word, pdf or images(.jpg etc).How can we create Lucene index out of these disparate file types, specially .jpg files? Can Tika understand scanned images?

When extracting from images, it is also possible to chain in Tesseract, via the TesseractOCRParser, to have OCR performed on the contents of the image.
Check Apache Tika documentation on images: https://tika.apache.org/1.20/formats.html#Image_formats

Related

How to directly stream large content to PDF with minimal memory footprint?

I am trying to stream large content (say 200 MB) of formatted data to PDF with minimal memory footprint (say 20 MB per Client/Thread). The PDF structure is written in Adobe postscript and it is complex to directly write in PDF postscript format. I have been using the following APIs to stream content to PDF.
Jasper Reports
iText
The problem I am facing with Jasper reports is that it needs all the input data to be taken into in-memory and only supports OutputStream. There is a function supporting InputStream of data in Jasper Reports but in the backend Jasper loads the whole of the InputStream data into the memory and effectively exhausting the memory.
The problem with iText is that it is commercial. I am now looking to write my own Java API to stream formatted data including tables, images to PDF directly. I have referred the following books to understand the PDF structure:
Pdf Structure by Adobe
Pdf Explained (O'REILLY)
The above books cover only the basic PDF formatting such as Text and 2D Graphics. How do I draw tables, icons, and all other formatting that I am able to generate with HTML/CSS into the PDF?
I need some pointers on understanding the PDF structure in depth. Or, is there already a Java API which supports direct streaming of input content to PDF without holding the entire data in memory?
Note: Headless browsers (PhantomJS, wkhtmltopdf), Apache FOP, Apache PdfBox renders PDF by loading the entire data in memory.

Minimizing IO and Memory Usage When Extracting Pages from a PDF

I am working on a cloud-hosted web application that needs to serve up extracted pages from a library of larger PDFs. For example, 5 pages from a 50,000 page PDF that is > 1 GB in size.
To facilitate this, I am using iTextSharp to extract page ranges from the large PDFs using the advised approach found in this blog article.
The trouble I am running into is that during testing, I have found that the PdfReader is reading the entire source PDF in order to extract the few pages I need. I know enough about PDF structure to be dangerous, and I know that resources can be spread around such that random read access all over the file is going to be expected, but I was hoping to avoid the need to read ALL the file content.
I even found several mentions of RandomAccessFileOrArray being the silver bullet to address high memory usage when opening large PDFs, but alas, even when I use that, the source PDF is still being read in it's entirety.
Is there a more efficient method (using iText or otherwise) to access just the content I need from the source PDF in order to extract a few pages?

import/embed xml ocr/text info from one pdf to a different pdf

I am trying to optimize quality/filesize of a image-scanned pdf while retaining ocr quality.
I could try and downsample after ocr of the high quality pdf document but the tools I'm using (acrobat primarily) do not create small file sizes as compared to using photoshop and exporting a lower dpi/optimized pages and using these pages to create a pdf.
A better solution, if possible would be to take a image-pdf document (800M for the current case) that has been ocred and apply the ocr layer to a lower-rez downsampled document.
I can successfully extract the OCR info with coordinates as xml with pdfminer, but I would like to take this and apply it the same file that has been downsampled using photoshop. I thought I read this was possible with pdftk but I can no longer find this information.
Any suggestions would be greatly appreciated.
jack
Can you describe what's the current way your create PDFs?
With IText it's possible to set the compression level of images added.
May be useful

Is it possible to extract tiff files from PDFs without external libraries?

I was able to use Ned Batchelder's python code, which I converted to C++, to extract jpgs from pdf files. I'm wondering if the same technique can be used to extract tiff files and if so, does anyone know the appropriate offsets and markers to find them?
Thanks,
David
PDF files may contain different image data (not surprisingly).
Most common cases are:
Fax data (CCITT Group 3 and 4)
raw raster data with decoding parameters and optional palette all compressed with Deflate or LZW compression
JPEG data
Recently, I (as developer of a PDF library) start noticing more and more PDFs with JBIG2 image data. Also, JPEG2000 sometimes can be put into a PDF.
I should say, that you probably can extract JPEG/JBIG2/JPEG2000 data into corresponding *.jpeg / *.jp2 / *.jpx files without external libraries but be prepared for all kinds of weird PDFs emitted by broken generators. Also, PDFs quite often use object streams so you'll need to implement sophisticated parser for PDF.
Fax data (i.e. what you probably call TIFF) should be at least packed into a valid TIFF. You can borrow some code for that from open source libtiff for example.
And then comes raw raster data. I don't think that it makes sense to try to extract such data without help of a library. You could do that, of course, but it will take months of work.
So, if you are trying to extract only specific kind of image data from a set of PDFs all created with the same generator, then your task is probably feasible. In all other cases I would recommend to save time, money and hair and use a library for the task.
PDF files store Jpegs as actual JPEGS (DCT and JPX encoding) so in most cases you can rip the data out. With Tiffs, you are looking for CCITT data (but you will need to add a header to the data to make it a Tiff). I wrote 2 blog articles on images in PDF files at http://www.jpedal.org/PDFblog/2010/09/understanding-the-pdf-file-format-images/ and http://www.jpedal.org/PDFblog/2011/07/extract-raw-jpeg-images-from-a-pdf-file/ which might help.

How to optimize PDF file size?

I have an input PDF file (usually, but not always generated by pdfTeX), which I want to convert to an output PDF, which is visually equivalent (no matter the resolution), it has the same metadata (Unicode text info, hyperlinks, outlines etc.), but the file size is as small as possible.
I know about the following methods:
java -cp Multivalent.jar tool.pdf.Compress input.pdf (from http://multivalent.sourceforge.net/). This recompresses all streams, removes unused objects, unifies equivalent objects, compresses whitespace, removes default values, compresses the cross-reference table.
Recompressing suitable images with jbig2 and PNGOUT.
Re-encoding Type1 fonts as CFF fonts.
Unifying equivalent images.
Unifying subsets of the same font to a bigger subset.
Remove fillable forms.
When distilling or otherwise converting (e.g. gs -sDEVICE=pdfwrite), make sure it doesn't degrade image quality, and doesn't increase (!) the image sizes.
I know about the following techniques, but they don't apply in my case, since I already have a PDF:
Use smaller and/or less fonts.
Use vector images instead bitmap images.
Do you have any other ideas how to optimize PDF?
Optimize PDF Files
Avoid Refried Graphics
For graphics that must be inserted as bitmaps, prepare them for maximum compressibility and minimum dimensions. Use the best quality images that you can at the output resolution of the PDF. Inserting compressed JPEGs into PDFs and Distilling them may recompress JPEGs, which can create noticeable artifacts. Use black and white images and text instead of color images to allow the use of the newer JBIG2 standard that excels in monochromatic compression. Be sure to turn off thumbnails when saving PDFs for the Web.
Use Vector Graphics
Use vector-based graphics wherever possible for images that would normally be made into GIFs. Vector images scale perfectly, look marvelous, and their mathematical formulas usually take up less space than bitmapped graphics that describe every pixel (although there are some cases where bitmap graphics are actually smaller than vector graphics). You can also compress vector image data using ZIP compression, which is built into the PDF format. Acrobat Reader version 5 and 6 also support the SVG standard.
Minimize Fonts
How you use fonts, especially in smaller PDFs, can have a significant impact on file size. Minimize the number of fonts you use in your documents to minimize their impact on file size. Each additional fully embedded font can easily take 40K in file size, which is why most authors create "subsetted" fonts that only include the glyphs actually used.
Flatten Fat Forms
Acrobat forms can take up a lot of space in your PDFs. New in Acrobat 8 Pro you can flatten form fields in the Advanced -> PDF Optimizer -> Discard Objects dialog. Flattening forms makes form fields unusable and form data is merged with the page. You can also use PDF Enhancer from Apago to reduce forms by 50% by removing information present in the file but never actually used. You can also combine a refried PDF with the old form pages to create a hybrid PDF in Acrobat (see "Refried PDF" section below).
see article
From PDF specification version 1.5 there are two new methods of compression, object streams and cross reference streams.
You mention that the Multivalent.jar compress tool compresses the cross reference table. This usually means the cross reference table is converted into a stream and then compressed.
The format of this cross reference stream is not fixed. You can change the bit size of the three "columns" of data. It's also possible to pre-process the stream data using a predictor function which will improve the compression level of the data. If you look inside the PDF with a text editor you might be able to find the /Predictor entry in the cross reference stream dictionary to check whether the tool you're using is taking advantage of this feature.
Using a predictor on the compression might be handy for images too.
The second type of compression offered is the use of object streams.
Often in a PDF you have many similar objects. These can now be combined into a single object and then compressed. The documentation for the Multivalent Compress tool mentions that object streams are used but doesn't have many details on the actual choice of which objects to group together. The compression will be better if you group similar objects together into an object stream.