Want to configure tika parsing to do OCR on PDFs only - pdf

I am trying to manipulate the tika configuration file (using tika server) to exclude all documents except PDFs from OCR processing. I have tried a number of combinations, such as excluding OCR from the default parser but configuring the PDF parser to do inline processing. I tried configuring the auto strategy. I excluded both PDF and Tesseract from the default parser. No luck. I ended up running two tika instances, one with OCR configured, and one without it, and directing files based on extension to one or the other in my code. I am using the python tika client. Is there a better way? More generally, is there a comprehensive guide to configuring parser parameters in tika? Most of what I have seen has been fragmentary. Thank you.

Do you know the ocrStrategy?
pdfParserConfig.setOcrStrategy(ocrStrategy)
Where ocrStrategy is an enum - OCRStrategy
you can set the value OCR_ONLY for pdf
and NO_OCR for other docs

Related

Conversion from PDF to TIFF file using XSLT

Is it possible to convert PDF to TIFF file using XSLT? Can someone point out some artcile or code i can refer regarding the image conversion using xslt.
THANKS!
No, it is not possible using just XSLT. XSLT is for transforming XML to other textual structures (usually XML, HTML, or plain text). Using XSL-FO, you can output a PDF from XML data - but that is a one way process as far as XSL-FO is concerned. Apache FOP does support outputting to TIFF instead of PDF, but again this is a one way process.
Assuming you could get a PDF -> XML conversion working (a quick google suggests such libraries exist, but it's unclear what they'd actually provide), it would be possible to use XSLT to transform that XML into something Apache FOP could render into a TIFF file, but at that point you'd really be better off investigating a direct PDF to TIFF conversion library (perhaps with an OCR library).
Possible? Maybe (but likely not). The real question is why do you even want to try to create a TIFF file from a PDF file using XSLT?
You do not need XSLT.
You want a raster image processor like Ghostscript (or many others). It can convert PDF (and Postscript) to other image formats like TIFF.
http://ghostscript.com/doc/current/Devices.htm
The only way to do that is to call a conversion service, e.g. aspose.com or to create another service externally to the DataPower box.
There might be some Node.js modules that could do it running in GatewayScript (GWS) (if you are on firmware 7+) but I believe they are all dependent on external binaries to function and that won't work in GWS.

How to merge PDF files without external dependencies

In one of my applications I need to merge many single PDF documents into one document, where each of the original PDFs is a page. Although many PDF libraries exist for most languages, I would like to write this myself if it's not too hard.
Is it necessary to implement a full-fledged PDF parser in order to merge PDF documents? Where and what would I start to read to find out what is needed for the task?
You can use the Debenu QuickPDF Library Lite (free) version to do it. Here is a very good example how to do it:
http://www.debenu.com/kb/merge-pdf-files-together-programmatically/

How to detect image in a document

How can I detect images in a document say doc,xls,ppt or pdf ?
I came across with Apache Tika, I am trying its command line option.
http://tika.apache.org/1.2/gettingstarted.html
But not quite sure how it will detect images.
Any help is appreciated.
Thanks
You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!
The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg
$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)
Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.
Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!
Having used Tika in the past I can't see how Tika can help with images embedded within Office documents or PDFs I was wrong to answer No. You will have may still try to resolve to native APIs like Apache POI and Apache PDFBox. Tika does use both libraries to parse text and metadata but no embedded image support.
Using Tika makes these APIs automatically available (side effect of using Tika).
UPDATE:
Since Tika 0.8: look for EmbeddedResourceHandler and examples - thanks to Gagravarr.

Suggestions on extracting text from uploaded documents

I currently have a number of documents uploaded to my website on a daily basis (.doc, .docx, .odt, pdf) and these docs are stored in a sql database (mediumblob).
Currently I open the docs from the database and cut and paste a text version into a field in the database for a quick reference and search function.
I'm looking to automate this "cut & paste" process - formatting isn't a real concern just as long as I can extract the text - and was hoping that some people may be able to suggest a good route to go down?
I've tried manipulating the content of the blob field using regex but it is not really working.
I've been looking at Apache POI with a view to extracting the text at the point of upload but I can't help thinking that this maybe a bit of an overkill given my relatively simple needs.
Given the various document formats I encounter and the current storing of the content in a blob field would Apache POI be the best solution to use in this instance or can anybody suggest an alternative?
Help and suggestions greatly appreciated.
Chris
Apache POI will only work for the Microsoft Office formats (.xls, .docx, .msg etc). For these formats, it provides classes for working with the files (always read, for many write support too), as well as text extractors.
For a general text extraction framework, you should look at Apache Tika. Tika uses POI internally to handle the Microsoft formats, and uses a number of other libraries to handle different formats. Tika will, for example, handle both PDF and ODF/ODT, which are the other two file formats you mentioned in the question.
There are some quick start tutorials and examples on the Apache Tika website, I'd suggest you have a look through. It's very quick to get started with, and you should be able to easily change your code to send the document through Tika during upload to get a plain text version, or event XHTML if that's more helpful to you.

Can we extract pdf pages using lua scripts

Our application is receiving PDF file based on 150 pages from business line, I want to extract pages from this pdf file using lua scripts.
Any body share his experience.
Thanks
Sure, you can do this. As long as you write a Lua module that can read PDF files.
There are some Lua modules for writing PDFs, but none for reading them. No public ones, at any rate. You may want to switch to Python for this, as there are quite a few Python modules for dealing with PDFs.
You could write a Lua wrapper calling something like pdftk.