Can anyone please suggest me a method by which a chm file can be indexed in such as pdfbox for pdf.
If you have also other document formats which you need to index, you might find a better and more general solution in Apache Tika
They just added a CHM Parser recently (for reference: Support of CHM Format) and it will be in the next version.
If you're talking about Microsoft Compiled HTML Help files, you can just extract text from them with JChm and then index it in a normal way.
Related
We are converting a sizeable document for hosting on ReadTheDocs. We weren't happy with the simple presentation enabled by Markdown table syntax, so we coded our tables as HTML. Very nice in the HTML viewer (e.g., the end of http://manual.cytoscape.org/en/latest/Command_Line_Arguments.html).
In the PDF version generated by ReadTheDocs, each of our tables is completely missing (see page 9 on https://media.readthedocs.org/pdf/cytoscape-working-copy/latest/cytoscape-working-copy.pdf).
Have we made a mistake by coding tables as HTML? Could we have taken a different route and gotten nice tables in both HTML and PDF?
Any advice would be helpful ...
Thanks!
I have not used ReadTheDocs myself, but from reading their Getting Started guide, I assume you are using Sphinx? While Markdown supports embedding raw HTML, Sphinx does not support converting it to other formats.
You should consider moving to reStructuredText (Sphinx's native markup format), as it is much more advanced than Markdown. It can even be extended with custom directives and roles, should you need this. But be sure to first check whether reStructuredText tables offer the flexibility you require. Pandoc can convert your Markdown files to reStructuredText.
I see you are using a table to document command line options. reStructuredText supports documenting command line options using option lists. In theory, you could change how option lists are represented in the output document, but this might not be easy to accomplish, especially for PDF output using LaTeX (shameless plug: using rinohtype for PDF output should make this much easier in the future).
In one of my applications I need to merge many single PDF documents into one document, where each of the original PDFs is a page. Although many PDF libraries exist for most languages, I would like to write this myself if it's not too hard.
Is it necessary to implement a full-fledged PDF parser in order to merge PDF documents? Where and what would I start to read to find out what is needed for the task?
You can use the Debenu QuickPDF Library Lite (free) version to do it. Here is a very good example how to do it:
http://www.debenu.com/kb/merge-pdf-files-together-programmatically/
How can I detect images in a document say doc,xls,ppt or pdf ?
I came across with Apache Tika, I am trying its command line option.
http://tika.apache.org/1.2/gettingstarted.html
But not quite sure how it will detect images.
Any help is appreciated.
Thanks
You've said you want to use a command line solution, and not write any Java code, so it's not going to be the prettiest way to do it... If you are happy to write a little bit of Java, and create a new program to call from Python, then you can do it much nicer!
The first thing to do is to have the Tika App extract out any embedded resources within your file. Use the --extract option for this, and have the extraction occur in a special temp directory you app controls, eg
$ java -jar tika.jar --extract ../testWORD_embedded_pdf.doc
Extracting 'image1.emf' (application/x-emf)
Extracting '_1402837031.pdf' (application/pdf)
Grab the output of the extraction if you can, and parse that looking for images (but be aware that some images have an application/ prefix on their canconical mimetype!). You might need to run a second --detect step on a few, I'm not sure, test how the parsers get on with the extraction.
Now, if there were images, they'll be in your test dir. Process them as you want. Finally, zap the temp dir when you're done with the file!
Having used Tika in the past I can't see how Tika can help with images embedded within Office documents or PDFs I was wrong to answer No. You will have may still try to resolve to native APIs like Apache POI and Apache PDFBox. Tika does use both libraries to parse text and metadata but no embedded image support.
Using Tika makes these APIs automatically available (side effect of using Tika).
UPDATE:
Since Tika 0.8: look for EmbeddedResourceHandler and examples - thanks to Gagravarr.
I currently have a number of documents uploaded to my website on a daily basis (.doc, .docx, .odt, pdf) and these docs are stored in a sql database (mediumblob).
Currently I open the docs from the database and cut and paste a text version into a field in the database for a quick reference and search function.
I'm looking to automate this "cut & paste" process - formatting isn't a real concern just as long as I can extract the text - and was hoping that some people may be able to suggest a good route to go down?
I've tried manipulating the content of the blob field using regex but it is not really working.
I've been looking at Apache POI with a view to extracting the text at the point of upload but I can't help thinking that this maybe a bit of an overkill given my relatively simple needs.
Given the various document formats I encounter and the current storing of the content in a blob field would Apache POI be the best solution to use in this instance or can anybody suggest an alternative?
Help and suggestions greatly appreciated.
Chris
Apache POI will only work for the Microsoft Office formats (.xls, .docx, .msg etc). For these formats, it provides classes for working with the files (always read, for many write support too), as well as text extractors.
For a general text extraction framework, you should look at Apache Tika. Tika uses POI internally to handle the Microsoft formats, and uses a number of other libraries to handle different formats. Tika will, for example, handle both PDF and ODF/ODT, which are the other two file formats you mentioned in the question.
There are some quick start tutorials and examples on the Apache Tika website, I'd suggest you have a look through. It's very quick to get started with, and you should be able to easily change your code to send the document through Tika during upload to get a plain text version, or event XHTML if that's more helpful to you.
I have a website where users upload documents in .doc and .pdf format. I am using Sphinx to conduct full text searches on my SQL database (MySQL). What is the best way to index these file formats with Sphinx?
The method I use for this is pdf2text and antiword. I use both of these to dump the contents of the pdfs and word documents into the database. From there it's easy to crawl with Sphinx.
Unfortunately, Sphinx can't index those file types directly. You'll need to either import the textual contents into a database, or into an XML format that Sphinx can understand.
Has anyone used Tika to index other types of documents, much like the SOLR plugin?
Apache Tika
Some links:
PDF2TEXT is in poppler or poppler-utils on Linux
ANTIWORD -- seems to be for old .doc, not newer .docx