How do we create a simple search engine using Lucene, Solr or Nutch? - lucene

Our company has thousands of PDF documents. How do we create a simple search engine using Lucene, Solr or Nutch? We'll provide a basic Java/JSP web page were people can type in words and perform basic and/or queries then show them the document links of all matching PDF's.

I have had good luck with lucene, but it is not click, install and search, it does require a bit of work.
If you need something that yo can download and install and be searching within 10 minutes, look at the free Ominifind Yahoo Edition http://omnifind.ibm.yahoo.net/, it uses Lucene, but is packaged such that it is configured and ready to run upon install, a much easier way to try Lucene.

Nutch + Lucene + Pdf plugin enabled in Nutch is your solution. Nutch allows you to parse pdfs by enabling the pdf plugin.
Lucene will allow you to index the crawled and parsed data and Nutch has servelet which gives you a search interface.
We use the same for our internal lans.

None of the projects in the Lucene family can natively process PDFs, but there are utilities you can drop in and well written examples on how to roll your own.
Lucene will do pretty much whatever you need it to do, but there is overhead in terms of your time, as Tony said above. Thousands of documents really isn't that many, so you might be able to get away with a lighter weight alternative.
That said, I would still recommend looking at Solr - it's much, much easier to set up than Lucene, has support for backups, replication, etc., as well as a nifty JSON interface which would fit your use case very well: http://wiki.apache.org/solr/SolJSON

Google Search Appliance http://www.google.com/enterprise/gsa/

I think you want a system to manage your PDF file. Please try to use dspace system. Dspace is a digital library, it supports Lucene based on. www.dspace.org.

Take a look at eprints. It includes a workflow for adding new documents, automatically indexes and thumbnails PDF's and has fairly comprehensive full text search functionality. It can also be easily customised and branded.
Why re-invent the wheel. Again.

Answering such a broad question in this forum will be tough. I'd recommend you check out the book Lucene in Action, which covers the basics of indexing and searching in a quite readable fashion.
Given your application, it sounds like Nutch and Solr probably will not be necessary. Since all of your documents are available locally, Nutch probably won't be helpful. Solr may help you manage a cluster of searchers if you have a high query load, but Lucene is highly performant, and handles large document sets in a very scalable manner.
The one area that might consume a lot of your effort is the use of PDF. It's possible to index PDF documents, and there are Lucene contributions to facilitate the extraction of raw text from PDFs, but depending on the document, the quality of results can vary. Often, the context of a keyword in a PDF document is unclear because of formatting instructions, and that can make it hard to do proximity searches or show the context of a hit.

A great free search technology you might look at is the IBM Yahoo! free search. I'm not sure whether they followed through on plans to use Lucene under the covers, but it remains one of the really great, east to use free search technologies. It handles up to 500K documents, I believe, and it supports PDF and other non-text formats as well. Graphic user interface; easy to customize search results, and basic search analytics. Basic thesaurus, and powerful API so you can do pretty much whatever you want if the out of the box results are not to your liking. We've suggested this to a number of clients where there were fewer than half a million documents, and they love it.

If you've a Linux server, you could use Beagle to index them, and then just use the search functionality that comes with it. It has an (experimental) web search interface, and it can be hooked into the FireFox search box as well.
It automatically indexes files as they're included, and I'd suspect that you'll find it much more efficient to enhance or fix beagle than to write your own search interface to Lucene.

Having the (imho) distinct advantage of being on a Mac, I use SearchLight on a somewhat older G5. nice web interface to spotlight, the Mac OS' built-in indexing service.

Related

Searching contents of files

Okay I am planning to create a local search engine in my intranet which searches the contents of files like xls,xlsx,doc,docx,pdb etc.
After searching in internet I am thinking that Luke Lucene can be used for this. Am I right?
Can Lucene be integrated in a Website?
I have around 500 Gb of files can Lucene handle these many files? Is there any alternative?
I know only basics of C and CPP.I dont have any prior knowledge on this. I am a self learner and please suggest me a good book on Lucene.
yes, Lucene can be used for this. But there is some code you need to write yourself (as Lucene is just a library):
- crawling code
- text extraction
- build a searcher app..
so you might be better looking at solr, that is built on top of Lucene, and has many built in features you would use: a solid server you can access with any language and dih for your crawling needs, and tika integration for text extraction, among many other things

Can we customize Lucene which is embedded in Solr?

Can we customize Lucene which is embedded in Solr just as we can in raw Lucene ? So that we can have "everything" that we have in Lucene in Solr ?
I am asking this because we are stuck at a point of deciding Solr vs Lucene, thinking like so :
Argument 1 :
"We might hit a dead zone in future if
we choose Solr, and Lucene is a better
choice hence... So we might as well
start writing HTTP wrappers and almost
half of Solr ourselves on top of
Lucene to be on safer side. "
Argument 2 :
"Solr already has all the features we
want to use, so why not just use it ?
Since people who commit to Lucene are
also responsible for committing to
Solr, all features of Lucene are
available to Solr too..."
I went through many blogs and posts that say something like :
For situations where you have very customized requirements requiring
low-level access to the Lucene API classes, Solr would be more a
hindrance than a help, since it is an extra layer of indirection.
-http://www.lucenetutorial.com/lucene-vs-solr.html
One way of defending Argument 2 is by confirming that we can customize the underlying Lucene in Solr just like we would do if we had only Lucene.
Can someone provide a better way of closing this argument ? :)
ps : We need a fast search with indexing and sharding terabytes of data...
Can we customize Lucene which is embedded in Solr ?
Yes, you can. But keep this in mind:
Lucene and Solr committers are some of the foremost experts in the field of full-text search. They have several years of experience in this field. If you think you can do better than them, then go ahead and change Solr to your needs (it's Apache-licensed so there aren't any commercial restrictions), and if you do so try to do it so that you can later contribute it back to the project so everyone can benefit and the project moves forward.
For the vast majority of Solr users though, the stock product is more than enough and satisfies all needs.
In other words, before jumping in to change the code, ask on a mailing list (stackoverflow or solr-user), there's a good chance that you don't really need to change any code.
"Fast search with indexing and sharding terabytes of data" is precisely what Solr was built for. It would be a bad case of Not-Invented-Here not to use it or any of the other similar solutions, such as ElasticSearch, Sphinx, Xapian, etc. If you think you'll need to customize or extend the search server in any way, consider the license and underlying code of each one. Solr and ElasticSearch are both Apache-licensed so they don't have commercial restrictions and are built on top of Lucene, a well-known library.

mg4j vs. apache lucene

Can anyone provide a simple comparative analysis of these search engines? What advantages does either framework have?
BTW, I've seen the following basic explanations of choosing mg4j from several academic papers:
combining indices over the same collection
multi-index queries
Update:
These slides (from mir2ed.org) contain a more fresh overview of open source search engines including Lucene and mg4j on benchmarking various aspects: memory & CPU, index size, search performance, search quality etc.
Jeff Dalton reviewed many open source search engines including Lucene and mg4j in 2007, and updated the comparison in 2009.
I have not used mg4j. I have used Lucene, though. The number one feature of Lucene IMO is its wide adoption and wonderful community of users/developers/committers. This means that there is a fair chance that somebody worked on a use case similar to yours using Lucene.
Current weak points of Lucene are its scoring model and its ability to scale to large collections of text. The Lucene developers are working on these issues.
I believe that the choice of a search library is very dependent on your (academic or industrial) setting, the other parts of your application and your use case.

Where do I begin learning Lucene.NET Solr Hadoop and MapReduce?

I'm a .NET developer and I need to learn Lucene so we can run a very large scale search service that removes entries that the end user doesn't have access to. (ie a User can search for all documents with clearance level 3 or higher, but not clearance level 2 or 1)
Where do I start learning, which products should I consider? To be honest, I'm a little overwhelmed, but I'm determined to figure it all out... eventually.
If you want a book that covers all the basics of Lucene, consider "Lucene in Action". Even though the code samples are Java, you can easily port them to .NET. Of course, there also are tonnes of resources on the web, such as SO and the Lucene mailing lists which should help you along.
For project you describe, you should look at Solr since it abstracts out lots of the issues of scalability etc. and via Solrnet can easily integrate into your .NET app. To restrict access by a level, your index documents should contain a field called "Level" (say) and in the background of your user query, you append the "Level:Level-1" query, using a boolean query construct.
At this stage, my recommendation would be to stay away from Hadoop (Apache Map-reduce implementation) for your project and stick with Solr. If you are however keen to learn about it. It too has a very useful book, you guessed it "Hadoop In Action" (also from Manning Publications).
You seem to be confused about what exactly each project (Lucene/Solr/Hadoop/etc) does. So the first thing to do would be understanding the purpose of each project. Read the docs and blogs about them. If possible, buy and read books about them.
For example, MapReduce and Hadoop have nothing to do with your security requirements. Hadoop is a platform for distributed, scalable computing. But Solr is scalable on its own. You might want to use Hadoop to distribute a crawler though (e.g. Nutch).

Company insists on using a binary format for all our documentation [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I work at a company that, for some reason, insists that all our development documentation should be in MS Word format. Which, being a binary format, means we cannot:
Diff versions of a document against each other (so peer reviewing them is a pain - because of the domain we work in, peer reviews for all changes are essential)
Grep a folder-full of documents for keywords
What do you use to write documentation in and why?
Please also give me ammo to change this situation with...
I recently started using DocBook XML to author my documentation.
On the upside, it's a pure text format. You can break a large document into multiple files, and use nodes to bring them all together into a single book. Table of contents and index are automatically generated. Intra-document links (within arbitrary text, pointing to chapters or sections) are very easy. And with a push of a button, I can create a single-html-file version, a chunked-html version (one file per chapter), and a PDF version.
After some tweaking and customization, I'm very happy with the output. The documents look great!!
DocBook is used extensively by real publishers (most notably, O'Reilly), and it's been around for more than fifteen years, so it's reached a certain level of maturity.
On the other hand, all of the processing is done with XSLT, using an ad-hoc collection of tools. (My own docbook pipeline includes Python, Java, Xerces, Xalan, Apache FOP, and PDF-SAM. Plus the official XSLT stylesheet distribution, and my own XSLT customizations.)
DocBook is not a turnkey solution. You won't be able to get going quickly, without reading the manual. And if you don't know anything about XSLT, you'll have to learn.
On the other hand, there are only a dozen or two XML tags that you really need to know to write the documents. (The real expertise comes into play during doc generation from the XML sources.) If one person on your team was willing to be responsible for writing the doc build script, then everyone else on the team could just learn the DTD and do a decent job contributing.
Anyhow... DocBook definitely has some faults. It's not the easiest system for tech authorship. But it's the best open source tool I know of.
The "Subversion Book" is written in DocBook. Here's a page with links to the different book versions (single-html, chunked-html, and PDF):
http://svnbook.red-bean.com/
And here's a link to the DocBook XML sources for the first chapter, so that you can get an idea for how it works:
http://sourceforge.net/p/svnbook/source/HEAD/tree/branches/1.7/en/book/ch01-fundamental-concepts.xml
For ammo, there's the trusty old Pragmatic Programmer, chapter 14: The Power of Plain Text.
As Pragmatic Programmers, our base
material isn't wood or iron, it's
knowledge. We gather requirements as
knowledge, and then express that
knowledge in our designs,
implementations, tests, and documents.
And we believe the best format for
storing knowledge persistently is
plain text. With plain text, we give
ourselves the ability to manipulate
knowledge, both manually and
programmatically, using virtually
every tool at our disposal.
We use a wiki (specifically the one provided by Trac) for the two reasons you mentioned. Plus, if we really need to we can get the text version of the markup and manipulate it in a text-only environment, too (e.g. as part of svn comments during commit).
A format that can be easily reduced to text-only (non-binary) is definitely a must. Having the ability to upconvert it to a pretty format like a PDF is, for us, not terribly important.
Word has change tracking for documents (although it only works up until you accept the changes) and you can also grep them (the text isn't encrypted). So I'm not sure either of your arguments will hold up under scrutiny. I'd love to give you the ammo to change this but I've become jaded and cynical with age.
We use MS Word for our docs (which is a huge improvement over the earlier choice (Lotus WordPro - ugh!).
We use a wiki - specifically Confluence by Atlassian.
It's a commercial product, and it's great. One of the reasons we picked it over free/open wiki engines is that it has a full-blown WYSIWYG editor and various other features that make it more easily accessible to users who are familiar with Word.
We've also come up with a neat trick where we store images, designs, wireframes, etc. in Subversion, and then embed links in the wiki documents to those resources URLs via the Apache/SVN web interface module; notes on how we do this are here if you're interested.
Like Dylan's organisation, we also use the excellent Confluence wiki. I wrote an article about why this is better approach called Wiki is my word-processor, which should give you some reasons to change the situation.
Benefits of using a wiki for internal documentation include the following.
Word-processor users get sucked into changing the layout and typography, however good your templates are, which wastes time and reduces consistency.
A wiki provides full-text search, which you are unlikely to have for your body of the MS Word documents written by everyone.
A wiki provides a document version history; I have never heard of a team successfully keeping all revisions in Word documents and always being able to compare old versions, or using a version control system (with the possible exception of SharePoint but that's whole different failure scenario).
A wiki makes hyperlinks between documents easy; it is too hard to reliably link between documents in a collection of Word documents, so new documents end up duplicating older content into new monolithic documents which means they take more time to read and write.
Separate wiki pages can be edited by different people at the same time, and Confluence can merge changes when multiple people edit the same page at the same time; collaboration is harder with a Word document that only one person can edit at a time.
A wiki like Confluence automatically generates navigation pages based on wiki structure and tags; you need a librarian and lots of discipline to make it possible to browse a large collection of Word documents.
A wiki page usually loads and displays more quickly than a Word document.
A wiki page has more automatic meta-data; you need templates and discipline to make sure that Word documents always have Title, Author and Version set in the document properties and visible in the document on-screen and in print.
If you want more ammunition than this, then there is lots of wiki-promotion on The Atlassian Blog.
You could ask for documentation to be in OOXML (.docx, in the case of Word) format. Not as ideal as using ODT, in my opinion, however, it's still just a zip file with a bunch of XML files inside. :-)
A textual format facilitates merging your documentation with generated items such as JavaDoc, API references or data dictionaries. It also scales much better than word, which is hard to use for large documents. Finally, a format that allows includes allows multiple authors to work on a document concurrently.
LaTeX and FrameMaker (the two systems I have used for this) both have vastly superior indexing and cross-referencing capabilities and have either a native textual format or a textual version of their native format that can be included (MIF in the case of Framemaker). They are also both much more stable than word.
I've built tools that read data dictionaries and generate documentation that can be included into a larger document with stable indexing and two-way cross-referencing. The functional specification for This product was done with LaTeX in this way and got me another gig with the company. I have also developed a similar process with FrameMaker.
Is the entire development team against this requirement, or is it a small group? If it's the entire team, just ignore the mandate and use a text-based format -- wouldn't be the first time employees ignored a silly rule. Works especially well if you've not made a big fuss about it in the past. If you have, management might look especially hard at your docs.
MS Word supports document changes tracking and peer review.
The new MS Office format is fully XML based (to see this, rename a MS Word .docx file to a .zip, then unpack it to see).
Maybe Office 2007 may fit both your company requirements and your concerns ?
You can at least compare Word documents, see the "Track changes" command in the "Extra" menu, or use software like DeltaView. Found via google search first link at lifehacker.com. Searching in word documents should be possible with Google Desktop Search or other similar programs that index all files they are able to read.
Do they insist that you write it in Word or only that it's available in Word format? You could write in a text format and convert it to Word automatically.
Don't you store documentation files in some kind of Version Control System, ideally together with the source code? I would recommend to do this (makes it easy to get the documentation for old software releases).
And if you do store the docs in VCS, you will notice that plain text or XML-bases files are much better for this, because you can get diffs; also, changes between text files are usually stored more efficiently than changes between binary files.
Not to defend MS products here, but MS word can diff documents.
If you use Beyond Compare as the diff tool for your source-control system (As we do, with Perforce), it will show you differences between revisions of your Word docs. Admittedly, it only shows the textual differences - formatting changes are not shown - but this is usually enough for you to see what changed.
This is just another reason to invest in Beyond Compare, as it is one of the most polished pieces of software I've ever used - and it's the best $30 dollars (Less if you buy several) I've spent on software
There are many tools for word document comparison. I currently use a python script that puts a command-line on the built-in compare and merge functionality of word.
http://nicolas.lehuen.com/index.php/post/2005/06/30/60-comparing-microsoft-word-documents-stored-in-a-subversion-repository
It should be easy to automate word to extract all text from a word document into a text file. So you could write a script creating text files from word docs, and grep, compare, version control, Review these text files.
Of course this is not an ideal solution, since you loose your pretty formatting, but it should work.
I think there are programmes that convert Word docs to plain text. Use one of them to convert the word doc to plain text and then use diff, grep etc
Also have a look into recommended toolchain(s) for DocBook.