find indexed terms in non-indexed document/string - lucene

Sorry if I'm using the wrong terminology here, I'm new to Lucene :D
Assume that I have indexed all titles of the English Wikipedia in Lucene.
Let's say I'm visiting a news website. Within the article I want to convert all phrases (that match a title in the Wikipedia) into a link to the Wikipedia page.
To clarify: I don't want to put the news article into the Lucene index, but rather use the indexed WP titles to find matches within a given string (the article). We also don't want to bother with the JS/HTML stuff, just focus on Lucene for now.
I'd also like to match greedy: i.e. if the text contains "Stack Overflow", I'd like to link to SO, rather than to "Stack" and "Overflow". But if I can get shorter matches as well, that would be neat, too. (I kindof want to do both, but I'll settle for either one if having both is difficult).
naive solution: I can see that I'd be able to query for single words iteratively and whenever I hit an index, try to find the current word plus the next word until I miss. Then convert the last match into a link and continue after that, until I'm through the complete document.
But, that seems really awkward and I have a suspicion that Lucene might have some functionality that could support me here (or at least I hope so :D), but I have no clue what I'd be looking for.
Lucene's inverted index should make this a pretty fast operation, so I guess this might perform reasonably well, even with the naive approach.
Any pointers? I'm stuck :3

There is such a thing, it's called the Tagger Handler:
Given a dictionary (a Solr index) with a name-like field, you can post text to this request handler and it will return every occurrence of one of those names with offsets and other document metadata desired. It’s used for named entity recognition (NER).
It seems a bit fiddly to set-up, but it's exactly what I wanted :D

Related

Phrase and word suggestions using Lucene.net

Need to create a Google like suggestions using Lucene.net. I am currently using ShingleAnalyzerWrapper for phrase suggestions and successfully. But I need to search for a word suggestions if there is no any phrase found.
I am completely new into Lucene world. I need to implement this in a short time. I would appreciate any advice.
Thanks.
Edit
I want simple answers to my questions.
Should I use SpellChecker?
How should I index phrases?
How to search for phrases(What if there are misspelled words?)?
If you are new to Lucene, this might not be that easy. However, what you need to do at a higher level is check your results from the phrase and if it comes back with zero results...simply create a new Query without the phrase.
I am not sure how your phrase is set up, but you could do:
- keyword search on the phrase and eliminate stopwords. "the big bus" phrase could become "big bus" or just "bus"
- add slop setting to your phrase search
- use Fuzzy search
- More like this search
I would recommend the book "Lucene In Action", as it covers Lucene 3.0.3. It is for Java, however the current Lucene.net version is 3.0.3 so there is symmetry between the two APIs and examples in the book. The book dedicates a chapter to what you are looking for and the strategies involved in doing: suggested search on a non-exact match (spell checking, suggesting close documents etc.)

Lucene Based Searching

I've a problem in Lucene based searching. I have designed a document with five fields. Consider the document be Address with addressline1, addressline2, city, state and pin. If a search is to be performed, then the search has be done in all the fields, so I'm using boolean term queries. So the results would be retrieved. Now I also have to respond not only with responses but also with the matching field. For eg if the city field matches the search, then I should respond as city matches the search along with the actual search response. Is there are any lucene api to accommodate this?
AFAIK there's no simple solution to find out which field matched the query.
Your options are:
try using hit highlighter (it knows where the match occurred but it's noticeably slow on large result sets)
fiddle with IndexSearcher's explain method
build your custom solution
Hit highlighter experience and workaround findings.
IMHO it shouldn't be hard to implement that yourself, since Lucene in some point in time surely knows which field yielded a match, but it discards that information as unnecessary weight by the time it composes your response.
I stumbled upon this custom approach.
Try to find more resources on search-lucene.com, the best Lucene/Solr related search engine.

Increasing the weight of particular terms (e.g. headings) when indexing documents in Lucene

I have documents which I am indexing with Lucene. These documents basically have a title (text) and body (text). Currently I am creating an index out of Lucene Documents with (amongst other fields) a single searchable field, which is basically title+" "+body. In this way, if you search for anything which occurs in the title or in the body, you will find the document.
However, now I have learned of the new requirement that matches in the title should cause the document to be "more relevant" than matches in the body. Thus, if there is a document with the title "Software design", and the user searches for "Software design", then that document should be placed higher up in the search results than a document called something else, which mentions software design a lot in the body.
I don't really have any idea how to begin implementing this requirement. I know that Google e.g. treats certain parts of the document as "more relevant" (e.g. text within <h1> tags), everyone here assumes Lucene supports something similar.
However,
The Javadoc for the Document class clearly states that fields contain text, i.e. not structured text where some parts are "more important" than other parts.
This blog post states "With Lucene, it is impossible to increase or decrease the weight of individual terms in a document."
I'm not really sure where to look. What would you suggest?
Any specific information (e.g. links to Lucene documentation) stating flatly that such a thing is not possible would also be helpful, then I needn't spend any further time looking for how to do it. (The software is already written with Lucene, so we won't re-write it now, so if Lucene doesn't support it, then there's nothing anyone (my boss) can do about that.)
Just use two fields, title and body, and while indexing boost 'title' field:
title.setBoost(float)
see here
you probably should split the combine field become title and body separately, then use the run-time boost to give more relevancy for title field
the run-time query will be like
title:apache^20 body:apache
see - http://lucene.apache.org/java/2_4_0/queryparsersyntax.html#Boosting%20a%20Term

Question Answering with Lucene

For a toy project, I want to implement an automated question answering system with Lucene and I'm trying to figure out a reasonable way to implement it. The basic operation is as follows:
1) The user will enter a question.
2) The system will identify the keywords in the question.
3) The keywords will be searched in a large knowledgebase and matching sentences will be shown as answers.
My knowledgebase (i.e., corpus) is not structured. It is just a large, continuous text (say, a user manual without any chapters). I mean that the only structure is that sentences and paragraphs are identified.
I plan to treat each sentence or paragraph as a separate document. To present the answer in a context, I may consider keeping one sentence/paragraph before/after the indexed one as payload. I would like to know if that makes sense. Also, I'm wondering if there are other tried and well-known approaches for that kind of systems. As an example, another approach that comes to mind is to index large chunks of the corpus as documents with the token positions, then process the vicinity of found keywords to construct my answers.
I would appreciate direct recommendations based on experience or intuition, but also tutorials or introductory materials to question-answering systems with Lucene in mind.
Thanks.
It's not an unreasonable approach to take.
One enhancement you might consider is incorporating learning feedback, so that you can continually improve the scoring of content vs search terms. To do this you would ask users to rate the answers that come back ('helpful vs unhelpful'), that way you can start to rank documents against keywords based on the historical data. You could classify potential documents as helpful/unhelpful for given keywords by using a simple Bayesian classifier.
Indexing each sentence as a document will give you some problems. You've pointed out one: you would need to store the surrounding texts a payloads. That means you'll need to store each sentence three times (before, during and after), and you'll have to manually get into the payload.
If you want to go the route of each sentence being a document, I would recommend coming up with an ID for each sentence and storing that as a separate field. Then you can display [ID-1, ID, ID+1] in each result.
The bigger question though is: how should you break up the text into documents? Identifying semantically related areas seems difficult, so doing it by sentence/paragraph might be the only way to go. A better way would be if you could find which text is the header of a section, and then put everything in that section as a document.
You might also want to use the index (if your corpus has one). The terms there could be boosted, as they are presumably more important.
Instead of luncene which does text indexing, search and retrieval, I think using something like Apache Mahout would help with this. Mahout considers text as knowledge and doing that makes the answering the question better than just text matching. Mahout is a machine learning and data mining f/w which fits this domain better. Just a very high level thought.
--Sai

Code related web searches

Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)