Lucene query-time boosting culture code - lucene

I'm using the Lucene.Net implementation packaged with the Kentico CMS. The site that we're indexing has articles in various languages. If a user is viewing the Japanese version of the site (for example) and runs a search for 'VPN', we'd like them to see Japanese articles about VPN first, but also see other language articles in the results.
I'm trying to achieve this with query-time boosting of the _culture field. Since we're using the standard analyzer (really don't want to change that), and the standard analyzer treats hyphens as whitespace, I thought I'd try appending '(_culture:jp)^4' to the user's query. As you can see from the Luke tool's Explain output, that isn't doing anything to boost the documents with 'jp' in the field. What gives?
I've also tried:
_culture:"en-jp"
_culture:en AND _culture:jp
_culture:"en jp"
Update: It's something with the field. There's another field in the index named 'documentculture' that contains the same data (don't know why). But when I try '(documentculture:jp)^4', it works as I expect. That solves my problem, but I still have an academic question of how the fields are different.

Even though the standard analyzer ignores hyphens I don't believe it will treat the two parts of your culture code as separate terms. Therefore under normal circumstances a wildcard would help you here. For example, the query vpn (_culture:en*)^4 would boost all documents with a culture starting with en.
However, in your case you want to match the end of the term. Unfortunately, Lucene syntax doesn't support wildcards at the start of terms for some reason (according to this reference). Therefore I think you're going to have to consider changing the analyzer you're using. I generally find the Whitespace analyzer fits my needs best. I've just tried your scenario using Whitespace analyzer and have found vpn (_culture:en-jp)^4 will give you what you need.
I understand if you don't accept this answer though since you stated you didn't want to change the analyzer!

Related

Azure Search - issues with Phonetic Analyzer

Our clients query on our Azure Search index, mostly for people's names. We are using the Lucene analyzer for all of our fields. We build the query string by making the client's input name into a phrase, and adding proximity rate of 3. Because we search using a phrase, we can not use the Fuzzy Search capability of the Lucene analyzer, as it only works on single words.
We were therefore in search of a solution for being able to bring back results with names that weren't spelled exactly as the client input them. We came across the phonetic analyzer, and have just implemented the Metaphone algorithm into our index. We've run some tests and while it gets us closer to what we need, we still see some issues:
The analyzer's scope is so wide, that it's bringing back a lot of false positives. For example, when searching on Kenneth Gooden, it brings back Kenneth Cotton. That's just a little too far to be considered phonetically similar, in our opinion. Can the sensitivity be tweaked in any way, or, can something be done to boost some other parameter to remedy this?
When doing a search on Barry Soper, the first and highest-scored result that comes back is "Barry Spear." The second result, scored lower, is "Soper, Barry Russell." To a certain extent, I can maybe see why it's scored that way (b/c of the 2nd one being last name first) but then... not really. The 2nd result contains both exact terms within the required proximity. Maybe Azure Search gives priority to the order of words in the phrase before applying the analyzer? Still doesn't make sense to me. (Side note - this query also brings back "Barh Super" - see issue #1 above)
I would like to know if someone could offer suggestions to tweak Azure Search's behavior to work more along the lines of what we need, OR, perhaps suggest an alternative to the phonetic analyzer. We haven't tried any of the the other available phonetic algorithms either yet, only b/c it seems Metaphone is the best and most commonly-used. But we're open to suggestions regarding the other algorithms as well.
Thanks.
You are correct that the fuzzy operator only works on single terms. In this case, you can use a custom analyzer (phonetic tokenfilter) or Synonyms feature (in preview). I am not sure what you meant by "we have just implemented the Metaphone algorithm into our index" but there are several phonetic tokenfilters you can choose from in Azure Search custom analysis stack. Synonyms is a newer feature only available in preview, you can take a look here. For synonyms, you will need to define synonyms rules, say 'Nate, Nathan, Nathaniel' for example, and at query time, searching for one automatically includes the results for the others.
Okay, then how should I use these building blocks in a way to control relevance for my search? One way to model is to use separate field for each expansion strategy. For example, instead of a single field for the name, you can have three fields, say 'name', 'name_synonym', and 'name_phonetic'. The first field 'name' is for exact matches, 'name_synonym' field has synonyms enabled and the third uses a phonetic analyzer and broadens the search the most. You can then use the scoring profile to boost scores from matches in each field. You can give the boost value of 10 for exact matches, 5 for synonyms and 1 for phonetic expansions, for example. Your search will be issued against these three internal fields.
Regarding your question as to why 'Soper, Barry Russell' is ranked lower than 'Barry Spear'. After the phonetic analysis. the words 'soper' and 'spear' reduce to the same form both at indexing and query time and treated as if they were identical terms. In computing the score and ranking, the search engine uses analyzed form of the terms and phonetic similarity makes no influence to the score. That’s why, secondary factors, like field length, will play a more significant role influencing the relevance score.
Hope this helps. I provided one example to model this but you could also take a look at term boosting in the full lucene query syntax.
Let me know if you have any additional questions.
Nate

Stemming + wildcarding: unexpected effects

I am editing a lucene .net implementation (2.3.2) at work to include stemming and automatic wildcarding (adding of * at the ends of words).
I have found that exact words with wildcarding don't work. (so stack* works for stackoverflow, but stackoverflow* does not get a hit), and was wondering what causes this, and how it might be fixed.
Thanks in advance. (Also thanks for not asking why I am implementing both automatic wildcarding and stemming.)
I am about to make the query always prefix query so I don't have to do any nasty adding "*"s to queries, so we will see if anything becomes clear then.
Edit: Only words that are stemmed do not work wildcarded. Example Silicate* doesn't work, but silic* does.
The reason it doesnt work is because you stem the content, thus changing the Term.
For example consider the word "valve". The snowball analyzer will stem it down to "valv".
So at search time, since you stem the input query, both "valve" and "valves" will be stemmed down to "valv". A TermQuery using the stemmed Term "valv" will yield a match on both "valve" and "valves" occurences.
But now, since in the Index you stored the Term "valv", a query for "valve*" will not match anything. That is because the QueryParser does not run the Analyzer on Wildcard Queries.
There is the AnalyzingQueryParser than can handle some of these cases, but I don't think it was in 2.3.x versions of Lucene. Anyway its not a universal fit, the documentation says:
Warning: This class should only be used with analyzers that do not use stopwords or that add tokens. Also, several stemming analyzers are inappropriate: for example, GermanAnalyzer will turn Häuser into hau, but H?user will become h?user when using this parser and thus no match would be found (i.e. using this parser will be no improvement over QueryParser in such cases).
The solution mentionned in the duplicate I linked works for all cases, but you will get bigger indexes.

Code related web searches

Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)

Problems using MySQL FULLTEXT indexing for programming-related data (SO Data Dump)

I'm trying to implement a search feature for an offline-accessible StackOverflow, and I'm noticing some problems with using MySQLs FULLTEXT indexing.
Specifically, by default FULLTEXT indexing is restricted to words between 4 and 84 characters long. Terms such as "PHP" or "SQL" would not meet the minimum length and searching for those terms would yield no results.
It is possible to modify the variable which controls the minimum length a word needs to be to be indexed (ft_min_word_len), but this is a system-wide change requiring indexes in all databases to be rebuilt. On the off chance others find this app useful, I'd rather keep these sort of variables as vanilla as possible. I found a post on this site the other day stating that changing that value is just a bad idea anyway.
Another issue is with terms like "VB.NET" where, as far as I can tell, the period in the middle of the term separates it into two indexed values - VB and NET. Again, this means searches for "VB.NET" would return nothing.
Finally, since I'm doing a direct dump of the monthly XML-based dumps, all values are converted to HTML Entities and I'm concerned that this might have an impact on my search results.
I found a blog post which tries to address these issues with the following advice:
keep two copies of your data - one with markup, etc. for display, and one modified for searching (remove unwanted words, markup, etc)
pad short terms so they will be indexed, I assume with a pre/suffix.
What I'd like to know is, are these really the best workarounds for these issues? It seems like semi-duplicating a > 1GB table is wasteful, but maybe that's just me.
Also, if anyone could recommend a good site to understand MySQL's FULLTEXT indexing, I'd appreciate it. To keep this question from being too cluttered, please leave the site recommendations in the question comments, or email me directly at the site on my user profile).
Thanks!
Additional Info:
I think I should clarify a couple of things.
I know "MySQL" tends to lead to the assumption of "web application", but that's not what I'm going for here. I could install Apache and PHP and run things that way, but I'm trying to keep this light. I can use my website for playing with PHP, so I don't feel the need to install it on my home machine too. I also hope this could be useful for others as well, and I don't want to force anyone else into installing a bunch of extra utilities. I went with MySQL since it was easy and needing to install some sort of DB was unavoidable.
The specifics of the project were going to be:
Desktop application written in C# (WinForms)
MySQL backend
I'm starting to wonder if I should just say to hell with it, and install everything I'd need to make this an (offline) webapp. As much as we'd all like to think our pet project is going to be used and loved by the community at large, I should know by now that this is likely going end up being only used by a single user.
From what was already said, I understand, that MySQL FullText is not for you ;) But why stick to MySQL? Try Sphinx:
http://www.sphinxsearch.com/
It will solve most of your problems.

Semantic analysis using Solr

I'm considering about adding semantic analysis to my Solr installation, but I don't exactly know where to start.
Basically, I'd like Solr to be able to find "similar" words (taken from the body of the indexed documents).
For example, if I search for "music", I should be able to query the semantic engine and obtain "rock", "pop", etc. (of course if these words appeared near to music in some of the indexed documents).
I found this project, but I don't know if it is the correct place to start:
http://code.google.com/p/semanticvectors/
Semantic indexing is a good place to start. However, in my experience, these kind of technologies don't work that well in practice. You often end up with very bizarre results. Also, because of Google, people have a certain expectation of how keyword search should behave - i.e. your search term should appear in the matching document.
You may use the Lucene Wordnet contrib package to look for synonyms.
Optimizing Findability in Lucene and Solr gives other ways to expand queries.