No optimization causes wrong search result - lucene

I just took over our solr/lucene stuff from my ex-colleague. But there is a weird bug.
If there is no optimization after dataimport, actually if there are multiple segment files, the search result then will be wrong. We are using a customized solr searchComponent. As far as I know about lucene, optimization should not affect search result. I doubt this may be related to multithreading or unclosed searcher/reader or something.
Anybody can help? Thank you.

It's still a guess. I find there is a custom lucene filter which is used by the custom search component. And in that filter, SolrIndexSearcher.search is called against the filter queries. Chances are high that this is the damn cause.
Could be a hint for guys who are familiar with lucene.

Related

Solr spellcheck vs fuzzy search

I don't quite understand the difference between apache solr's spell check vs fuzzy search functionality.
I understand that fuzzy search matches your search term with the indexed value based on some difference expressed in distance.
I also understand that spellcheck also give you suggestions based on how close your search term is to a value in the index.
So to me those two things are not that different though I am sure that this is due to my shortcoming in understanding each feature thoroughly.
If anyone could provide an explanation preferably via an example, I would greatly appreciate it.
Thanks,
Bob
I'm not a professional in the Solr but I try to explain.
Fuzzy search is a simple instruction for Solr to use a kind of spellchecking during requests - Solr’s standard query parser supports the fuzzy search and you can use this one without any additional settings, for example: roam~ or roam~1. And this so-colled spellcheking is used a Damerau-Levenshtein Distance or Edit Distance algorithm.
To use spellchecking you need to configure it in the solrconfig.xml (please, see here). It gives you sort of flexibility how to implement spellcheking (there are a couple of OOTB implementation) so, for example, you can use another index for spellcheck thereby you decrease load on main index. Also for spellchecking you use another URL: /spell so it is not a search query like fuzzy query.
Why should I use spellcheking or fuzzy search? I guess it is depended on your server loading because the fuzzy search is more expensive and not recommended by the Solr team.
P.S. It is my understanding of fuzzy and spellcheking so if somebody has more correct and clear explanation, please, give us advice how to deal with them.

incremental query vs. continuous query

I know that continuous query is a query which is registered once and it is evaluated continuously over a data stream. But, I don't understand what does incremental query means. I am reading about continuous data streams and the way we query for a specific pattern in the stream.
Can anyone explain me - what is an incremental query? Explanation with an example will be really helpful
Although after googling a lot, I find some definitions, but none of them explains clearly.
UPDATE:
I don't find the exact paper now in which I found this term, but in this paper I can find it on page no. 6.
You might already have researched incremental algorithm, I think it is what you're looking for.
I have never heard of an 'incremental' query. However that sounds a lot like doctrine's schema update command here in symfony's doc
Food for thought until someone come up with a better answer :)

Lucene.NET Faceted Search

I am building a faceted search with Lucene.NET, not using Solr. I want to get a list of navigation items within the current query. I just want to make sure I'm pointed in the right direction. I've got an idea in mind that will work, but I'm not sure if it's the right way to do this.
My plan at the moment is to create hiarchry of all available filters, then walk through the list using the technique described here to get a count for each, excluding filters which produce zero results. Does that sound alright, or am I missing something?
yeah. you're missing solr. the math they used behind doing faceted searching is very impressive, there is almost no good reason to not use it. the only exception i can find is if your index is small enough you can roll your own theory behind it, otherwise, its a good idea to stand on their shoulders.
Ok, so I finished my implementation. I did a lot of digging in the Lucene and Solr source code in the process and I'd recommend not using the implementation described in the linked question for several reasons. Not the least of which is that it relies on a depreciated method. It is needlessly clever; just writing your own collector will get you faster code that uses less RAM.

Code related web searches

Is there a way to search the web which does NOT remove punctuation? For example, I want to search for window.window->window (Yes, I actually do, this is a structure in mozilla plugins). I figure that this HAS to be a fairly rare string.
Unfortunately, Google, Bing, AltaVista, Yahoo, and Excite all strip the punctuation and just show anything with the word "window" in it. And according to Google, on their site, at least, there is NO WAY AROUND IT.
In general, searching for chunks of code must be hard for this reason... anyone have any hints?
google codesearch ("window.window->window" but it doesn't seem to get any relevant result out of this request)
There is similar tools all over the internet like codase or koders but I'm not sure they let you search exactly this string. Anyway they might be useful to you so I think they're worth mentioning.
edit: It is very unlikely you'll find a general purpose search engine which will allow you to search for something like "window.window->window" because most search engines will do some processing on the document before storing it. For instance they might represent it internally as vectors of words (a vector space model) and use that to do the search, not the actual original string. And creating such a vector involves first cutting the document according to punctuation and other critters. This is a very complex and interesting subject which I can't tell you much more about. My bad memory did a pretty good job since I studied it at school!
BTW they might do the same kind of processing on your query too. You might want to read about tf-idf which is probably light years from what google and his friends are doing but can give you a hint about what happens to your query.
There is no way to do that, by itself in the main Google engine, as you discovered -- however, if you are looking for information about Mozilla then the best bet would be to structure your query something more like this:
"window.window->window" +Mozilla
OR +XUL
+ Another search string related to what you are
trying to do.
SymbolHound is a web search that does not remove punctuation from the queries. There is an option to search source code repositories (like the now-discontinued Google Code Search), but it also has the option to search the Internet for special characters. (primarily programming-related sites such as StackOverflow).
try it here: http://www.symbolhound.com
-Tom (co-founder)

Problems using MySQL FULLTEXT indexing for programming-related data (SO Data Dump)

I'm trying to implement a search feature for an offline-accessible StackOverflow, and I'm noticing some problems with using MySQLs FULLTEXT indexing.
Specifically, by default FULLTEXT indexing is restricted to words between 4 and 84 characters long. Terms such as "PHP" or "SQL" would not meet the minimum length and searching for those terms would yield no results.
It is possible to modify the variable which controls the minimum length a word needs to be to be indexed (ft_min_word_len), but this is a system-wide change requiring indexes in all databases to be rebuilt. On the off chance others find this app useful, I'd rather keep these sort of variables as vanilla as possible. I found a post on this site the other day stating that changing that value is just a bad idea anyway.
Another issue is with terms like "VB.NET" where, as far as I can tell, the period in the middle of the term separates it into two indexed values - VB and NET. Again, this means searches for "VB.NET" would return nothing.
Finally, since I'm doing a direct dump of the monthly XML-based dumps, all values are converted to HTML Entities and I'm concerned that this might have an impact on my search results.
I found a blog post which tries to address these issues with the following advice:
keep two copies of your data - one with markup, etc. for display, and one modified for searching (remove unwanted words, markup, etc)
pad short terms so they will be indexed, I assume with a pre/suffix.
What I'd like to know is, are these really the best workarounds for these issues? It seems like semi-duplicating a > 1GB table is wasteful, but maybe that's just me.
Also, if anyone could recommend a good site to understand MySQL's FULLTEXT indexing, I'd appreciate it. To keep this question from being too cluttered, please leave the site recommendations in the question comments, or email me directly at the site on my user profile).
Thanks!
Additional Info:
I think I should clarify a couple of things.
I know "MySQL" tends to lead to the assumption of "web application", but that's not what I'm going for here. I could install Apache and PHP and run things that way, but I'm trying to keep this light. I can use my website for playing with PHP, so I don't feel the need to install it on my home machine too. I also hope this could be useful for others as well, and I don't want to force anyone else into installing a bunch of extra utilities. I went with MySQL since it was easy and needing to install some sort of DB was unavoidable.
The specifics of the project were going to be:
Desktop application written in C# (WinForms)
MySQL backend
I'm starting to wonder if I should just say to hell with it, and install everything I'd need to make this an (offline) webapp. As much as we'd all like to think our pet project is going to be used and loved by the community at large, I should know by now that this is likely going end up being only used by a single user.
From what was already said, I understand, that MySQL FullText is not for you ;) But why stick to MySQL? Try Sphinx:
http://www.sphinxsearch.com/
It will solve most of your problems.