sql server 2005 full text index query to help find noise words in content - sql-server-2005

Is there a way to query a full text index to help determine additional noise words? I would like to add some custom noise words and wondered if theres a way to analyse the index to help determine suggestions.

As simple as in
http://arcanecode.com/2008/05/29/creating-and-customizing-noise-words-in-sql-server-2005-full-text-search/
where this is explained (how to do it). Coming up with proper ones, though, is hard.

I decided to look into lucene.net because I wasn't happy with the relevance calculations in sql server full text indexing.
I managed to figure out how to index all the content pretty quickly and then used Luke to find noise words. I have now edited the sql server noise files based on this analysis. Now I have a search solution that works reasonably well using sql server full text indexing, but I plan to move to lucene.net in the future.
Using sql server full text indexing as a base, I developed a domain centric approach to finding relevant content using tool I understood. After some serious thinking and testing, I used many other measures to determine the relevance of a search result other than what is provided by analysing text content for term frequency and word distance. SQL Server full text indexing provided me a great start, and now I have a strategy I can express using lucene that will work very well.
It would have taken me a whole lot longer to understand lucene, and develop a strategy for the search. If anyone out there is still reading this, use full text indexing for testing your idea and then move to lucene once you have a strategy you know will work for your domain.

Related

Solr spellcheck vs fuzzy search

I don't quite understand the difference between apache solr's spell check vs fuzzy search functionality.
I understand that fuzzy search matches your search term with the indexed value based on some difference expressed in distance.
I also understand that spellcheck also give you suggestions based on how close your search term is to a value in the index.
So to me those two things are not that different though I am sure that this is due to my shortcoming in understanding each feature thoroughly.
If anyone could provide an explanation preferably via an example, I would greatly appreciate it.
Thanks,
Bob
I'm not a professional in the Solr but I try to explain.
Fuzzy search is a simple instruction for Solr to use a kind of spellchecking during requests - Solr’s standard query parser supports the fuzzy search and you can use this one without any additional settings, for example: roam~ or roam~1. And this so-colled spellcheking is used a Damerau-Levenshtein Distance or Edit Distance algorithm.
To use spellchecking you need to configure it in the solrconfig.xml (please, see here). It gives you sort of flexibility how to implement spellcheking (there are a couple of OOTB implementation) so, for example, you can use another index for spellcheck thereby you decrease load on main index. Also for spellchecking you use another URL: /spell so it is not a search query like fuzzy query.
Why should I use spellcheking or fuzzy search? I guess it is depended on your server loading because the fuzzy search is more expensive and not recommended by the Solr team.
P.S. It is my understanding of fuzzy and spellcheking so if somebody has more correct and clear explanation, please, give us advice how to deal with them.

Search String parsing algorithm

I am writing a prototype of a new app for an enterprise. I want to include a great search engine, which is something they have never had before. What I am looking for is something that can translate a lucene style query language into SQL statements on a key value pair data model. (three fields, grouping id, key, value)
Ive been looking for a while now and havn't had any luck. Im about to open the source for lucene and see if I can pull the query algorithms out and have them generate sql instead of index search commands. but im not very hopefull.
I can't just run lucene or any other indexing system on this enterprise for political and regulatory reasons so thats not an option.
Does this type of system exist?
see if I can pull the query algorithms out and have them generate sql instead
Don't waste your time. SQL and Lucene queries work in a completely different way; this is because they use different underlying data structures, algorithms, etc.
The best you can do is to write SQL query parser and rewrite those queries into Lucene queries. But you'd have to be naive to think you can write full-blown SQL query parser. You can easily solve simple cases, but what are you going to do when somebody sends you a JOIN? Or a GROUP BY bar HAVING foo>3?
If you can't jump over political hurdles, just use one of the full text indexing algorithms databases can offer; this is better than nothing.

Using Lucene QueryAPI to access SQL

Can you advise on whether I can use just the Query functionality from Lucene to generate SQL queries? Something like an SQLQueryBuilder?
I have a massive SQL database of logs from a webserver cluster containing the original request and response strings plus some other useful/less bits and bobs. What I need to do is analyse the parameters in the original request and compare with the generated responses, looking at ratios, volatility, variability, consistency etc.
This question does not relate to the analysis stage, but only the retrieval of data from database which matches the parameters I'm interested in. So, I could just do this in good old sql queries, manually building the exact queries I need on a case-by-case basis. But that's kinda lame; I reckon we can be a bit smarter than that. Particularly as I can already see large numbers of similar but subtly different queries being useful. And as I'm hoping that I can expose a single search box via a web interface to non-technical end-users, adding sql queries seems like a bad idea... and a recipe for permanent maintenance requests (and can I be the first to say, er no thanks!).
In an ideal world I expose a search form, with the option to write simple queries like
request:"someAttribute=\"someValue\"" AND response="some hoped for result" AND daterange:30
which would then hopefully find all instances of requests which contain someAttribute="someValue" over the last 30 days. The results will then be put through standard statistical analyses on the given response text and printed out on-screen. At least, that's the idea.
Much of the actual logic to determine how to handle custom field definitions or special words I'll need to write myself, and that's ok. And NB, my non-technical end users are familiar enough with xml that they can handle a bit of attr="value" syntax, at least for the first iteration of the tool :D
In summary, I want to:
1) allow users to use google-like search syntax (e.g. via Lucene's QueryAPI) to specify text to match in the logs
2) allow a layer to manipulate the query based on special words or fields (e.g. this layer could be during a Java object phase)
3) convert the final query into an sql query appropriate for my database schema
4) query the database and spit back the resultset for statistical analysis
5) pretty-print on website:)
Am I completely barking up the wrong tree? It looks like it should be possible, but I can't seem to find much on it. I've been googling for a bit on this, for example trying "Lucene SQLQueryBuilder" as a possible start but didn't really find much by way of a lead.
So, my questions are:
Has anyone tried using Lucene's QueryAPI like this before? Did it work? Any gotchas?
Are there better query api libraries out there?
Examples, finished discussions and open-source implementations would be most helpful.
Many thanks.
NB: I don't think I want Lucene's search capabilities as such, as I'm only ever looking for exact matches. I just need a query layer on top of the database.
Lucene and SQL have very little in common as they're using totally different syntax (as HefferWolf mentioned) and different underlying data models. As you said yourself, I'm afraid you're barking the wrong tree.
There are however attempts, such as Hibernate Search to bridge this gap. These are interesting experiments as such, but I would be very careful to use any of that code in production.
You could possibly use Full Text Search features available in some SQL databases, or reindex all data in Lucene and use it without database.
I doubt you can reuse any code from lucene for this. Lucene does an internal rewrite of such queries but into a syntax which wouldn't be of much help for SQL I think.
name: Phil AND lastname: Miller AND NOT age: 26
would be rewritten to
+name Phil +lastname: Miller -age: 26
So I think you would have to write your on transition into a SQL Query syntax.
But maybe you can use Lucene as such for this. Have a look into hibernate-search which is quite handy to easily create a lucene index of a sql table.

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.

Can someone give me a high overview of how lucene.net works?

I have an MS SQL database and have a varchar field that I would like to do queries like where name like '%searchTerm%'. But right now it is too slow, even with SQL enterprise's full text indexing.
Can someone explain how Lucene .Net might help my situation? How does the indexer work? How do queries work?
What is done for me, and what do I have to do?
I saw this guy (Michael Neel) present on Lucene at a user group meeting - effectively, you build index files (using Lucene) and they have pointers to whatever you want (database rows, whatever)
http://code.google.com/p/vinull/source/browse/#svn/Examples/LuceneSearch
Very fast, flexible and powerful.
What's good with Lucene is the ability to index a variety of things (files, images, database rows) together in your own index using Lucene and then translating that back to your business domain, whereas with SQL Server, it all has to be in SQL to be indexed.
It doesn't look like his slides are up there in Google code.
This article (strangely enough it's on the top of the Google search results :) has a fairly good description of how the Lucene search could be optimised.
Properly configured Lucene should easily beat SQL (pre 2005) full-text indexing search. If you on MS SQL 2005 and your search performance is still too slow you might consider checking your DB setup.