Lucene.NET with SQL SERVER 2000 - sql-server-2000

I have a SQL 2000 database with around 10 million rows and I need to make a query to get the product information based on the full / partial text search.
Based on this I need to join back to other tables to check on my business process.
I have this implemented using SQL proc, but I can only validate around 6 rows a sec (without threads.. its a long business logic). I am trying to find a better ways to improve performance.
Lucene.NET might help on this. I have couple of questions.
Can you point me to right sources.
While building index on Lucene, how would I sync up with the SQL database and lucene DB?
Do you think Lucene can give real performance gain?

You can start with Mark Krellenstein's 'Search Engine versus DBMS', to see whether a full text search engine, such as Lucene, is the solution for you. In theory, Lucene should be faster than SQL for textual search, but your mileage may vary.
You can do incremental updates with Lucene, which are a bit similar to database replication. This keeps the Lucene index synchronized with the database.

Here is an article on using LINQ to Lucene to work with SQL. This may point you in the right direction.

Related

Make search query with LIKE faster

I have an application where search query takes too much time. There are different search queries where LIKE (with '%__%') operator is mostly used. I need some general guidelines (do's and don't s) for making a better and faster search query.
Querying in SQL with a wildcard at the end is quite efficient, but there is definitely a performance issue with having a wildcard at the beginning.
One way around this is in SQL 2008 has support for Full Text Search
You'll have to change how your query works to utilize full text indexing, but it should dramatically improve your text querying performance.

Search String parsing algorithm

I am writing a prototype of a new app for an enterprise. I want to include a great search engine, which is something they have never had before. What I am looking for is something that can translate a lucene style query language into SQL statements on a key value pair data model. (three fields, grouping id, key, value)
Ive been looking for a while now and havn't had any luck. Im about to open the source for lucene and see if I can pull the query algorithms out and have them generate sql instead of index search commands. but im not very hopefull.
I can't just run lucene or any other indexing system on this enterprise for political and regulatory reasons so thats not an option.
Does this type of system exist?
see if I can pull the query algorithms out and have them generate sql instead
Don't waste your time. SQL and Lucene queries work in a completely different way; this is because they use different underlying data structures, algorithms, etc.
The best you can do is to write SQL query parser and rewrite those queries into Lucene queries. But you'd have to be naive to think you can write full-blown SQL query parser. You can easily solve simple cases, but what are you going to do when somebody sends you a JOIN? Or a GROUP BY bar HAVING foo>3?
If you can't jump over political hurdles, just use one of the full text indexing algorithms databases can offer; this is better than nothing.

sql server 2005 full text index query to help find noise words in content

Is there a way to query a full text index to help determine additional noise words? I would like to add some custom noise words and wondered if theres a way to analyse the index to help determine suggestions.
As simple as in
http://arcanecode.com/2008/05/29/creating-and-customizing-noise-words-in-sql-server-2005-full-text-search/
where this is explained (how to do it). Coming up with proper ones, though, is hard.
I decided to look into lucene.net because I wasn't happy with the relevance calculations in sql server full text indexing.
I managed to figure out how to index all the content pretty quickly and then used Luke to find noise words. I have now edited the sql server noise files based on this analysis. Now I have a search solution that works reasonably well using sql server full text indexing, but I plan to move to lucene.net in the future.
Using sql server full text indexing as a base, I developed a domain centric approach to finding relevant content using tool I understood. After some serious thinking and testing, I used many other measures to determine the relevance of a search result other than what is provided by analysing text content for term frequency and word distance. SQL Server full text indexing provided me a great start, and now I have a strategy I can express using lucene that will work very well.
It would have taken me a whole lot longer to understand lucene, and develop a strategy for the search. If anyone out there is still reading this, use full text indexing for testing your idea and then move to lucene once you have a strategy you know will work for your domain.

Implementing a massive search application

We have an email service that hosts close to 10000 domains such that we store the headers of messages in a SQL Server database.
I need to implement an application that will search the message body for keywords. The messages are stored as files on a NAS storage system.
As a proof of concept, I had implemented a SQL server based search system were I would parse the message and store all the words in a database table along with the memberid and the messageid. The database was on a separate server to the headers database.
The problem with that system was that I ended up with a table with 600 million rows after processing messages on just one domain. Obviously this is not a very scalable solution.
Since the headers are stored in a SQL Server table, I am going to need to join the messageIDs from the search application to the header table to display the messages that contain the searched for keywords.
Any suggestions on a better architecture? Any better alternative to using SQL server? We receive over 20 million messages a day.
We are a small company with limited resources with respect to servers, maintenance etc.
Thanks
have a look at Hadoop. It's complete "map-reduce" framework for working with huge datasets inspired by Google. It think (but I could be wrong) Rackspace is using it for email search for their clients.
lucene.net will help you a lot, but no matter how you approach this, it's going to be a lot of work.
Consider not using SQL for this. It isn't helping.
GREP and other flat-file techniques for searching the text of the headers is MUCH faster and much simpler.
You can also check out the java lucene stuff which might be useful to you. Both Katta which is a distributed lucene index and Solr which can use rsync for index syncing might be useful. While I don't consider either to be very elegant it is often better to use something that is already built and known to work before embarking on actual development. Without knowing more details its hard to make a more specific recommendation.
If you can break up your 600 million rows, look into database sharding. Any query across all rows is going to be slow. At very least you could break up by language. If they're all English, well, find some way to split the data that makes sense based on common searches. I'm just guessing here but maybe domains could be grouped by TLD (.com, .net, .org, etc).
For fulltext search, compare SQL Server vs Lucene.NET vs cLucene vs MySQL vs PostgreSQL. Note full-text search will be faster if you don't need to rank the results. If a database is still slow look into performance tuning and if that fails look into a Linux-based db.
http://incubator.apache.org/lucene.net/
http://sourceforge.net/projects/clucene/
i wonder if BigTable (http://en.wikipedia.org/wiki/BigTable) does searching.
Look into the SQL Server full text search services/functionality. I haven't used it myself, but I once read that Stack Overflow uses it.
three solutions:
Use an already-existant text search engine (lucene is the most mentioned, there are several more)
Store the whole message in the SQL database, and use included full text search (most DBs have it these days).
Don't create a new record for each word occurrence, just add a new value to a big field in the word record. Even better if you don't use SQL for this table, use a key-value store where the key is the word and the value is the list of occurrences. Check some Inverted Index bibliography for inspiration
but to be honest, i think the only reasonable approach is #1

Can someone give me a high overview of how lucene.net works?

I have an MS SQL database and have a varchar field that I would like to do queries like where name like '%searchTerm%'. But right now it is too slow, even with SQL enterprise's full text indexing.
Can someone explain how Lucene .Net might help my situation? How does the indexer work? How do queries work?
What is done for me, and what do I have to do?
I saw this guy (Michael Neel) present on Lucene at a user group meeting - effectively, you build index files (using Lucene) and they have pointers to whatever you want (database rows, whatever)
http://code.google.com/p/vinull/source/browse/#svn/Examples/LuceneSearch
Very fast, flexible and powerful.
What's good with Lucene is the ability to index a variety of things (files, images, database rows) together in your own index using Lucene and then translating that back to your business domain, whereas with SQL Server, it all has to be in SQL to be indexed.
It doesn't look like his slides are up there in Google code.
This article (strangely enough it's on the top of the Google search results :) has a fairly good description of how the Lucene search could be optimised.
Properly configured Lucene should easily beat SQL (pre 2005) full-text indexing search. If you on MS SQL 2005 and your search performance is still too slow you might consider checking your DB setup.