SQL full-text thesaurus - sql

Is there somewhere one can get the xml for the english thesaurus from the web (for mssql that is)? I'd really hate to populate it by hand...

Here is a free one used on project guthenburg although I think it is TXT format
http://www.gutenberg.org/dirs/etext02/mthes10.zip

Related

Fulltext search with Simple.Data

I am trying out Simple.Data ORM. Is there a way take advantage of sql server's full text search with Simple.Data ORM? I found methods to use the wild card search but did not see anything for full text search. The wild card search is not very useful in my case because I have close to half a million rows to deal with.
I went through the
Simple.Data documentation . I didn't see any mention of full text search.
Thanks advance.
You should be able to use Contains or Freetext as methods on a TEXT or NTEXT column, passing a string parameter, but there is no support for anything like FORMS OF; that's a bit too specific to SQL Server.
So:
var dubstepAlbums = db.Albums.FindAll(db.Albums.Description.Contains("dub-step"));
If that doesn't work, please report it as a bug at the project Site.

SQL Server get value from a xml file in internet site

There is a xml file in www.samplexxxxx.com/myfile.xml
I want to read or query this file with SQL Server in my computer
Can I make this?
Here is one example: http://pratchev.blogspot.com/2008/11/import-xml-file-to-sql-table.html
Also see dba.stackexchange.com for another option
Quite a bit of manual work. It's a bit of a square peg in a round hole, DBMS systems aren't great at working with XML data. I would recommend parsing the XML using Java, PHP, python, whatever...and inserting the data into a DBMS if needed.

The best online resources for full-text searching in Microsoft SQL?

So I've learned the difference between FREETEXT, FREETEXTTABLE, CONTAINS, and CONTAINSTABLE. And I've created a pretty cool search engine that combines a full-text enabled search with a tagging system (with a little help from you guys).
But where have you gone to really learn about and master full-text searching and get the most out of it in real-world scenarios? I'm struggling now with things like database design with full-text indexing in mind, and writing efficient queries that reference multiple tables each with their own full-text-indexed columns.
Any good articles or tutorials you know of are welcome.
Not an article or a tutorial, but if you're willing to spend a few bucks your single best source of information would be Pro Full-Text Search in SQL Server 2008 by Michael Coles and Hilary Cotter.
http://apress.com/book/view/9781430215943
You could start by going straight to the source (assuming that you haven't already).
Full-Text Search (SQL Server)
SQL Server 2008 Full-Text Search: Internals and Enhancements

Implementing a massive search application

We have an email service that hosts close to 10000 domains such that we store the headers of messages in a SQL Server database.
I need to implement an application that will search the message body for keywords. The messages are stored as files on a NAS storage system.
As a proof of concept, I had implemented a SQL server based search system were I would parse the message and store all the words in a database table along with the memberid and the messageid. The database was on a separate server to the headers database.
The problem with that system was that I ended up with a table with 600 million rows after processing messages on just one domain. Obviously this is not a very scalable solution.
Since the headers are stored in a SQL Server table, I am going to need to join the messageIDs from the search application to the header table to display the messages that contain the searched for keywords.
Any suggestions on a better architecture? Any better alternative to using SQL server? We receive over 20 million messages a day.
We are a small company with limited resources with respect to servers, maintenance etc.
Thanks
have a look at Hadoop. It's complete "map-reduce" framework for working with huge datasets inspired by Google. It think (but I could be wrong) Rackspace is using it for email search for their clients.
lucene.net will help you a lot, but no matter how you approach this, it's going to be a lot of work.
Consider not using SQL for this. It isn't helping.
GREP and other flat-file techniques for searching the text of the headers is MUCH faster and much simpler.
You can also check out the java lucene stuff which might be useful to you. Both Katta which is a distributed lucene index and Solr which can use rsync for index syncing might be useful. While I don't consider either to be very elegant it is often better to use something that is already built and known to work before embarking on actual development. Without knowing more details its hard to make a more specific recommendation.
If you can break up your 600 million rows, look into database sharding. Any query across all rows is going to be slow. At very least you could break up by language. If they're all English, well, find some way to split the data that makes sense based on common searches. I'm just guessing here but maybe domains could be grouped by TLD (.com, .net, .org, etc).
For fulltext search, compare SQL Server vs Lucene.NET vs cLucene vs MySQL vs PostgreSQL. Note full-text search will be faster if you don't need to rank the results. If a database is still slow look into performance tuning and if that fails look into a Linux-based db.
http://incubator.apache.org/lucene.net/
http://sourceforge.net/projects/clucene/
i wonder if BigTable (http://en.wikipedia.org/wiki/BigTable) does searching.
Look into the SQL Server full text search services/functionality. I haven't used it myself, but I once read that Stack Overflow uses it.
three solutions:
Use an already-existant text search engine (lucene is the most mentioned, there are several more)
Store the whole message in the SQL database, and use included full text search (most DBs have it these days).
Don't create a new record for each word occurrence, just add a new value to a big field in the word record. Even better if you don't use SQL for this table, use a key-value store where the key is the word and the value is the list of occurrences. Check some Inverted Index bibliography for inspiration
but to be honest, i think the only reasonable approach is #1

Can someone give me a high overview of how lucene.net works?

I have an MS SQL database and have a varchar field that I would like to do queries like where name like '%searchTerm%'. But right now it is too slow, even with SQL enterprise's full text indexing.
Can someone explain how Lucene .Net might help my situation? How does the indexer work? How do queries work?
What is done for me, and what do I have to do?
I saw this guy (Michael Neel) present on Lucene at a user group meeting - effectively, you build index files (using Lucene) and they have pointers to whatever you want (database rows, whatever)
http://code.google.com/p/vinull/source/browse/#svn/Examples/LuceneSearch
Very fast, flexible and powerful.
What's good with Lucene is the ability to index a variety of things (files, images, database rows) together in your own index using Lucene and then translating that back to your business domain, whereas with SQL Server, it all has to be in SQL to be indexed.
It doesn't look like his slides are up there in Google code.
This article (strangely enough it's on the top of the Google search results :) has a fairly good description of how the Lucene search could be optimised.
Properly configured Lucene should easily beat SQL (pre 2005) full-text indexing search. If you on MS SQL 2005 and your search performance is still too slow you might consider checking your DB setup.