Implementing a massive search application - sql

We have an email service that hosts close to 10000 domains such that we store the headers of messages in a SQL Server database.
I need to implement an application that will search the message body for keywords. The messages are stored as files on a NAS storage system.
As a proof of concept, I had implemented a SQL server based search system were I would parse the message and store all the words in a database table along with the memberid and the messageid. The database was on a separate server to the headers database.
The problem with that system was that I ended up with a table with 600 million rows after processing messages on just one domain. Obviously this is not a very scalable solution.
Since the headers are stored in a SQL Server table, I am going to need to join the messageIDs from the search application to the header table to display the messages that contain the searched for keywords.
Any suggestions on a better architecture? Any better alternative to using SQL server? We receive over 20 million messages a day.
We are a small company with limited resources with respect to servers, maintenance etc.
Thanks

have a look at Hadoop. It's complete "map-reduce" framework for working with huge datasets inspired by Google. It think (but I could be wrong) Rackspace is using it for email search for their clients.

lucene.net will help you a lot, but no matter how you approach this, it's going to be a lot of work.

Consider not using SQL for this. It isn't helping.
GREP and other flat-file techniques for searching the text of the headers is MUCH faster and much simpler.

You can also check out the java lucene stuff which might be useful to you. Both Katta which is a distributed lucene index and Solr which can use rsync for index syncing might be useful. While I don't consider either to be very elegant it is often better to use something that is already built and known to work before embarking on actual development. Without knowing more details its hard to make a more specific recommendation.

If you can break up your 600 million rows, look into database sharding. Any query across all rows is going to be slow. At very least you could break up by language. If they're all English, well, find some way to split the data that makes sense based on common searches. I'm just guessing here but maybe domains could be grouped by TLD (.com, .net, .org, etc).
For fulltext search, compare SQL Server vs Lucene.NET vs cLucene vs MySQL vs PostgreSQL. Note full-text search will be faster if you don't need to rank the results. If a database is still slow look into performance tuning and if that fails look into a Linux-based db.
http://incubator.apache.org/lucene.net/
http://sourceforge.net/projects/clucene/

i wonder if BigTable (http://en.wikipedia.org/wiki/BigTable) does searching.

Look into the SQL Server full text search services/functionality. I haven't used it myself, but I once read that Stack Overflow uses it.

three solutions:
Use an already-existant text search engine (lucene is the most mentioned, there are several more)
Store the whole message in the SQL database, and use included full text search (most DBs have it these days).
Don't create a new record for each word occurrence, just add a new value to a big field in the word record. Even better if you don't use SQL for this table, use a key-value store where the key is the word and the value is the list of occurrences. Check some Inverted Index bibliography for inspiration
but to be honest, i think the only reasonable approach is #1

Related

Using Lucene QueryAPI to access SQL

Can you advise on whether I can use just the Query functionality from Lucene to generate SQL queries? Something like an SQLQueryBuilder?
I have a massive SQL database of logs from a webserver cluster containing the original request and response strings plus some other useful/less bits and bobs. What I need to do is analyse the parameters in the original request and compare with the generated responses, looking at ratios, volatility, variability, consistency etc.
This question does not relate to the analysis stage, but only the retrieval of data from database which matches the parameters I'm interested in. So, I could just do this in good old sql queries, manually building the exact queries I need on a case-by-case basis. But that's kinda lame; I reckon we can be a bit smarter than that. Particularly as I can already see large numbers of similar but subtly different queries being useful. And as I'm hoping that I can expose a single search box via a web interface to non-technical end-users, adding sql queries seems like a bad idea... and a recipe for permanent maintenance requests (and can I be the first to say, er no thanks!).
In an ideal world I expose a search form, with the option to write simple queries like
request:"someAttribute=\"someValue\"" AND response="some hoped for result" AND daterange:30
which would then hopefully find all instances of requests which contain someAttribute="someValue" over the last 30 days. The results will then be put through standard statistical analyses on the given response text and printed out on-screen. At least, that's the idea.
Much of the actual logic to determine how to handle custom field definitions or special words I'll need to write myself, and that's ok. And NB, my non-technical end users are familiar enough with xml that they can handle a bit of attr="value" syntax, at least for the first iteration of the tool :D
In summary, I want to:
1) allow users to use google-like search syntax (e.g. via Lucene's QueryAPI) to specify text to match in the logs
2) allow a layer to manipulate the query based on special words or fields (e.g. this layer could be during a Java object phase)
3) convert the final query into an sql query appropriate for my database schema
4) query the database and spit back the resultset for statistical analysis
5) pretty-print on website:)
Am I completely barking up the wrong tree? It looks like it should be possible, but I can't seem to find much on it. I've been googling for a bit on this, for example trying "Lucene SQLQueryBuilder" as a possible start but didn't really find much by way of a lead.
So, my questions are:
Has anyone tried using Lucene's QueryAPI like this before? Did it work? Any gotchas?
Are there better query api libraries out there?
Examples, finished discussions and open-source implementations would be most helpful.
Many thanks.
NB: I don't think I want Lucene's search capabilities as such, as I'm only ever looking for exact matches. I just need a query layer on top of the database.
Lucene and SQL have very little in common as they're using totally different syntax (as HefferWolf mentioned) and different underlying data models. As you said yourself, I'm afraid you're barking the wrong tree.
There are however attempts, such as Hibernate Search to bridge this gap. These are interesting experiments as such, but I would be very careful to use any of that code in production.
You could possibly use Full Text Search features available in some SQL databases, or reindex all data in Lucene and use it without database.
I doubt you can reuse any code from lucene for this. Lucene does an internal rewrite of such queries but into a syntax which wouldn't be of much help for SQL I think.
name: Phil AND lastname: Miller AND NOT age: 26
would be rewritten to
+name Phil +lastname: Miller -age: 26
So I think you would have to write your on transition into a SQL Query syntax.
But maybe you can use Lucene as such for this. Have a look into hibernate-search which is quite handy to easily create a lucene index of a sql table.

querying larg text file containing JSON objects

I have few Gigabytes text file in format:
{"user_ip":"x.x.x.x", "action_type":"xxx", "action_data":{"some_key":"some_value"...},...}
each entry is one line.
First I would like to easily find entries for given ip. This part is easy because I can use grep for example. However even for this I would like to find better solution because I would like to get response as fast as possible.
Next part is more complicated because I would like to find entries from selected ip and of selected type and with particular value of some_key in action_data.
Probably I would have to convert this file to SQL db (probably SQLite, because it will be desktop APP), but I would ask if there are exists better solutions?
You could take a look at MongoDB, a document based database. With it you essentially store JSON objects that you can then index and easily query in an efficient way. You can find about how to query in the docs: Querying.
Yes, put it into a database, any database. Then querying it will be straightforward.
Just wanted to mention that Oracle Berkeley DB 11gR2 (released on April 1st, 2010) introduces support for a SQL API. In fact, the SQL API is the sqlite3() API. So, as Jason mentioned, if you'd like the ease-of-use of SQLite, combined with the scalability and concurrency of Berkeley DB, you can now get both things in a single library.
Regards,
Dave
If you need the relational guarantees of an SQL-based DB, definitely go ahead with SQLite. It will allow for fast queries, joins, aggregations, sorts, and overall any sort of search you could possibly dream up. It sounds like this is just a big list of Actions performed by users at some IP, so you'll probably want to use some sort of sequence as your primary key since none of the other attributes look like good candidates.
On the other hand, if you just need to do very simple queries, e.g. look up entries by IP, look up entries by action type, etc., you might want to look into Oracle Berkeley DB. As long as you don't need any searches that are too fancy, Berkeley DB will let you store Terabytes of data and access them at record speed.
So look over both and see what's best for your use case. They're good for different things, which might be why both are available as storage systems on Android, for instance. I think SQLite will probably win out, but when thinking about embedded local DB systems you should always at least consider both of these technologies.

MYSQL with Coldfusion - Solutions to create Search Capabilites?

I'm using MySQL & ColdFusion. Currently for searching TEXT fields I'm using LIKE in the database. Luckily my database is empty but soon the table will fill up and I fear I the LIKE SQL query will kill my app.
I'm looking for a solution that works with both MySQL & ColdFusion that will allow me to scalably offer search capabilities with my MySQL & ColdFusion app.
Thanks
Consider using ColdFusion's built in Verity search engine or Solr Search engine in ColdFusion 9, which is Apache Lucene. Good Luck!
Update: Coldfusion 9.0.1 has addressed several quirks in the Solr (apache lucene) search engine. Use it..!
You are right to worry about the LIKE operator's performance having scalability problems. But keep two things in mind.
First: column LIKE 'pattern%' works well if your column is indexed. It's column LIKE '%pattern%' that can cause real performance problems.
Second, mySQL has a good full-text search system built into it. See http://dev.mysql.com/doc/refman/5.1/en/fulltext-search.html
Whats makes you think that it will be a problem? Have you done any load testing? What is the worst case scenario max size of the table? Have you filled it to that level and tried it? Finally, do you actually need it to be "text"? MySQL has some very large varchars, would that do instead?
My point being, it sounds like you already have the simplest solution that might possibly work. Maybe you should prove that it does not work before over-engineering something else?
Lastly, to actually answer your question, you could cache the database into a verity search index and then search that (CF 9 offers another index engine as well). But your going to loose it being a live search.
I don't know if it is an option for your app but what I usually do is reserve like '%pattern%' for advanced searches defined by the user when a performance hit could be expected. When possible I default the search options selected by the user to 'Starts With.' I've searched '%pattern%' in a MySql 5 DB with 1.25 Million records on a low traffic site. The database doesn't seem to be the bottle neck, even on a field that isn't indexed. The customer wants all the records shown on the screen. Showing 10,000+ records seems to be the problem (lol). The DB may be less of a problem than you think depending on traffic.

Fulltext Search with InnoDB

I'm developing a high-volume web application, where part of it is a MySQL database of discussion posts that will need to grow to 20M+ rows, smoothly.
I was originally planning on using MyISAM for the tables (for the built-in fulltext search capabilities), but the thought of the entire table being locked due to a single write operation makes me shutter. Row-level locks make so much more sense (not to mention InnoDB's other speed advantages when dealing with huge tables). So, for this reason, I'm pretty determined to use InnoDB.
The problem is... InnoDB doesn't have built-in fulltext search capabilities.
Should I go with a third-party search system? Like Lucene(c++) / Sphinx? Do any of you database ninjas have any suggestions/guidance? LinkedIn's zoie (based off Lucene) looks like the best option at the moment... having been built around realtime capabilities (which is pretty critical for my application.) I'm a little hesitant to commit yet without some insight...
(FYI: going to be on EC2 with high-memory rigs, using PHP to serve the frontend)
Along with the general phasing out of MyISAM, InnoDB full-text search (FTS) is finally available in MySQL 5.6.4 release.
Lots of juicy details at https://dev.mysql.com/doc/refman/5.6/en/innodb-fulltext-index.html.
While other engines have lots of different features, this one is InnoDB, so it's native (which means there's an upgrade path), and that makes it a worthwhile option.
I can vouch for MyISAM fulltext being a bad option - even leaving aside the various problems with MyISAM tables in general, I've seen the fulltext stuff go off the rails and start corrupting itself and crashing MySQL regularly.
A dedicated search engine is definitely going to be the most flexible option here - store the post data in MySQL/innodb, and then export the text to your search engine. You can set up a periodic full index build/publish pretty easily, and add real-time index updates if you feel the need and want to spend the time.
Lucene and Sphinx are good options, as is Xapian, which is nice and lightweight. If you go the Lucene route don't assume that Clucene will better, even if you'd prefer not to wrestle with Java, although I'm not really qualified to discuss the pros and cons of either.
You should spend an hour and go through installation and test-drive of Sphinx and Lucene. See if either meets your needs, with respect to data updates.
One of the things that disappointed me about Sphinx is that it doesn't support incremental inserts very well. That is, it's very expensive to reindex after an insert, so expensive that their recommended solution is to split your data into older, unchanging rows and newer, volatile rows. So every search your app does would have to search twice: once on the larger index for old rows and also on the smaller index for recent rows. If that doesn't integrate with your usage patterns, this Sphinx is not a good solution (at least not in its current implementation).
I'd like to point out another possible solution you could consider: Google Custom Search. If you can apply some SEO to your web application, then outsource the indexing and search function to Google, and embed a Google search textfield into your site. It could be the most economical and scalable way to make your site searchable.
Perhaps you shouldn't dismiss MySQL's FT so quickly. Craigslist used to use it.
MySQL’s speed and Full Text Search has enabled craigslist to serve their users .. craigslist uses MySQL to serve approximately 50 million searches per month at a rate of up to 60 searches per second."
edit
As commented below, Craigslist seems to have switched to Sphinx some time in early 2009.
Sphinx, as you point out, is quite nice for this stuff. All the work is in the configuration file. Make sure whatever your table is with the strings has some unique integer id key, and you should be fine.
try this
ROUND((LENGTH(text) - LENGTH(REPLACE(text, 'serchtext', ''))) / LENGTH('serchtext'),0)!=0
You should take a look at Sphinx. It is worth a try. It's indexing is super fast and it is distributed. You should take a look at this (http://www.percona.com/webinars/2012-08-22-full-text-search-throwdown) webminar. It talks about searching and has some neat benchmarks. You may find it helpful.
If everything else fails, there's always soundex_match, which sadly isn't really fast an accurate
For anyone stuck on an older version of MySQL / MariaDB (i.e. CentOS users) where InnoDB doesn't support Fulltext searches, my solution when using InnoDB tables was to create a separate MyISAM table for the thing I wanted to search.
For example, my main InnoDB table was products with various keys and referential integrity. I then created a simple MyISAM table called product_search containing two fields, product_id and product_name where the latter was set to a FULLTEXT index. Both fields are effectively a copy of what's in the main product table.
I then search on the MyISAM table using fulltext, and do an inner join back to the InnoDB table.
The contents of the MyISAM table can be kept up-to-date via either triggers or the application's model.
I wouldn't recommend this if you have multiple tables that require fulltext, but for a single table it seems like an adequate work around until you can upgrade.

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.