SQL Indexes and Multicolumn Database search - sql

I have to implement search that finds substring in the name of the user.
User has FirstName and LastName in 2 columns. It is good enough to do WHERE FirstName LIKE '%searchText%' OR LastName LIKE '%searchText%'
What is my problem that I want to solve is performance. Let's say that currently I expect like 1000 users tops. I do not want to search to take ages. So I thought of indexes (those columns will not change much, I expect that the value of these columns will almost never change). I know that I will be looking for both columns I need multi column index.
Is this correct way of doing this?
Or it is better to use SQL full text search for this (please provide some good link)?
Would it be better to create a View where FirstName and LastName will be concatenated and search there?
Or it is better to just use e.g. Azure Search (Currently, this is the only and it may be the last entity I would need to search for)?
I am using .NET 4.6 and EntityFramework hosted on Azure web apps.
Thanks

Because of the leading wildcard character, an index will not be used. The only time an index will be used is if the wildcard is either in the middle or at the end of the string. Adding one index on LastName, FirstName can be a good suggestion as well if your table is very wide, like David Browne said.
If you are truly needing to search with both wildcards (i.e. a partial match), then I would take a look at the amount of data in your table. If it's just a few thousand rows we're talking about, a table scan will still be very fine performance wise. If we're talking about something like 50.000 rows or more, then a full text index would be best.
This seems like a good tutorial on the matter: https://learn.microsoft.com/en-us/sql/relational-databases/search/get-started-with-full-text-search

Related

Searching efficiently with keywords

I'm working with a big table (millions of rows) on a postgresql database, each row has a name column and i would like to perform a search on that column.
For instance, if i'm searching for the movie Django Unchained, i would like the query to return the movie whether i search for Django or for Unchained (or Dj or Uncha), just like the IMDB search engine.
I've looked up full text search but i believe it is more intended for long text, my name column will never be more than 4-5 words.
I've thought about having a table keywords with a many to many relationship, but i'm not sure that's the best way to do it.
What would be the most efficient way to query my database ?
My guess is that for what you want to do, full text search is the best solution. (Documented here.)
It does allow you to search for any complete words. It allows you to search for prefixes on words (such as "Dja"). Plus, you can add synonyms as necessary. It doesn't allow for wildcards at the beginning of a word, so "Jango" would need to be handled with a synonym.
If this doesn't meet your needs and you need the capabilities of like, I would suggest the following. Put the title into a separate table that basically has two columns: an id and the title. The goal is to make the scanning of the table as fast as possible, which in turn means getting the titles to fit in the smallest space possible.
There is an alternative solution, which is n-gram searching. I'm not sure if Postgres supports it natively, but here is an interesting article on the subject that include Postgres code for implementing it.
The standard way to search for a sub-string anywhere in a larger string is using the LIKE operator:
SELECT *
FROM mytable
WHERE name LIKE '%Unchai%';
However, in case you have millions of rows it will be slow because there are no significant efficiencies to be had from indexes.
You might want to dabble with multiple strategies, such as first retrieving records where the value for name starts with the search string (which can benefit from an index on the name column - LIKE 'Unchai%';) and then adding middle-of-the-string hits after a second non-indexed pass. Humans tend to be significantly slower than computers on interpreting strings, so the user may not suffer.
This question is very much related to the autocomplete in forms. You will find several threads for that.
Basically, you will need a special kind of index, a space partitioning tree. There is an extension called SP-GiST for Postgres which supports such index structures. You will find a bunch of useful stuff if you google for that.

Defining four indexes Versus defining one index on the search fields

I have an employee table that have the following columns:
employeeID (PK)
First_name
Last_name
Third_name
Age
Sex
Telephone
Dateofbirth
Position
And i have a search functionality to search for the employee using any of these fields:-
First_name
Last_name
Telephone
Dateofbirth
And the user can specify any combinations of these fields to be included for the search.
Now to improve the search what is the best way to create indexes ,, should i :-
create separate four indexes on the four search fields?
OR
it will be better to create a single index that contains the four columns together?
OR
there is a third better solution?
BR
The second option (one index on all four columns) is not very likely to be useful unless you know that users always include one column, almost always include a second column, etc. Any query that doesn't look at the first column in an index isn't going to make use of that index, at least for the purpose of index seek. Such an index still might be used to retrieve the data in those cases, if no other columns are included in the select list.
Whether all four indexes are necessary, I'm not sure. Is it really likely that a search for first name will occur without last name? How often do you think users will search for John? I'd let those scan and combine them as a single index (LastName,FirstName). I'd also be surprised if you had many searches for birth date. Is it really common practice where employees will know (or can easily find out) their co-workers' age / birthday?
There's not really an ideal solution using traditional indices. If you think about how it works, an index is really like sorting the table against a specific field. So if you only have one index with all four columns, it can only really use the first one anyway. If you use four indices, then SQL Server can search against each one in sequence.
However, the above is moot if you're searching against a wildcard patterns:
COLUMN LIKE '%searchstring%'
That search cannot use the index-- it will have to search the contents of each row.
All of which suggests that you should consider a full-text index, which is meant for exactly this kind of scenario.
http://msdn.microsoft.com/en-us/library/ms142571.aspx

Sql Search in millions of records. Possible?

I have a table in my sql server 2005 database which contains about 50 million records.
I have firstName and LastName columns, and I would like to be able to allow the user to search on these columns without it taking forever.
Out of indexing these columns, is there a way to make my query work fast?
Also, I want to search similar sounded names. for example, if the user searches for Danny, I would like to return records with the name Dan, Daniel as well. It would be nice to show the user a rank in % how close the result he got to what he actually searched.
I know this is a tuff task, but I bet I'm not the first one in the world that face this issue :)
Thanks for your help.
We have databases with half a billion of records (Oracle, but should have similar performances). You can search in it within a few milli seconds if you have proper indexes. In your case, place an index on firstname and lastname. Using binary-tree index will perform good and will scale with the size of your database. Careful, LIKE clauses often break the use of the index and degrades largely the performances. I know MySQL can keep using indexes with LIKE clauses when wildcards are only at the right of the string. You would have to make similar search for SQL Server.
String similarity is indeed not simple. Have a look at http://en.wikipedia.org/wiki/Category:String_similarity_measures, you'll see some of the possible algorithms. Cannot say if SQL Server do implement one of them, dont know this database. Try to Google "SQL Server" + the name of the algorithms to maybe find what you need. Otherwise, you have code provided on Wiki for various languages (maybe not SQL but you should be able to adapt them for a stored procedure).
Have you tried full text indexing? I used it on free text fields in a table over 1 million records, and found it to be pretty fast. Plus you can add synonyms to it, so that Dan, Danial, and Danny all index as the same (where you get the dictionary of name equivalents is a different story). It does allow wildcard searches as well. Full text indexing can also do rank, though I found it to be less useful on names (better for documents).
use FUll TEXT SEARCH enable for this table and those columns, that will create full text index for those columns.

How to search millions of record in SQL table faster?

I have SQL table with millions of domain name. But now when I search for let's say
SELECT *
FROM tblDomainResults
WHERE domainName LIKE '%lifeis%'
It takes more than 10 minutes to get the results. I tried indexing but that didn't help.
What is the best way to store this millions of record and easily access these information in short period of time?
There are about 50 million records and 5 column so far.
Most likely, you tried a traditional index which cannot be used to optimize LIKE queries unless the pattern begins with a fixed string (e.g. 'lifeis%').
What you need for your query is a full-text index. Most DBMS support it these days.
Assuming that your 50 million row table includes duplicates (perhaps that is part of the problem), and assuming SQL Server (the syntax may change but the concept is similar on most RDBMSes), another option is to store domains in a lookup table, e.g.
CREATE TABLE dbo.Domains
(
DomainID INT IDENTITY(1,1) PRIMARY KEY,
DomainName VARCHAR(255) NOT NULL
);
CREATE UNIQUE INDEX dn ON dbo.Domains(DomainName);
When you load new data, check if any of the domain names are new - and insert those into the Domains table. Then in your big table, you just include the DomainID. Not only will this keep your 50 million row table much smaller, it will also make lookups like this much more efficient.
SELECT * -- please specify column names
FROM dbo.tblDomainResults AS dr
INNER JOIN dbo.Domains AS d
ON dr.DomainID = d.DomainID
WHERE d.DomainName LIKE '%lifeis%';
Of course except on the tiniest of tables, it will always help to avoid LIKE clauses with a leading wildcard.
Full-text indexing is the far-and-away best option here - how this is accomplished will depend on the DBMS you're using.
Short of that, ensuring that you have an index on the column being matched with the pattern will help performance, but by the sounds of it, you've tried this and it didn't help a great deal.
Stop using LIKE statement. You could use fulltext search, but it will require MyISAM table and isn't all that good solution.
I would recommend for you to examine available 3rd party solutions - like Lucene and Sphinx. They will be superior.
One thing you might want to consider is having a separate search engine for such lookups. For example, you can use a SOLR (lucene) server to search on and retrieve the ids of entries that match your search, then retrieve the data from the database by id. Even having to make two different calls, its very likely it will wind up being faster.
Indexes are slowed down whenever they have to go lookup ("bookmark lookup") data that the index itself doesn't contain. For instance, if your index has 2 columns, ID, and NAME, but you're selecting * (which is 5 columns total) the database has to read the index for the first two columns, then go lookup the other 3 columns in a less efficient data structure somewhere else.
In this case, your index can't be used because of the "like". This is similar to not putting any where filter on the query, it will skip the index altogether since it has to read the whole table anyway it will do just that ("table scan"). There is a threshold (i think around 35-50% where the engine normally flips over to this).
In short, it seems unlikely that you need all 50 million rows from the DB for a production application, but if you do... use a machine with more memory and try methods that keep that data in memory. Maybe a No-SQL DB would be a better option - mongoDB, couch DB, tokyo cabinet. Things like this. Good luck!
You could try breaking up the domain into chunks and then searh the chunks themselves. I did some thing like that years ago when I needed to search for words in sentences. I did not have full text searching available so I broke up the sentences into a word list and searched the words. It was really fast to find the results since the words were indexed.

should I use LIKE to query tables with 4 million rows

I am designing a search form, and I am wondering whether should I give the possibility to search by using LIKE %search_string% for a table that is going to have up to 4 million rows
In general, I would say no. This is a good candidate for full-text indexing. The leading % in your search string is going to eliminate the possibility of using any indexes.
There may be cases where the wait is acceptable and/or you do not want the additional administrative overhead of maintaining full-text indexes, in which case you might opt for LIKE.
No, you should really only use LIKE '%...%' when your tables are relatively small or you don't care about the performance of your own or other peoples' queries on your database.
There are other ways to achieve this capability which scale much better, full text indexing or, if that's unavailable or not flexible enough, using insert/update triggers to extract non-noise words for querying later.
I mention that last possibility since you may not want a full text index. In other words, do you really care about words like "is", "or" and "but" (these are the noise-words I was alluding to before).
You can separate the field into words and place the relevant ones in another table and use blindingly fast queries on that table to find the actual rows.
The search with LIKE %search_string% is very slow even on indexed columns. Worstcase the search does a full table scan.
If a search LIKE search_string% is enough I'd just provide this possibility.
It depends - without knowing how responsive the search has to be, it could either be fine or completely no go. You'll only really know if you profile your search with likely data patterns and search criteria.
And as RedFilter points out, you might want to consider Full Text Search, if plain search isn't performing well