Any way to achieve fulltext-like search on InnoDB - sql

I have a very simple query:
SELECT ... WHERE row LIKE '%some%' OR row LIKE '%search%' OR row LIKE '%string%'
to search for some search string, but as you can see, it searches for each string individually and it's also not good for performance.
Is there a way to recreate a fulltext-like search using LIKE on an InnoDB table. Of course, I know I can use something like Sphinx to achieve this but I'm looking for a pure MySQL solution.

use a myisam fulltext table to index back into your innodb tables for example:
Build your system using innodb:
create table users (...) engine=innodb;
create table forums (...) engine=innodb;
create table threads
(
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
user_id int unsigned not null,
subject varchar(255) not null, -- gonna want to search this... !!
created_date datetime not null,
next_reply_id int unsigned not null default 0,
view_count int unsigned not null default 0,
primary key (forum_id, thread_id) -- composite clustered PK index
)
engine=innodb;
Now the fulltext search table which we will use just to index back into our innodb tables. You can maintain rows in this table either by using a trigger or nightly batch updates etc.
create table threads_ft
(
forum_id smallint unsigned not null,
thread_id int unsigned not null default 0,
subject varchar(255) not null,
fulltext (subject), -- fulltext index on subject
primary key (forum_id, thread_id) -- composite non-clustered index
)
engine=myisam;
Finally the search stored procedure which you call from your php/application:
drop procedure if exists ft_search_threads;
delimiter #
create procedure ft_search_threads
(
in p_search varchar(255)
)
begin
select
t.*,
f.title as forum_title,
u.username,
match(tft.subject) against (p_search in boolean mode) as rank
from
threads_ft tft
inner join threads t on tft.forum_id = t.forum_id and tft.thread_id = t.thread_id
inner join forums f on t.forum_id = f.forum_id
inner join users u on t.user_id = u.user_id
where
match(tft.subject) against (p_search in boolean mode)
order by
rank desc
limit 100;
end;
call ft_search_threads('+innodb +clustered +index');
Hope this helps :)

Using PHP to construct the query. This is an horrible hack. Once seen, it can't be unseen...
$words=dict($userQuery);
$numwords = sizeof($words);
$innerquery="";
for($i=0;$i<$numwords;$i++) {
$words[$i] = mysql_real_escape_string($words[$i]);
if($i>0) $innerquery .= " AND ";
$innerquery .= "
(
field1 LIKE \"%$words[$i]%\" OR
field2 LIKE \"%$words[$i]%\" OR
field3 LIKE \"%$words[$i]%\" OR
field4 LIKE \"%$words[$i]%\"
)
";
}
SELECT fields FROM table WHERE $innerquery AND whatever;
dict is a dictionary function

InnoDB full-text search (FTS) is finally available in MySQL 5.6.4 release.
These indexes are physically represented as entire InnoDB tables, which are acted upon by SQL keywords such as the FULLTEXT clause of the CREATE INDEX statement, the MATCH() ... AGAINST syntax in a SELECT statement, and the OPTIMIZE TABLE statement.
From FULLTEXT Indexes

Related

SQL Server Indexing and Composite Keys

Given the following:
-- This table will have roughly 14 million records
CREATE TABLE IdMappings
(
Id int IDENTITY(1,1) NOT NULL,
OldId int NOT NULL,
NewId int NOT NULL,
RecordType varchar(80) NOT NULL, -- 15 distinct values, will never increase
Processed bit NOT NULL DEFAULT 0,
CONSTRAINT pk_IdMappings
PRIMARY KEY CLUSTERED (Id ASC)
)
CREATE UNIQUE INDEX ux_IdMappings_OldId ON IdMappings (OldId);
CREATE UNIQUE INDEX ux_IdMappings_NewId ON IdMappings (NewId);
and this is the most common query run against the table:
WHILE #firstBatchId <= #maxBatchId
BEGIN
-- the result of this is used to insert into another table:
SELECT
NewId, -- and lots of non-indexed columns from SOME_TABLE
FROM
IdMappings map
INNER JOIN
SOME_TABLE foo ON foo.Id = map.OldId
WHERE
map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
AND map.Processed = 0
-- We only really need this in case the user kills the binary or SQL Server service:
UPDATE IdMappings
SET Processed = 1
WHERE map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
SET #firstBatchId += 4999
SET #lastBatchId += 4999
END
What are the best indices to add? I figure Processed isn't worth indexing since it only has 2 values. Is it worth indexing RecordType since there are only about 15 distinct values? How many distinct values will a column likely have before we consider indexing it?
Is there any advantage in a composite key if some of the fields are in the WHERE and some are in a JOIN's ON condition? For example:
CREATE INDEX ix_IdMappings_RecordType_OldId
ON IdMappings (RecordType, OldId)
... if I wanted both these fields indexed (I'm not saying I do), does this composite key gain any advantage since both columns don't appear together in the same WHERE or same ON?
Insert time into IdMappings isn't really an issue. After we insert all records into the table, we don't need to do so again for months if ever.

Suggested Indexing for table with 50 million rows is queried using its CREATED_DATE column and USER_TYPE column

Table Users:
ID PK INT
USER_TYPE VARCHAR(50) NOT NULL
CREATED_DATE DATETIME2(7) NOT NULL
I have this table with 50 million rows, and it is queries using the following where clause:
WHERE
u.USER_TYPE= 'manager'
AND u.CREATED_DATE >= #StartDate
AND u.CREATED_DATE < #EndDate
What would be a good starting point for an index on this table to optimize for the above query where clause?
For that query, the index you want is a composite index with two columns: (user_type, created_date). The order matters, you want user_type first because of the equality comparison.
You'll be well served by creating a table with user types having an arbitrary INT ID and referring to the manager type by ID, instead of having the manager type directly in the users table. This will narrow the table data as well as any index referring to the user type.
CREATE TABLE user_type (
id INT NOT NULL IDENTITY(1,1),
description NVARCHAR(128) NOT NULL,
CONSTRAINT pk_user_type PRIMARY KEY CLUSTERED(id)
);
CREATE TABLE users (
id INT NOT NULL IDENTITY(1,1),
user_type_id INT NOT NULL,
created_date DATETIME2(7) NOT NULL,
CONSTRAINT pk_users PRIMARY KEY CLUSTERED(id),
CONSTRAINT fk_users_user_type FOREIGN KEY(user_type_id) REFERENCES user_type(id)
);
CREATE NONCLUSTERED INDEX
ix_users_type_created
ON
users (
user_type_id,
created_date
);
You would be querying using the user_type ID rather than directly with the text of course.
For any query. Run the query in SSMS with "Include Actual Execution Plan" on. SSMS will advice an index if it feels proper index doesn't exist.

SQLite speed up select with collate nocase

I have SQLite db:
CREATE TABLE IF NOT EXISTS Commits
(
GlobalVer INTEGER PRIMARY KEY,
Data blob NOT NULL
) WITHOUT ROWID;
CREATE TABLE IF NOT EXISTS Streams
(
Name char(40) NOT NULL,
GlobalVer INTEGER NOT NULL,
PRIMARY KEY(Name, GlobalVer)
) WITHOUT ROWID;
I want to make 1 select:
SELECT Commits.Data
FROM Streams JOIN Commits ON Streams.GlobalVer=Commits.GlobalVer
WHERE
Streams.Name = ?
ORDER BY Streams.GlobalVer
LIMIT ? OFFSET ?
after that i want to make another select:
SELECT Commits.Data,Streams.Name
FROM Streams JOIN Commits ON Streams.GlobalVer=Commits.GlobalVer
WHERE
Streams.Name = ? COLLATE NOCASE
ORDER BY Streams.GlobalVer
LIMIT ? OFFSET ?
The problem is that 2nd select works super slow. I think this is because COLLATE NOCASE. I want to speed up it. I tried to add index but it doesn't help (may be i did sometinhg wrong?). How to execute 2nd query with speed approximately equals to 1st query's speed?
An index can be used to speed up a search only if it uses the same collation as the query.
By default, an index takes the collation from the table column, so you could change the table definition:
CREATE TABLE IF NOT EXISTS Streams
(
Name char(40) NOT NULL COLLATE NOCASE,
GlobalVer INTEGER NOT NULL,
PRIMARY KEY(Name, GlobalVer)
) WITHOUT ROWID;
However, this would make the first query slower.
To speed up both queries, you need two indexes, one for each collation. So to use the default collation for the implicit index, and NOCASE for the explicit index:
CREATE TABLE IF NOT EXISTS Streams
(
Name char(40) NOT NULL COLLATE NOCASE,
GlobalVer INTEGER NOT NULL,
PRIMARY KEY(Name, GlobalVer)
) WITHOUT ROWID;
CREATE INDEX IF NOT EXISTS Streams_nocase_idx ON Streams(Name COLLATE NOCASE, GlobalVar);
(Adding the second column to the index speeds up the ORDER BY in this query.)

Can "auto_increment" on "sub_groups" be enforced at a database level?

In Rails, I have the following
class Token < ActiveRecord
belongs_to :grid
attr_accessible :turn_order
end
When you insert a new token, turn_order should auto-increment. HOWEVER, it should only auto-increment for tokens belonging to the same grid.
So, take 4 tokens for example:
Token_1 belongs to Grid_1, turn_order should be 1 upon insert.
Token_2 belongs to Grid_2, turn_Order should be 1 upon insert.
If I insert Token_3 to Grid_1, turn_order should be 2 upon insert.
If I insert Token_4 to Grid_2, turn_order should be 2 upon insert.
There is an additional constraint, imagine I execute #Token_3.turn_order = 1, now #Token_1 must automatically set its turn_order to 2, because within these "sub-groups" there can be no turn_order collision.
I know MySQL has auto_increment, I was wondering if there is any logic that can be applied at the DB level to enforce a constraint such as this. Basically auto_incrementing within sub-groups of a query, those sub-groups being based on a foreign key.
Is this something that can be handled at a DB level, or should I just strive for implementing rock-solid constraints at the application layer?
If i understood your question properly then you could use one of the following two methods (innodb vs myisam). Personally, I'd take the innodb road as i'm a fan of clustered indexes which myisam doesnt support and I prefer performance over how many lines of code I need to type, but the decision is yours...
http://dev.mysql.com/doc/refman/5.0/en/innodb-table-and-index.html
Rewriting mysql select to reduce time and writing tmp to disk
full sql script here : http://pastie.org/1259734
innodb implementation (recommended)
-- TABLES
drop table if exists grid;
create table grid
(
grid_id int unsigned not null auto_increment primary key,
name varchar(255) not null,
next_token_id int unsigned not null default 0
)
engine = innodb;
drop table if exists grid_token;
create table grid_token
(
grid_id int unsigned not null,
token_id int unsigned not null,
name varchar(255) not null,
primary key (grid_id, token_id) -- note clustered PK order (innodb only)
)
engine = innodb;
-- TRIGGERS
delimiter #
create trigger grid_token_before_ins_trig before insert on grid_token
for each row
begin
declare tid int unsigned default 0;
select next_token_id + 1 into tid from grid where grid_id = new.grid_id;
set new.token_id = tid;
update grid set next_token_id = tid where grid_id = new.grid_id;
end#
delimiter ;
-- TEST DATA
insert into grid (name) values ('g1'),('g2'),('g3');
insert into grid_token (grid_id, name) values
(1,'g1 t1'),(1,'g1 t2'),(1,'g1 t3'),
(2,'g2 t1'),
(3,'g3 t1'),(3,'g3 t2');
select * from grid;
select * from grid_token;
myisam implementation (not recommended)
-- TABLES
drop table if exists grid;
create table grid
(
grid_id int unsigned not null auto_increment primary key,
name varchar(255) not null
)
engine = myisam;
drop table if exists grid_token;
create table grid_token
(
grid_id int unsigned not null,
token_id int unsigned not null auto_increment,
name varchar(255) not null,
primary key (grid_id, token_id) -- non clustered PK
)
engine = myisam;
-- TEST DATA
insert into grid (name) values ('g1'),('g2'),('g3');
insert into grid_token (grid_id, name) values
(1,'g1 t1'),(1,'g1 t2'),(1,'g1 t3'),
(2,'g2 t1'),
(3,'g3 t1'),(3,'g3 t2');
select * from grid;
select * from grid_token;
My opinion: Rock-solid constraints at the app level. You may get it to work in SQL -- I've seen some people do some pretty amazing stuff. A lot of SQL logic used to be squirreled away in triggers, but I don't see much of that lately.
This smells more like business logic and you absolutely can get it done in Ruby without wrapping yourself around a tree. And... people will be able to see the tests and read the code.
This to me sounds like something you'd want to handle in an after_save method or in an observer. If the model itself doesn't need to be aware of when or how something increments then I'd stick the business logic in the observer. This approach will make the incrementing logic more expressive to other developers and database agnostic.

Using MySQL's "IN" function where the target is a column?

In a certain TABLE, I have a VARTEXT field which includes comma-separated values of country codes. The field is named cc_list. Typical entries look like the following:
'DE,US,IE,GB'
'IT,CA,US,FR,BE'
Now given a country code, I want to be able to efficiently find which records include that country. Obviously there's no point in indexing this field.
I can do the following
SELECT * from TABLE where cc_list LIKE '%US%';
But this is inefficient.
Since the "IN" function is supposed to be efficient (it bin-sorts the values), I was thinking along the lines of
SELECT * from TABLE where 'US' IN cc_list
But this doesn't work - I think the 2nd operand of IN needs to be a list of values, not a string. Is there a way to convert a CSV string to a list of values?
Any other suggestions? Thanks!
SELECT *
FROM MYTABLE
WHERE FIND_IN_SET('US', cc_list)
In a certain TABLE, I have a VARTEXT field which includes comma-separated values of country codes.
If you want your queries to be efficient, you should create a many-to-many link table:
CREATE TABLE table_country (cc CHAR(2) NOT NULL, tableid INT NOT NULL, PRIMARY KEY (cc, tableid))
SELECT *
FROM tablecountry tc
JOIN mytable t
ON t.id = tc.tableid
WHERE t.cc = 'US'
Alternatively, you can set ft_min_word_len to 2, create a FULLTEXT index on your column and query like this:
CREATE FULLTEXT INDEX fx_mytable_cclist ON mytable (cc_list);
SELECT *
FROM MYTABLE
WHERE MATCH(cc_list) AGAINST('+US' IN BOOLEAN MODE)
This only works for MyISAM tables and the argument should be a literal string (you won't be able to join on this condition).
The first rule of normalization says you should change multi-value columns such as cc_list into a single value field for this very reason.
Preferably into it's own table with IDs for each country code and a pivot table to support a many-to-many relationship.
CREATE TABLE my_table (
my_id INT(11) UNSIGNED NOT NULL AUTO_INCREMENT,
mystuff VARCHAR NOT NULL,
PRIMARY KEY(my_id)
);
# this is the pivot table
CREATE TABLE my_table_countries (
my_id INT(11) UNSIGNED NOT NULL,
country_id SMALLINT(5) UNSIGNED NOT NULL,
PRIMARY KEY(my_id, country_id)
);
CREATE TABLE countries {
country_id SMALLINT(5) UNSIGNED NOT NULL AUTO_INCREMENT,
country_code CHAR(2) NOT NULL,
country_name VARCHAR(100) NOT NULL,
PRIMARY KEY (country_id)
);
Then you can query it making use of indexes:
SELECT * FROM my_table JOIN my_table_countries USING (my_id) JOIN countries USING (country_id) WHERE country_code = 'DE'
SELECT * FROM my_table JOIN my_table_countries USING (my_id) JOIN countries USING (country_id) WHERE country_code IN('DE','US')
You may have to group the results my my_id.
find_in_set seems to be the MySql function you want. If you could actually store those comma-separated strings as MySql sets (no more than 64 possible countries, or splitting countries into two groups of no more than 64 each), you could keep using find_in_set and go a bit faster.
There's no efficient way to find what you want. A table scan will be necessary. Putting multiple values into a single text field is a terrible misuse of relational database technology. If you refactor (if you have access to the database structure) so that the country codes are properly stored in a separate table you will be able to easily and quickly retrieve the data you want.
One approach that I've used successfully before (not on mysql, though) is to place a trigger on the table that splits the values (based on a specific delimiter) into discrete values, inserting them into a sub-table. Your select can then look like this:
SELECT * from TABLE where cc_list IN
(
select cc_list_name from cc_list_subtable
where c_list_subtable.table_id = TABLE.id
)
where the trigger parses cc_list in TABLE into separate entries in column cc_list_name in table cc_list_subtable. It involves a bit of work in the trigger, too, as every change to TABLE means that associated rows in cc_list_table have to be deleted/updated/inserted as appropriate, but is an approach that works in situations where the original table TABLE has to retain its original structure, but where you are free to adapt the query as you see fit.