MySQL, need some performance suggestions on my match query - sql

I need some performance improvement guidance, my query takes several seconds to run and this is causing problems on the server. This query runs on the most common page on my site. I think a radical rethink may be required.
~ EDIT ~
This query produces a list of records whose keywords match those of the program (record) being queried. My site is a software download directory. And this list is used on the program listing page to show other similar programs. PadID is the primary key of the program records in my database.
~ EDIT ~
Heres my query
select match_keywords.PadID, count(match_keywords.Word) as matching_words
from keywords current_program_keywords
inner join keywords match_keywords on
match_keywords.Word=current_program_keywords.Word
where match_keywords.Word IS NOT NULL
and current_program_keywords.PadID=44243
group by match_keywords.PadID
order by matching_words DESC
LIMIT 0,11;
Heres the query explained.
Heres some sample data, however I doubt you'd be able to see the effects of any performance tweaks without more data, which I can provide if you'd like.
CREATE TABLE IF NOT EXISTS `keywords` (
`Word` varchar(20) NOT NULL,
`PadID` bigint(20) NOT NULL,
`LetterIdx` varchar(1) NOT NULL,
KEY `Word` (`Word`),
KEY `LetterIdx` (`LetterIdx`),
KEY `PadID_2` (`PadID`,`Word`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `keywords` (`Word`, `PadID`, `LetterIdx`) VALUES
('tv', 44243, 'T'),
('satellite tv', 44243, 'S'),
('satellite tv to pc', 44243, 'S'),
('satellite', 44243, 'S'),
('your', 44243, 'X'),
('computer', 44243, 'C'),
('pc', 44243, 'P'),
('soccer on your pc', 44243, 'S'),
('sports on your pc', 44243, 'S'),
('television', 44243, 'T');
I've tried adding an index, but this doesn't make much difference.
ALTER TABLE `keywords` ADD INDEX ( `PadID` )

You might find this helpful if I understood you correctly. The solution takes advantage of innodb's clustered primary key indexes (http://pastie.org/1195127)
EDIT: here's some links that may prove of interest:
http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
http://dev.mysql.com/doc/refman/5.0/en/innodb-adaptive-hash.html
drop table if exists programmes;
create table programmes
(
prog_id mediumint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
insert into programmes (name) values
('prog1'),('prog2'),('prog3'),('prog4'),('prog5'),('prog6');
drop table if exists keywords;
create table keywords
(
keyword_id mediumint unsigned not null auto_increment primary key,
name varchar(255) unique not null
)
engine=innodb;
insert into keywords (name) values
('tv'),('satellite tv'),('satellite tv to pc'),('pc'),('computer');
drop table if exists programme_keywords;
create table programme_keywords
(
keyword_id mediumint unsigned not null,
prog_id mediumint unsigned not null,
primary key (keyword_id, prog_id), -- note clustered composite primary key
key (prog_id)
)
engine=innodb;
insert into programme_keywords values
-- keyword 1
(1,1),(1,5),
-- keyword 2
(2,2),(2,4),
-- keyword 3
(3,1),(3,2),(3,5),(3,6),
-- keyword 4
(4,2),
-- keyword 5
(5,2),(5,3),(5,4);
/*
efficiently list all other programmes whose keywords match that of the
programme currently being queried (for instance prog_id = 1)
*/
drop procedure if exists list_matching_programmes;
delimiter #
create procedure list_matching_programmes
(
in p_prog_id mediumint unsigned
)
proc_main:begin
select
p.*
from
programmes p
inner join
(
select distinct -- other programmes with same keywords as current
pk.prog_id
from
programme_keywords pk
inner join
(
select keyword_id from programme_keywords where prog_id = p_prog_id
) current_programme -- the current program keywords
on pk.keyword_id = current_programme.keyword_id
inner join programmes p on pk.prog_id = p.prog_id
) matches
on matches.prog_id = p.prog_id
order by
p.prog_id;
end proc_main #
delimiter ;
call list_matching_programmes(1);
call list_matching_programmes(6);
explain
select
p.*
from
programmes p
inner join
(
select distinct
pk.prog_id
from
programme_keywords pk
inner join
(
select keyword_id from programme_keywords where prog_id = 1
) current_programme
on pk.keyword_id = current_programme.keyword_id
inner join programmes p on pk.prog_id = p.prog_id
) matches
on matches.prog_id = p.prog_id
order by
p.prog_id;
EDIT: added char_idx functionality as requested
alter table keywords add column char_idx char(1) null after name;
update keywords set char_idx = upper(substring(name,1,1));
select * from keywords;
explain
select
p.*
from
programmes p
inner join
(
select distinct
pk.prog_id
from
programme_keywords pk
inner join
(
select keyword_id from keywords where char_idx = 'P' -- just change the driver query
) keywords_starting_with
on pk.keyword_id = keywords_starting_with.keyword_id
) matches
on matches.prog_id = p.prog_id
order by
p.prog_id;

Try this approach, not sure if it will help but at least is different:
select PadID, count(Word) as matching_words
from keywords k
where Word in (
select Word
from keywords
where PadID=44243 )
group by PadID
order by matching_words DESC
LIMIT 0,11
Anyway the job you want to get done is heavy, and full of string comparison, maybe exporting keywords and storing only numeric ids in the keyword table can reduce the times.

Ok after reviewing you database I think there is not a lot of room to improve in the query, in fact on my test server with index on Word it only takes about 0.15s to complete, without the index it is almost 4x times slower.
Anyway I think that implementing the change in database sctructure f00 and I have told you it will improve the response time.
Also drop the index PadID_2 as it is now it is futile and it will only slow your writes.
What you should do but it requise to clean the database is to avoid duplicate keyword-prodId pair first removing al duplicate ones currently in DB (around 90k in my test with 3/4 of your DB) that will reduce query time and give meaningfull results. If you ask for a progId that has the keyword ABC that is duplicated for progdID2 then progID2 will be on top o other progIDs with the same ABC keyword but not duplicated, on my tests I have seen a progID that get several more matches that the same progID I am querying.
After dropping duplicates from the DB you will need to change your application to avoid this problem again in the future and just for being safe you could add a primary key (or index with unique activated) to Word + ProgID.

Related

SQL Server Indexing and Composite Keys

Given the following:
-- This table will have roughly 14 million records
CREATE TABLE IdMappings
(
Id int IDENTITY(1,1) NOT NULL,
OldId int NOT NULL,
NewId int NOT NULL,
RecordType varchar(80) NOT NULL, -- 15 distinct values, will never increase
Processed bit NOT NULL DEFAULT 0,
CONSTRAINT pk_IdMappings
PRIMARY KEY CLUSTERED (Id ASC)
)
CREATE UNIQUE INDEX ux_IdMappings_OldId ON IdMappings (OldId);
CREATE UNIQUE INDEX ux_IdMappings_NewId ON IdMappings (NewId);
and this is the most common query run against the table:
WHILE #firstBatchId <= #maxBatchId
BEGIN
-- the result of this is used to insert into another table:
SELECT
NewId, -- and lots of non-indexed columns from SOME_TABLE
FROM
IdMappings map
INNER JOIN
SOME_TABLE foo ON foo.Id = map.OldId
WHERE
map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
AND map.Processed = 0
-- We only really need this in case the user kills the binary or SQL Server service:
UPDATE IdMappings
SET Processed = 1
WHERE map.Id BETWEEN #firstBatchId AND #lastBatchId
AND map.RecordType = #someRecordType
SET #firstBatchId += 4999
SET #lastBatchId += 4999
END
What are the best indices to add? I figure Processed isn't worth indexing since it only has 2 values. Is it worth indexing RecordType since there are only about 15 distinct values? How many distinct values will a column likely have before we consider indexing it?
Is there any advantage in a composite key if some of the fields are in the WHERE and some are in a JOIN's ON condition? For example:
CREATE INDEX ix_IdMappings_RecordType_OldId
ON IdMappings (RecordType, OldId)
... if I wanted both these fields indexed (I'm not saying I do), does this composite key gain any advantage since both columns don't appear together in the same WHERE or same ON?
Insert time into IdMappings isn't really an issue. After we insert all records into the table, we don't need to do so again for months if ever.

How to optimize SQL query that uses GROUP BY and joined many-to-many relation tables?

I have tables with many-to-many relations:
CREATE TABLE `item` (
`id` mediumint(8) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL DEFAULT '',
`size_id` tinyint(3) NOT NULL DEFAULT 0,
PRIMARY KEY (`id`),
INDEX `size` (`size_id`)
);
CREATE TABLE `items_styles` (
`style_id` smallint(5) unsigned NOT NULL,
`item_id` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`item_id`, `style_id`),
INDEX `style` (`style_id`),
INDEX `item` (`item_id`),
CONSTRAINT `items_styles_item_id_item_id` FOREIGN KEY (`item_id`) REFERENCES `item` (`id`)
);
CREATE TABLE `items_themes` (
`theme_id` tinyint(3) unsigned NOT NULL,
`item_id` mediumint(8) unsigned NOT NULL,
PRIMARY KEY (`item_id`, `theme_id`),
INDEX `theme` (`theme_id`),
INDEX `item` (`item_id`),
CONSTRAINT `items_themes_item_id_item_id` FOREIGN KEY (`item_id`) REFERENCES `item` (`id`)
);
I'm trying to get the report that shows style_id and the number of items that use this style but with applying filters to the item table and/or to another table, like this:
SELECT i_s.style_id, COUNT(i.id) total FROM item i
JOIN items_themes i_t ON i.id = i_t.item_id AND i_t.theme_id IN (6, 7)
JOIN items_styles i_s ON i.id = i_s.item_id
GROUP BY i_s.style_id;
-- or like this
SELECT i_s.style_id, COUNT(i.id) total FROM item i
JOIN items_themes i_t ON i.id = i_t.item_id AND i_t.theme_id IN (6, 7)
JOIN items_styles i_s ON i.id = i_s.item_id
WHERE i.size_id != 3
GROUP BY i_s.style_id;
The problem is that tables are pretty big so queries take a long time to execute (~8 seconds)
item - 8M rows
items_styles - 12M rows
items_themes - 11M rows
Is there any way to optimize these queries? If not, what approach can be used to receive such reports.
I will be grateful for any help. Thanks.
First, you don't need the items table for the queries. Probably doesn't have much impact on performance, but no need.
So you can write the query as:
SELECT i_s.style_id, COUNT(*) as total
FROM items_themes i_t JOIN
items_styles i_s
ON i_s.item_id = i_t.item_id
WHERE i_t.theme_id IN (6, 7)
GROUP BY i_s.style_id;
For this query, you want an index on items_themes(theme_id, item_id). There is no much you can do about the GROUP BY.
Then, I don't think this is what you really want, because it will double count an item that has both themes. So, use EXISTS instead:
SELECT i_s.style_id, COUNT(*) as total
FROM items_styles i_s
WHERE EXISTS (SELECT
FROM items_themes i_t
WHERE i_t.item_id = i_s.item_id AND
i_t.theme_id IN (6, 7)
)
GROUP BY i_s.style_id;
For this, you want an index on items_themes(item_id, theme_id). You can also try an index on items_styles(style_id). Some databases might be able to use that one, but I am guessing not MariaDB.
In a many-to-many table, it is optimal to have these two indexes:
PRIMARY KEY (`item_id`, `style_id`),
INDEX `style` (`style_id`, `item_id`)
And be sure to use InnoDB.
More discussion: http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table
Still, you have two many-to-many mappings, so there probably is no great solution.

Bad SQLite query performance with outer joins

I have an SQLite database as part of an iOS app which works fine for the most part but certain small changes to a query can result in it taking 1000x longer to complete. Here's the 2 tables I have involved:
create table "journey_item" ("id" SERIAL NOT NULL PRIMARY KEY,
"position" INTEGER NOT NULL,
"last_update" BIGINT NOT NULL,
"rank" DOUBLE PRECISION NOT NULL,
"skipped" BOOLEAN NOT NULL,
"item_id" INTEGER NOT NULL,
"journey_id" INTEGER NOT NULL);
create table "content_items" ("id" SERIAL NOT NULL PRIMARY KEY,
"full_id" VARCHAR(32) NOT NULL,
"title" VARCHAR(508),
"timestamp" BIGINT NOT NULL,
"item_size" INTEGER NOT NULL,
"http_link" VARCHAR(254),
"local_url" VARCHAR(254),
"creator_id" INTEGER NOT NULL,
"from_id" INTEGER,"location_id" INTEGER);
Tables have indexes on primary and foreign keys.
And here are 2 queries which give a good example of my problem
SELECT * FROM content_items ci
INNER JOIN journey_item ji ON ji.item_id = ci.id WHERE ji.journey_id = 1
SELECT * FROM content_items ci
LEFT OUTER JOIN journey_item ji ON ji.item_id = ci.id WHERE ji.journey_id = 1
The first query takes 167 ms to complete while the second takes 3.5 minutes and I don't know why the outer join would make such a huge difference.
Edit:
Without the WHERE part the second query only takes 267 ms
The two queries should have the same result set (the where clause turns the left join into an inner join)`. However, SQLite probably doesn't recognize this.
If you have an index on journey_item(journey_id, item_id), then this would be used for the inner join version. However, the second version is probably scanning the first table for the join. An index on journey_item(item_id) would help, but probably still not match the performance of the first query.

Moving table columns to new table and referencing as foreign key in PostgreSQL

Suppose we have a DB table with fields
"id", "category", "subcategory", "brand", "name", "description", etc.
What's a good way of creating separate tables for
category, subcategory and brand
and the corresponding columns and rows in the original table becoming foreign key references?
To outline the operations involved:
get all unique values in each column of the original table which should become foreign keys;
create tables for those
create foreign key reference columns in the original table (or a copy)
In this case, the PostgreSQL DB is accessed via Sequel in a Ruby app, so available interfaces are the command line, Sequel, PGAdmin, etc...
The question: how would you do this?
-- Some test data
CREATE TABLE animals
( id SERIAL NOT NULL PRIMARY KEY
, name varchar
, category varchar
, subcategory varchar
);
INSERT INTO animals(name, category, subcategory) VALUES
( 'Chimpanzee' , 'mammals', 'apes' )
,( 'Urang Utang' , 'mammals', 'apes' )
,( 'Homo Sapiens' , 'mammals', 'apes' )
,( 'Mouse' , 'mammals', 'rodents' )
,( 'Rat' , 'mammals', 'rodents' )
;
-- [empty] table to contain the "squeezed out" domain
CREATE TABLE categories
( id SERIAL NOT NULL PRIMARY KEY
, category varchar
, subcategory varchar
, UNIQUE (category,subcategory)
);
-- The original table needs a "link" to the new table
ALTER TABLE animals
ADD column category_id INTEGER -- NOT NULL
REFERENCES categories(id)
;
-- FK constraints are helped a lot by a supportive index.
CREATE INDEX animals_categories_fk ON animals (category_id);
-- Chained query to:
-- * populate the domain table
-- * initialize the FK column in the original table
WITH ins AS (
INSERT INTO categories(category, subcategory)
SELECT DISTINCT a.category, a.subcategory
FROM animals a
RETURNING *
)
UPDATE animals ani
SET category_id = ins.id
FROM ins
WHERE ins.category = ani.category
AND ins.subcategory = ani.subcategory
;
-- Now that we have the FK pointing to the new table,
-- we can drop the redundant columns.
ALTER TABLE animals DROP COLUMN category, DROP COLUMN subcategory;
-- show it to the world
SELECT a.*
, c.category, c.subcategory
FROM animals a
JOIN categories c ON c.id = a.category_id
;
Note: the fragment:
WHERE ins.category = ani.category
AND ins.subcategory = ani.subcategory
will lead to problems if these columns contain NULLs.
It would be better to compare them using
(ins.category,ins.subcategory)
IS NOT DISTINCT FROM
(ani.category,ani.subcategory)
I'm not sure I completely understand your question, if this doesn't seem to answer it, then please leave a comment and possibly improve your question to clarify, but it sounds like you want to do a CREATE TABLE xxx AS. For example:
CREATE TABLE category AS (SELECT DISTINCT(category) AS id FROM parent_table);
Then alter the parent_table to add a foreign key constraint.
ALTER TABLE parent_table ADD CONSTRAINT category_fk FOREIGN KEY (category) REFERENCES category (id);
Repeat this for each table you want to create.
Here is the related documentation:
CREATE TABLE
ALTER TABLE
Note: code and references are for Postgresql 9.4

SQL JOIN To Find Records That Don't Have a Matching Record With a Specific Value

I'm trying to speed up some code that I wrote years ago for my employer's purchase authorization app. Basically I have a SLOW subquery that I'd like to replace with a JOIN (if it's faster).
When the director logs into the application he sees a list of purchase requests he has yet to authorize or deny. That list is generated with the following query:
SELECT * FROM SA_ORDER WHERE ORDER_ID NOT IN
(SELECT ORDER_ID FROM SA_SIGNATURES WHERE TYPE = 'administrative director');
There are only about 900 records in sa_order and 1800 records in sa_signature and this query still takes about 5 seconds to execute. I've tried using a LEFT JOIN to retrieve records I need, but I've only been able to get sa_order records with NO matching records in sa_signature, and I need sa_order records with "no matching records with a type of 'administrative director'". Your help is greatly appreciated!
The schema for the two tables is as follows:
The tables involved have the following layout:
CREATE TABLE sa_order
(
`order_id` BIGINT PRIMARY KEY AUTO_INCREMENT,
`order_number` BIGINT NOT NULL,
`submit_date` DATE NOT NULL,
`vendor_id` BIGINT NOT NULL,
`DENIED` BOOLEAN NOT NULL DEFAULT FALSE,
`MEMO` MEDIUMTEXT,
`year_id` BIGINT NOT NULL,
`advisor` VARCHAR(255) NOT NULL,
`deleted` BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE TABLE sa_signature
(
`signature_id` BIGINT PRIMARY KEY AUTO_INCREMENT,
`order_id` BIGINT NOT NULL,
`signature` VARCHAR(255) NOT NULL,
`proxy` BOOLEAN NOT NULL DEFAULT FALSE,
`timestamp` TIMESTAMP NOT NULL DEFAULT NOW(),
`username` VARCHAR(255) NOT NULL,
`type` VARCHAR(255) NOT NULL
);
Create an index on sa_signatures (type, order_id).
This is not necessary to convert the query into a LEFT JOIN unless sa_signatures allows nulls in order_id. With the index, the NOT IN will perform as well. However, just in case you're curious:
SELECT o.*
FROM sa_order o
LEFT JOIN
sa_signatures s
ON s.order_id = o.order_id
AND s.type = 'administrative director'
WHERE s.type IS NULL
You should pick a NOT NULL column from sa_signatures for the WHERE clause to perform well.
You could replace the [NOT] IN operator with EXISTS for faster performance.
So you'll have:
SELECT * FROM SA_ORDER WHERE NOT EXISTS
(SELECT ORDER_ID FROM SA_SIGNATURES
WHERE TYPE = 'administrative director'
AND ORDER_ID = SA_ORDER.ORDER_ID);
Reason : "When using “NOT IN”, the query performs nested full table scans, whereas for “NOT EXISTS”, query can use an index within the sub-query."
Source : http://decipherinfosys.wordpress.com/2007/01/21/32/
This following query should work, however I suspect your real issue is you don't have the proper indices in place. You should have an index on the SA_SGINATURES table on the ORDER_ID column.
SELECT *
FROM
SA_ORDER
LEFT JOIN
SA_SIGNATURES
ON
SA_ORDER.ORDER_ID = SA_SIGNATURES.ORDER_ID AND
TYPE = 'administrative director'
WHERE
SA_SIGNATURES.ORDER_ID IS NULL;
select * from sa_order as o inner join sa_signature as s on o.orderid = sa.orderid and sa.type = 'administrative director'
also, you can create a non clustered index on type in sa_signature table
even better - have a master table for types with typeid and typename, and then instead of saving type as text in your sa_signature table, simply save type as integer. thats because computing on integers is way faster than computing on text