Mysql - help me optimize this query - optimization

About the system:
-The system has a total of 8 tables
- Users
- Tutor_Details (Tutors are a type of User,Tutor_Details table is linked to Users)
- learning_packs, (stores packs created by tutors)
- learning_packs_tag_relations, (holds tag relations meant for search)
- tutors_tag_relations and tags and
orders (containing purchase details of tutor's packs),
order_details linked to orders and tutor_details.
For a more clear idea about the tables involved please check the The tables section in the end.
-A tags based search approach is being followed.Tag relations are created when new tutors register and when tutors create packs (this makes tutors and packs searcheable). For details please check the section How tags work in this system? below.
Following is a simpler representation (not the actual) of the more complex query which I am trying to optimize:- I have used statements like explanation of parts in the query
============================================================================
select
SUM(DISTINCT( t.tag LIKE "%Dictatorship%" )) as key_1_total_matches,
SUM(DISTINCT( t.tag LIKE "%democracy%" )) as key_2_total_matches,
td.*, u.*, count(distinct(od.id_od)), `if (lp.id_lp > 0) then some conditional logic on lp fields else 0 as tutor_popularity`
from Tutor_Details AS td JOIN Users as u on u.id_user = td.id_user
LEFT JOIN Learning_Packs_Tag_Relations AS lptagrels ON td.id_tutor = lptagrels.id_tutor
LEFT JOIN Learning_Packs AS lp ON lptagrels.id_lp = lp.id_lp
LEFT JOIN `some other tables on lp.id_lp - let's call learning pack tables set (including
Learning_Packs table)`
LEFT JOIN Order_Details as od on td.id_tutor = od.id_author LEFT JOIN Orders as o on
od.id_order = o.id_order
LEFT JOIN Tutors_Tag_Relations as ttagrels ON td.id_tutor = ttagrels.id_tutor
JOIN Tags as t on (t.id_tag = ttagrels.id_tag) OR (t.id_tag = lptagrels.id_tag)
where `some condition on Users table's fields`
AND CASE WHEN ((t.id_tag = lptagrels.id_tag) AND (lp.id_lp > 0)) THEN `some
conditions on learning pack tables set` ELSE 1 END
AND CASE WHEN ((t.id_tag = wtagrels.id_tag) AND (wc.id_wc > 0)) THEN `some
conditions on webclasses tables set` ELSE 1 END
AND CASE WHEN (od.id_od>0) THEN od.id_author = td.id_tutor and `some conditions on Orders table's fields` ELSE 1 END
AND ( t.tag LIKE "%Dictatorship%" OR t.tag LIKE "%democracy%")
group by td.id_tutor HAVING key_1_total_matches = 1 AND key_2_total_matches = 1
order by tutor_popularity desc, u.surname asc, u.name asc limit
0,20
=====================================================================
What does the above query do?
Does AND logic search on the search keywords (2 in this example - "Democracy" and "Dictatorship").
Returns only those tutors for which both the keywords are present in the union of the two sets - tutors details and details of all the packs created by a tutor.
To make things clear - Suppose a Tutor name "Sandeepan Nath" has created a pack "My first pack", then:-
Searching "Sandeepan Nath" returns Sandeepan Nath.
Searching "Sandeepan first" returns Sandeepan Nath.
Searching "Sandeepan second" does not return Sandeepan Nath.
======================================================================================
The problem
The results returned by the above query are correct (AND logic working as per expectation), but the time taken by the query on heavily loaded databases is like 25 seconds as against normal query timings of the order of 0.005 - 0.0002 seconds, which makes it totally unusable.
It is possible that some of the delay is being caused because all the possible fields have not yet been indexed, but I would appreciate a better query as a solution, optimized as much as possible, displaying the same results
==========================================================================================
How tags work in this system?
When a tutor registers, tags are entered and tag relations are created with respect to tutor's details like name, surname etc.
When a Tutors create packs, again tags are entered and tag relations are created with respect to pack's details like pack name, description etc.
tag relations for tutors stored in tutors_tag_relations and those for packs stored in learning_packs_tag_relations. All individual tags are stored in tags table.
====================================================================
The tables
Most of the following tables contain many other fields which I have omitted here.
CREATE TABLE IF NOT EXISTS `users` (
`id_user` int(10) unsigned NOT NULL AUTO_INCREMENT,
`name` varchar(100) NOT NULL DEFAULT '',
`surname` varchar(155) NOT NULL DEFAULT '',
PRIMARY KEY (`id_user`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=636 ;
CREATE TABLE IF NOT EXISTS `tutor_details` (
`id_tutor` int(10) NOT NULL AUTO_INCREMENT,
`id_user` int(10) NOT NULL DEFAULT '0',
PRIMARY KEY (`id_tutor`),
KEY `Users_FKIndex1` (`id_user`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=51 ;
CREATE TABLE IF NOT EXISTS `orders` (
`id_order` int(10) unsigned NOT NULL AUTO_INCREMENT,
PRIMARY KEY (`id_order`),
KEY `Orders_FKIndex1` (`id_user`),
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=275 ;
ALTER TABLE `orders`
ADD CONSTRAINT `Orders_ibfk_1` FOREIGN KEY (`id_user`) REFERENCES `users`
(`id_user`) ON DELETE NO ACTION ON UPDATE NO ACTION;
CREATE TABLE IF NOT EXISTS `order_details` (
`id_od` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_order` int(10) unsigned NOT NULL DEFAULT '0',
`id_author` int(10) NOT NULL DEFAULT '0',
PRIMARY KEY (`id_od`),
KEY `Order_Details_FKIndex1` (`id_order`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=284 ;
ALTER TABLE `order_details`
ADD CONSTRAINT `Order_Details_ibfk_1` FOREIGN KEY (`id_order`) REFERENCES `orders`
(`id_order`) ON DELETE NO ACTION ON UPDATE NO ACTION;
CREATE TABLE IF NOT EXISTS `learning_packs` (
`id_lp` int(10) unsigned NOT NULL AUTO_INCREMENT,
`id_author` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`id_lp`),
KEY `Learning_Packs_FKIndex2` (`id_author`),
KEY `id_lp` (`id_lp`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 AUTO_INCREMENT=23 ;
CREATE TABLE IF NOT EXISTS `tags` (
`id_tag` int(10) unsigned NOT NULL AUTO_INCREMENT,
`tag` varchar(255) DEFAULT NULL,
PRIMARY KEY (`id_tag`),
UNIQUE KEY `tag` (`tag`),
KEY `id_tag` (`id_tag`),
KEY `tag_2` (`tag`),
KEY `tag_3` (`tag`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=3419 ;
CREATE TABLE IF NOT EXISTS `tutors_tag_relations` (
`id_tag` int(10) unsigned NOT NULL DEFAULT '0',
`id_tutor` int(10) DEFAULT NULL,
KEY `Tutors_Tag_Relations` (`id_tag`),
KEY `id_tutor` (`id_tutor`),
KEY `id_tag` (`id_tag`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `tutors_tag_relations`
ADD CONSTRAINT `Tutors_Tag_Relations_ibfk_1` FOREIGN KEY (`id_tag`) REFERENCES
`tags` (`id_tag`) ON DELETE NO ACTION ON UPDATE NO ACTION;
CREATE TABLE IF NOT EXISTS `learning_packs_tag_relations` (
`id_tag` int(10) unsigned NOT NULL DEFAULT '0',
`id_tutor` int(10) DEFAULT NULL,
`id_lp` int(10) unsigned DEFAULT NULL,
KEY `Learning_Packs_Tag_Relations_FKIndex1` (`id_tag`),
KEY `id_lp` (`id_lp`),
KEY `id_tag` (`id_tag`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
ALTER TABLE `learning_packs_tag_relations`
ADD CONSTRAINT `Learning_Packs_Tag_Relations_ibfk_1` FOREIGN KEY (`id_tag`)
REFERENCES `tags` (`id_tag`) ON DELETE NO ACTION ON UPDATE NO ACTION;
===================================================================================
Following is the exact query (this includes classes also - tutors can create classes and search terms are matched with classes created by tutors):-
SELECT SUM(DISTINCT( t.tag LIKE "%Dictatorship%" )) AS key_1_total_matches,
SUM(DISTINCT( t.tag LIKE "%democracy%" )) AS key_2_total_matches,
COUNT(DISTINCT( od.id_od )) AS tutor_popularity,
CASE
WHEN ( IF(( wc.id_wc > 0 ), ( wc.wc_api_status = 1
AND wc.wc_type = 0
AND wc.class_date > '2010-06-01 22:00:56'
AND wccp.status = 1
AND ( wccp.country_code = 'IE'
OR wccp.country_code IN ( 'INT' )
) ), 0)
) THEN 1
ELSE 0
END AS 'classes_published',
CASE
WHEN ( IF(( lp.id_lp > 0 ), ( lp.id_status = 1
AND lp.published = 1
AND lpcp.status = 1
AND ( lpcp.country_code = 'IE'
OR lpcp.country_code IN ( 'INT' )
) ), 0)
) THEN 1
ELSE 0
END AS 'packs_published',
td . *,
u . *
FROM tutor_details AS td
JOIN users AS u
ON u.id_user = td.id_user
LEFT JOIN learning_packs_tag_relations AS lptagrels
ON td.id_tutor = lptagrels.id_tutor
LEFT JOIN learning_packs AS lp
ON lptagrels.id_lp = lp.id_lp
LEFT JOIN learning_packs_categories AS lpc
ON lpc.id_lp_cat = lp.id_lp_cat
LEFT JOIN learning_packs_categories AS lpcp
ON lpcp.id_lp_cat = lpc.id_parent
LEFT JOIN learning_pack_content AS lpct
ON ( lp.id_lp = lpct.id_lp )
LEFT JOIN webclasses_tag_relations AS wtagrels
ON td.id_tutor = wtagrels.id_tutor
LEFT JOIN webclasses AS wc
ON wtagrels.id_wc = wc.id_wc
LEFT JOIN learning_packs_categories AS wcc
ON wcc.id_lp_cat = wc.id_wp_cat
LEFT JOIN learning_packs_categories AS wccp
ON wccp.id_lp_cat = wcc.id_parent
LEFT JOIN order_details AS od
ON td.id_tutor = od.id_author
LEFT JOIN orders AS o
ON od.id_order = o.id_order
LEFT JOIN tutors_tag_relations AS ttagrels
ON td.id_tutor = ttagrels.id_tutor
JOIN tags AS t
ON ( t.id_tag = ttagrels.id_tag )
OR ( t.id_tag = lptagrels.id_tag )
OR ( t.id_tag = wtagrels.id_tag )
WHERE ( u.country = 'IE'
OR u.country IN ( 'INT' ) )
AND CASE
WHEN ( ( t.id_tag = lptagrels.id_tag )
AND ( lp.id_lp > 0 ) ) THEN lp.id_status = 1
AND lp.published = 1
AND lpcp.status = 1
AND ( lpcp.country_code = 'IE'
OR lpcp.country_code IN (
'INT'
) )
ELSE 1
END
AND CASE
WHEN ( ( t.id_tag = wtagrels.id_tag )
AND ( wc.id_wc > 0 ) ) THEN wc.wc_api_status = 1
AND wc.wc_type = 0
AND
wc.class_date > '2010-06-01 22:00:56'
AND wccp.status = 1
AND ( wccp.country_code = 'IE'
OR wccp.country_code IN (
'INT'
) )
ELSE 1
END
AND CASE
WHEN ( od.id_od > 0 ) THEN od.id_author = td.id_tutor
AND o.order_status = 'paid'
AND CASE
WHEN ( od.id_wc > 0 ) THEN od.can_attend_class = 1
ELSE 1
END
ELSE 1
END
GROUP BY td.id_tutor
HAVING key_1_total_matches = 1
AND key_2_total_matches = 1
ORDER BY tutor_popularity DESC,
u.surname ASC,
u.name ASC
LIMIT 0, 20
Please note - The provided database structure does not show all the fields and tables as in this query
=====================================================================================
The explain query output:-
Please see this screenshot
http://www.test.examvillage.com/Explain_query.jpg

Information on row counts, value distributions, indexes, size of the database, size of memory, disk layout - raid 0, 5, etc - how many users are hitting your database when queries are slow - what other queries are running. All these things factor into performance.
Also a print out of the explain plan output may shed some light on the cause if it's simply a query / index issue. The exact query would be needed as well.

You really should use some better formatting for the query.
Just add at least 4 spaces to the beginning of each row to get this nice code formatting.
SELECT * FROM sometable
INNER JOIN anothertable ON sometable.id = anothertable.sometable_id
Or have a look here: https://stackoverflow.com/editing-help
Could you provide the execution plan from mysql? You need to add "EXPLAIN" to the query and copy the result.
EXPLAIN SELECT * FROM ...complexquery...
will give you some useful hints (execution order, returned rows, available/used indexes)

Your question is, "how can I find tutors that match certain tags?" That's not a hard question, so the query to answer it shouldn't be hard either.
Something like:
SELECT *
FROM tutors
WHERE tags LIKE '%Dictator%' AND tags LIKE '%Democracy%'
That will work, if you modify your design to have a "tags" field in your "tutors" table, in which you put all the tags that apply to that tutor. It will eliminate layers of joins and tables.
Are all those layers of joins and tables providing real functionality, or just more programming headaches? Think about the functionality that your app REALLY needs, and then simplify your database design!!

Answering my own question.
The main problem with this approach was that too many tables were joined in a single query. Some of those tables like Tags (having large number of records - which can in future hold as many as all the English words in the vocabulary) when joined with so many tables cause this multiplication effect which can in no way be countered.
The solution is basically to make sure too many joins are not made in a single query. Breaking one large join query into steps, using the results of the one query (involving joins on some of the tables) for the next join query (involving joins on the other tables) reduces the multiplication effect.
I will try to provide better explanation to this later.

Related

Duplicate rows returned even though group by is used

This is my query
SELECT p.book FROM customers_books p
INNER JOIN books b ON p.book = b.id
INNER JOIN bookprices bp ON bp.book = p.book
WHERE b.status = 'PUBLISHED' AND bp.currency_code = 'GBP'
AND p.book NOT IN (SELECT cb.book FROM customers_books cb WHERE cb.customer = 1)
GROUP BY p.book, p.created_date ORDER BY p.created_date DESC
This is the data in my customers_books table,
I expect only 8,6,1 of books IDs to return but query is returning 8,6,1,1
table structures are here
CREATE TABLE "public"."customers_books" (
"id" int8 NOT NULL,
"created_date" timestamp(6),
"book" int8,
"customer" int8,
);
CREATE TABLE "public"."books" (
"id" int8 NOT NULL,
"created_date" timestamp(6),
"status" varchar(255) COLLATE "pg_catalog"."default",
)
CREATE TABLE "public"."bookprices" (
"id" int8 NOT NULL,
"currency_code" varchar(255) COLLATE "pg_catalog"."default",
"book" int8
)
what do you think I am doing wrong here.
I really dont want to use p.created_date in group by but I was forced to use because of order by
You have too many joins in the outer query:
SELECT b.book
FROM books b INNER JOIN
bookprices bp
ON bp.book = p.book
WHERE b.status = 'PUBLISHED' AND bp.currency_code = 'GBP' AND
NOT EXISTS (SELECT 1
FROM customers_books cb
WHERE cb.book = p.book AND cb.customer = 1
) ;
Note that I replaced the NOT IN with NOT EXISTS. I strongly, strongly discourage you from using NOT IN with a subquery. If the subquery returns any NULL values, then NOT IN returns no rows at all. It is better to sidestep this issue just by using NOT EXISTS.

Complex conditional SQL statement in SQLite

I'm trying to build a support system in which I now face a complex query. I've got a couple tables in my SQLite table wich look like so (slightly simplified):
CREATE TABLE "assign" (
"id" INTEGER NOT NULL PRIMARY KEY,
"created" DATETIME NOT NULL,
"is_assigned" SMALLINT NOT NULL,
"user_id" INTEGER NOT NULL REFERENCES "user" ("id")
);
CREATE TABLE "message" (
"id" INTEGER NOT NULL PRIMARY KEY,
"created" DATETIME NOT NULL,
"user_id" INTEGER REFERENCES "user" ("id") ,
"text" TEXT NOT NULL
);
CREATE TABLE "user" (
"id" INTEGER NOT NULL PRIMARY KEY,
"name" VARCHAR(255) NOT NULL
);
I now want to do a query which gives me *a list of users for which the last created Assign.is_assigned == False and the last created Message is later than the last created Assign*. So I now have the following (pseudo) query:
SELECT *
FROM user
WHERE ((IF (
SELECT is_assigned
FROM assign
WHERE assign.user_id = user.id
ORDER BY created DESC LIMIT 1
) = False)
AND ((
SELECT created
FROM message
WHERE message.user_id = user.id
ORDER BY created DESC
LIMIT 1
) > (
SELECT created
FROM assign
WHERE assign.user_id = user.id
ORDER BY created DESC
LIMIT 1))
);
This makes sense to me, but unfortunately not to the computer. I guess I need to make use of case statements or even joins or something but I have no clue how. Does anybody have a tip on how to do this?
You don't need the IF in there, and SQLite has no False, but otherwise, your query is quite correct:
SELECT *
FROM "user"
WHERE NOT (SELECT is_assigned
FROM assign
WHERE user_id = "user".id
ORDER BY created DESC
LIMIT 1)
AND (SELECT created
FROM message
WHERE user_id = "user".id
ORDER BY created DESC
LIMIT 1
) > (
SELECT created
FROM assign
WHERE user_id = "user".id
ORDER BY created DESC
LIMIT 1)
Try following query I have created in mysql
SELECT u.id AS 'user',u.name AS 'User_Name', ass.created AS 'assign_created',ass.is_assigned AS 'is_assigned',
msg.created AS 'message_created'
FROM `user` AS u
LEFT JOIN `assign` AS ass ON ass.`user_id` = u.`id`
LEFT JOIN `message` AS msg ON msg.`user_id` = u.id
LEFT JOIN (SELECT u.id AS 'user_id',u.name AS 'username',ass.created AS 'max_ass_created',ass.is_assigned AS 'assigned'
FROM `user` AS u
LEFT JOIN `assign` AS ass ON ass.`user_id` = u.`id`
LEFT JOIN `message` AS msg ON msg.`user_id` = u.`id`
GROUP BY u.id ORDER BY ass.created DESC) AS sub ON sub.user_id = u.id
WHERE (sub.assigned IS FALSE AND msg.created < sub.max_ass_created)
check SQL Fiddle of your scenario
hope this will solve your problem !

sql queries slower than expected

Before I show the query here are the relevant table definitions:
CREATE TABLE phpbb_posts (
topic_id mediumint(8) UNSIGNED DEFAULT '0' NOT NULL,
poster_id mediumint(8) UNSIGNED DEFAULT '0' NOT NULL,
KEY topic_id (topic_id),
KEY poster_id (poster_id),
);
CREATE TABLE phpbb_topics (
topic_id mediumint(8) UNSIGNED NOT NULL auto_increment
);
Here's the query I'm trying to do:
SELECT p.topic_id, p.poster_id
FROM phpbb_topics AS t
LEFT JOIN phpbb_posts AS p
ON p.topic_id = t.topic_id
AND p.poster_id <> ...
WHERE p.poster_id IS NULL;
Basically, the query is an attempt to find all topics where the number of times someone other than the target user has posted in is zero. In other words, the topics where the only person who has posted is the target user.
Problem is that query is taking a super long time. Here's the EXPLAIN for it:
Array
(
[id] => 1
[select_type] => SIMPLE
[table] => t
[type] => index
[possible_keys] =>
[key] => topic_approved
[key_len] => 1
[ref] =>
[rows] => 146484
[Extra] => Using index
)
Array
(
[id] => 1
[select_type] => SIMPLE
[table] => p
[type] => ref
[possible_keys] => topic_id,poster_id,tid_post_time
[key] => tid_post_time
[key_len] => 3
[ref] => db_name.t.topic_id
[rows] => 1
[Extra] => Using where; Not exists
)
My general assumption when it comes to SQL is that JOINs of any are super fast and can be done in no time at all assuming all relevant columns are primary or foreign keys (which in this case they are).
I tried out a few other queries:
SELECT COUNT(1)
FROM phpbb_topics AS t
JOIN phpbb_posts AS p
ON p.topic_id = t.topic_id;
That returns 353340 pretty quickly.
I then do these:
SELECT COUNT(1)
FROM phpbb_topics AS t
JOIN phpbb_posts AS p
ON p.topic_id = t.topic_id
AND p.poster_id <> 77198;
SELECT COUNT(1)
FROM phpbb_topics AS t
JOIN phpbb_posts AS p
ON p.topic_id = t.topic_id
WHERE p.poster_id <> 77198;
And both of those take quite a while (between 15-30 seconds). If I change the <> to a = it takes no time at all.
Am I making some incorrect assumptions? Maybe my DB is just foobar'd?
I think replacing index on phpbb_posts(topic_id) to composite index on 2 fields should improve performance of your query :
CREATE TABLE phpbb_posts (
topic_id mediumint(8) UNSIGNED DEFAULT '0' NOT NULL,
poster_id mediumint(8) UNSIGNED DEFAULT '0' NOT NULL,
--KEY topic_id (topic_id),
KEY topic_id_poster_id (topic_id,poster_id)
KEY poster_id (poster_id),
);
Your indexes look sufficient to me... could you try this query and let me know how the performance compares to your original?
SELECT sub.topic_id
FROM (
SELECT t.topic_id
FROM phpbb_topics AS t
WHERE
EXISTS (
SELECT *
FROM phpbb_posts p
WHERE
p.topic_id = t.topic_id
AND p.poster_id = 77198
)
) sub
WHERE
NOT EXISTS (
SELECT *
FROM phpbb_posts p
WHERE
p.topic_id = sub.topic_id
AND p.poster_id <> 77198
)
My thoughts are that by limiting the topics to only those that the poster in question has actually posted in, that the anti-join (implemented in this case with NOT EXISTS instead of a LEFT JOIN) will have to check much fewer topics for posters other than the one being searched.
SELECT t.topic_id
FROM phpbb_topics AS t
JOIN phpbb_posts AS p1
ON p1.topic_id = t.topic_id
AND p1.poster_id = $poster_id
LEFT JOIN phpbb_posts AS p2
ON p2.topic_id = t.topic_id
AND p2.poster_id <> $poster_id
WHERE p2.poster_id IS NULL
That made it a ton faster. I'm getting all the posts where the target user has posted with the topic info attached to that and then getting all the people other than the target who've posted.
There'll be lots of duplicates in the p1.poster_id column but since I'm not actually getting that row I figure duplicates in that column don't matter a whole lot.
Thanks!

Select rows with highest version for a shared key

Here's the Schema in mysql:
CREATE TABLE IF NOT EXISTS `labs_test`.`games` (
`game_id` INT NOT NULL AUTO_INCREMENT ,
`key` VARCHAR(45) NOT NULL ,
`config` BLOB NOT NULL ,
`game_version` BIGINT NOT NULL DEFAULT 1 ,
PRIMARY KEY (`game_id`) ,
INDEX `key` (`key`(8) ASC) ,
INDEX `version` (`game_version` ASC) )
ENGINE = InnoDB
DEFAULT CHARACTER SET = utf8
I've tried using MAX(game_version) to no avail. Should I give up the dream and use a sub query?
Use:
SELECT g.*
FROM GAMES g
JOIN (SELECT t.key,
MAX(t.game_version) AS max_version
FROM GAMES t
GROUP BY t.key) x ON x.key = g.key
AND x.max_version = g.game_version
Should I give up the dream and use a sub query?
Some would call what I used in my answer a subquery, but it'd be more accurate to call it a derived table/inline view. Unless I'm missing something, I don't see how you can get the records without some form of self-join (joining the same table onto itself). Here's the EXISTS alternative:
SELECT g.*
FROM GAMES g
WHERE EXISTS(SELECT NULL
FROM GAMES t
WHERE t.key = g.key
GROUP BY t.key
HAVING MAX(t.game_version) = g.game_verion
Use the EXPLAIN plan to determine which performs best, rather than be concerned with whether or not to use a subquery.

How do you perform an AND with a join?

I have the following data structure and data:
CREATE TABLE `parent` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(10) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `parent` VALUES(1, 'parent 1');
INSERT INTO `parent` VALUES(2, 'parent 2');
CREATE TABLE `other` (
`id` int(11) NOT NULL auto_increment,
`name` varchar(10) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `other` VALUES(1, 'other 1');
INSERT INTO `other` VALUES(2, 'other 2');
CREATE TABLE `relationship` (
`id` int(11) NOT NULL auto_increment,
`parent_id` int(11) NOT NULL,
`other_id` int(11) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
INSERT INTO `relationship` VALUES(1, 1, 1);
INSERT INTO `relationship` VALUES(2, 1, 2);
INSERT INTO `relationship` VALUES(3, 2, 1);
I want to find the the parent records with both other's 1 & 2.
This is what I've figured out, but I'm wondering if there is a better way:
SELECT p.id, p.name
FROM parent AS p
LEFT JOIN relationship AS r1 ON (r1.parent_id = p.id)
LEFT JOIN relationship AS r2 ON (r2.parent_id = p.id)
WHERE r1.other_id = 1 AND r2.other_id = 2;
The result is 1, "parent 1" which is correct. The problem is that once you get a list of 5+ joins, it gets messy and as the relationship table grows, it gets slow.
Is there a better way?
I'm using MySQL and PHP, but this is probably pretty generic.
Ok, I tested this. The queries from best to worst were:
Query 1: Joins (0.016s; basically instant)
SELECT p.id, name
FROM parent p
JOIN relationship r1 ON p.id = r1.parent_id AND r1.other_id = 100
JOIN relationship r2 ON p.id = r2.parent_id AND r2.other_id = 101
JOIN relationship r3 ON p.id = r3.parent_id AND r3.other_id = 102
JOIN relationship r4 ON p.id = r4.parent_id AND r4.other_id = 103
Query 2: EXISTS (0.625s)
SELECT id, name
FROM parent p
WHERE EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 100)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 101)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND other_id = 102)
AND EXISTS (SELECT 1 FROM relationship WHERE parent_id = p.id AND oth
Query 3: Aggregate (1.016s)
SELECT p.id, p.name
FROM parent p
WHERE (SELECT COUNT(*) FROM relationship WHERE parent_id = p.id AND other_id IN (100,101,102,103))
Query 4: UNION Aggregate (2.39s)
SELECT id, name FROM (
SELECT p1.id, p1.name
FROM parent AS p1 LEFT JOIN relationship as r1 ON(r1.parent_id=p1.id)
WHERE r1.other_id = 100
UNION ALL
SELECT p2.id, p2.name
FROM parent AS p2 LEFT JOIN relationship as r2 ON(r2.parent_id=p2.id)
WHERE r2.other_id = 101
UNION ALL
SELECT p3.id, p3.name
FROM parent AS p3 LEFT JOIN relationship as r3 ON(r3.parent_id=p3.id)
WHERE r3.other_id = 102
UNION ALL
SELECT p4.id, p4.name
FROM parent AS p4 LEFT JOIN relationship as r4 ON(r4.parent_id=p4.id)
WHERE r4.other_id = 103
) a
GROUP BY id, name
HAVING count(*) = 4
Actually the above was producing the wrong data so it's either wrong or I did something wrong with it. Whatever the case, the above is just a bad idea.
If that's not fast then you need to look at the explain plan for the query. You're probably just lacking appropriate indices. Try it with:
CREATE INDEX ON relationship (parent_id, other_id)
Before you go down the route of aggregation (SELECT COUNT(*) FROM ...) you should read SQL Statement - “Join” Vs “Group By and Having”.
Note: The above timings are based on:
CREATE TABLE parent (
id INT PRIMARY KEY,
name VARCHAR(50)
);
CREATE TABLE other (
id INT PRIMARY KEY,
name VARCHAR(50)
);
CREATE TABLE relationship (
id INT PRIMARY KEY,
parent_id INT,
other_id INT
);
CREATE INDEX idx1 ON relationship (parent_id, other_id);
CREATE INDEX idx2 ON relationship (other_id, parent_id);
and nearly 800,000 records created with:
<?php
ini_set('max_execution_time', 600);
$start = microtime(true);
echo "<pre>\n";
mysql_connect('localhost', 'scratch', 'scratch');
if (mysql_error()) {
echo "Connect error: " . mysql_error() . "\n";
}
mysql_select_db('scratch');
if (mysql_error()) {
echo "Selct DB error: " . mysql_error() . "\n";
}
define('PARENTS', 100000);
define('CHILDREN', 100000);
define('MAX_CHILDREN', 10);
define('SCATTER', 10);
$rel = 0;
for ($i=1; $i<=PARENTS; $i++) {
query("INSERT INTO parent VALUES ($i, 'Parent $i')");
$potential = range(max(1, $i - SCATTER), min(CHILDREN, $i + SCATTER));
$elements = sizeof($potential);
$other = rand(1, min(MAX_CHILDREN, $elements - 4));
$j = 0;
while ($j < $other) {
$index = rand(0, $elements - 1);
if (isset($potential[$index])) {
$c = $potential[$index];
$rel++;
query("INSERT INTO relationship VALUES ($rel, $i, $c)");
unset($potential[$index]);
$j++;
}
}
}
for ($i=1; $i<=CHILDREN; $i++) {
query("INSERT INTO other VALUES ($i, 'Other $i')");
}
$count = PARENTS + CHILDREN + $rel;
$stop = microtime(true);
$duration = $stop - $start;
$insert = $duration / $count;
echo "$count records added.\n";
echo "Program ran for $duration seconds.\n";
echo "Insert time $insert seconds.\n";
echo "</pre>\n";
function query($str) {
mysql_query($str);
if (mysql_error()) {
echo "$str: " . mysql_error() . "\n";
}
}
?>
So once again joins carry the day.
Given that parent table contains unique key on (parent_id, other_id) you can do this:
select p.id, p.name
from parent as p
where (select count(*)
from relationship as r
where r.parent_id = p.id
and r.other_id in (1,2)
) >= 2
Simplifying a bit, this should work, and efficiently.
SELECT DISTINCT p.id, p.name
FROM parent p
INNER JOIN relationship r1 ON p.id = r1.parent_id AND r1.other_id = 1
INNER JOIN relationship r2 ON p.id = r2.parent_id AND r2.other_id = 2
will require at least one joined record for each "other" value. And the optimizer should know it only has to find one match each, and it only needs to read the index, not either of the subsidiary tables, one of which isn't even referenced at all.
I haven't actually tested it, but something along the lines of:
SELECT id, name FROM (
SELECT p1.id, p1.name
FROM parent AS p1 LEFT JOIN relationship as r1 ON(r1.parent_id=p1.id)
WHERE r1.other_id = 1
UNION ALL
SELECT p2.id, p2.name
FROM parent AS p2 LEFT JOIN relationship as r2 ON(r2.parent_id=p2.id)
WHERE r2.other_id = 2
-- etc
) GROUP BY id, name
HAVING count(*) = 2
The idea is you don't have to do multi-way joins; just concatenate the results of regular joins, group by your ids, and pick the rows that showed up in every segment.
This is a common problem when searching multiple associates via a many to many join. This is often encountered in services using the 'tag' concept e.g. Stackoverflow
See my other post on a better architecture for tag (in your case 'other') storage
Searching is a two step process:
Find all possible candiates of TagCollections that have any/all the tags you require (may be easier using a cursor of loop construct)
Select data based that matches TagCollection
Performance is always faster due to there being significantly less TagCollections than data items to search
You can do it with a nested select , I tested it in MSSQL 2005 but as you said it should be pretty generic
SELECT * FROM parent p
WHERE p.id in(
SELECT r.parent_Id
FROM relationship r
WHERE r.parent_id in(1,2)
GROUP BY r.parent_id
HAVING COUNT(r.parent_Id)=2
)
and the number 2 in COUNT(r.parent_Id)=2 is according to the number of joins you need)
If you can put your list of other_id values into a table that would be ideal. The code below looks for parents with AT LEAST the ids given. If you want it to have EXACTLY the same ids (i.e. no extras) you would have to change the query slightly.
SELECT
p.id,
p.name
FROM
My_Other_IDs MOI
INNER JOIN Relationships R ON
R.other_id = MOI.other_id
INNER JOIN Parents P ON
P.parent_id = R.parent_id
GROUP BY
p.parent_id,
p.name
HAVING
COUNT(*) = (SELECT COUNT(*) FROM My_Other_IDs)