How do you query for comments stackoverflow style? - sql

I saw this question on meta: https://meta.stackexchange.com/questions/33101/how-does-so-query-comments
I wanted to set the record straight and ask the question in a proper technical way.
Say I have 2 tables:
Posts
id
content
parent_id (null for questions, question_id for answer)
Comments
id
body
is_deleted
post_id
upvotes
date
Note: I think this is how the schema for SO is setup, answers have a parent_id which is the question, questions have null there. Questions and answers are stored in the same table.
How do I pull out comments stackoverflow style in a very efficient way with minimal round trips?
The rules:
A single query should pull out all the comments needed for a page with multiple posts to render
Needs to only pull out 5 comments per answer, with pref for upvotes
Needs to provide enough information to inform the user there are more comments beyond the 5 that are there. (and the actual count - eg. 2 more comments)
Sorting is really hairy for comments, as you can see on the comments in this question. The rules are, display comments by date, HOWEVER if a comment has an upvote it is to get preferential treatment and be displayed as well at the bottom of the list. (this is nasty hard to express in sql)
If any denormalizations make stuff better what are they? What indexes are critical?

I wouldn't bother to filter the comments using SQL (which may surprise you because I'm an SQL advocate). Just fetch them all sorted by CommentId, and filter them in application code.
It's actually pretty infrequent that there are more than five comments for a given post, so that they need to be filtered. In StackOverflow's October data dump, 78% of posts have zero or one comment, and 97% have five or fewer comments. Only 20 posts have >= 50 comments, and only two posts have over 100 comments.
So writing complex SQL to do that kind of filtering would increase complexity when querying all posts. I'm all for using clever SQL when appropriate, but this would be penny-wise and pound-foolish.
You could do it this way:
SELECT q.PostId, a.PostId, c.CommentId
FROM Posts q
LEFT OUTER JOIN Posts a
ON (a.ParentId = q.PostId)
LEFT OUTER JOIN Comments c
ON (c.PostId IN (q.PostId, a.PostId))
WHERE q.PostId = 1234
ORDER BY q.PostId, a.PostId, c.CommentId;
But this gives you redundant copies of q and a columns, which is significant because those columns include text blobs. The extra cost of copying redundant text from the RDBMS to the app becomes significant.
So it's probably better to not do this in two queries. Instead, given that the client is viewing a Question with PostId = 1234, do the following:
SELECT c.PostId, c.Text
FROM Comments c
JOIN (SELECT 1234 AS PostId UNION ALL
SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p
ON (c.PostId = p.PostId);
And then sort through them in application code, collecting them by the referenced post and filtering out extra comments beyond the five most interesting ones per post.
I tested these two queries against a MySQL 5.1 database loaded with StackOverflow's data dump from October. The first query takes about 50 seconds. The second query is pretty much instantaneous (after I pre-cached indexes for the Posts and Comments tables).
The bottom line is that insisting on fetching all the data you need using a single SQL query is an artificial requirement (probably based on a misconception that the round-trip of issuing a query against an RDBMS is overhead that must be minimized at any cost). Often a single query is a less efficient solution. Do you try to write all your application code in a single function? :-)

the real question is not the query, but the schema, specially the clustered indexes. The comment ordering requirements are ambuigous as you defined them (is it only 5 per answer or not?). I interpreted the requirements as 'pull 5 comments per post (answer or question) and give preference to upvoted ones, then to newer ones. I know this is not how SO comments are showen, but you gotta define your requirements more precisesly.
Here is my query:
declare #postId int;
set #postId = ?;
with cteQuestionAndReponses as (
select post_id
from Posts
where post_id = #postId
union all
select post_id
from Posts
where parent_id = #postId)
select * from
cteQuestionAndReponses p
outer apply (
select count(*) as CommentsCount
from Comments c
where is_deleted = 0
and c.post_id = p.post_id) as cc
outer apply (
select top(5) *
from Comments c
where is_deleted = 0
and p.post_id = c.post_id
order by upvotes desc, date desc
) as c
I have some 14k posts and 67k comments in my test tables, the query gets the posts in 7ms:
Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 7 ms.
Here is the schema I tested with:
create table Posts (
post_id int identity (1,1) not null
, content varchar(max) not null
, parent_id int null -- (null for questions, question_id for answer)
, constraint fkPostsParent_id
foreign key (parent_id)
references Posts(post_id)
, constraint pkPostsId primary key nonclustered (post_id)
);
create clustered index cdxPosts on
Posts(parent_id, post_id);
go
create table Comments (
comment_id int identity(1,1) not null
, body varchar(max) not null
, is_deleted bit not null default 0
, post_id int not null
, upvotes int not null default 0
, date datetime not null default getutcdate()
, constraint pkComments primary key nonclustered (comment_id)
, constraint fkCommentsPostId
foreign key (post_id)
references Posts(post_id)
);
create clustered index cdxComments on
Comments (is_deleted, post_id, upvotes, date, comment_id);
go
and here is my test data generation:
insert into Posts (content)
select 'Lorem Ipsum'
from master..spt_values;
insert into Posts (content, parent_id)
select 'Ipsum Lorem', post_id
from Posts p
cross apply (
select top(checksum(newid(), p.post_id) % 10) Number
from master..spt_values) as r
where parent_id is NULL
insert into Comments (body, is_deleted, post_id, upvotes, date)
select 'Sit Amet'
-- 5% deleted comments
, case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end
, p.post_id
-- up to 10 upvotes
, abs(checksum(newid(), p.post_id, r.Number)) % 10
-- up to 1 year old posts
, dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate())
from Posts p
cross apply (
select top(abs(checksum(newid(), p.post_id)) % 10) Number
from master..spt_values) as r

Use:
WITH post_hierarchy AS (
SELECT p.id,
p.content,
p.parent_id,
1 AS post_level
FROM POSTS p
WHERE p.parent_id IS NULL
UNION ALL
SELECT p.id,
p.content,
p.parent_id,
ph.post_level + 1 AS post_level
FROM POSTS p
JOIN post_hierarchy ph ON ph.id = p.parent_id)
SELECT ph.id,
ph.post_level,
c.upvotes,
c.body
FROM COMMENTS c
JOIN post_hierarchy ph ON ph.id = c.post_id
ORDER BY ph.post_level, c.date
Couple of things to be aware of:
StackOverflow displays the first 5 comments, doesn't matter if they were upvoted or not. Subsequent comments that were upvoted are immediately displayed
You can't accommodate a limit of 5 comments per post without devoting a SELECT to each post. Adding TOP 5 to what I posted will only return the first five rows based on the ORDER BY statement

Related

Nested SQL call

EDIT:
As requested, our table schema is,
posts:
postid (primary key),
post_text
comments:
commentid (primary key) ,
postid (foreign key referencing posts.postid),
comment_text
replies
replyid (primary key)
commentid (foreign key referencing comments.commentid)
reply_text
I have the tables posts, comments, and replies in a SQL database. (Obviously, a post can have comments, and a comment can have replies)
I want to return a post based on its id, postid.
So I would like a database function has the inputs and outputs,
input:
postid
output:
post = {
postid
post_text
comments: [comment, ...]
}
Where the comment and reply are nested in the post,
comment = {
commentid,
text
replies: [reply, ...]
}
reply = {
replyid
reply_text
}
I have tried using joins, but the returned data is highly redundant, and it seems stupid. For instance, fetching the data from two different replies will give,
postid
post_text
commentid
comment_text
replyid
reply_text
1
POST_TEXT
78
COMMENT_TEXT
14
REPLY1_TEXT
1
POST_TEXT
78
COMMENT_TEXT
15
REPLY2_TEXT
It seems instead I want to make 3 separate queries, in sequence (first to the table posts, then to comments, then to replies)
How do I do this?
The “highly redundant” join result is normally the best way, because it is the natural thing in a relational database. Relational databases aim at avoiding redundancy in data storage, but not in query output. Avoiding that redundancy comes at an extra cost: you have to aggregate the data on the server side, and the client probably has to unpack the nested JSON data again.
Here is some sample code that demonstrates how you could aggregate the results:
SELECT postid, post_text,
jsonb_agg(
jsonb_build_object(
'commentid', commentid,
'comment_text', comment_text,
'replies', replies
)
) AS comments
FROM (SELECT postid, post_text, commentid, comment_text,
jsonb_agg(
jsonb_build_object(
'replyid', replyid,
'reply_text', reply_text
)
) AS replies
FROM /* your join */
GROUP BY postid, post_text, commentid, comment_text) AS q
GROUP BY postid, post_text;
The redundant data stems from a cross join of a post's comments and replies. I.e. for each post you join each comment with each reply. Comment 78 does neither relate to reply 14 nor to reply 15, but merely to the same post.
The typical approach to select the data would hence be three queries:
select * from posts;
select * from comments;
select * from replies;
You can also reduce this to two queries and join the posts table to the comments query, the replies query, or both. This again, will lead to selecting redundant data, but may ease data handling in your app.
If you want to avoid joins, but must avoid database round trips, you can glue query results together:
select *
from
(
select postid as id, postid, 'post' as type, post_text as text from posts
union all
select commentid as id, postid, 'comment' as type, comment_text as text from comments
union all
select replyid as id, postid, 'reply' as type, reply_text as text from replies
) glued
order by postid, type, id;
At last you can create JSON in your DBMS. Again, don't cross join comments with replies, but join the aggregated comments object and the aggregated replies object to the post.
select p.postid, p.post_text, c.comments, r.replies
from posts p
left join
(
select
postid,
jsonb_object_agg(jsonb_build_object('commentid', commentid,
'comment_text', comment_text)
) as comments
from comments
group by postid
) c on c.postid = p.postid
left join
(
select
postid,
jsonb_object_agg(jsonb_build_object('replyid', replyid,
'reply_text', reply_text)
) as replies
from replies
group by postid
) r on r.postid = p.postid;
Your idea to store things in JSON is a good one if you have something to parse it down the line.
As an alternative to the previous answers that involve JSON, you can also get a normal SQL result set (table definition and sample data are below the query):
WITH MyFilter(postid) AS (
VALUES (1),(2) /* rest of your filter */
)
SELECT 'Post' AS PublicationType, postid, NULL AS CommentID, NULL As ReplyToID, post_text
FROM Posts
WHERE postID IN (SELECT postid from MyFilter)
UNION ALL
SELECT CASE ReplyToID WHEN NULL THEN 'Comment' ELSE 'Reply' END, postid, commentid, replyToID, comment_text
FROM Comments
WHERE postid IN (SELECT postid from MyFilter)
ORDER BY postid, CommentID NULLS FIRST, ReplyToID NULLS FIRST
Note: the PublicationType column was added for the sake of clarity. You can alternatively inspect CommentID and ReplyToId and see what is null to determine the type of publication.
This should leave you with very little, if any, redundant data to transfer back to the SQL client.
This approach with UNION ALL will work with 3 tables too (you only have to add 1 UNION ALL) but in your case, I would rather go with a 2-table schema:
CREATE TABLE posts (
postid SERIAL primary key,
post_text text NOT NULL
);
CREATE TABLE comments (
commentid SERIAL primary key,
ReplyToID INTEGER NULL REFERENCES Comments(CommentID) /* ON DELETE CASCADE? */,
postid INTEGER NOT NULL references posts(postid) /* ON DELETE CASCADE? */,
comment_text Text NOT NULL
);
INSERT INTO posts(post_text) VALUES ('Post 1'),('Post 2'),('Post 3');
INSERT INTO Comments(postid, comment_text) VALUES (1, 'Comment 1.1'), (1, 'Comment 1.2'), (2, 'Comment 2.1');
INSERT INTO Comments(replytoId, postid, comment_text) VALUES (1, 1, 'Reply to comment 1.1'), (3, 2, 'Reply to comment 2.1');
This makes 1 fewer table and allows to have level 2 replies (replies to replies), or more, rather than just replies to comments. A recursive query (there are plenty of samples of that on SO) can make it so a reply can always be linked back to the original comment if you want.
Edit: I noticed your comment just a bit late. Of course, no matter what solution you take, there is no need to execute a request to get the replies to each and every comment.
Even with 3 tables, even without JSON, the query to get all the replies for all the comments at once is:
SELECT *
FROM replies
WHERE commentid IN (
SELECT commentid
FROM comments
WHERE postid IN (
/* List your post ids here or nest another SELECT postid FROM posts WHERE ... */
)
)

Optimize query to get rows with highest, filtered count in another table

I'm trying to create the most optimal query where the database would return the names of readers who often borrow sci-fi books. That's what I'm trying to optimize:
SELECT reader.name,
COUNT (CASE WHEN book.status_id = 1 AND book.category_id = 2 THEN 1 END)
FROM reader
JOIN book ON book.reader_id = reader.id
GROUP BY reader.name
ORDER BY COUNT (CASE WHEN book.status_id = 1 AND book.category_id = 2 THEN 1 END) DESC
LIMIT 10;
How can I improve my query other than with INNER JOIN or memory consumption increase?
This is my ERD diagram:
You could try to add your criteria in your join statement and just use the total count. It really depends on how much data you have etc....
SELECT reader.name,
COUNT(1) AS COUNTER
FROM reader
JOIN book ON book.reader_id = reader.id
AND book.status_id = 1
AND book.category = 2
GROUP BY reader.name
ORDER BY COUNTER DESC
LIMIT 10;
Assuming at least 10 readers that pass the criteria (like another answer also silently assumes), else you get fewer than 10 result rows.
Start with the filter. Aggregate & limit before joining to the second table. Much cheaper:
SELECT r.reader_id, r.surname, r.name, b.ct
FROM (
SELECT reader_id, count(*) AS ct
FROM book
WHERE status_id = 1
AND category_id = 2
GROUP BY reader_id
ORDER BY ct DESC, reader_id -- tiebreaker
LIMIT 10
) b
JOIN reader r ON r.id = b.reader_id
ORDER BY b.ct DESC, r.reader_id; -- tiebreaker
A multicolumn index on (status_id, category_id) would help a lot. Or an index on just one of both columns if either predicate is very selective. If performance of this particular query is your paramount objective, have this partial multicolumn index:
CREATE INDEX book_reader_special_idx ON book (reader_id)
WHERE status_id = 1 AND category_id = 2;
Typically, you'd vary the query, then this last index is too specialized.
Additional points:
Group by reader_id, which is the primary key (I assume) and guaranteed to be unique - as opposed to reader.name! Your original is likely to fail completely, name being just the "first name" from the looks of your ERD.
It's also typically substantially faster to group by an integer instead of varchar(25) (two times). But that's secondary, correctness comes first.
Also output surname and reader_id to disambiguate identical names. (Even name & surname are not reliably unique.)
count(*) is faster than count(1) while doing the same, exactly.
Add a tiebreaker to the ORDER BY clause to get a stable sort order and deterministic results. (Else, the result can be different every time with ties on the count.)

mariadb not using all fields of composite index

Mariadb not fully using composite index. Fast select and slow select both return same data, but explain shows that slow select uses only ix_test_relation.entity_id part and does not use ix_test_relation.stamp part.
I tried many cases (inner join, with, from) but couldn't make mariadb use both fields of index together with recursive query. I understand that I need to tell mariadb to materialize recursive query somehow.
Please help me optimize slow select which is using recursive query to be similar speed to fast select.
Some details about the task... I need to query user activity. One user activity record may relate to multiple entities. Entities are hierarchical. I need to query user activity for some parent entity and all children for specified stamp range. Stamp simplified from TIMESTAMP to BIGINT for demonstration simplicity. There can be a lot (1mil) of entities and each entity may relate to a lot (1mil) of user activity entries. Entity hierarchy depth expected to be like 10 levels deep. I assume that used stamp range reduces number of user activity records to 10-100. I denormalized schema, copied stamp from test_entry to test_relation to be able to include it in test_relation index.
I use 10.4.11-Mariadb-1:10:4.11+maria~bionic.
I can upgrade or patch or whatever mariadb if needed, I have full control over building docker image.
Schema:
CREATE TABLE test_entity(
id BIGINT NOT NULL,
parent_id BIGINT NULL,
CONSTRAINT pk_test_entity PRIMARY KEY (id),
CONSTRAINT fk_test_entity_pid FOREIGN KEY (parent_id) REFERENCES test_entity(id)
);
CREATE TABLE test_entry(
id BIGINT NOT NULL,
name VARCHAR(100) NOT NULL,
stamp BIGINT NOT NULL,
CONSTRAINT pk_test_entry PRIMARY KEY (id)
);
CREATE TABLE test_relation(
entry_id BIGINT NOT NULL,
entity_id BIGINT NOT NULL,
stamp BIGINT NOT NULL,
CONSTRAINT pk_test_relation PRIMARY KEY (entry_id, entity_id),
CONSTRAINT fk_test_relation_erid FOREIGN KEY (entry_id) REFERENCES test_entry(id),
CONSTRAINT fk_test_relation_enid FOREIGN KEY (entity_id) REFERENCES test_entity(id)
);
CREATE INDEX ix_test_relation ON test_relation(entity_id, stamp);
CREATE SEQUENCE sq_test_entry;
Test data:
CREATE OR REPLACE PROCEDURE test_insert()
BEGIN
DECLARE v_entry_id BIGINT;
DECLARE v_parent_entity_id BIGINT;
DECLARE v_child_entity_id BIGINT;
FOR i IN 1..1000 DO
SET v_parent_entity_id = i * 2;
SET v_child_entity_id = i * 2 + 1;
INSERT INTO test_entity(id, parent_id)
VALUES(v_parent_entity_id, NULL);
INSERT INTO test_entity(id, parent_id)
VALUES(v_child_entity_id, v_parent_entity_id);
FOR j IN 1..1000000 DO
SELECT NEXT VALUE FOR sq_test_entry
INTO v_entry_id;
INSERT INTO test_entry(id, name, stamp)
VALUES(v_entry_id, CONCAT('entry ', v_entry_id), j);
INSERT INTO test_relation(entry_id, entity_id, stamp)
VALUES(v_entry_id, v_parent_entity_id, j);
INSERT INTO test_relation(entry_id, entity_id, stamp)
VALUES(v_entry_id, v_child_entity_id, j);
END FOR;
END FOR;
END;
CALL test_insert;
Slow select (> 100ms):
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (
WITH RECURSIVE recursive_child AS (
SELECT id
FROM test_entity
WHERE id IN (2, 4)
UNION ALL
SELECT C.id
FROM test_entity C
INNER JOIN recursive_child P
ON P.id = C.parent_id
)
SELECT id
FROM recursive_child
)
AND TR.stamp BETWEEN 6 AND 8
Fast select (1-2ms):
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (2,3,4,5)
AND TR.stamp BETWEEN 6 AND 8
UPDATE 1
I can demonstrate the problem with even shorter example.
Explicitly store required entity_id records in temporary table
CREATE OR REPLACE TEMPORARY TABLE tbl
WITH RECURSIVE recursive_child AS (
SELECT id
FROM test_entity
WHERE id IN (2, 4)
UNION ALL
SELECT C.id
FROM test_entity C
INNER JOIN recursive_child P
ON P.id = C.parent_id
)
SELECT id
FROM recursive_child
Try to run select using temporary table (below). Select is still slow but the only difference with fast query now is that IN statement queries table instead of inline constants.
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (SELECT id FROM tbl)
AND TR.stamp BETWEEN 6 AND 8
For your queries (both of them) it looks to me like you should, as you mentioned, flip the column order on your compound index:
CREATE INDEX ix_test_relation ON test_relation(stamp, entity_id);
Why?
Your queries have a range filter TR.stamp BETWEEN 2 AND 3 on that column. For a range filter to use an index range scan (whether on a TIMESTAMP or a BIGINT column), the column being filtered must be first in a multicolumn index.
You also want a sargable filter, that is something lik this:
TR.stamp >= CURDATE() - INTERVAL 7 DAY
AND TR.stamp < CURDATE()
in place of
DATE(TR.stamp) BETWEEN DATE(NOW() - INTERVAL 7 DAY) AND DATE(NOW())
That is, don't put a function on the column you're scanning in your WHERE clause.
With a structured query like your first one, the query planner turns it into several queries. You can see this with ANALYZE FORMAT=JSON. The planner may choose different indexes and/or different chunks of indexes for each of those subqueries.
And, a word to the wise: don't get too wrapped around the axle trying to outguess the query planner built into the DBMS. It's an extraordinarily complex and highly wrought piece of software, created by decades of programming work by world-class experts in optimization. Our job as MariaDB / MySQL users is to find the right indexes.
The order of columns in a composite index matters. (O.Jones explains it nicely -- using SQL that has been removed from the Question?!)
I would rewrite
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (SELECT id FROM tbl)
AND TR.stamp BETWEEN 6 AND 8
as
SELECT TR.entry_id
FROM tbl
JOIN test_relation TR ON tbl.id = TR.entity_id
WHERE TR.stamp BETWEEN 6 AND 8
or
SELECT entry_id
FROM test_relation TR
WHERE TR.stamp BETWEEN 6 AND 8
AND EXISTS ( SELECT 1 FROM tbl
WHERE tbl.id = TR.entity_id )
And have these in either case:
TR: INDEX(stamp, entity_id, entry_id) -- With `stamp` first
tbl: INDEX(id) -- maybe
Since tbl is a freshly built TEMPORARY TABLE, and it seems that only 3 rows need checking, it may not be worth adding INDEX(id).
Also needed:
test_entity: INDEX(parent_id, id)
Assuming that test_relation is a many:many mapping table, it is likely that you will also need (though not necessarily for the current query):
INDEX(entity_id, entry_id)

SQL Server : index for finding latest value which is greater than a passed value

I have a table with 4 columns
USER_ID: numeric
EVENT_DATE: date
VERSION: date
SCORE: decimal
I have a clustered index on (USER_ID, EVENT_DATE, VERSION). These three values together are unique.
I need to get the maximum EventDate for a set of UserIds (~1000 different ids) where the Score is larger than a specific value and only consider those entries with a specific Version.
SELECT M.*
FROM (VALUES
( 5237 ),
………1000 more
( 27054 ) ) C (USER_ID)
CROSS APPLY
(SELECT TOP 1 C.USER_ID, M.EVENT_DATE, M.SCORE
FROM MY_HUGE_TABLE M
WHERE C. USER_ID = M. USER_ID
AND M.VERSION = 'xxxx-xx-xx'
AND M.SCORE > 2 --Comment M.SCORE > 2
ORDER BY M.EVENT_DATE DESC) M
Once I execute the query, I get poor results with respect to runtime, due to a missing index on score column (I suppose).
If I delete the filtering on “M.SCORE > 2” I get my results ten times faster, nevertheless the latest Scores may be less than “2”.
Could anyone please hint me on how to setup an index which could allow me to improve my query performance.
Thank you very much in advance
For your query, the optimal index would be on (User_ID, Version, ValueDate desc, Score).
Unfortunately, your clustered index doesn't match. Only the first and third columns match, but they need to match in order. So, only the User_ID can help but that probably doesn't do much to filter the data.

Rewriting mysql select to reduce time and writing tmp to disk

I have a mysql query that's taking several minutes which isn't very good as it's used to create a web page.
Three tables are used: poster_data contains information on individual posters. poster_categories lists all the categories (movies, art, etc) while poster_prodcat lists the posterid number and the categories it can be in e.g. one poster would have multiple lines for say, movies, indiana jones, harrison ford, adventure films, etc.
this is the slow query:
select *
from poster_prodcat,
poster_data,
poster_categories
where poster_data.apnumber = poster_prodcat.apnumber
and poster_categories.apcatnum = poster_prodcat.apcatnum
and poster_prodcat.apcatnum='623'
ORDER BY aptitle ASC
LIMIT 0, 32
According to the explain:
It was taking a few minutes. Poster_data has just over 800,000 rows, while poster_prodcat has just over 17 million. Other category queries with this select are barely noticeable, while poster_prodcat.apcatnum='623' has about 400,000 results and is writing out to disk
hope you find this helpful - http://pastie.org/1105206
drop table if exists poster;
create table poster
(
poster_id int unsigned not null auto_increment primary key,
name varchar(255) not null unique
)
engine = innodb;
drop table if exists category;
create table category
(
cat_id mediumint unsigned not null auto_increment primary key,
name varchar(255) not null unique
)
engine = innodb;
drop table if exists poster_category;
create table poster_category
(
cat_id mediumint unsigned not null,
poster_id int unsigned not null,
primary key (cat_id, poster_id) -- note the clustered composite index !!
)
engine = innodb;
-- FYI http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
select count(*) from category
count(*)
========
500,000
select count(*) from poster
count(*)
========
1,000,000
select count(*) from poster_category
count(*)
========
125,675,688
select count(*) from poster_category where cat_id = 623
count(*)
========
342,820
explain
select
p.*,
c.*
from
poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
pc.cat_id = 623
order by
p.name
limit 32;
id select_type table type possible_keys key key_len ref rows
== =========== ===== ==== ============= === ======= === ====
1 SIMPLE c const PRIMARY PRIMARY 3 const 1
1 SIMPLE p index PRIMARY name 257 null 32
1 SIMPLE pc eq_ref PRIMARY PRIMARY 7 const,foo_db.p.poster_id 1
select
p.*,
c.*
from
poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
pc.cat_id = 623
order by
p.name
limit 32;
Statement:21/08/2010
0:00:00.021: Query OK
The query you listed is how the final query will look like? (So they have the apcatnum=/ID/ ?)
where poster_data.apnumber=poster_prodcat.apnumber and poster_categories.apcatnum=poster_prodcat.apcatnum and poster_prodcat.apcatnum='623'
poster_prodcat.apcatnum='623'
will vastly decrease the data-set mysql has to work on, thus this should be the first parsed part of the query.
Then go on to swap the where-comparisons so those minimizing the data-set the most will be parsed first.
You may also want to try sub-queries. I’m not sure that will help, but mysql probably won’t first get all 3 tables, but first do the sub-query and then the other one. This should minimize memory consumption while querying.
Although this is not an option if you really want to select all columns (as you’re using a * there).
You need to have an index on apnumber in POSTER_DATA. Scanning 841,152 records is killing the performance.
Looks like the query is using the apptitle index to get the ordering but it is doing a full scan to filter the results. I think it might help if you have a composite index across both apptitle and apnumber on poster_data. MySQL might then be able to use this to do both the sort order and the filter.
create index data_title_anum_idx on poster_data(aptitle,apnumber);