Selecting multiple "most recent by timestamp" in mysql - sql

I have a table containing log entries for various servers. I need to create a view with the most recent (by time) log entry for each idServer.
mysql> describe serverLog;
+----------+-----------+------+-----+-------------------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------+-----------+------+-----+-------------------+----------------+
| idLog | int(11) | NO | PRI | NULL | auto_increment |
| idServer | int(11) | NO | MUL | NULL | |
| time | timestamp | NO | | CURRENT_TIMESTAMP | |
| text | text | NO | | NULL | |
+----------+-----------+------+-----+-------------------+----------------+
mysql> select * from serverLog;
+-------+----------+---------------------+------------+
| idLog | idServer | time | text |
+-------+----------+---------------------+------------+
| 1 | 1 | 2009-12-01 15:50:27 | log line 2 |
| 2 | 1 | 2009-12-01 15:50:32 | log line 1 |
| 3 | 3 | 2009-12-01 15:51:43 | log line 3 |
| 4 | 1 | 2009-12-01 10:20:30 | log line 0 |
+-------+----------+---------------------+------------+
What makes this difficult (for me) is:
Entries for earlier dates/times may be inserted later, so I can't just rely on idLog.
timestamps are not unique, so I need to use idLog as a tiebreaker for "latest".
I can get the result I want using a subquery, but I can't put a subquery into a view. Also, I hear that subquery performance sucks in MySQL.
mysql> SELECT * FROM (
SELECT * FROM serverLog ORDER BY time DESC, idLog DESC
) q GROUP BY idServer;
+-------+----------+---------------------+------------+
| idLog | idServer | time | text |
+-------+----------+---------------------+------------+
| 2 | 1 | 2009-12-01 15:50:32 | log line 1 |
| 3 | 3 | 2009-12-01 15:51:43 | log line 3 |
+-------+----------+---------------------+------------+
What is the correct way to write my view?

I recommend using:
CREATE OR REPLACE VIEW vw_your_view AS
SELECT t.*
FROM SERVERLOG t
JOIN (SELECT sl.idserver,
MAX(sl.time) 'max_time'
FROM SERVERLOG sl
GROUP BY sl.idserver) x ON x.idserver = t.idserver
AND x.max_time = t.time
Never define an ORDER BY in a VIEW, because there's no guarantee that the order you specify is needed for every time you use the view.

Related

SQL problem Code problem about removing the spam

Table: Actions
+---------------+---------+
| Column Name | Type |
+---------------+---------+
| user_id | int |
| post_id | int |
| action_date | date |
| action | enum |
| extra | varchar |
+---------------+---------+
There is no primary key for this table
It may have duplicate rows.
The action column is an ENUM type of ('view', 'like', 'reaction', 'comment', 'report', 'share').
The extra column has optional information about the action such as a reason for report or a type of reaction.
Example data:
+---------+---------+-------------+--------+--------+
| user_id | post_id | action_date | action | extra |
+---------+---------+-------------+--------+--------+
| 1 | 1 | 2019-07-01 | view | null |
| 1 | 1 | 2019-07-01 | like | null |
| 1 | 1 | 2019-07-01 | share | null |
| 2 | 2 | 2019-07-04 | view | null |
| 2 | 2 | 2019-07-04 | report | spam |
| 3 | 4 | 2019-07-04 | view | null |
| 3 | 4 | 2019-07-04 | report | spam |
| 4 | 3 | 2019-07-02 | view | null |
| 4 | 3 | 2019-07-02 | report | spam |
| 5 | 2 | 2019-07-03 | view | null |
| 5 | 2 | 2019-07-03 | report | racism |
| 5 | 5 | 2019-07-03 | view | null |
| 5 | 5 | 2019-07-03 | report | racism |
+---------+---------+-------------+--------+--------+
Table: Removals
+---------------+---------+
| Column Name | Type |
+---------------+---------+
| post_id | int |
| remove_date | date |
+---------------+---------+
post_id is the primary key of this table.
Each row in this table indicates that some post was removed as a result of being reported or as a result of an admin review.
Example data:
+---------+-------------+
| post_id | remove_date |
+---------+-------------+
| 2 | 2019-07-20 |
| 3 | 2019-07-18 |
+---------+-------------+
Write an SQL query to find the average for daily percentage of posts that got removed after being reported as spam, rounded to 2 decimal places.
Example results:
+-----------------------+
| average_daily_percent |
+-----------------------+
| 75.00 |
+-----------------------+
The percentage for 2019-07-04 is 50% because only one post of two spam reported posts was removed.
The percentage for 2019-07-02 is 100% because one post was reported as spam and it was removed.
The other days had no spam reports so the average is (50 + 100) / 2 = 75%
Note that the output is only one number and that we do not care about the remove dates.
My code is below:
select round(avg(100*a.rem/b.num),2) as 'average_daily_percent' from
(select count(post_id) as 'rem' from actions
where post_id in (select post_id from removals) and extra= 'spam'
group by action_date) a,
(select count(post_id) as 'num' from actions
where action = 'report' and extra = 'spam'
group by action_date) b
May I know why my code is wrong? Thank you!
Here's an approach using MySQL.
WITH daily_spam_counts AS
(SELECT A.`action_date`,
COUNT(*) AS num_posts,
SUM(CASE WHEN R.`remove_date` IS NOT NULL THEN 1 ELSE 0 END) AS num_posts_removed
FROM Actions A
LEFT OUTER JOIN Removals R ON A.`post_id` = R.`post_id`
WHERE A.`action` = 'report'
AND A.`extra` = 'spam'
GROUP BY A.`action_date`
)
SELECT ROUND(AVG(CAST(100 AS float) * Num_posts_removed / num_posts),2) AS average_daily_percent
FROM daily_spam_counts;
The first thing it does (within the CTE) is to calculate, for each day,
how many posts are flagged as spam
how many of those posts have been removed
The result of the CTE (with your data) is a virtual table as follows
action_date num_posts num_posts_removed
2019-07-02 1 1
2019-07-04 2 1
The second part simply does an average calculation of those across all the relevant days.
Here's a db<>fiddle with it running; note that I added an extra line of spam in the data so that the percentage becomes something with decimal points.
I also wrote this in SQL Server first (then converted to MySQL) so here's a db<>fiddle in SQL Server if you prefer that.

How to determine the name of an Impala object corresponds to a view

Is there a way in Impala to determine whether an object name returned by SHOW TABLES corresponds to a table or a view since:
this statement only return the object names, without their type
SHOW CREATE VIEW is just an alias for SHOW CREATE TABLE (same result, no view/table distinction)
DESCRIBE does not give any clue about the type of the item
Ideally I'd like to list all the tables + views and their types using a single operation, not one to retrieve the tables + views and then another call for each name to determine the type of the object.
(please note the question is about Impala, not Hive)
You can use describe formatted to know the type of an object
impala-shell> CREATE TABLE table2(
id INT,
name STRING
);
impala-shell> CREATE VIEW view2 AS SELECT * FROM table2;
impala-shell> DESCRIBE FORMATTED table2;
+------------------------------+--------------------------------------------------------------------+----------------------+
| name | type | comment |
+------------------------------+--------------------------------------------------------------------+----------------------+
| Retention: | 0 | NULL |
| Location: | hdfs://quickstart.cloudera:8020/user/hive/warehouse/test.db/table2 | NULL |
| Table Type: | MANAGED_TABLE | NULL |
+------------------------------+--------------------------------------------------------------------+----------------------+
impala-shell> DESCRIBE FORMATTED view2;
+------------------------------+-------------------------------+----------------------+
| name | type | comment |
+------------------------------+-------------------------------+----------------------+
| Protect Mode: | None | NULL |
| Retention: | 0 | NULL |
| Table Type: | VIRTUAL_VIEW | NULL |
| Table Parameters: | NULL | NULL |
| | transient_lastDdlTime | 1601632695 |
| | NULL | NULL |
| # Storage Information | NULL | NULL |
+------------------------------+-------------------------------+----------------------+
In the case of the table type is Table Type: MANAGED_TABLE and for the view is Table Type: VIRTUAL_VIEW
Other way is querying metastore database (if you can) to know about metadata in Impala(or Hive)
mysql> use metastore;
mysql> select * from TBLS;
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+------------------------------------------------------------+---------------------------+----------------+
| TBL_ID | CREATE_TIME | DB_ID | LAST_ACCESS_TIME | OWNER | RETENTION | SD_ID | TBL_NAME | TBL_TYPE | VIEW_EXPANDED_TEXT | VIEW_ORIGINAL_TEXT | LINK_TARGET_ID |
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+------------------------------------------------------------+---------------------------+----------------+
| 9651 | 1601631971 | 9331 | 0 | anonymous | 0 | 27996 | table1 | MANAGED_TABLE | NULL | NULL | NULL |
| 9652 | 1601632121 | 9331 | 0 | anonymous | 0 | 27997 | view1 | VIRTUAL_VIEW | SELECT `table1`.`id`, `table1`.`name` FROM `test`.`table1` | SELECT * FROM table1 | NULL |
| 9653 | 1601632676 | 9331 | 0 | cloudera | 0 | 27998 | table2 | MANAGED_TABLE | NULL | NULL | NULL |
| 9654 | 1601632695 | 9331 | 0 | cloudera | 0 | 27999 | view2 | VIRTUAL_VIEW | SELECT * FROM test.table2 | SELECT * FROM test.table2 | NULL |
+--------+-------------+-------+------------------+-----------+-----------+-------+----------+---------------+------------------------------------------------------------+---------------------------+----------------+

optimizing an sql query using inner join and order by

I'm trying to optimize the following query without success. Any idea where it could be indexed to prevent the temporary table and the filesort?
EXPLAIN SELECT SQL_NO_CACHE `groups`.*
FROM `groups`
INNER JOIN `memberships` ON `groups`.id = `memberships`.group_id
WHERE ((`memberships`.user_id = 1)
AND (`memberships`.`status_code` = 1 AND `memberships`.`manager` = 0))
ORDER BY groups.created_at DESC LIMIT 5;`
+----+-------------+-------------+--------+--------------------------+---------+---------+---------------------------------------------+------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------------+--------+--------------------------+---------+---------+---------------------------------------------+------+----------------------------------------------+
| 1 | SIMPLE | memberships | ref | grp_usr,grp,usr,grp_mngr | usr | 5 | const | 5 | Using where; Using temporary; Using filesort |
| 1 | SIMPLE | groups | eq_ref | PRIMARY | PRIMARY | 4 | sportspool_development.memberships.group_id | 1 | |
+----+-------------+-------------+--------+--------------------------+---------+---------+---------------------------------------------+------+----------------------------------------------+
2 rows in set (0.00 sec)
+--------+------------+-----------------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+--------+------------+-----------------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+
| groups | 0 | PRIMARY | 1 | id | A | 6 | NULL | NULL | | BTREE | |
| groups | 1 | index_groups_on_name | 1 | name | A | 6 | NULL | NULL | YES | BTREE | |
| groups | 1 | index_groups_on_privacy_setting | 1 | privacy_setting | A | 6 | NULL | NULL | YES | BTREE | |
| groups | 1 | index_groups_on_created_at | 1 | created_at | A | 6 | NULL | NULL | YES | BTREE | |
| groups | 1 | index_groups_on_id_and_created_at | 1 | id | A | 6 | NULL | NULL | | BTREE | |
| groups | 1 | index_groups_on_id_and_created_at | 2 | created_at | A | 6 | NULL | NULL | YES | BTREE | |
+--------+------------+-----------------------------------+--------------+-----------------+-----------+-------------+----------+--------+------+------------+---------+
+-------------+------------+----------------------------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+-------------+------------+----------------------------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| memberships | 0 | PRIMARY | 1 | id | A | 2 | NULL | NULL | | BTREE | |
| memberships | 0 | grp_usr | 1 | group_id | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 0 | grp_usr | 2 | user_id | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | grp | 1 | group_id | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | usr | 1 | user_id | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | grp_mngr | 1 | group_id | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | grp_mngr | 2 | manager | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | complex_index | 1 | group_id | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | complex_index | 2 | user_id | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | complex_index | 3 | status_code | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | complex_index | 4 | manager | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | index_memberships_on_user_id_and_status_code_and_manager | 1 | user_id | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | index_memberships_on_user_id_and_status_code_and_manager | 2 | status_code | A | 2 | NULL | NULL | YES | BTREE | |
| memberships | 1 | index_memberships_on_user_id_and_status_code_and_manager | 3 | manager | A | 2 | NULL | NULL | YES | BTREE | |
+-------------+------------+----------------------------------------------------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
Index on memberships column user_id (should already have one if it's PK)
Index on memberships columns status_code and manager (both of them on same index)
Index on groups column created_at (with default DESC if possible, I don't know if you can in mySql)
This is what I would do in a MS SQL Server but I guess same optimization can be used in mySql too.
Do you have all the "obvious" single-column indexes on the fields you join on, the fields in your where clause and the created_at field you order by?
The trouble is you need an index on groups in order to eliminate the filesort, but all of your where conditions are on memberships.
Try adding an index on groups on (id, created_at).
If that doesn't work, try tricking the optimizer like so using a subquery (keeping the aforementioned index on groups):
SELECT SQL_NO_CACHE `groups`.*
FROM `groups`
INNER JOIN (select group_id from `memberships`
WHERE
`memberships`.user_id = 1
AND `memberships`.`status_code` = 1
AND `memberships`.`manager` = 0
) m on m.group_id=`groups`.id
ORDER BY groups.created_at DESC LIMIT 5;
There should be an index on at least membershipships.user_id, but you could also gain some benefit from an index like (user_id, status, manager). I assume status and manager are flags that don't have a large range of possible values so it isn't that important as long as there's an index on user_id.
A (user_id, status_code, manager) (in any order) index on memberships would help.
Avoiding the sort would be difficult, because then you have to start the join in the groups table, which means you can't use all the (presumably pretty selective) where clauses that reference the memberships table until it is too late.
Thanks for posting details about the indexes you're using.
I've tested this, and tried omitting some of the indexes. The most index is memberships.complex_index which serves as a covering index. This allows the query to achieve its results by reading only the index; it doesn't have to read data rows for memberships at all.
None of the indexes on groups make any difference. It appears to use a filesort no matter what. Filesort in MySQL simply means it's doing a table-scan, which in some cases can be less costly than using an index. For instance, if every row of groups needs to be read to produce the result of the query anyway, why bother doing an unnecessary double-lookup by using an index? The optimizer can sense these cases and can appropriately refuse to use an index.
So aside from primary key indexes and the complex_index, I'd drop all the others, since they're not helping and can only add to the cost of maintaining these tables.

INSERT INTO..SELECT..ON DUPLICATE KEYS ambiguous ids

I have the following table:
mysql> SELECT * FROM `bright_promotion_earnings`;
+----+----------+------------+----------+-------+
| id | promoter | generation | turnover | payed |
+----+----------+------------+----------+-------+
| 1 | 4 | 1 | 10 | 0 |
| 3 | 4 | 5 | 100 | 0 |
| 4 | 4 | 3 | 10000 | 1 |
| 5 | 4 | 3 | 200 | 0 |
+----+----------+------------+----------+-------+
4 rows in set (0.00 sec)
There is one unique key(promoter, generation, payed):
+---------------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment |
+---------------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
| bright_promotion_earnings | 0 | promoter_2 | 1 | promoter | A | 2 | NULL | NULL | YES | BTREE | |
| bright_promotion_earnings | 0 | promoter_2 | 2 | generation | A | 4 | NULL | NULL | | BTREE | |
| bright_promotion_earnings | 0 | promoter_2 | 3 | payed | A | 4 | NULL | NULL | | BTREE | |
+---------------------------+------------+------------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+
3 rows in set (0.00 sec)
Now I want to mark every earning for a promoter as paid by updating the same entry with paid=1 (if it exists).
So if I wanted to mark the earnings of promoter 4 as paid this is what the table should look like:
+----+----------+------------+----------+-------+
| id | promoter | generation | turnover | payed |
+----+----------+------------+----------+-------+
| 4 | 4 | 3 | 10200 | 1 |
| 6 | 4 | 5 | 100 | 1 |
| 7 | 4 | 1 | 10 | 1 |
+----+----------+------------+----------+-------+
3 rows in set (0.00 sec)
This is my current approach(without the DELETE which is trivial):
INSERT INTO
bright_promotion_earnings
(
promoter,
generation,
turnover,
payed
)
SELECT
commission.promoter,
commission.generation,
commission.turnover as turnover2,
'1' as payed
FROM
bright_promotion_earnings as commission
WHERE
promoter=4
AND payed=0
ON DUPLICATE KEY UPDATE turnover=turnover+turnover2;
But mysql keeps telling me that turnover is ambiguous:
#1052 - Column 'turnover' in field list is ambiguous
Does anybody have a hint as I can't alias the table that I'm inserting to.
How can I give the table I'm inserting to a name so that mysql can identify the column?
Thanks in advance.
you have a turnover field in both tables, so mysql can't decide which one you mean at the last row.

LIMIT the return of non-unique values

I have two tables. Posts and Replies. Think of posts as a blog entry while replies are the comments.
I want to display X number of posts and then the latest three comments for each of the posts.
My replies has a foreign key "post_id" which matches the "id" of every post.
I am trying to create a main page that has something along the lines of
Post
--Reply
--Reply
--Reply
Post
--Reply
so on and so fourth. I can accomplish this by using a for loop in my template and discarding the unneeded replies but I hate grabbing data from a db I won't use. Any ideas?
This is actually a pretty interesting question.
HA HA DISREGARD THIS, I SUCK
On edit: this answer works, but on MySQL it becomes tediously slow when the number of parent rows is as few as 100. However, see below for a performant fix.
Obviously, you can run this query once per post: select * from comments where id = $id limit 3 That creates a lot of overhead, as you end up doing one database query per post, the dreaded N+1 queries.
If you want to get all posts at once (or some subset with a where) the following will surprisingly work. It assumes that comments have a monotonically increasing id (as a datetime is not guaranteed to be unique), but allows for comment ids to be interleaved among posts.
Since an auto_increment id column is monotonically increasing, if comment has an id, you're all set.
First, create this view. In the view, I call post parent and comment child:
create view parent_top_3_children as
select a.*,
(select max(id) from child where parent_id = a.id) as maxid,
(select max(id) from child where id < maxid
and parent_id = a.id) as maxidm1,
(select max(id) from child where id < maxidm1
and parent_id = a.id) as maxidm2
from parent a;
maxidm1 is just "max id minus 1"; maxidm2, "max id minus 2" -- that is, the second and third greatest child ids within a particular parent id.
Then join the view to whatever you need from the comment (I'll call that text):
select a.*,
b.text as latest_comment,
c.text as second_latest_comment,
d.text as third_latest_comment
from parent_top_3_children a
left outer join child b on (b.id = a.maxid)
left outer join child c on (c.id = a.maxidm1)
left outer join child d on (c.id = a.maxidm2);
Naturally, you can add whatever where clause you want to that, to limit the posts: where a.category = 'foo' or whatever.
Here's what my tables look like:
mysql> select * from parent;
+----+------+------+------+
| id | a | b | c |
+----+------+------+------+
| 1 | 1 | 1 | NULL |
| 2 | 2 | 2 | NULL |
| 3 | 3 | 3 | NULL |
+----+------+------+------+
3 rows in set (0.00 sec)
And a portion of child. Parent 1 has noo children:
mysql> select * from child;
+----+-----------+------+------+------+------+
| id | parent_id | a | b | c | d |
+----+-----------+------+------+------+------+
. . . .
| 18 | 3 | NULL | NULL | NULL | NULL |
| 19 | 2 | NULL | NULL | NULL | NULL |
| 20 | 2 | NULL | NULL | NULL | NULL |
| 21 | 3 | NULL | NULL | NULL | NULL |
| 22 | 2 | NULL | NULL | NULL | NULL |
| 23 | 2 | NULL | NULL | NULL | NULL |
| 24 | 3 | NULL | NULL | NULL | NULL |
| 25 | 2 | NULL | NULL | NULL | NULL |
+----+-----------+------+------+------+------+
24 rows in set (0.00 sec)
And the view gives us this:
mysql> select * from parent_top_3;
+----+------+------+------+-------+---------+---------+
| id | a | b | c | maxid | maxidm1 | maxidm2 |
+----+------+------+------+-------+---------+---------+
| 1 | 1 | 1 | NULL | NULL | NULL | NULL |
| 2 | 2 | 2 | NULL | 25 | 23 | 22 |
| 3 | 3 | 3 | NULL | 24 | 21 | 18 |
+----+------+------+------+-------+---------+---------+
3 rows in set (0.21 sec)
The explain plan for the view is only slightly hairy:
mysql> explain select * from parent_top_3;
+----+--------------------+------------+------+---------------+------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+------+---------------+------+---------+------+------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 2 | DERIVED | a | ALL | NULL | NULL | NULL | NULL | 3 | |
| 5 | DEPENDENT SUBQUERY | child | ALL | PRIMARY | NULL | NULL | NULL | 24 | Using where |
| 4 | DEPENDENT SUBQUERY | child | ALL | PRIMARY | NULL | NULL | NULL | 24 | Using where |
| 3 | DEPENDENT SUBQUERY | child | ALL | NULL | NULL | NULL | NULL | 24 | Using where |
+----+--------------------+------------+------+---------------+------+---------+------+------+-------------+
However, if we add an index for parent_fks,it gets a better:
mysql> create index pid on child(parent_id);
mysql> explain select * from parent_top_3;
+----+--------------------+------------+------+---------------+------+---------+-----------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+------+---------------+------+---------+-----------+------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 3 | |
| 2 | DERIVED | a | ALL | NULL | NULL | NULL | NULL | 3 | |
| 5 | DEPENDENT SUBQUERY | child | ref | PRIMARY,pid | pid | 5 | util.a.id | 2 | Using where |
| 4 | DEPENDENT SUBQUERY | child | ref | PRIMARY,pid | pid | 5 | util.a.id | 2 | Using where |
| 3 | DEPENDENT SUBQUERY | child | ref | pid | pid | 5 | util.a.id | 2 | Using where |
+----+--------------------+------------+------+---------------+------+---------+-----------+------+-------------+
5 rows in set (0.04 sec)
As noted above, this begins to fall apart when the number of parent rows is few as 100, even if we index into parent using its primary key:
mysql> select * from parent_top_3 where id < 10;
+----+------+------+------+-------+---------+---------+
| id | a | b | c | maxid | maxidm1 | maxidm2 |
+----+------+------+------+-------+---------+---------+
| 1 | 1 | 1 | NULL | NULL | NULL | NULL |
| 2 | 2 | 2 | NULL | 25 | 23 | 22 |
| 3 | 3 | 3 | NULL | 24 | 21 | 18 |
| 4 | NULL | 1 | NULL | 65 | 64 | 63 |
| 5 | NULL | 2 | NULL | 73 | 72 | 71 |
| 6 | NULL | 3 | NULL | 113 | 112 | 111 |
| 7 | NULL | 1 | NULL | 209 | 208 | 207 |
| 8 | NULL | 2 | NULL | 401 | 400 | 399 |
| 9 | NULL | 3 | NULL | 785 | 784 | 783 |
+----+------+------+------+-------+---------+---------+
9 rows in set (1 min 3.11 sec)
(Note that I intentionally test on a slow machine, with data saved on a slow flash disk.)
Here's the explain, looking for exactly one id (and the first one, at that):
mysql> explain select * from parent_top_3 where id = 1;
+----+--------------------+------------+------+---------------+------+---------+-----------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+------+---------------+------+---------+-----------+------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 1000 | Using where |
| 2 | DERIVED | a | ALL | NULL | NULL | NULL | NULL | 1000 | |
| 5 | DEPENDENT SUBQUERY | child | ref | PRIMARY,pid | pid | 5 | util.a.id | 179 | Using where |
| 4 | DEPENDENT SUBQUERY | child | ref | PRIMARY,pid | pid | 5 | util.a.id | 179 | Using where |
| 3 | DEPENDENT SUBQUERY | child | ref | pid | pid | 5 | util.a.id | 179 | Using where |
+----+--------------------+------------+------+---------------+------+---------+-----------+------+-------------+
5 rows in set (56.01 sec)
Over 56 seconds for one row, even on my slow machine, is two orders of magnitude unacceptable.
So can we save this query? It works, it's just too slow.
Here's the explain plan for the modified query. It looks as bad or worse:
mysql> explain select * from parent_top_3a where id = 1;
+----+--------------------+------------+------+---------------+------+---------+-----------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+------------+------+---------------+------+---------+-----------+------+-------------+
| 1 | PRIMARY | <derived2> | ALL | NULL | NULL | NULL | NULL | 100 | Using where |
| 2 | DERIVED | <derived4> | ALL | NULL | NULL | NULL | NULL | 100 | |
| 4 | DERIVED | <derived6> | ALL | NULL | NULL | NULL | NULL | 100 | |
| 6 | DERIVED | a | ALL | NULL | NULL | NULL | NULL | 100 | |
| 7 | DEPENDENT SUBQUERY | child | ref | pid | pid | 5 | util.a.id | 179 | Using where |
| 5 | DEPENDENT SUBQUERY | child | ref | PRIMARY,pid | pid | 5 | a.id | 179 | Using where |
| 3 | DEPENDENT SUBQUERY | child | ref | PRIMARY,pid | pid | 5 | a.id | 179 | Using where |
+----+--------------------+------------+------+---------------+------+---------+-----------+------+-------------+
7 rows in set (0.05 sec)
But it completes three orders of magnitude faster, in 1/20th of a second!
How do we get to the much speedier parent_top_3a? We create three views, each one dependent on the previous one:
create view parent_top_1 as
select a.*,
(select max(id) from child where parent_id = a.id)
as maxid
from parent a;
create view parent_top_2 as
select a.*,
(select max(id) from child where parent_id = a.id and id < a.maxid)
as maxidm1
from parent_top_1 a;
create view parent_top_3a as
select a.*,
(select max(id) from child where parent_id = a.id and id < a.maxidm1)
as maxidm2
from parent_top_2 a;
Not only does this work much more quickly, it's legal on RDBMSes other than MySQL.
Let's increase the number of parent rows to 12800, the number of child rows to 1536 (most blog posts don't get comments, right? ;) )
mysql> select * from parent_top_3a where id >= 20 and id < 40;
+----+------+------+------+-------+---------+---------+
| id | a | b | c | maxid | maxidm1 | maxidm2 |
+----+------+------+------+-------+---------+---------+
| 39 | NULL | 2 | NULL | NULL | NULL | NULL |
| 38 | NULL | 1 | NULL | NULL | NULL | NULL |
| 37 | NULL | 3 | NULL | NULL | NULL | NULL |
| 36 | NULL | 2 | NULL | NULL | NULL | NULL |
| 35 | NULL | 1 | NULL | NULL | NULL | NULL |
| 34 | NULL | 3 | NULL | NULL | NULL | NULL |
| 33 | NULL | 2 | NULL | NULL | NULL | NULL |
| 32 | NULL | 1 | NULL | NULL | NULL | NULL |
| 31 | NULL | 3 | NULL | NULL | NULL | NULL |
| 30 | NULL | 2 | NULL | 1537 | 1536 | 1535 |
| 29 | NULL | 1 | NULL | 1529 | 1528 | 1527 |
| 28 | NULL | 3 | NULL | 1513 | 1512 | 1511 |
| 27 | NULL | 2 | NULL | 1505 | 1504 | 1503 |
| 26 | NULL | 1 | NULL | 1481 | 1480 | 1479 |
| 25 | NULL | 3 | NULL | 1457 | 1456 | 1455 |
| 24 | NULL | 2 | NULL | 1425 | 1424 | 1423 |
| 23 | NULL | 1 | NULL | 1377 | 1376 | 1375 |
| 22 | NULL | 3 | NULL | 1329 | 1328 | 1327 |
| 21 | NULL | 2 | NULL | 1281 | 1280 | 1279 |
| 20 | NULL | 1 | NULL | 1225 | 1224 | 1223 |
+----+------+------+------+-------+---------+---------+
20 rows in set (1.01 sec)
Note that these timings are for MyIsam tables; I'll leave it to someone else to do timings on Innodb.
But using Postgresql, on a similar but not identical data set, we get similar timings on where predicates involving parent's columns:
postgres=# select (select count(*) from parent) as parent_count, (select count(*)
from child) as child_count;
parent_count | child_count
--------------+-------------
12289 | 1536
postgres=# select * from parent_top_3a where id >= 20 and id < 40;
id | a | b | c | maxid | maxidm1 | maxidm2
----+---+----+---+-------+---------+---------
20 | | 18 | | 1464 | 1462 | 1461
21 | | 88 | | 1463 | 1460 | 1457
22 | | 72 | | 1488 | 1486 | 1485
23 | | 13 | | 1512 | 1510 | 1509
24 | | 49 | | 1560 | 1558 | 1557
25 | | 92 | | 1559 | 1556 | 1553
26 | | 45 | | 1584 | 1582 | 1581
27 | | 37 | | 1608 | 1606 | 1605
28 | | 96 | | 1607 | 1604 | 1601
29 | | 90 | | 1632 | 1630 | 1629
30 | | 53 | | 1631 | 1628 | 1625
31 | | 57 | | | |
32 | | 64 | | | |
33 | | 79 | | | |
34 | | 37 | | | |
35 | | 60 | | | |
36 | | 75 | | | |
37 | | 34 | | | |
38 | | 87 | | | |
39 | | 43 | | | |
(20 rows)
Time: 91.139 ms
Sounds like you just want the LIMIT clause for a SELECT statement:
SELECT comment_text, other_stuff FROM comments WHERE post_id = POSTID ORDER BY comment_time DESC LIMIT 3;
You'll have to run this query once per post you want to show comments for. There are a few ways to get around that, if you're willing to sacrifice maintainability and your sanity in the Quest for Ultimate Performance:
As above, one query per post to retrieve comments. Simple, but probably not all that fast.
Retrieve a list of post_ids that you want to show comments for, then retrieve all comments for those posts, and filter them client-side (or you could do it server-side if you had windowing functions, I think, though those aren't in MySQL). Simple on the server side, but the client-side filtering will be ugly, and you're still moving a lot of data from server to client, so this probably won't be all that fast either.
As #1, but use an unholy UNION ALL of as many queries as you have posts to display, so you're running one abominable query instead of N small ones. Ugly, but it'll be faster than options 1 or 2. You'll still have to do a bit of filtering client-side, but careful writing of the UNION will make that much easier than the filtering required for #2, and no wasted data will be sent over the wire. It'll make for an ugly query, though.
Join the posts and comments table, partially pivoting the comments. This is pretty clean if you only need one comment, but if you want three it'll get messy quickly. Great on the client side, but even worse SQL than #3, and probably harder for the server, to boot.
At the end of the day, I'd go with option 1, the simple query above, and not worry about the overhead of doing it once per post. If you only needed one comment, then the join option might be acceptable, but you want three and that rules it out. If windowing functions ever get added to MySQL (they're in release 8.4 of PostgreSQL), option 2 might become palatable or even preferable. Until that day, though, just pick the simple, easy-to-understand query.
Although there may be a clever way to get this in one query with no schema changes I'm guessing it wouldn't be performant anyway. Edit: Looks like tpdi has the clever solution. It looks potentially pretty fast, but I'd be curious to see a benchmark on specific databases.
Given the constraints of high performance and minimal data transfer I have two suggestions.
Solution with no schema changes or maintenance
First:
SELECT * FROM Posts
Collect the ids, then:
SELECT id FROM Replies WHERE post_id IN (?) ORDER BY id DESC
Finally, loop through those ids, grabbing only the first 3 for each post_id, then do:
SELECT * FROM Replies WHERE post_id IN (?)
More efficient solution if you are willing to maintain a few cache columns
The second solution is assuming that there are far more reads than writes, you can minimize lookups by storing the last three comment ids on the Posts table every time you add a Reply. In that case you would simply add three columns last_reply_id, second_reply_id, third_reply_id or some such. Then you can look up with either two queries like:
SELECT * FROM Posts
Collect the ids from those fields, then:
SELECT * FROM Replies WHERE post_id IN (?)
If you have those fields you could also manually construct a triple join, which would get the data in one query, although the field list would be quite verbose. Something like
SELECT posts.*, r1.title, r2.title ... FROM Posts
LEFT JOIN Replies as r1 ON Posts.last_reply_id = Replies.id
LEFT JOIN Replies as r2 ON Posts.second_reply_id = Replies.id
...
Which you prefer probably depends on your ORM or language.