MySQL GROUP BY optimization - optimization

This question is a more specific version of a previous question I asked
Table
CREATE TABLE Test4_ClusterMatches
(
`match_index` INT UNSIGNED,
`cluster_index` INT UNSIGNED,
`id` INT NOT NULL AUTO_INCREMENT,
`tfidf` FLOAT,
PRIMARY KEY (`cluster_index`,`match_index`,`id`)
);
The query I want to run
mysql> explain SELECT `match_index`, SUM(`tfidf`) AS total
FROM Test4_ClusterMatches WHERE `cluster_index` IN (1,2,3 ... 3000)
GROUP BY `match_index`;
The Problem with the query
It uses temporary and filesort so its to slow+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| 1 | SIMPLE | Test4_ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 51540 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
With the current indexing the query would need to sort by cluster_index first to eliminate the use of temporary and filesort, but doing so gives the wrong results for sum(tfidf).
Changing the primary key to
PRIMARY KEY (`match_index`,`cluster_index`,`id`)
Doesn't use file sort or temp tables but it uses 14,932,441 rows so it is also to slow
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
| 1 | SIMPLE | Test5_ClusterMatches | index | NULL | PRIMARY | 16 | NULL | 14932441 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
Tight Index Scan
Using tight index scan by running the search for just one index
mysql> explain SELECT match_index, SUM(tfidf) AS total
FROM Test4_ClusterMatches WHERE cluster_index =3000
GROUP BY match_index;Eliminates the temporary tables and filesort.
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
| 1 | SIMPLE | Test4_ClusterMatches | ref | PRIMARY | PRIMARY | 4 | const | 27 | Using where; Using index |
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+ I'm not sure if this can be exploited with some magic sql-fu that I haven't come across yet?
Question
How can I change my query so that it use 3,000 cluster_indexes, avoids using temporary and filesort without it needing to use 14,932,441 rows?
Update
Using the table
CREATE TABLE Test6_ClusterMatches
(
match_index INT UNSIGNED,
cluster_index INT UNSIGNED,
id INT NOT NULL AUTO_INCREMENT,
tfidf FLOAT,
PRIMARY KEY (id),
UNIQUE KEY(cluster_index,match_index)
);
The query below then gives 10 rows in set (0.41 sec) :)
SELECT `match_index`, SUM(`tfidf`) AS total FROM Test6_ClusterMatches WHERE
`cluster_index` IN (.....)
GROUP BY `match_index` ORDER BY total DESC LIMIT 0,10;
but its using temporary and filesort
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
| 1 | SIMPLE | Test6_ClusterMatches | range | cluster_index | cluster_index | 5 | NULL | 78663 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
I'm wondering if theres anyway to get it faster by eliminating the using temporary and using filesort?

I had a quick look and this is what I came up with - hope it helps...
SQL Table
drop table if exists cluster_matches;
create table cluster_matches
(
cluster_id int unsigned not null,
match_id int unsigned not null,
...
tfidf float not null default 0,
primary key (cluster_id, match_id) -- if this isnt unique add id to the end !!
)
engine=innodb;
Test Data
select count(*) from cluster_matches
count(*)
========
17974591
select count(distinct(cluster_id)) from cluster_matches;
count(distinct(cluster_id))
===========================
1000000
select count(distinct(match_id)) from cluster_matches;
count(distinct(match_id))
=========================
6000
explain select
cm.match_id,
sum(tfidf) as sum_tfidf,
count(*) as count_tfidf
from
cluster_matches cm
where
cm.cluster_id between 5000 and 10000
group by
cm.match_id
order by
sum_tfidf desc limit 10;
id select_type table type possible_keys key key_len ref rows Extra
== =========== ===== ==== ============= === ======= === ==== =====
1 SIMPLE cm range PRIMARY PRIMARY 4 290016 Using where; Using temporary; Using filesort
runtime - 0.067 seconds.
Pretty respectable runtime of 0.067 seconds but I think we can make it better.
Stored Procedure
You will have to forgive me for not wanting to type/pass in a list of 5000+ random cluster_ids !
call sum_cluster_matches(null,1); -- for testing
call sum_cluster_matches('1,2,3,4,....5000',1);
The bulk of the sproc isnt very elegant but all it does is split a csv string into individual cluster_ids and populate a temp table.
drop procedure if exists sum_cluster_matches;
delimiter #
create procedure sum_cluster_matches
(
in p_cluster_id_csv varchar(65535),
in p_show_explain tinyint unsigned
)
proc_main:begin
declare v_id varchar(10);
declare v_done tinyint unsigned default 0;
declare v_idx int unsigned default 1;
create temporary table tmp(cluster_id int unsigned not null primary key);
-- not every elegant - split the string into tokens and put into a temp table...
if p_cluster_id_csv is not null then
while not v_done do
set v_id = trim(substring(p_cluster_id_csv, v_idx,
if(locate(',', p_cluster_id_csv, v_idx) > 0,
locate(',', p_cluster_id_csv, v_idx) - v_idx, length(p_cluster_id_csv))));
if length(v_id) > 0 then
set v_idx = v_idx + length(v_id) + 1;
insert ignore into tmp values(v_id);
else
set v_done = 1;
end if;
end while;
else
-- instead of passing in a huge comma separated list of cluster_ids im cheating here to save typing
insert into tmp select cluster_id from clusters where cluster_id between 5000 and 10000;
-- end cheat
end if;
if p_show_explain then
select count(*) as count_of_tmp from tmp;
explain
select
cm.match_id,
sum(tfidf) as sum_tfidf,
count(*) as count_tfidf
from
cluster_matches cm
inner join tmp on tmp.cluster_id = cm.cluster_id
group by
cm.match_id
order by
sum_tfidf desc limit 10;
end if;
select
cm.match_id,
sum(tfidf) as sum_tfidf,
count(*) as count_tfidf
from
cluster_matches cm
inner join tmp on tmp.cluster_id = cm.cluster_id
group by
cm.match_id
order by
sum_tfidf desc limit 10;
drop temporary table if exists tmp;
end proc_main #
delimiter ;
Results
call sum_cluster_matches(null,1);
count_of_tmp
============
5001
id select_type table type possible_keys key key_len ref rows Extra
== =========== ===== ==== ============= === ======= === ==== =====
1 SIMPLE tmp index PRIMARY PRIMARY 4 5001 Using index; Using temporary; Using filesort
1 SIMPLE cm ref PRIMARY PRIMARY 4 vldb_db.tmp.cluster_id 8
match_id sum_tfidf count_tfidf
======== ========= ===========
1618 387 64
1473 387 64
3307 382 64
2495 373 64
1135 373 64
3832 372 57
3203 362 58
5464 358 67
2100 355 60
1634 354 52
runtime 0.028 seconds.
Explain plan and runtime much improved.

If the cluster_index values in the WHERE condition are continuous, then instead of IN use:
WHERE (cluster_index >= 1) and (cluster_index <= 3000)
If the values are not continuous then you can create a temporary table to hold the cluster_index values with an index and use an INNER JOIN to the temporary table.

Related

SQL N:M query merging results by condition flag in intermediate table

[First of all, if this is a duplicate, sorry, I couldn't find a response for this, as this is a strange solution for a limitation on an ORM and I'm clearly a noobie on SQL]
Domain requirements:
A brigades must be composed by one user (the commissar one) and, optionally, one and only one assistant (1:1)
A user can only be part of one brigade (1:1)
CREATE TABLE Users
(
id SERIAL PRIMARY KEY,
username VARCHAR(100) NOT NULL UNIQUE,
password VARCHAR(100) NOT NULL
);
CREATE TABLE Brigades
(
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL
);
-- N:M relationship with a flag inside which determine if that user is a commissar or not
CREATE TABLE Brigade_User
(
brigade_id INT NOT NULL REFERENCES Brigades(id)
ON DELETE CASCADE
ON UPDATE CASCADE,
user_id INT NOT NULL REFERENCES Users(id)
ON DELETE CASCADE
ON UPDATE CASCADE,
is_commissar BOOLEAN NOT NULL
PRIMARY KEY(brigade_id, user_id)
);
Ideally, as relations are 1:1, Brigade_User intermediate table could be erased and a Brigade table with two foreign keys could be created instead (this is not supported by Diesel Rust ORM, so I think I'm coupled to first approach)
CREATE TABLE Brigades
(
id SERIAL PRIMARY KEY,
name VARCHAR(100) NOT NULL
-- 1:1
commisar_id INT NOT NULL REFERENCES Users(id)
ON DELETE CASCADE
ON UPDATE CASCADE,
-- 1:1
assistant_id INT NOT NULL REFERENCES Users(id)
ON DELETE CASCADE
ON UPDATE CASCADE
);
An example...
> SELECT * FROM brigade_user LEFT JOIN brigades ON brigade_user.brigade_id = brigades.id;
brigade_id | user_id | is_commissar | id | name
------------+---------+--------------+----+------------------
1 | 1 | t | 1 | Patrulla gatuna
1 | 2 | f | 1 | Patrulla gatuna
2 | 3 | t | 2 | Patrulla perruna
2 | 4 | f | 2 | Patrulla perruna
3 | 6 | t | 3 | Patrulla canina
3 | 5 | f | 3 | Patrulla canina
(4 rows)
Is it possible to make a query which returns a table like this?
brigade_id | commissar_id | assistant_id | name
-----------+--------------+--------------+--------------------
1 | 1 | 2 | Patrulla gatuna
2 | 3 | 4 | Patrulla perruna
3 | 6 | 5 | Patrulla canina
See that each two rows have been merged into one (remember, a brigade is composed by one commissary and, optionally, one assistant) depending on the flag.
Could this model be improved (having in mind the limitation on multiple foreign keys referencing the same table, discussed here)
Try the following:
with cte as
(
SELECT A.brigade_id,A.user_id,A.is_commissar,B.name
FROM brigade_user A LEFT JOIN brigades B ON A.brigade_id = B.id
)
select C1.brigade_id, C1.user_id as commissar_id , C2.user_id as assistant_id, C1.name from
cte C1 left join cte C2
on C1.brigade_id=C2.brigade_id
and C1.user_id<>C2.user_id
where C1.is_commissar=true
See a demo from here.

Result of query as column value

I've got three tables:
Lessons:
CREATE TABLE lessons (
id SERIAL PRIMARY KEY,
title text NOT NULL,
description text NOT NULL,
vocab_count integer NOT NULL
);
+----+------------+------------------+-------------+
| id | title | description | vocab_count |
+----+------------+------------------+-------------+
| 1 | lesson_one | this is a lesson | 3 |
| 2 | lesson_two | another lesson | 2 |
+----+------------+------------------+-------------+
Lesson_vocabulary:
CREATE TABLE lesson_vocabulary (
lesson_id integer REFERENCES lessons(id),
vocabulary_id integer REFERENCES vocabulary(id)
);
+-----------+---------------+
| lesson_id | vocabulary_id |
+-----------+---------------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
| 2 | 2 |
| 2 | 4 |
+-----------+---------------+
Vocabulary:
CREATE TABLE vocabulary (
id integer PRIMARY KEY,
hiragana text NOT NULL,
reading text NOT NULL,
meaning text[] NOT NULL
);
Each lesson contains multiple vocabulary, and each vocabulary can be included in multiple lessons.
How can I get the vocab_count column of the lessons table to be calculated and updated whenevr I add more rows to the lesson_vocabulary table. Is this possible, and how would I go about doing this?
Thanks
You can use SQL triggers to serve your purpose. This would be similar to mysql after insert trigger which updates another table's column.
The trigger would look somewhat like this. I am using Oracle SQL, but there would just be minor tweaks for any other implementation.
CREATE TRIGGER vocab_trigger
AFTER INSERT ON lesson_vocabulary
FOR EACH ROW
begin
for lesson_cur in (select LESSON_ID, COUNT(VOCABULARY_ID) voc_cnt from LESSON_VOCABULARY group by LESSON_ID) LOOP
update LESSONS
set VOCAB_COUNT = LESSON_CUR.VOC_CNT
where id = LESSON_CUR.LESSON_ID;
end loop;
END;
It's better to create a view that calculates that (and get rid of the column in the lessons table):
select l.*, lv.vocab_count
from lessons l
left join (
select lesson_id, count(*)
from lesson_vocabulary
group by lesson_id
) as lv(lesson_id, vocab_count) on l.id = lv.lesson_id
If you really want to update the lessons table each time the lesson_vocabulary changes, you can run an UPDATE statement like this in a trigger:
update lessons l
set vocab_count = t.cnt
from (
select lesson_id, count(*) as cnt
from lesson_vocabulary
group by lesson_id
) t
where t.lesson_id = l.id;
I would recommend using a query for this information:
select l.*,
(select count(*)
from lesson_vocabulary lv
where lv.lesson_id = l.lesson_id
) as vocabulary_cnt
from lessons l;
With an index on lesson_vocabulary(lesson_id), this should be quite fast.
I recommend this over an update, because the data remains correct.
I recommend this over a trigger, because it is simpler.
I recommend this over a subquery with aggregation because it should be faster, particularly if you are filtering on the lessons table.

Merging Complicated Tables

I'm trying to merge tables where rows correspond to a many:1 relationship with "real" things.
I'm writing a blackjack simulator that stores game history in a database with a new set of tables generated each run. The tables are really more like templates, since each game gets its own set of the 3 mutable tables (players, hands, and matches). Here's the layout, where suff is a user-specified suffix to use for the current run:
- cards
- id INTEGER PRIMARY KEY
- cardValue INTEGER NOT NULL
- suit INTEGER NOT NULL
- players_suff
- whichPlayer INTEGER PRIMARY KEY
- aiType TEXT NOT NULL
- hands_suff
- id BIGSERIAL PRIMARY KEY
- whichPlayer INTEGER REFERENCES players_suff(whichPlayer) *
- whichHand BIGINT NOT NULL
- thisCard INTEGER REFERENCES cards(id)
- matches_suff
- id BIGSERIAL PRIMARY KEY
- whichGame INTEGER NOT NULL
- dealersHand BIGINT NOT NULL
- whichPlayer INTEGER REFERENCES players_suff(whichPlayer)
- thisPlayersHand BIGINT NOT NULL **
- playerResult INTEGER NOT NULL --AKA who won
Only one cards table is created because its values are constant.
So after running the simulator twice you might have:
hands_firstrun
players_firstrun
matches_firstrun
hands_secondrun
players_secondrun
matches_secondrun
I want to be able to combine these tables if you used the same AI parameters for both of those runs (i.e. players_firstrun and players_secondrun are exactly the same). The problem is that the way I'm inserting hands makes this really messy: whichHand can't be a BIGSERIAL because the relationship of hands_suff rows to "actual hands" is many:1. matches_suff is handled the same way because a blackjack "game" actually consists of a set of games: the set of pairs of each player vs. the dealer. So for 3 players, you actually have 3 rows for each round.
Currently I select the largest whichHand in the table, add 1 to it, then insert all of the rows for one hand. I'm worried this "query-and-insert" will be really slow if I'm merging 2 tables that might both be arbitrarily huge.
When I'm merging tables, I feel like I should be able to (entirely in SQL) query the largest values in whichHand and whichGame once then use them combine the tables, incrementing them for each unique whichHand and whichGame in the table being merged.
(I saw this question, but it doesn't handle using a generated ID in 2 different places). I'm using Postgres and it's OK if the answer is specific to it.
* sadly postgres doesn't allow parameterized table names so this had to be done by manual string substitution. Not the end of the world since the program isn't web-facing and no one except me is likely to ever bother with it, but the SQL injection vulnerability does not make me happy.
** matches_suff(whichPlayersHand) was originally going to reference hands_suff(whichHand) but foreign keys must reference unique values. whichHand isn't unique because a hand is made up of multiple rows, with each row "holding" one card. To query for a hand you select all of those rows with the same value in whichHand. I couldn't think of a more elegant way to do this without resorting to arrays.
EDIT:
This is what I have now:
thomas=# \dt
List of relations
Schema | Name | Type | Owner
--------+----------------+-------+--------
public | cards | table | thomas
public | hands_first | table | thomas
public | hands_second | table | thomas
public | matches_first | table | thomas
public | matches_second | table | thomas
public | players_first | table | thomas
public | players_second | table | thomas
(7 rows)
thomas=# SELECT * FROM hands_first
thomas-# \g
id | whichplayer | whichhand | thiscard
----+-------------+-----------+----------
1 | 0 | 0 | 6
2 | 0 | 0 | 63
3 | 0 | 0 | 41
4 | 1 | 1 | 76
5 | 1 | 1 | 23
6 | 0 | 2 | 51
7 | 0 | 2 | 29
8 | 0 | 2 | 2
9 | 0 | 2 | 92
10 | 0 | 2 | 6
11 | 1 | 3 | 101
12 | 1 | 3 | 8
(12 rows)
thomas=# SELECT * FROM hands_second
thomas-# \g
id | whichplayer | whichhand | thiscard
----+-------------+-----------+----------
1 | 0 | 0 | 78
2 | 0 | 0 | 38
3 | 1 | 1 | 24
4 | 1 | 1 | 18
5 | 1 | 1 | 95
6 | 1 | 1 | 40
7 | 0 | 2 | 13
8 | 0 | 2 | 84
9 | 0 | 2 | 41
10 | 1 | 3 | 29
11 | 1 | 3 | 34
12 | 1 | 3 | 56
13 | 1 | 3 | 52
thomas=# SELECT * FROM matches_first
thomas-# \g
id | whichgame | dealershand | whichplayer | thisplayershand | playerresult
----+-----------+-------------+-------------+-----------------+--------------
1 | 0 | 0 | 1 | 1 | 1
2 | 1 | 2 | 1 | 3 | 2
(2 rows)
thomas=# SELECT * FROM matches_second
thomas-# \g
id | whichgame | dealershand | whichplayer | thisplayershand | playerresult
----+-----------+-------------+-------------+-----------------+--------------
1 | 0 | 0 | 1 | 1 | 0
2 | 1 | 2 | 1 | 3 | 2
(2 rows)
I'd like to combine them to have:
hands_combined table:
id | whichplayer | whichhand | thiscard
----+-------------+-----------+----------
1 | 0 | 0 | 6 --Seven of Spades
2 | 0 | 0 | 63 --Queen of Spades
3 | 0 | 0 | 41 --Three of Clubs
4 | 1 | 1 | 76
5 | 1 | 1 | 23
6 | 0 | 2 | 51
7 | 0 | 2 | 29
8 | 0 | 2 | 2
9 | 0 | 2 | 92
10 | 0 | 2 | 6
11 | 1 | 3 | 101
12 | 1 | 3 | 8
13 | 0 | 4 | 78
14 | 0 | 4 | 38
15 | 1 | 5 | 24
16 | 1 | 5 | 18
17 | 1 | 5 | 95
18 | 1 | 5 | 40
19 | 0 | 6 | 13
20 | 0 | 6 | 84
21 | 0 | 6 | 41
22 | 1 | 7 | 29
23 | 1 | 7 | 34
24 | 1 | 7 | 56
25 | 1 | 7 | 52
matches_combined table:
id | whichgame | dealershand | whichplayer | thisplayershand | playerresult
----+-----------+-------------+-------------+-----------------+--------------
1 | 0 | 0 | 1 | 1 | 1
2 | 1 | 2 | 1 | 3 | 2
3 | 2 | 4 | 1 | 5 | 0
4 | 3 | 6 | 1 | 7 | 2
Each value of "thiscard" represents a playing card in the range [1..104]--52 playing cards with an extra bit representing if it's face up or face down. I didn't post the actual table for space reasons.
So player 0 (aka the dealer) had a hand of (Seven of Spades, Queen of Spaces, 3 of Clubs) in the first game.
I think you're not using PostgreSQL the way it's intended to be used, plus your table design may not be suitable for what you want to achieve. Whilst it was difficult to understand what you want your solution to achieve, I wrote this, which seems to solve everything you want using a handful of tables only, and functions that return recordsets for simulating your requirement for individual runs. I used Enums and complex types to illustrate some of the features that you may wish to harness from the power of PostgreSQL.
Also, I'm not sure what parameterized table names are (I have never seen anything like it in any RDBMS), but PostgreSQL does allow something perfectly suitable: recordset returning functions.
CREATE TYPE card_value AS ENUM ('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'J', 'Q', 'K');
CREATE TYPE card_suit AS ENUM ('Clubs', 'Diamonds', 'Hearts', 'Spades');
CREATE TYPE card AS (value card_value, suit card_suit, face_up bool);
CREATE TABLE runs (
run_id bigserial NOT NULL PRIMARY KEY,
run_date timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP
);
CREATE TABLE players (
run_id bigint NOT NULL REFERENCES runs,
player_no int NOT NULL, -- 0 can be assumed as always the dealer
ai_type text NOT NULL,
PRIMARY KEY (run_id, player_no)
);
CREATE TABLE matches (
run_id bigint NOT NULL REFERENCES runs,
match_no int NOT NULL,
PRIMARY KEY (run_id, match_no)
);
CREATE TABLE hands (
hand_id bigserial NOT NULL PRIMARY KEY,
run_id bigint NOT NULL REFERENCES runs,
match_no int NOT NULL,
hand_no int NOT NULL,
player_no int NOT NULL,
UNIQUE (run_id, match_no, hand_no),
FOREIGN KEY (run_id, match_no) REFERENCES matches,
FOREIGN KEY (run_id, player_no) REFERENCES players
);
CREATE TABLE deals (
deal_id bigserial NOT NULL PRIMARY KEY,
hand_id bigint NOT NULL REFERENCES hands,
card card NOT NULL
);
CREATE OR REPLACE FUNCTION players(int) RETURNS SETOF players AS $$
SELECT * FROM players WHERE run_id = $1 ORDER BY player_no;
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION matches(int) RETURNS SETOF matches AS $$
SELECT * FROM matches WHERE run_id = $1 ORDER BY match_no;
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION hands(int) RETURNS SETOF hands AS $$
SELECT * FROM hands WHERE run_id = $1 ORDER BY match_no, hand_no;
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION hands(int, int) RETURNS SETOF hands AS $$
SELECT * FROM hands WHERE run_id = $1 AND match_no = $2 ORDER BY hand_no;
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION winner_player (int, int) RETURNS int AS $$
SELECT player_no
FROM hands
WHERE run_id = $1 AND match_no = $2
ORDER BY hand_no DESC
LIMIT 1
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION next_player_no (int) RETURNS int AS $$
SELECT CASE WHEN EXISTS (SELECT 1 FROM runs WHERE run_id = $1) THEN
COALESCE((SELECT MAX(player_no) FROM players WHERE run_id = $1), 0) + 1 END
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION next_match_no (int) RETURNS int AS $$
SELECT CASE WHEN EXISTS (SELECT 1 FROM runs WHERE run_id = $1) THEN
COALESCE((SELECT MAX(match_no) FROM matches WHERE run_id = $1), 0) + 1 END
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION next_hand_no (int) RETURNS int AS $$
SELECT CASE WHEN EXISTS (SELECT 1 FROM runs WHERE run_id = $1) THEN
COALESCE((SELECT MAX(hand_no) + 1 FROM hands WHERE run_id = $1), 0) END
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION card_to_int (card) RETURNS int AS $$
SELECT ((SELECT enumsortorder::int-1 FROM pg_enum WHERE enumtypid = 'card_suit'::regtype AND enumlabel = ($1).suit::name) * 13 +
(SELECT enumsortorder::int-1 FROM pg_enum WHERE enumtypid = 'card_value'::regtype AND enumlabel = ($1).value::name) + 1) *
CASE WHEN ($1).face_up THEN 2 ELSE 1 END
$$ LANGUAGE SQL; -- SELECT card_to_int(('3', 'Spades', false))
CREATE OR REPLACE FUNCTION int_to_card (int) RETURNS card AS $$
SELECT ((SELECT enumlabel::card_value FROM pg_enum WHERE enumtypid = 'card_value'::regtype AND enumsortorder = ((($1-1)%13)+1)::real),
(SELECT enumlabel::card_suit FROM pg_enum WHERE enumtypid = 'card_suit'::regtype AND enumsortorder = (((($1-1)/13)::int%4)+1)::real),
$1 > (13*4))::card
$$ LANGUAGE SQL; -- SELECT i, int_to_card(i) FROM generate_series(1, 13*4*2) i
CREATE OR REPLACE FUNCTION deal_cards(int, int, int, int[]) RETURNS TABLE (player_no int, hand_no int, card card) AS $$
WITH
hand AS (
INSERT INTO hands (run_id, match_no, player_no, hand_no)
VALUES ($1, $2, $3, next_hand_no($1))
RETURNING hand_id, player_no, hand_no),
mydeals AS (
INSERT INTO deals (hand_id, card)
SELECT hand_id, int_to_card(card_id)::card AS card
FROM hand, UNNEST($4) card_id
RETURNING hand_id, deal_id, card
)
SELECT h.player_no, h.hand_no, d.card
FROM hand h, mydeals d
$$ LANGUAGE SQL;
CREATE OR REPLACE FUNCTION deals(int) RETURNS TABLE (deal_id bigint, hand_no int, player_no int, card int) AS $$
SELECT d.deal_id, h.hand_no, h.player_no, card_to_int(d.card)
FROM hands h
JOIN deals d ON (d.hand_id = h.hand_id)
WHERE h.run_id = $1
ORDER BY d.deal_id;
$$ LANGUAGE SQL;
INSERT INTO runs DEFAULT VALUES; -- Add first run
INSERT INTO players VALUES (1, 0, 'Dealer'); -- dealer always zero
INSERT INTO players VALUES (1, next_player_no(1), 'Player 1');
INSERT INTO matches VALUES (1, next_match_no(1)); -- First match
SELECT * FROM deal_cards(1, 1, 0, ARRAY[6, 63, 41]);
SELECT * FROM deal_cards(1, 1, 1, ARRAY[76, 23]);
SELECT * FROM deal_cards(1, 1, 0, ARRAY[51, 29, 2, 92, 6]);
SELECT * FROM deal_cards(1, 1, 1, ARRAY[101, 8]);
INSERT INTO matches VALUES (1, next_match_no(1)); -- Second match
SELECT * FROM deal_cards(1, 2, 0, ARRAY[78, 38]);
SELECT * FROM deal_cards(1, 2, 1, ARRAY[24, 18, 95, 40]);
SELECT * FROM deal_cards(1, 2, 0, ARRAY[13, 84, 41]);
SELECT * FROM deal_cards(1, 2, 1, ARRAY[29, 34, 56, 52]);
SELECT * FROM deals(1); -- This is the output you need (hands_combined table)
-- This view can be used to retrieve the list of all winning hands
CREATE OR REPLACE VIEW winning_hands AS
SELECT DISTINCT ON (run_id, match_no) *
FROM hands
ORDER BY run_id, match_no, hand_no DESC;
SELECT * FROM winning_hands;
Wouldn't using the UNION operator work?
For the hands relation:
SELECT * FROM hands_first
UNION ALL
SELECT * FROM hands_second
For the matches relation:
SELECT * FROM matches_first
UNION ALL
SELECT * FROM matches_second
As a more long term solution I'd consider restructuring the DB because it will quickly become unmanageable with this schema. Why not improve normalization by introducing a games table?
In other words Games have many Matches, matches have many players for each game and players have many hands for each match.
I'd recommend drawing the UML for the entity relationships on paper (http://dawgsquad.googlecode.com/hg/docs/database_images/Database_Model_Diagram(Title).png), then improving the schema so it can be queried using normal SQL operators.
Hope this helps.
EDIT:
In that case you can use a subquery on the union of both tables with the rownumber() PG function to represent the row number:
SELECT
row_number() AS id,
whichplayer,
whichhand,
thiscard
FROM
(
SELECT * FROM hands_first
UNION ALL
SELECT * FROM hands_second
);
The same principle would apply to the matches table. Obviously this doesn't scale well to even a small number of tables, so would prioritize normalizing your schema.
Docs on some PG functions: http://www.postgresql.org/docs/current/interactive/functions-window.html
to build new table with all rows of two tables, do:
CREATE TABLE hands AS
select 1 as hand, id, whichplayer, whichhand, thiscard
from hands_first
union all
select 2 as hand, id, whichplayer, whichhand, thiscard
from hands_second
after that, to insert data of new matche, create sequence with start on current last + 1
CREATE SEQUENCE matche START 3;
before insert read sequence value, and use it in inserts:
SELECT nextval('matche');
Your database structure is not great, and I know for sure it is not scalable approach creating tables on fly. There are performance drawbacks creating physical tables instead of using an existing structure. I suggest you refactor your db structure if can.
You can however use the UNION operator to merge your data.

SQL Query 2 tables null results

I was asked this question in an interview:
From the 2 tables below, write a query to pull customers with no sales orders.
How many ways to write this query and which would have best performance.
Table 1: Customer - CustomerID
Table 2: SalesOrder - OrderID, CustomerID, OrderDate
Query:
SELECT *
FROM Customer C
RIGHT OUTER JOIN SalesOrder SO ON C.CustomerID = SO.CustomerID
WHERE SO.OrderID = NULL
Is my query correct and are there other ways to write the query and get the same results?
Answering for MySQL instead of SQL Server, cause you tagged it later with SQL Server, so I thought (since this was an interview question, that it wouldn't bother you, for which DBMS this is). Note though, that the queries I wrote are standard sql, they should run in every RDBMS out there. How each RDBMS handles those queries is another issue, though.
I wrote this little procedure for you, to have a test case. It creates the tables customers and orders like you specified and I added primary keys and foreign keys, like one would usually do it. No other indexes, as every column worth indexing here is already primary key. 250 customers are created, 100 of them made an order (though out of convenience none of them twice / multiple times). A dump of the data follows, posted the script just in case you want to play around a little by increasing the numbers.
delimiter $$
create procedure fill_table()
begin
create table customers(customerId int primary key) engine=innodb;
set #x = 1;
while (#x <= 250) do
insert into customers values(#x);
set #x := #x + 1;
end while;
create table orders(orderId int auto_increment primary key,
customerId int,
orderDate timestamp,
foreign key fk_customer (customerId) references customers(customerId)
) engine=innodb;
insert into orders(customerId, orderDate)
select
customerId,
now() - interval customerId day
from
customers
order by rand()
limit 100;
end $$
delimiter ;
call fill_table();
For me, this resulted in this:
CREATE TABLE `customers` (
`customerId` int(11) NOT NULL,
PRIMARY KEY (`customerId`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
INSERT INTO `customers` VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250);
CREATE TABLE `orders` (
`orderId` int(11) NOT NULL AUTO_INCREMENT,
`customerId` int(11) DEFAULT NULL,
`orderDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
PRIMARY KEY (`orderId`),
KEY `fk_customer` (`customerId`),
CONSTRAINT `orders_ibfk_1` FOREIGN KEY (`customerId`) REFERENCES `customers` (`customerId`)
) ENGINE=InnoDB AUTO_INCREMENT=128 DEFAULT CHARSET=utf8;
INSERT INTO `orders` VALUES (1,247,'2013-06-24 19:50:07'),(2,217,'2013-07-24 19:50:07'),(3,8,'2014-02-18 20:50:07'),(4,40,'2014-01-17 20:50:07'),(5,52,'2014-01-05 20:50:07'),(6,80,'2013-12-08 20:50:07'),(7,169,'2013-09-10 19:50:07'),(8,135,'2013-10-14 19:50:07'),(9,115,'2013-11-03 20:50:07'),(10,225,'2013-07-16 19:50:07'),(11,112,'2013-11-06 20:50:07'),(12,243,'2013-06-28 19:50:07'),(13,158,'2013-09-21 19:50:07'),(14,24,'2014-02-02 20:50:07'),(15,214,'2013-07-27 19:50:07'),(16,25,'2014-02-01 20:50:07'),(17,245,'2013-06-26 19:50:07'),(18,182,'2013-08-28 19:50:07'),(19,166,'2013-09-13 19:50:07'),(20,69,'2013-12-19 20:50:07'),(21,85,'2013-12-03 20:50:07'),(22,44,'2014-01-13 20:50:07'),(23,103,'2013-11-15 20:50:07'),(24,19,'2014-02-07 20:50:07'),(25,33,'2014-01-24 20:50:07'),(26,102,'2013-11-16 20:50:07'),(27,41,'2014-01-16 20:50:07'),(28,94,'2013-11-24 20:50:07'),(29,43,'2014-01-14 20:50:07'),(30,150,'2013-09-29 19:50:07'),(31,218,'2013-07-23 19:50:07'),(32,131,'2013-10-18 19:50:07'),(33,77,'2013-12-11 20:50:07'),(34,2,'2014-02-24 20:50:07'),(35,45,'2014-01-12 20:50:07'),(36,230,'2013-07-11 19:50:07'),(37,101,'2013-11-17 20:50:07'),(38,31,'2014-01-26 20:50:07'),(39,56,'2014-01-01 20:50:07'),(40,176,'2013-09-03 19:50:07'),(41,223,'2013-07-18 19:50:07'),(42,145,'2013-10-04 19:50:07'),(43,26,'2014-01-31 20:50:07'),(44,62,'2013-12-26 20:50:07'),(45,195,'2013-08-15 19:50:07'),(46,153,'2013-09-26 19:50:07'),(47,179,'2013-08-31 19:50:07'),(48,104,'2013-11-14 20:50:07'),(49,7,'2014-02-19 20:50:07'),(50,209,'2013-08-01 19:50:07'),(51,86,'2013-12-02 20:50:07'),(52,110,'2013-11-08 20:50:07'),(53,204,'2013-08-06 19:50:07'),(54,187,'2013-08-23 19:50:07'),(55,114,'2013-11-04 20:50:07'),(56,38,'2014-01-19 20:50:07'),(57,236,'2013-07-05 19:50:07'),(58,79,'2013-12-09 20:50:07'),(59,96,'2013-11-22 20:50:07'),(60,37,'2014-01-20 20:50:07'),(61,207,'2013-08-03 19:50:07'),(62,22,'2014-02-04 20:50:07'),(63,120,'2013-10-29 20:50:07'),(64,200,'2013-08-10 19:50:07'),(65,51,'2014-01-06 20:50:07'),(66,181,'2013-08-29 19:50:07'),(67,4,'2014-02-22 20:50:07'),(68,123,'2013-10-26 19:50:07'),(69,108,'2013-11-10 20:50:07'),(70,55,'2014-01-02 20:50:07'),(71,76,'2013-12-12 20:50:07'),(72,6,'2014-02-20 20:50:07'),(73,18,'2014-02-08 20:50:07'),(74,211,'2013-07-30 19:50:07'),(75,53,'2014-01-04 20:50:07'),(76,216,'2013-07-25 19:50:07'),(77,32,'2014-01-25 20:50:07'),(78,74,'2013-12-14 20:50:07'),(79,138,'2013-10-11 19:50:07'),(80,197,'2013-08-13 19:50:07'),(81,221,'2013-07-20 19:50:07'),(82,118,'2013-10-31 20:50:07'),(83,61,'2013-12-27 20:50:07'),(84,28,'2014-01-29 20:50:07'),(85,16,'2014-02-10 20:50:07'),(86,39,'2014-01-18 20:50:07'),(87,3,'2014-02-23 20:50:07'),(88,46,'2014-01-11 20:50:07'),(89,189,'2013-08-21 19:50:07'),(90,59,'2013-12-29 20:50:07'),(91,249,'2013-06-22 19:50:07'),(92,127,'2013-10-22 19:50:07'),(93,47,'2014-01-10 20:50:07'),(94,178,'2013-09-01 19:50:07'),(95,141,'2013-10-08 19:50:07'),(96,188,'2013-08-22 19:50:07'),(97,220,'2013-07-21 19:50:07'),(98,15,'2014-02-11 20:50:07'),(99,175,'2013-09-04 19:50:07'),(100,206,'2013-08-04 19:50:07');
Okay, now to the queries. Three ways came to my mind, I omitted the right join that MDiesel did, because it's actually just another way of writing left join. It was invented for lazy sql developers, that don't want to switch table names, but instead just rewrite one word.
Anyway, first query:
select
c.*
from
customers c
left join orders o on c.customerId = o.customerId
where o.customerId is null;
Results in an execution plan like this:
+----+-------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
| 1 | SIMPLE | c | index | NULL | PRIMARY | 4 | NULL | 250 | Using index |
| 1 | SIMPLE | o | ref | fk_customer | fk_customer | 5 | wtf.c.customerId | 1 | Using where; Using index |
+----+-------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
Second query:
select
c.*
from
customers c
where c.customerId not in (select distinct customerId from orders);
Results in an execution plan like this:
+----+--------------------+--------+----------------+---------------+-------------+---------+------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+--------+----------------+---------------+-------------+---------+------+------+--------------------------+
| 1 | PRIMARY | c | index | NULL | PRIMARY | 4 | NULL | 250 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | orders | index_subquery | fk_customer | fk_customer | 5 | func | 2 | Using index |
+----+--------------------+--------+----------------+---------------+-------------+---------+------+------+--------------------------+
Third query:
select
c.*
from
customers c
where not exists (select 1 from orders o where o.customerId = c.customerId);
Results in an execution plan like this:
+----+--------------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
| 1 | PRIMARY | c | index | NULL | PRIMARY | 4 | NULL | 250 | Using where; Using index |
| 2 | DEPENDENT SUBQUERY | o | ref | fk_customer | fk_customer | 5 | wtf.c.customerId | 1 | Using where; Using index |
+----+--------------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+
We can see in all execution plans, that the customers table is read as a whole, but from the index (the implicit one as the only column is primary key). This may change, when you select other columns from the table, that are not in an index.
The first one seems to be the best. For each row in customers only one row in orders is read. The id column suggests, that MySQL can do this in one step, as only indexes are involved.
The second query seems to be the worst (though all 3 queries shouldn't perform too bad). For each row in customers the subquery is executed (the select_type column tells this).
The third query is not much different in that it uses a dependent subquery, but should perform better than the second query. Explaining the small differences would lead to far now. If you're interested, here's the manual page that explains what each column and their values mean here: EXPLAIN output
Finally: I'd say, that the first query will perform best, but as always, in the end one has to measure, to measure and to measure.
I can thing of two other ways to write this query:
SELECT C.*
FROM Customer C
LEFT OUTER JOIN SalesOrder SO ON C.CustomerID = SO.CustomerID
WHERE SO.CustomerID IS NULL
SELECT C.*
FROM Customer C
WHERE NOT C.CustomerID IN(SELECT CustomerID FROM SalesOrder)
The solutions involving outer joins will perform better than a solution using NOT IN.

Is it possible to update an "order" column from within a trigger in MySQL?

We have a table in our system that would benefit from a numeric column so we can easily grab the 1st, 2nd, 3rd records for a job. We could, of course, update this column from the application itself, but I was hoping to do it in the database.
The final method must handle cases where users insert data that belongs in the "middle" of the results, as they may receive information out of order. They may also edit or delete records, so there will be corresponding update and delete triggers.
The table:
CREATE TABLE `test` (
`id` int(11) unsigned NOT NULL AUTO_INCREMENT,
`seq` int(11) unsigned NOT NULL,
`job_no` varchar(20) NOT NULL,
`date` date NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=7 DEFAULT CHARSET=latin1
And some example data:
mysql> SELECT * FROM test ORDER BY job_no, seq;
+----+-----+--------+------------+
| id | seq | job_no | date |
+----+-----+--------+------------+
| 5 | 1 | 123 | 2009-10-05 |
| 6 | 2 | 123 | 2009-10-01 |
| 4 | 1 | 123456 | 2009-11-02 |
| 3 | 2 | 123456 | 2009-11-10 |
| 2 | 3 | 123456 | 2009-11-19 |
+----+-----+--------+------------+
I was hoping to update the "seq" column from a t rigger, but this isn't allowed by MySQL, with an error "Can't update table 'test' in stored function/trigger because it is already used by statement which invoked this stored function/trigger".
My test trigger is as follows:
CREATE TRIGGER `test_after_ins_tr` AFTER INSERT ON `test`
FOR EACH ROW
BEGIN
SET #seq = 0;
UPDATE
`test` t
SET
t.`seq` = #seq := (SELECT #seq + 1)
WHERE
t.`job_no` = NEW.`job_no`
ORDER BY
t.`date`;
END;
Is there any way to achieve what I'm after other than remembering to call a function after each update to this table?
What about this?
CREATE TRIGGER `test_after_ins_tr` BEFORE INSERT ON `test`
FOR EACH ROW
BEGIN
SET #seq = (SELECT COALESCE(MAX(seq),0) + 1 FROM test t WHERE t.job_no = NEW.job_no);
SET NEW.seq = #seq;
END;
From Sergi's comment above:
http://dev.mysql.com/doc/refman/5.1/en/stored-program-restrictions.html - "Within a stored function or trigger, it is not permitted to modify a table that is already being used (for reading or writing) by the statement that invoked the function or trigger."