MySQL GROUP BY optimization - optimization
This question is a more specific version of a previous question I asked
Table
CREATE TABLE Test4_ClusterMatches
(
`match_index` INT UNSIGNED,
`cluster_index` INT UNSIGNED,
`id` INT NOT NULL AUTO_INCREMENT,
`tfidf` FLOAT,
PRIMARY KEY (`cluster_index`,`match_index`,`id`)
);
The query I want to run
mysql> explain SELECT `match_index`, SUM(`tfidf`) AS total
FROM Test4_ClusterMatches WHERE `cluster_index` IN (1,2,3 ... 3000)
GROUP BY `match_index`;
The Problem with the query
It uses temporary and filesort so its to slow+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| 1 | SIMPLE | Test4_ClusterMatches | range | PRIMARY | PRIMARY | 4 | NULL | 51540 | Using where; Using index; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
With the current indexing the query would need to sort by cluster_index first to eliminate the use of temporary and filesort, but doing so gives the wrong results for sum(tfidf).
Changing the primary key to
PRIMARY KEY (`match_index`,`cluster_index`,`id`)
Doesn't use file sort or temp tables but it uses 14,932,441 rows so it is also to slow
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
| 1 | SIMPLE | Test5_ClusterMatches | index | NULL | PRIMARY | 16 | NULL | 14932441 | Using where; Using index |
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
Tight Index Scan
Using tight index scan by running the search for just one index
mysql> explain SELECT match_index, SUM(tfidf) AS total
FROM Test4_ClusterMatches WHERE cluster_index =3000
GROUP BY match_index;Eliminates the temporary tables and filesort.
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
| 1 | SIMPLE | Test4_ClusterMatches | ref | PRIMARY | PRIMARY | 4 | const | 27 | Using where; Using index |
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+ I'm not sure if this can be exploited with some magic sql-fu that I haven't come across yet?
Question
How can I change my query so that it use 3,000 cluster_indexes, avoids using temporary and filesort without it needing to use 14,932,441 rows?
Update
Using the table
CREATE TABLE Test6_ClusterMatches
(
match_index INT UNSIGNED,
cluster_index INT UNSIGNED,
id INT NOT NULL AUTO_INCREMENT,
tfidf FLOAT,
PRIMARY KEY (id),
UNIQUE KEY(cluster_index,match_index)
);
The query below then gives 10 rows in set (0.41 sec) :)
SELECT `match_index`, SUM(`tfidf`) AS total FROM Test6_ClusterMatches WHERE
`cluster_index` IN (.....)
GROUP BY `match_index` ORDER BY total DESC LIMIT 0,10;
but its using temporary and filesort
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
| 1 | SIMPLE | Test6_ClusterMatches | range | cluster_index | cluster_index | 5 | NULL | 78663 | Using where; Using temporary; Using filesort |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
I'm wondering if theres anyway to get it faster by eliminating the using temporary and using filesort?
I had a quick look and this is what I came up with - hope it helps...
SQL Table
drop table if exists cluster_matches;
create table cluster_matches
(
cluster_id int unsigned not null,
match_id int unsigned not null,
...
tfidf float not null default 0,
primary key (cluster_id, match_id) -- if this isnt unique add id to the end !!
)
engine=innodb;
Test Data
select count(*) from cluster_matches
count(*)
========
17974591
select count(distinct(cluster_id)) from cluster_matches;
count(distinct(cluster_id))
===========================
1000000
select count(distinct(match_id)) from cluster_matches;
count(distinct(match_id))
=========================
6000
explain select
cm.match_id,
sum(tfidf) as sum_tfidf,
count(*) as count_tfidf
from
cluster_matches cm
where
cm.cluster_id between 5000 and 10000
group by
cm.match_id
order by
sum_tfidf desc limit 10;
id select_type table type possible_keys key key_len ref rows Extra
== =========== ===== ==== ============= === ======= === ==== =====
1 SIMPLE cm range PRIMARY PRIMARY 4 290016 Using where; Using temporary; Using filesort
runtime - 0.067 seconds.
Pretty respectable runtime of 0.067 seconds but I think we can make it better.
Stored Procedure
You will have to forgive me for not wanting to type/pass in a list of 5000+ random cluster_ids !
call sum_cluster_matches(null,1); -- for testing
call sum_cluster_matches('1,2,3,4,....5000',1);
The bulk of the sproc isnt very elegant but all it does is split a csv string into individual cluster_ids and populate a temp table.
drop procedure if exists sum_cluster_matches;
delimiter #
create procedure sum_cluster_matches
(
in p_cluster_id_csv varchar(65535),
in p_show_explain tinyint unsigned
)
proc_main:begin
declare v_id varchar(10);
declare v_done tinyint unsigned default 0;
declare v_idx int unsigned default 1;
create temporary table tmp(cluster_id int unsigned not null primary key);
-- not every elegant - split the string into tokens and put into a temp table...
if p_cluster_id_csv is not null then
while not v_done do
set v_id = trim(substring(p_cluster_id_csv, v_idx,
if(locate(',', p_cluster_id_csv, v_idx) > 0,
locate(',', p_cluster_id_csv, v_idx) - v_idx, length(p_cluster_id_csv))));
if length(v_id) > 0 then
set v_idx = v_idx + length(v_id) + 1;
insert ignore into tmp values(v_id);
else
set v_done = 1;
end if;
end while;
else
-- instead of passing in a huge comma separated list of cluster_ids im cheating here to save typing
insert into tmp select cluster_id from clusters where cluster_id between 5000 and 10000;
-- end cheat
end if;
if p_show_explain then
select count(*) as count_of_tmp from tmp;
explain
select
cm.match_id,
sum(tfidf) as sum_tfidf,
count(*) as count_tfidf
from
cluster_matches cm
inner join tmp on tmp.cluster_id = cm.cluster_id
group by
cm.match_id
order by
sum_tfidf desc limit 10;
end if;
select
cm.match_id,
sum(tfidf) as sum_tfidf,
count(*) as count_tfidf
from
cluster_matches cm
inner join tmp on tmp.cluster_id = cm.cluster_id
group by
cm.match_id
order by
sum_tfidf desc limit 10;
drop temporary table if exists tmp;
end proc_main #
delimiter ;
Results
call sum_cluster_matches(null,1);
count_of_tmp
============
5001
id select_type table type possible_keys key key_len ref rows Extra
== =========== ===== ==== ============= === ======= === ==== =====
1 SIMPLE tmp index PRIMARY PRIMARY 4 5001 Using index; Using temporary; Using filesort
1 SIMPLE cm ref PRIMARY PRIMARY 4 vldb_db.tmp.cluster_id 8
match_id sum_tfidf count_tfidf
======== ========= ===========
1618 387 64
1473 387 64
3307 382 64
2495 373 64
1135 373 64
3832 372 57
3203 362 58
5464 358 67
2100 355 60
1634 354 52
runtime 0.028 seconds.
Explain plan and runtime much improved.
If the cluster_index values in the WHERE condition are continuous, then instead of IN use:
WHERE (cluster_index >= 1) and (cluster_index <= 3000)
If the values are not continuous then you can create a temporary table to hold the cluster_index values with an index and use an INNER JOIN to the temporary table.
Related
SQL N:M query merging results by condition flag in intermediate table
[First of all, if this is a duplicate, sorry, I couldn't find a response for this, as this is a strange solution for a limitation on an ORM and I'm clearly a noobie on SQL] Domain requirements: A brigades must be composed by one user (the commissar one) and, optionally, one and only one assistant (1:1) A user can only be part of one brigade (1:1) CREATE TABLE Users ( id SERIAL PRIMARY KEY, username VARCHAR(100) NOT NULL UNIQUE, password VARCHAR(100) NOT NULL ); CREATE TABLE Brigades ( id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL ); -- N:M relationship with a flag inside which determine if that user is a commissar or not CREATE TABLE Brigade_User ( brigade_id INT NOT NULL REFERENCES Brigades(id) ON DELETE CASCADE ON UPDATE CASCADE, user_id INT NOT NULL REFERENCES Users(id) ON DELETE CASCADE ON UPDATE CASCADE, is_commissar BOOLEAN NOT NULL PRIMARY KEY(brigade_id, user_id) ); Ideally, as relations are 1:1, Brigade_User intermediate table could be erased and a Brigade table with two foreign keys could be created instead (this is not supported by Diesel Rust ORM, so I think I'm coupled to first approach) CREATE TABLE Brigades ( id SERIAL PRIMARY KEY, name VARCHAR(100) NOT NULL -- 1:1 commisar_id INT NOT NULL REFERENCES Users(id) ON DELETE CASCADE ON UPDATE CASCADE, -- 1:1 assistant_id INT NOT NULL REFERENCES Users(id) ON DELETE CASCADE ON UPDATE CASCADE ); An example... > SELECT * FROM brigade_user LEFT JOIN brigades ON brigade_user.brigade_id = brigades.id; brigade_id | user_id | is_commissar | id | name ------------+---------+--------------+----+------------------ 1 | 1 | t | 1 | Patrulla gatuna 1 | 2 | f | 1 | Patrulla gatuna 2 | 3 | t | 2 | Patrulla perruna 2 | 4 | f | 2 | Patrulla perruna 3 | 6 | t | 3 | Patrulla canina 3 | 5 | f | 3 | Patrulla canina (4 rows) Is it possible to make a query which returns a table like this? brigade_id | commissar_id | assistant_id | name -----------+--------------+--------------+-------------------- 1 | 1 | 2 | Patrulla gatuna 2 | 3 | 4 | Patrulla perruna 3 | 6 | 5 | Patrulla canina See that each two rows have been merged into one (remember, a brigade is composed by one commissary and, optionally, one assistant) depending on the flag. Could this model be improved (having in mind the limitation on multiple foreign keys referencing the same table, discussed here)
Try the following: with cte as ( SELECT A.brigade_id,A.user_id,A.is_commissar,B.name FROM brigade_user A LEFT JOIN brigades B ON A.brigade_id = B.id ) select C1.brigade_id, C1.user_id as commissar_id , C2.user_id as assistant_id, C1.name from cte C1 left join cte C2 on C1.brigade_id=C2.brigade_id and C1.user_id<>C2.user_id where C1.is_commissar=true See a demo from here.
Result of query as column value
I've got three tables: Lessons: CREATE TABLE lessons ( id SERIAL PRIMARY KEY, title text NOT NULL, description text NOT NULL, vocab_count integer NOT NULL ); +----+------------+------------------+-------------+ | id | title | description | vocab_count | +----+------------+------------------+-------------+ | 1 | lesson_one | this is a lesson | 3 | | 2 | lesson_two | another lesson | 2 | +----+------------+------------------+-------------+ Lesson_vocabulary: CREATE TABLE lesson_vocabulary ( lesson_id integer REFERENCES lessons(id), vocabulary_id integer REFERENCES vocabulary(id) ); +-----------+---------------+ | lesson_id | vocabulary_id | +-----------+---------------+ | 1 | 1 | | 1 | 2 | | 1 | 3 | | 2 | 2 | | 2 | 4 | +-----------+---------------+ Vocabulary: CREATE TABLE vocabulary ( id integer PRIMARY KEY, hiragana text NOT NULL, reading text NOT NULL, meaning text[] NOT NULL ); Each lesson contains multiple vocabulary, and each vocabulary can be included in multiple lessons. How can I get the vocab_count column of the lessons table to be calculated and updated whenevr I add more rows to the lesson_vocabulary table. Is this possible, and how would I go about doing this? Thanks
You can use SQL triggers to serve your purpose. This would be similar to mysql after insert trigger which updates another table's column. The trigger would look somewhat like this. I am using Oracle SQL, but there would just be minor tweaks for any other implementation. CREATE TRIGGER vocab_trigger AFTER INSERT ON lesson_vocabulary FOR EACH ROW begin for lesson_cur in (select LESSON_ID, COUNT(VOCABULARY_ID) voc_cnt from LESSON_VOCABULARY group by LESSON_ID) LOOP update LESSONS set VOCAB_COUNT = LESSON_CUR.VOC_CNT where id = LESSON_CUR.LESSON_ID; end loop; END;
It's better to create a view that calculates that (and get rid of the column in the lessons table): select l.*, lv.vocab_count from lessons l left join ( select lesson_id, count(*) from lesson_vocabulary group by lesson_id ) as lv(lesson_id, vocab_count) on l.id = lv.lesson_id If you really want to update the lessons table each time the lesson_vocabulary changes, you can run an UPDATE statement like this in a trigger: update lessons l set vocab_count = t.cnt from ( select lesson_id, count(*) as cnt from lesson_vocabulary group by lesson_id ) t where t.lesson_id = l.id;
I would recommend using a query for this information: select l.*, (select count(*) from lesson_vocabulary lv where lv.lesson_id = l.lesson_id ) as vocabulary_cnt from lessons l; With an index on lesson_vocabulary(lesson_id), this should be quite fast. I recommend this over an update, because the data remains correct. I recommend this over a trigger, because it is simpler. I recommend this over a subquery with aggregation because it should be faster, particularly if you are filtering on the lessons table.
Merging Complicated Tables
I'm trying to merge tables where rows correspond to a many:1 relationship with "real" things. I'm writing a blackjack simulator that stores game history in a database with a new set of tables generated each run. The tables are really more like templates, since each game gets its own set of the 3 mutable tables (players, hands, and matches). Here's the layout, where suff is a user-specified suffix to use for the current run: - cards - id INTEGER PRIMARY KEY - cardValue INTEGER NOT NULL - suit INTEGER NOT NULL - players_suff - whichPlayer INTEGER PRIMARY KEY - aiType TEXT NOT NULL - hands_suff - id BIGSERIAL PRIMARY KEY - whichPlayer INTEGER REFERENCES players_suff(whichPlayer) * - whichHand BIGINT NOT NULL - thisCard INTEGER REFERENCES cards(id) - matches_suff - id BIGSERIAL PRIMARY KEY - whichGame INTEGER NOT NULL - dealersHand BIGINT NOT NULL - whichPlayer INTEGER REFERENCES players_suff(whichPlayer) - thisPlayersHand BIGINT NOT NULL ** - playerResult INTEGER NOT NULL --AKA who won Only one cards table is created because its values are constant. So after running the simulator twice you might have: hands_firstrun players_firstrun matches_firstrun hands_secondrun players_secondrun matches_secondrun I want to be able to combine these tables if you used the same AI parameters for both of those runs (i.e. players_firstrun and players_secondrun are exactly the same). The problem is that the way I'm inserting hands makes this really messy: whichHand can't be a BIGSERIAL because the relationship of hands_suff rows to "actual hands" is many:1. matches_suff is handled the same way because a blackjack "game" actually consists of a set of games: the set of pairs of each player vs. the dealer. So for 3 players, you actually have 3 rows for each round. Currently I select the largest whichHand in the table, add 1 to it, then insert all of the rows for one hand. I'm worried this "query-and-insert" will be really slow if I'm merging 2 tables that might both be arbitrarily huge. When I'm merging tables, I feel like I should be able to (entirely in SQL) query the largest values in whichHand and whichGame once then use them combine the tables, incrementing them for each unique whichHand and whichGame in the table being merged. (I saw this question, but it doesn't handle using a generated ID in 2 different places). I'm using Postgres and it's OK if the answer is specific to it. * sadly postgres doesn't allow parameterized table names so this had to be done by manual string substitution. Not the end of the world since the program isn't web-facing and no one except me is likely to ever bother with it, but the SQL injection vulnerability does not make me happy. ** matches_suff(whichPlayersHand) was originally going to reference hands_suff(whichHand) but foreign keys must reference unique values. whichHand isn't unique because a hand is made up of multiple rows, with each row "holding" one card. To query for a hand you select all of those rows with the same value in whichHand. I couldn't think of a more elegant way to do this without resorting to arrays. EDIT: This is what I have now: thomas=# \dt List of relations Schema | Name | Type | Owner --------+----------------+-------+-------- public | cards | table | thomas public | hands_first | table | thomas public | hands_second | table | thomas public | matches_first | table | thomas public | matches_second | table | thomas public | players_first | table | thomas public | players_second | table | thomas (7 rows) thomas=# SELECT * FROM hands_first thomas-# \g id | whichplayer | whichhand | thiscard ----+-------------+-----------+---------- 1 | 0 | 0 | 6 2 | 0 | 0 | 63 3 | 0 | 0 | 41 4 | 1 | 1 | 76 5 | 1 | 1 | 23 6 | 0 | 2 | 51 7 | 0 | 2 | 29 8 | 0 | 2 | 2 9 | 0 | 2 | 92 10 | 0 | 2 | 6 11 | 1 | 3 | 101 12 | 1 | 3 | 8 (12 rows) thomas=# SELECT * FROM hands_second thomas-# \g id | whichplayer | whichhand | thiscard ----+-------------+-----------+---------- 1 | 0 | 0 | 78 2 | 0 | 0 | 38 3 | 1 | 1 | 24 4 | 1 | 1 | 18 5 | 1 | 1 | 95 6 | 1 | 1 | 40 7 | 0 | 2 | 13 8 | 0 | 2 | 84 9 | 0 | 2 | 41 10 | 1 | 3 | 29 11 | 1 | 3 | 34 12 | 1 | 3 | 56 13 | 1 | 3 | 52 thomas=# SELECT * FROM matches_first thomas-# \g id | whichgame | dealershand | whichplayer | thisplayershand | playerresult ----+-----------+-------------+-------------+-----------------+-------------- 1 | 0 | 0 | 1 | 1 | 1 2 | 1 | 2 | 1 | 3 | 2 (2 rows) thomas=# SELECT * FROM matches_second thomas-# \g id | whichgame | dealershand | whichplayer | thisplayershand | playerresult ----+-----------+-------------+-------------+-----------------+-------------- 1 | 0 | 0 | 1 | 1 | 0 2 | 1 | 2 | 1 | 3 | 2 (2 rows) I'd like to combine them to have: hands_combined table: id | whichplayer | whichhand | thiscard ----+-------------+-----------+---------- 1 | 0 | 0 | 6 --Seven of Spades 2 | 0 | 0 | 63 --Queen of Spades 3 | 0 | 0 | 41 --Three of Clubs 4 | 1 | 1 | 76 5 | 1 | 1 | 23 6 | 0 | 2 | 51 7 | 0 | 2 | 29 8 | 0 | 2 | 2 9 | 0 | 2 | 92 10 | 0 | 2 | 6 11 | 1 | 3 | 101 12 | 1 | 3 | 8 13 | 0 | 4 | 78 14 | 0 | 4 | 38 15 | 1 | 5 | 24 16 | 1 | 5 | 18 17 | 1 | 5 | 95 18 | 1 | 5 | 40 19 | 0 | 6 | 13 20 | 0 | 6 | 84 21 | 0 | 6 | 41 22 | 1 | 7 | 29 23 | 1 | 7 | 34 24 | 1 | 7 | 56 25 | 1 | 7 | 52 matches_combined table: id | whichgame | dealershand | whichplayer | thisplayershand | playerresult ----+-----------+-------------+-------------+-----------------+-------------- 1 | 0 | 0 | 1 | 1 | 1 2 | 1 | 2 | 1 | 3 | 2 3 | 2 | 4 | 1 | 5 | 0 4 | 3 | 6 | 1 | 7 | 2 Each value of "thiscard" represents a playing card in the range [1..104]--52 playing cards with an extra bit representing if it's face up or face down. I didn't post the actual table for space reasons. So player 0 (aka the dealer) had a hand of (Seven of Spades, Queen of Spaces, 3 of Clubs) in the first game.
I think you're not using PostgreSQL the way it's intended to be used, plus your table design may not be suitable for what you want to achieve. Whilst it was difficult to understand what you want your solution to achieve, I wrote this, which seems to solve everything you want using a handful of tables only, and functions that return recordsets for simulating your requirement for individual runs. I used Enums and complex types to illustrate some of the features that you may wish to harness from the power of PostgreSQL. Also, I'm not sure what parameterized table names are (I have never seen anything like it in any RDBMS), but PostgreSQL does allow something perfectly suitable: recordset returning functions. CREATE TYPE card_value AS ENUM ('1', '2', '3', '4', '5', '6', '7', '8', '9', '10', 'J', 'Q', 'K'); CREATE TYPE card_suit AS ENUM ('Clubs', 'Diamonds', 'Hearts', 'Spades'); CREATE TYPE card AS (value card_value, suit card_suit, face_up bool); CREATE TABLE runs ( run_id bigserial NOT NULL PRIMARY KEY, run_date timestamptz NOT NULL DEFAULT CURRENT_TIMESTAMP ); CREATE TABLE players ( run_id bigint NOT NULL REFERENCES runs, player_no int NOT NULL, -- 0 can be assumed as always the dealer ai_type text NOT NULL, PRIMARY KEY (run_id, player_no) ); CREATE TABLE matches ( run_id bigint NOT NULL REFERENCES runs, match_no int NOT NULL, PRIMARY KEY (run_id, match_no) ); CREATE TABLE hands ( hand_id bigserial NOT NULL PRIMARY KEY, run_id bigint NOT NULL REFERENCES runs, match_no int NOT NULL, hand_no int NOT NULL, player_no int NOT NULL, UNIQUE (run_id, match_no, hand_no), FOREIGN KEY (run_id, match_no) REFERENCES matches, FOREIGN KEY (run_id, player_no) REFERENCES players ); CREATE TABLE deals ( deal_id bigserial NOT NULL PRIMARY KEY, hand_id bigint NOT NULL REFERENCES hands, card card NOT NULL ); CREATE OR REPLACE FUNCTION players(int) RETURNS SETOF players AS $$ SELECT * FROM players WHERE run_id = $1 ORDER BY player_no; $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION matches(int) RETURNS SETOF matches AS $$ SELECT * FROM matches WHERE run_id = $1 ORDER BY match_no; $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION hands(int) RETURNS SETOF hands AS $$ SELECT * FROM hands WHERE run_id = $1 ORDER BY match_no, hand_no; $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION hands(int, int) RETURNS SETOF hands AS $$ SELECT * FROM hands WHERE run_id = $1 AND match_no = $2 ORDER BY hand_no; $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION winner_player (int, int) RETURNS int AS $$ SELECT player_no FROM hands WHERE run_id = $1 AND match_no = $2 ORDER BY hand_no DESC LIMIT 1 $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION next_player_no (int) RETURNS int AS $$ SELECT CASE WHEN EXISTS (SELECT 1 FROM runs WHERE run_id = $1) THEN COALESCE((SELECT MAX(player_no) FROM players WHERE run_id = $1), 0) + 1 END $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION next_match_no (int) RETURNS int AS $$ SELECT CASE WHEN EXISTS (SELECT 1 FROM runs WHERE run_id = $1) THEN COALESCE((SELECT MAX(match_no) FROM matches WHERE run_id = $1), 0) + 1 END $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION next_hand_no (int) RETURNS int AS $$ SELECT CASE WHEN EXISTS (SELECT 1 FROM runs WHERE run_id = $1) THEN COALESCE((SELECT MAX(hand_no) + 1 FROM hands WHERE run_id = $1), 0) END $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION card_to_int (card) RETURNS int AS $$ SELECT ((SELECT enumsortorder::int-1 FROM pg_enum WHERE enumtypid = 'card_suit'::regtype AND enumlabel = ($1).suit::name) * 13 + (SELECT enumsortorder::int-1 FROM pg_enum WHERE enumtypid = 'card_value'::regtype AND enumlabel = ($1).value::name) + 1) * CASE WHEN ($1).face_up THEN 2 ELSE 1 END $$ LANGUAGE SQL; -- SELECT card_to_int(('3', 'Spades', false)) CREATE OR REPLACE FUNCTION int_to_card (int) RETURNS card AS $$ SELECT ((SELECT enumlabel::card_value FROM pg_enum WHERE enumtypid = 'card_value'::regtype AND enumsortorder = ((($1-1)%13)+1)::real), (SELECT enumlabel::card_suit FROM pg_enum WHERE enumtypid = 'card_suit'::regtype AND enumsortorder = (((($1-1)/13)::int%4)+1)::real), $1 > (13*4))::card $$ LANGUAGE SQL; -- SELECT i, int_to_card(i) FROM generate_series(1, 13*4*2) i CREATE OR REPLACE FUNCTION deal_cards(int, int, int, int[]) RETURNS TABLE (player_no int, hand_no int, card card) AS $$ WITH hand AS ( INSERT INTO hands (run_id, match_no, player_no, hand_no) VALUES ($1, $2, $3, next_hand_no($1)) RETURNING hand_id, player_no, hand_no), mydeals AS ( INSERT INTO deals (hand_id, card) SELECT hand_id, int_to_card(card_id)::card AS card FROM hand, UNNEST($4) card_id RETURNING hand_id, deal_id, card ) SELECT h.player_no, h.hand_no, d.card FROM hand h, mydeals d $$ LANGUAGE SQL; CREATE OR REPLACE FUNCTION deals(int) RETURNS TABLE (deal_id bigint, hand_no int, player_no int, card int) AS $$ SELECT d.deal_id, h.hand_no, h.player_no, card_to_int(d.card) FROM hands h JOIN deals d ON (d.hand_id = h.hand_id) WHERE h.run_id = $1 ORDER BY d.deal_id; $$ LANGUAGE SQL; INSERT INTO runs DEFAULT VALUES; -- Add first run INSERT INTO players VALUES (1, 0, 'Dealer'); -- dealer always zero INSERT INTO players VALUES (1, next_player_no(1), 'Player 1'); INSERT INTO matches VALUES (1, next_match_no(1)); -- First match SELECT * FROM deal_cards(1, 1, 0, ARRAY[6, 63, 41]); SELECT * FROM deal_cards(1, 1, 1, ARRAY[76, 23]); SELECT * FROM deal_cards(1, 1, 0, ARRAY[51, 29, 2, 92, 6]); SELECT * FROM deal_cards(1, 1, 1, ARRAY[101, 8]); INSERT INTO matches VALUES (1, next_match_no(1)); -- Second match SELECT * FROM deal_cards(1, 2, 0, ARRAY[78, 38]); SELECT * FROM deal_cards(1, 2, 1, ARRAY[24, 18, 95, 40]); SELECT * FROM deal_cards(1, 2, 0, ARRAY[13, 84, 41]); SELECT * FROM deal_cards(1, 2, 1, ARRAY[29, 34, 56, 52]); SELECT * FROM deals(1); -- This is the output you need (hands_combined table) -- This view can be used to retrieve the list of all winning hands CREATE OR REPLACE VIEW winning_hands AS SELECT DISTINCT ON (run_id, match_no) * FROM hands ORDER BY run_id, match_no, hand_no DESC; SELECT * FROM winning_hands;
Wouldn't using the UNION operator work? For the hands relation: SELECT * FROM hands_first UNION ALL SELECT * FROM hands_second For the matches relation: SELECT * FROM matches_first UNION ALL SELECT * FROM matches_second As a more long term solution I'd consider restructuring the DB because it will quickly become unmanageable with this schema. Why not improve normalization by introducing a games table? In other words Games have many Matches, matches have many players for each game and players have many hands for each match. I'd recommend drawing the UML for the entity relationships on paper (http://dawgsquad.googlecode.com/hg/docs/database_images/Database_Model_Diagram(Title).png), then improving the schema so it can be queried using normal SQL operators. Hope this helps. EDIT: In that case you can use a subquery on the union of both tables with the rownumber() PG function to represent the row number: SELECT row_number() AS id, whichplayer, whichhand, thiscard FROM ( SELECT * FROM hands_first UNION ALL SELECT * FROM hands_second ); The same principle would apply to the matches table. Obviously this doesn't scale well to even a small number of tables, so would prioritize normalizing your schema. Docs on some PG functions: http://www.postgresql.org/docs/current/interactive/functions-window.html
to build new table with all rows of two tables, do: CREATE TABLE hands AS select 1 as hand, id, whichplayer, whichhand, thiscard from hands_first union all select 2 as hand, id, whichplayer, whichhand, thiscard from hands_second after that, to insert data of new matche, create sequence with start on current last + 1 CREATE SEQUENCE matche START 3; before insert read sequence value, and use it in inserts: SELECT nextval('matche');
Your database structure is not great, and I know for sure it is not scalable approach creating tables on fly. There are performance drawbacks creating physical tables instead of using an existing structure. I suggest you refactor your db structure if can. You can however use the UNION operator to merge your data.
SQL Query 2 tables null results
I was asked this question in an interview: From the 2 tables below, write a query to pull customers with no sales orders. How many ways to write this query and which would have best performance. Table 1: Customer - CustomerID Table 2: SalesOrder - OrderID, CustomerID, OrderDate Query: SELECT * FROM Customer C RIGHT OUTER JOIN SalesOrder SO ON C.CustomerID = SO.CustomerID WHERE SO.OrderID = NULL Is my query correct and are there other ways to write the query and get the same results?
Answering for MySQL instead of SQL Server, cause you tagged it later with SQL Server, so I thought (since this was an interview question, that it wouldn't bother you, for which DBMS this is). Note though, that the queries I wrote are standard sql, they should run in every RDBMS out there. How each RDBMS handles those queries is another issue, though. I wrote this little procedure for you, to have a test case. It creates the tables customers and orders like you specified and I added primary keys and foreign keys, like one would usually do it. No other indexes, as every column worth indexing here is already primary key. 250 customers are created, 100 of them made an order (though out of convenience none of them twice / multiple times). A dump of the data follows, posted the script just in case you want to play around a little by increasing the numbers. delimiter $$ create procedure fill_table() begin create table customers(customerId int primary key) engine=innodb; set #x = 1; while (#x <= 250) do insert into customers values(#x); set #x := #x + 1; end while; create table orders(orderId int auto_increment primary key, customerId int, orderDate timestamp, foreign key fk_customer (customerId) references customers(customerId) ) engine=innodb; insert into orders(customerId, orderDate) select customerId, now() - interval customerId day from customers order by rand() limit 100; end $$ delimiter ; call fill_table(); For me, this resulted in this: CREATE TABLE `customers` ( `customerId` int(11) NOT NULL, PRIMARY KEY (`customerId`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; INSERT INTO `customers` VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),(31),(32),(33),(34),(35),(36),(37),(38),(39),(40),(41),(42),(43),(44),(45),(46),(47),(48),(49),(50),(51),(52),(53),(54),(55),(56),(57),(58),(59),(60),(61),(62),(63),(64),(65),(66),(67),(68),(69),(70),(71),(72),(73),(74),(75),(76),(77),(78),(79),(80),(81),(82),(83),(84),(85),(86),(87),(88),(89),(90),(91),(92),(93),(94),(95),(96),(97),(98),(99),(100),(101),(102),(103),(104),(105),(106),(107),(108),(109),(110),(111),(112),(113),(114),(115),(116),(117),(118),(119),(120),(121),(122),(123),(124),(125),(126),(127),(128),(129),(130),(131),(132),(133),(134),(135),(136),(137),(138),(139),(140),(141),(142),(143),(144),(145),(146),(147),(148),(149),(150),(151),(152),(153),(154),(155),(156),(157),(158),(159),(160),(161),(162),(163),(164),(165),(166),(167),(168),(169),(170),(171),(172),(173),(174),(175),(176),(177),(178),(179),(180),(181),(182),(183),(184),(185),(186),(187),(188),(189),(190),(191),(192),(193),(194),(195),(196),(197),(198),(199),(200),(201),(202),(203),(204),(205),(206),(207),(208),(209),(210),(211),(212),(213),(214),(215),(216),(217),(218),(219),(220),(221),(222),(223),(224),(225),(226),(227),(228),(229),(230),(231),(232),(233),(234),(235),(236),(237),(238),(239),(240),(241),(242),(243),(244),(245),(246),(247),(248),(249),(250); CREATE TABLE `orders` ( `orderId` int(11) NOT NULL AUTO_INCREMENT, `customerId` int(11) DEFAULT NULL, `orderDate` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, PRIMARY KEY (`orderId`), KEY `fk_customer` (`customerId`), CONSTRAINT `orders_ibfk_1` FOREIGN KEY (`customerId`) REFERENCES `customers` (`customerId`) ) ENGINE=InnoDB AUTO_INCREMENT=128 DEFAULT CHARSET=utf8; INSERT INTO `orders` VALUES (1,247,'2013-06-24 19:50:07'),(2,217,'2013-07-24 19:50:07'),(3,8,'2014-02-18 20:50:07'),(4,40,'2014-01-17 20:50:07'),(5,52,'2014-01-05 20:50:07'),(6,80,'2013-12-08 20:50:07'),(7,169,'2013-09-10 19:50:07'),(8,135,'2013-10-14 19:50:07'),(9,115,'2013-11-03 20:50:07'),(10,225,'2013-07-16 19:50:07'),(11,112,'2013-11-06 20:50:07'),(12,243,'2013-06-28 19:50:07'),(13,158,'2013-09-21 19:50:07'),(14,24,'2014-02-02 20:50:07'),(15,214,'2013-07-27 19:50:07'),(16,25,'2014-02-01 20:50:07'),(17,245,'2013-06-26 19:50:07'),(18,182,'2013-08-28 19:50:07'),(19,166,'2013-09-13 19:50:07'),(20,69,'2013-12-19 20:50:07'),(21,85,'2013-12-03 20:50:07'),(22,44,'2014-01-13 20:50:07'),(23,103,'2013-11-15 20:50:07'),(24,19,'2014-02-07 20:50:07'),(25,33,'2014-01-24 20:50:07'),(26,102,'2013-11-16 20:50:07'),(27,41,'2014-01-16 20:50:07'),(28,94,'2013-11-24 20:50:07'),(29,43,'2014-01-14 20:50:07'),(30,150,'2013-09-29 19:50:07'),(31,218,'2013-07-23 19:50:07'),(32,131,'2013-10-18 19:50:07'),(33,77,'2013-12-11 20:50:07'),(34,2,'2014-02-24 20:50:07'),(35,45,'2014-01-12 20:50:07'),(36,230,'2013-07-11 19:50:07'),(37,101,'2013-11-17 20:50:07'),(38,31,'2014-01-26 20:50:07'),(39,56,'2014-01-01 20:50:07'),(40,176,'2013-09-03 19:50:07'),(41,223,'2013-07-18 19:50:07'),(42,145,'2013-10-04 19:50:07'),(43,26,'2014-01-31 20:50:07'),(44,62,'2013-12-26 20:50:07'),(45,195,'2013-08-15 19:50:07'),(46,153,'2013-09-26 19:50:07'),(47,179,'2013-08-31 19:50:07'),(48,104,'2013-11-14 20:50:07'),(49,7,'2014-02-19 20:50:07'),(50,209,'2013-08-01 19:50:07'),(51,86,'2013-12-02 20:50:07'),(52,110,'2013-11-08 20:50:07'),(53,204,'2013-08-06 19:50:07'),(54,187,'2013-08-23 19:50:07'),(55,114,'2013-11-04 20:50:07'),(56,38,'2014-01-19 20:50:07'),(57,236,'2013-07-05 19:50:07'),(58,79,'2013-12-09 20:50:07'),(59,96,'2013-11-22 20:50:07'),(60,37,'2014-01-20 20:50:07'),(61,207,'2013-08-03 19:50:07'),(62,22,'2014-02-04 20:50:07'),(63,120,'2013-10-29 20:50:07'),(64,200,'2013-08-10 19:50:07'),(65,51,'2014-01-06 20:50:07'),(66,181,'2013-08-29 19:50:07'),(67,4,'2014-02-22 20:50:07'),(68,123,'2013-10-26 19:50:07'),(69,108,'2013-11-10 20:50:07'),(70,55,'2014-01-02 20:50:07'),(71,76,'2013-12-12 20:50:07'),(72,6,'2014-02-20 20:50:07'),(73,18,'2014-02-08 20:50:07'),(74,211,'2013-07-30 19:50:07'),(75,53,'2014-01-04 20:50:07'),(76,216,'2013-07-25 19:50:07'),(77,32,'2014-01-25 20:50:07'),(78,74,'2013-12-14 20:50:07'),(79,138,'2013-10-11 19:50:07'),(80,197,'2013-08-13 19:50:07'),(81,221,'2013-07-20 19:50:07'),(82,118,'2013-10-31 20:50:07'),(83,61,'2013-12-27 20:50:07'),(84,28,'2014-01-29 20:50:07'),(85,16,'2014-02-10 20:50:07'),(86,39,'2014-01-18 20:50:07'),(87,3,'2014-02-23 20:50:07'),(88,46,'2014-01-11 20:50:07'),(89,189,'2013-08-21 19:50:07'),(90,59,'2013-12-29 20:50:07'),(91,249,'2013-06-22 19:50:07'),(92,127,'2013-10-22 19:50:07'),(93,47,'2014-01-10 20:50:07'),(94,178,'2013-09-01 19:50:07'),(95,141,'2013-10-08 19:50:07'),(96,188,'2013-08-22 19:50:07'),(97,220,'2013-07-21 19:50:07'),(98,15,'2014-02-11 20:50:07'),(99,175,'2013-09-04 19:50:07'),(100,206,'2013-08-04 19:50:07'); Okay, now to the queries. Three ways came to my mind, I omitted the right join that MDiesel did, because it's actually just another way of writing left join. It was invented for lazy sql developers, that don't want to switch table names, but instead just rewrite one word. Anyway, first query: select c.* from customers c left join orders o on c.customerId = o.customerId where o.customerId is null; Results in an execution plan like this: +----+-------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+ | 1 | SIMPLE | c | index | NULL | PRIMARY | 4 | NULL | 250 | Using index | | 1 | SIMPLE | o | ref | fk_customer | fk_customer | 5 | wtf.c.customerId | 1 | Using where; Using index | +----+-------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+ Second query: select c.* from customers c where c.customerId not in (select distinct customerId from orders); Results in an execution plan like this: +----+--------------------+--------+----------------+---------------+-------------+---------+------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+--------+----------------+---------------+-------------+---------+------+------+--------------------------+ | 1 | PRIMARY | c | index | NULL | PRIMARY | 4 | NULL | 250 | Using where; Using index | | 2 | DEPENDENT SUBQUERY | orders | index_subquery | fk_customer | fk_customer | 5 | func | 2 | Using index | +----+--------------------+--------+----------------+---------------+-------------+---------+------+------+--------------------------+ Third query: select c.* from customers c where not exists (select 1 from orders o where o.customerId = c.customerId); Results in an execution plan like this: +----+--------------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+--------------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+ | 1 | PRIMARY | c | index | NULL | PRIMARY | 4 | NULL | 250 | Using where; Using index | | 2 | DEPENDENT SUBQUERY | o | ref | fk_customer | fk_customer | 5 | wtf.c.customerId | 1 | Using where; Using index | +----+--------------------+-------+-------+---------------+-------------+---------+------------------+------+--------------------------+ We can see in all execution plans, that the customers table is read as a whole, but from the index (the implicit one as the only column is primary key). This may change, when you select other columns from the table, that are not in an index. The first one seems to be the best. For each row in customers only one row in orders is read. The id column suggests, that MySQL can do this in one step, as only indexes are involved. The second query seems to be the worst (though all 3 queries shouldn't perform too bad). For each row in customers the subquery is executed (the select_type column tells this). The third query is not much different in that it uses a dependent subquery, but should perform better than the second query. Explaining the small differences would lead to far now. If you're interested, here's the manual page that explains what each column and their values mean here: EXPLAIN output Finally: I'd say, that the first query will perform best, but as always, in the end one has to measure, to measure and to measure.
I can thing of two other ways to write this query: SELECT C.* FROM Customer C LEFT OUTER JOIN SalesOrder SO ON C.CustomerID = SO.CustomerID WHERE SO.CustomerID IS NULL SELECT C.* FROM Customer C WHERE NOT C.CustomerID IN(SELECT CustomerID FROM SalesOrder)
The solutions involving outer joins will perform better than a solution using NOT IN.
Is it possible to update an "order" column from within a trigger in MySQL?
We have a table in our system that would benefit from a numeric column so we can easily grab the 1st, 2nd, 3rd records for a job. We could, of course, update this column from the application itself, but I was hoping to do it in the database. The final method must handle cases where users insert data that belongs in the "middle" of the results, as they may receive information out of order. They may also edit or delete records, so there will be corresponding update and delete triggers. The table: CREATE TABLE `test` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, `seq` int(11) unsigned NOT NULL, `job_no` varchar(20) NOT NULL, `date` date NOT NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM AUTO_INCREMENT=7 DEFAULT CHARSET=latin1 And some example data: mysql> SELECT * FROM test ORDER BY job_no, seq; +----+-----+--------+------------+ | id | seq | job_no | date | +----+-----+--------+------------+ | 5 | 1 | 123 | 2009-10-05 | | 6 | 2 | 123 | 2009-10-01 | | 4 | 1 | 123456 | 2009-11-02 | | 3 | 2 | 123456 | 2009-11-10 | | 2 | 3 | 123456 | 2009-11-19 | +----+-----+--------+------------+ I was hoping to update the "seq" column from a t rigger, but this isn't allowed by MySQL, with an error "Can't update table 'test' in stored function/trigger because it is already used by statement which invoked this stored function/trigger". My test trigger is as follows: CREATE TRIGGER `test_after_ins_tr` AFTER INSERT ON `test` FOR EACH ROW BEGIN SET #seq = 0; UPDATE `test` t SET t.`seq` = #seq := (SELECT #seq + 1) WHERE t.`job_no` = NEW.`job_no` ORDER BY t.`date`; END; Is there any way to achieve what I'm after other than remembering to call a function after each update to this table?
What about this? CREATE TRIGGER `test_after_ins_tr` BEFORE INSERT ON `test` FOR EACH ROW BEGIN SET #seq = (SELECT COALESCE(MAX(seq),0) + 1 FROM test t WHERE t.job_no = NEW.job_no); SET NEW.seq = #seq; END;
From Sergi's comment above: http://dev.mysql.com/doc/refman/5.1/en/stored-program-restrictions.html - "Within a stored function or trigger, it is not permitted to modify a table that is already being used (for reading or writing) by the statement that invoked the function or trigger."