mariadb not using all fields of composite index - indexing

Mariadb not fully using composite index. Fast select and slow select both return same data, but explain shows that slow select uses only ix_test_relation.entity_id part and does not use ix_test_relation.stamp part.
I tried many cases (inner join, with, from) but couldn't make mariadb use both fields of index together with recursive query. I understand that I need to tell mariadb to materialize recursive query somehow.
Please help me optimize slow select which is using recursive query to be similar speed to fast select.
Some details about the task... I need to query user activity. One user activity record may relate to multiple entities. Entities are hierarchical. I need to query user activity for some parent entity and all children for specified stamp range. Stamp simplified from TIMESTAMP to BIGINT for demonstration simplicity. There can be a lot (1mil) of entities and each entity may relate to a lot (1mil) of user activity entries. Entity hierarchy depth expected to be like 10 levels deep. I assume that used stamp range reduces number of user activity records to 10-100. I denormalized schema, copied stamp from test_entry to test_relation to be able to include it in test_relation index.
I use 10.4.11-Mariadb-1:10:4.11+maria~bionic.
I can upgrade or patch or whatever mariadb if needed, I have full control over building docker image.
Schema:
CREATE TABLE test_entity(
id BIGINT NOT NULL,
parent_id BIGINT NULL,
CONSTRAINT pk_test_entity PRIMARY KEY (id),
CONSTRAINT fk_test_entity_pid FOREIGN KEY (parent_id) REFERENCES test_entity(id)
);
CREATE TABLE test_entry(
id BIGINT NOT NULL,
name VARCHAR(100) NOT NULL,
stamp BIGINT NOT NULL,
CONSTRAINT pk_test_entry PRIMARY KEY (id)
);
CREATE TABLE test_relation(
entry_id BIGINT NOT NULL,
entity_id BIGINT NOT NULL,
stamp BIGINT NOT NULL,
CONSTRAINT pk_test_relation PRIMARY KEY (entry_id, entity_id),
CONSTRAINT fk_test_relation_erid FOREIGN KEY (entry_id) REFERENCES test_entry(id),
CONSTRAINT fk_test_relation_enid FOREIGN KEY (entity_id) REFERENCES test_entity(id)
);
CREATE INDEX ix_test_relation ON test_relation(entity_id, stamp);
CREATE SEQUENCE sq_test_entry;
Test data:
CREATE OR REPLACE PROCEDURE test_insert()
BEGIN
DECLARE v_entry_id BIGINT;
DECLARE v_parent_entity_id BIGINT;
DECLARE v_child_entity_id BIGINT;
FOR i IN 1..1000 DO
SET v_parent_entity_id = i * 2;
SET v_child_entity_id = i * 2 + 1;
INSERT INTO test_entity(id, parent_id)
VALUES(v_parent_entity_id, NULL);
INSERT INTO test_entity(id, parent_id)
VALUES(v_child_entity_id, v_parent_entity_id);
FOR j IN 1..1000000 DO
SELECT NEXT VALUE FOR sq_test_entry
INTO v_entry_id;
INSERT INTO test_entry(id, name, stamp)
VALUES(v_entry_id, CONCAT('entry ', v_entry_id), j);
INSERT INTO test_relation(entry_id, entity_id, stamp)
VALUES(v_entry_id, v_parent_entity_id, j);
INSERT INTO test_relation(entry_id, entity_id, stamp)
VALUES(v_entry_id, v_child_entity_id, j);
END FOR;
END FOR;
END;
CALL test_insert;
Slow select (> 100ms):
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (
WITH RECURSIVE recursive_child AS (
SELECT id
FROM test_entity
WHERE id IN (2, 4)
UNION ALL
SELECT C.id
FROM test_entity C
INNER JOIN recursive_child P
ON P.id = C.parent_id
)
SELECT id
FROM recursive_child
)
AND TR.stamp BETWEEN 6 AND 8
Fast select (1-2ms):
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (2,3,4,5)
AND TR.stamp BETWEEN 6 AND 8
UPDATE 1
I can demonstrate the problem with even shorter example.
Explicitly store required entity_id records in temporary table
CREATE OR REPLACE TEMPORARY TABLE tbl
WITH RECURSIVE recursive_child AS (
SELECT id
FROM test_entity
WHERE id IN (2, 4)
UNION ALL
SELECT C.id
FROM test_entity C
INNER JOIN recursive_child P
ON P.id = C.parent_id
)
SELECT id
FROM recursive_child
Try to run select using temporary table (below). Select is still slow but the only difference with fast query now is that IN statement queries table instead of inline constants.
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (SELECT id FROM tbl)
AND TR.stamp BETWEEN 6 AND 8

For your queries (both of them) it looks to me like you should, as you mentioned, flip the column order on your compound index:
CREATE INDEX ix_test_relation ON test_relation(stamp, entity_id);
Why?
Your queries have a range filter TR.stamp BETWEEN 2 AND 3 on that column. For a range filter to use an index range scan (whether on a TIMESTAMP or a BIGINT column), the column being filtered must be first in a multicolumn index.
You also want a sargable filter, that is something lik this:
TR.stamp >= CURDATE() - INTERVAL 7 DAY
AND TR.stamp < CURDATE()
in place of
DATE(TR.stamp) BETWEEN DATE(NOW() - INTERVAL 7 DAY) AND DATE(NOW())
That is, don't put a function on the column you're scanning in your WHERE clause.
With a structured query like your first one, the query planner turns it into several queries. You can see this with ANALYZE FORMAT=JSON. The planner may choose different indexes and/or different chunks of indexes for each of those subqueries.
And, a word to the wise: don't get too wrapped around the axle trying to outguess the query planner built into the DBMS. It's an extraordinarily complex and highly wrought piece of software, created by decades of programming work by world-class experts in optimization. Our job as MariaDB / MySQL users is to find the right indexes.

The order of columns in a composite index matters. (O.Jones explains it nicely -- using SQL that has been removed from the Question?!)
I would rewrite
SELECT entry_id
FROM test_relation TR
WHERE TR.entity_id IN (SELECT id FROM tbl)
AND TR.stamp BETWEEN 6 AND 8
as
SELECT TR.entry_id
FROM tbl
JOIN test_relation TR ON tbl.id = TR.entity_id
WHERE TR.stamp BETWEEN 6 AND 8
or
SELECT entry_id
FROM test_relation TR
WHERE TR.stamp BETWEEN 6 AND 8
AND EXISTS ( SELECT 1 FROM tbl
WHERE tbl.id = TR.entity_id )
And have these in either case:
TR: INDEX(stamp, entity_id, entry_id) -- With `stamp` first
tbl: INDEX(id) -- maybe
Since tbl is a freshly built TEMPORARY TABLE, and it seems that only 3 rows need checking, it may not be worth adding INDEX(id).
Also needed:
test_entity: INDEX(parent_id, id)
Assuming that test_relation is a many:many mapping table, it is likely that you will also need (though not necessarily for the current query):
INDEX(entity_id, entry_id)

Related

Is it possible that nested loop joins different data to same id in different loops

We have an interesting phenomenon with a sql and the oracle database that we could not reproduce. The example was simplified. We believe not, but possibly oversimplified.
Main question: Given a nested loop, where the inner (not driving) table has an analytic function, whose result is ambiguous (multiple rows could be the first row of the order by), would it be feasible that said analytic function can return different results for different outer loops?
Secondary Question: If yes, how can we reproduce this behaviour?
If no, have you any other ideas why this query would produce multiple rows for the same company.
Not the question: Should the assumption on what is wrong be correct, correcting the sql would be easy. Just make the order by in the analytic function unambiguous e.g. by adding the id column as second criteria.
Problem:
Company has a n:m relation to owner and a 1:n relation to address.
The SQL joins all tables while reading only a single address per company making use of the analytic function row_number(), groups by company AND address and accumulates the owner name.
We use the query for multiple purposes, other purposes involve reading the “best” address, the problematic one does not. We got multiple error reports with results like this:
Company A has owners N1, N2, N3.
Result was
Company
Owner list
A
N1
A
N2, N3
All cases that were reported involve companies with multiple “best” addresses, hence the theory, that somehow the subquery that should deliver a single address is broken. But we could not reproduce the result.
Full Details:
(for smaller numbers the listagg() is the original function used, but it fails for bigger numbers. count(*) should be a suitable replacement)
--cleanup
DROP TABLE rau_companyowner;
DROP TABLE rau_owner;
DROP TABLE rau_address;
DROP TABLE rau_company;
--create structure
CREATE TABLE rau_company (
id NUMBER CONSTRAINT pk_rau_company PRIMARY KEY USING INDEX (CREATE UNIQUE INDEX idx_rau_company_p ON rau_company(id))
);
CREATE TABLE rau_owner (
id NUMBER CONSTRAINT pk_rau_owner PRIMARY KEY USING INDEX (CREATE UNIQUE INDEX idx_rau_owner_p ON rau_owner(id)),
name varchar2(1000)
);
CREATE TABLE rau_companyowner (
company_id NUMBER,
owner_id NUMBER,
CONSTRAINT pk_rau_companyowner PRIMARY KEY (company_id, owner_id) USING INDEX (CREATE UNIQUE INDEX idx_rau_companyowner_p ON rau_companyowner(company_id, owner_id)),
CONSTRAINT fk_companyowner_company FOREIGN KEY (company_id) REFERENCES rau_company(id),
CONSTRAINT fk_companyowner_owner FOREIGN KEY (owner_id) REFERENCES rau_owner(id)
);
CREATE TABLE rau_address (
id NUMBER CONSTRAINT pk_rau_address PRIMARY KEY USING INDEX (CREATE UNIQUE INDEX idx_rau_address_p ON rau_address(id)),
company_id NUMBER,
prio NUMBER NOT NULL,
street varchar2(1000),
CONSTRAINT fk_address_company FOREIGN KEY (company_id) REFERENCES rau_company(id)
);
--create testdata
DECLARE
TYPE t_address IS TABLE OF rau_address%rowtype INDEX BY pls_integer;
address t_address;
TYPE t_owner IS TABLE OF rau_owner%rowtype INDEX BY pls_integer;
owner t_owner;
TYPE t_companyowner IS TABLE OF rau_companyowner%rowtype INDEX BY pls_integer;
companyowner t_companyowner;
ii pls_integer;
company_id pls_integer := 1;
test_count PLS_INTEGER := 10000;
--test_count PLS_INTEGER := 50;
BEGIN
--rau_company
INSERT INTO rau_company VALUES (company_id);
--rau_owner,rau_companyowner
FOR ii IN 1 .. test_count
LOOP
owner(ii).id:=ii;
owner(ii).name:='N'||to_char(ii);
companyowner(ii).company_id:=company_id;
companyowner(ii).owner_id:=ii;
END LOOP;
forall ii IN owner.FIRST .. owner.LAST
INSERT INTO rau_owner VALUES (owner(ii).id, owner(ii).name);
forall ii IN companyowner.FIRST .. companyowner.LAST
INSERT INTO rau_companyowner VALUES (companyowner(ii).company_id, companyowner(ii).owner_id);
--rau_address
FOR ii IN 1 .. test_count
LOOP
address(ii).id:=ii;
address(ii).company_id:=company_id;
address(ii).prio:=1;
address(ii).street:='S'||to_char(ii);
END LOOP;
forall ii IN address.FIRST .. address.LAST
INSERT INTO rau_address VALUES (address(ii).id, address(ii).company_id, address(ii).prio, address(ii).street);
COMMIT;
END;
-- check testdata
SELECT 'rau_company' tab, COUNT(*) count FROM rau_company
UNION all
SELECT 'rau_owner', COUNT(*) FROM rau_owner
UNION all
SELECT 'rau_companyowner', COUNT(*) FROM rau_companyowner
UNION all
SELECT 'rau_address', COUNT(*) FROM rau_address;
-- the sql: NL with address as inner loop enforced
-- ‘order BY prio’ is ambiguous because all addresses have the same prio
-- => the single row in ad could be any row
SELECT /*+ leading(hh hhoo oo ad) use_hash(hhoo oo) USE_NL(hh ad) */
hh.id company,
ad.street,
-- LISTAGG(oo.name || ', ') within group (order by oo.name) owner_list,
count(oo.id) owner_count
FROM rau_company hh
LEFT JOIN rau_companyowner hhoo ON hh.id = hhoo.company_id
LEFT JOIN rau_owner oo ON hhoo.owner_id = oo.id
LEFT JOIN (
SELECT *
FROM (
SELECT company_id, street,
row_number() over ( partition by company_id order BY prio asc ) as row_num
FROM rau_address
)
WHERE row_num = 1
) ad ON hh.id = ad.company_id
GROUP BY hh.id,
ad.street;
Cris Saxon was so nice to answer my question: https://asktom.oracle.com/pls/apex/f?p=100:11:::::P11_QUESTION_ID:9546263400346154452
In short: As long as the order by is ambiguous (non-deterministic), there will always be a chance for different results even within the same sql.
To reproduce add this to my test data:
ALTER TABLE rau_address PARALLEL 8;
and try the select at the bottom, it should deliver multiple rows.

Incremental DISTINCT / GROUP BY operation

I have a simple two-stage SQL query that operators on two tables A and B, where I use a sub-select to retrieve a number of IDs of table A that are stored as foreign keys in B, using a (possibly complex) query on table B (and possibly other joined tables). Then, I want to simply return the first x IDs of A. I tried using a query like this:
SELECT sq.id
FROM (
SELECT a_id AS id, created_at
FROM B
WHERE ...
ORDER BY created_at DESC
) sq
GROUP BY sq.id
ORDER BY max(sq.created_at) DESC
LIMIT 10;
which is quite slow as Postgres seems to perform the GROUP BY / DISTINCT operation on the whole result set before limiting it. If I LIMIT the sub-query (e.g. to 100), the performance is just fine (as I'd expect), but of course it's no longer guaranteed that there will be at least 10 distinct a_id values in the resulting rows of sq.
Similarly, the query
SELECT a_id AS id
FROM B
WHERE ...
GROUP BY id
ORDER BY max(created_at) DESC
LIMIT 10
is quite slow as Postgres seems to perform a sequential scan on B instead of using an (existing) index. If I remove the GROUP BY clause it uses the index just fine.
The data in table B is such that most rows contain different a_ids, hence even without the GROUP BY most of the returned IDs will be different. The goal I pursue with the grouping is to assure that the result set always contains a given number of entries from A.
Is there a way to perform an "incremental DISTINCT / GROUP BY"? In my naive thinking it would suffice for Postgres to produce result rows and group them incrementally until it reaches the number specified by LIMIT, which in most cases should be nearly instantaneous as most a_id values are different. I tried various ways to query the data but so far I didn't find anything that works reliably.
The Postgres version is 9.6, the data schema as follows:
Table "public.a"
Column | Type | Modifiers
--------+-------------------+------------------------------------------------
id | bigint | not null default nextval('a_id_seq'::regclass)
bar | character varying |
Indexes:
"a_pkey" PRIMARY KEY, btree (id)
"ix_a_bar" btree (bar)
Referenced by:
TABLE "b" CONSTRAINT "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)
Table "public.b"
Column | Type | Modifiers
------------+-----------------------------+--------------------------------------------------
id | bigint | not null default nextval('b_id_seq'::regclass)
foo | character varying |
a_id | bigint | not null
created_at | timestamp without time zone |
Indexes:
"b_pkey" PRIMARY KEY, btree (id)
"ix_b_created_at" btree (created_at)
"ix_b_foo" btree (foo)
Foreign-key constraints:
"b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)
This problem is much more complex than it might seem at a first glance.
If ...
your criteria are not very selective (much more than 10 distinct a_id qualify)
you don't have many duplicate a_id in table B (like you stated)
then there is a very fast way.
To simplify a bit I assume created_at is also defined NOT NULL, or you need to do more.
WITH RECURSIVE top10 AS (
( -- extra parentheses required
SELECT a_id, ARRAY[a_id] AS id_arr, created_at
FROM b
WHERE ... -- your other filter conditions here
ORDER BY created_at DESC, a_id DESC -- both NOT NULL
LIMIT 1
)
UNION ALL -- UNION ALL, not UNION, since we exclude dupes a priori
(
SELECT b.a_id, id_arr || b.a_id, b.created_at
FROM top10 t
JOIN b ON (b.created_at, b.a_id)
< (t.created_at, t.a_id) -- comparing ROW values
AND b.a_id <> ALL (t.id_arr)
WHERE ... -- repeat conditions
ORDER BY created_at DESC, a_id DESC
LIMIT 1
)
)
SELECT a_id
FROM top10
LIMIT 10;
Ideally supported by an index on (created_at DESC, a_id DESC) (or just (created_at, a_id)).
Depending on your other WHERE conditions, other (partial?) indexes may serve even better.
This is particularly efficient for a small result set. Else, and depending on various other details, other solutions may be faster.
Related (with much more explanation):
Can spatial index help a “range - order by - limit” query
Optimize GROUP BY query to retrieve latest record per user
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'
Best way to select random rows PostgreSQL
PostgreSQL sort by datetime asc, null first?
Select first row in each GROUP BY group?
The only way the planner has a chance to avoid sorting the whole table is if you have an index on the complete ORDER BY clause.
Then an index scan can be chosen to get the correct ordering, and the first ten result rows may be found quickly.

SQL Queries instead of Cursors

I'm creating a database for a hypothetical video rental store.
All I need to do is a procedure that check the availabilty of a specific movie (obviously the movie can have several copies). So I have to check if there is a copy available for the rent, and take the number of the copy (because it'll affect other trigger later..).
I already did everything with the cursors and it works very well actually, but I need (i.e. "must") to do it without using cursors but just using "pure sql" (i.e. queries).
I'll explain briefly the scheme of my DB:
The tables that this procedure is going to use are 3: 'Copia Film' (Movie Copy) , 'Include' (Includes) , 'Noleggio' (Rent).
Copia Film Table has this attributes:
idCopia
Genere (FK references to Film)
Titolo (FK references to Film)
dataUscita (FK references to Film)
Include Table:
idNoleggio (FK references to Noleggio. Means idRent)
idCopia (FK references to Copia film. Means idCopy)
Noleggio Table:
idNoleggio (PK)
dataNoleggio (dateOfRent)
dataRestituzione (dateReturn)
dateRestituito (dateReturned)
CF (FK to Person)
Prezzo (price)
Every movie can have more than one copy.
Every copy can be available in two cases:
The copy ID is not present in the Include Table (that means that the specific copy has ever been rented)
The copy ID is present in the Include Table and the dataRestituito (dateReturned) is not null (that means that the specific copy has been rented but has already returned)
The query I've tried to do is the following and is not working at all:
SELECT COUNT(*)
FROM NOLEGGIO
WHERE dataNoleggio IS NOT NULL AND dataRestituito IS NOT NULL AND idNoleggio IN (
SELECT N.idNoleggio
FROM NOLEGGIO N JOIN INCLUDE I ON N.idNoleggio=I.idNoleggio
WHERE idCopia IN (
SELECT idCopia
FROM COPIA_FILM
WHERE titolo='Pulp Fiction')) -- Of course the title is just an example
Well, from the query above I can't figure if a copy of the movie selected is available or not AND I can't take the copy ID if a copy of the movie were available.
(If you want, I can paste the cursors lines that work properly)
------ USING THE 'WITH SOLUTION' ----
I modified a little bit your code to this
WITH film
as
(
SELECT idCopia,titolo
FROM COPIA_FILM
WHERE titolo = 'Pulp Fiction'
),
copy_info as
(
SELECT N.idNoleggio, N.dataNoleggio, N.dataRestituito, I.idCopia
FROM NOLEGGIO N JOIN INCLUDE I ON N.idNoleggio = I.idNoleggio
),
avl as
(
SELECT film.titolo, copy_info.idNoleggio, copy_info.dataNoleggio,
copy_film.dataRestituito,film.idCopia
FROM film LEFT OUTER JOIN copy_info
ON film.idCopia = copy_info.idCopia
)
SELECT COUNT(*),idCopia FROM avl
WHERE(dataRestituito IS NOT NULL OR idNoleggio IS NULL)
GROUP BY idCopia
As I said in the comment, this code works properly if I use it just in a query, but once I try to make a procedure from this, I got errors.
The problem is the final SELECT:
SELECT COUNT(*), idCopia INTO CNT,COPYFILM
FROM avl
WHERE (dataRestituito IS NOT NULL OR idNoleggio IS NULL)
GROUP BY idCopia
The error is:
ORA-01422: exact fetch returns more than requested number of rows
ORA-06512: at "VIDEO.PR_AVAILABILITY", line 9.
So it seems the Into clause is wrong because obviously the query returns more rows. What can I do ? I need to take the Copy ID (even just the first one on the list of rows) without using cursors.
You can try this -
WITH film
as
(
SELECT idCopia, titolo
FROM COPIA_FILM
WHERE titolo='Pulp Fiction'
),
copy_info as
(
select N.idNoleggio, I.dataNoleggio , I.dataRestituito , I.idCopia
FROM NOLEGGIO N JOIN INCLUDE I ON N.idNoleggio=I.idNoleggio
),
avl as
(
select film.titolo, copy_info.idNoleggio, copy_info.dataNoleggio,
copy_info.dataRestituito
from film LEFT OUTER JOIN copy_info
ON film.idCopia = copy_info.idCopia
)
select * from avl
where (dataRestituito IS NOT NULL OR idNoleggio IS NULL);
You should think in terms of sets, rather than records.
If you find the set of all the films that are out, you can exclude them from your stock, and the rest is rentable.
select copiafilm.* from #f copiafilm
left join
(
select idCopia from #r Noleggio
inner join #i include on Noleggio.idNoleggio = include.idNoleggio
where dateRestituito is null
) out
on copiafilm.idCopia = out.idCopia
where out.idCopia is null
I solved the problem editing the last query into this one:
SELECT COUNT(*),idCopia INTO CNT,idCopiaFilm
FROM avl
WHERE (dataRestituito IS NOT NULL OR idNoleggio IS NULL) AND rownum = 1
GROUP BY idCopia;
IF CNT > 0 THEN
-- FOUND AVAILABLE COPY
END IF;
EXCEPTION
WHEN NO_DATA_FOUND THEN
-- NOT FOUND AVAILABLE COPY
Thank you #Aditya Kakirde ! Your suggestion almost solved the problem.

in SQL, best way to join first and last instance of child table without NOT EXISTS?

in PostgreSQL, have issue table and child issue_step table - an issue contains one or more steps.
the view issue_v pulls things from the issue and the first and last step: author and from_ts are pulled from the first step, while status and thru_ts are pulled from the last step.
the tables
create table if not exists seeplai.issue(
isu_id serial primary key,
subject varchar(240)
);
create table if not exists seeplai.issue_step(
stp_id serial primary key,
isu_id int not null references seeplai.issue on delete cascade,
status varchar(12) default 'open',
stp_ts timestamp(0) default current_timestamp,
author varchar(40),
notes text
);
the view
create view seeplai.issue_v as
select isu.*,
first.stp_ts as from_ts,
first.author as author,
first.notes as notes,
last.stp_ts as thru_ts,
last.status as status
from seeplai.issue isu
join seeplai.issue_step first on( first.isu_id = isu.isu_id and not exists(
select 1 from seeplai.issue_step where isu_id=isu.isu_id and stp_id>first.stp_id ) )
join seeplai.issue_step last on( last.isu_id = isu.isu_id and not exists(
select 1 from seeplai.issue_step where isu_id=isu.isu_id and stp_id<last.stp_id ) );
note1: issue_step.stp_id is guaranteed to be chronologically sequential, so using it instead of stp_ts because it's already indexed
this works, but ugly as sin, and cannot be the most efficient query in the world.
In this code, I use a sub-query to find the first and last step IDs, and then join to the two instances of the step table by using those found values.
SELECT ISU.*
,S1.STP_TS AS FROM_TS
,S1.AUTHOR AS AUTHOR
,S1.NOTES AS NOTES
,S2.STP_TS AS THRU_TS
,S2.STATUS AS STATUS
FROM SEEPLAI.ISSUE ISU
INNER JOIN
(
SELECT ISU_ID
,MIN(STP_ID) AS MIN_ID
,MAX(STP_ID AS MAX_ID
FROM SEEPLAI.ISSUE_STEP
GROUP BY
ISU_ID
) SQ
ON SQ.ISU_ID = ISU.ISU.ID
INNER JOIN
SEEPLAI.ISSUE_STEP S1
ON S1.STP_ID = SQ.MIN_ID
INNER JOIN
SEEPLAI.ISSUE_STEP S2
ON S2.STP_ID = SQ.MAX_ID
Note: you really shouldn't be using a select * in a view. It is much better practice to list out all the fields that you need in the view explicitly
Have you considered using window functions?
http://www.postgresql.org/docs/9.2/static/tutorial-window.html
http://www.postgresql.org/docs/9.2/static/functions-window.html
A starting point:
select steps.*,
first_value(steps.stp_id) over w as first_id,
last_value(steps.stp_id) over w as last_id
from issue_step steps
window w as (partition by steps.isu_id order by steps.stp_id)
Btw, if you know the IDs in advance, you'll much be better off getting details in a separate query. (Trying to fetch everything in one go will just yield sucky plans due to subqueries or joins on aggregates, which will result in inefficiently considering/joining the entire tables together.)

Rewriting mysql select to reduce time and writing tmp to disk

I have a mysql query that's taking several minutes which isn't very good as it's used to create a web page.
Three tables are used: poster_data contains information on individual posters. poster_categories lists all the categories (movies, art, etc) while poster_prodcat lists the posterid number and the categories it can be in e.g. one poster would have multiple lines for say, movies, indiana jones, harrison ford, adventure films, etc.
this is the slow query:
select *
from poster_prodcat,
poster_data,
poster_categories
where poster_data.apnumber = poster_prodcat.apnumber
and poster_categories.apcatnum = poster_prodcat.apcatnum
and poster_prodcat.apcatnum='623'
ORDER BY aptitle ASC
LIMIT 0, 32
According to the explain:
It was taking a few minutes. Poster_data has just over 800,000 rows, while poster_prodcat has just over 17 million. Other category queries with this select are barely noticeable, while poster_prodcat.apcatnum='623' has about 400,000 results and is writing out to disk
hope you find this helpful - http://pastie.org/1105206
drop table if exists poster;
create table poster
(
poster_id int unsigned not null auto_increment primary key,
name varchar(255) not null unique
)
engine = innodb;
drop table if exists category;
create table category
(
cat_id mediumint unsigned not null auto_increment primary key,
name varchar(255) not null unique
)
engine = innodb;
drop table if exists poster_category;
create table poster_category
(
cat_id mediumint unsigned not null,
poster_id int unsigned not null,
primary key (cat_id, poster_id) -- note the clustered composite index !!
)
engine = innodb;
-- FYI http://dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html
select count(*) from category
count(*)
========
500,000
select count(*) from poster
count(*)
========
1,000,000
select count(*) from poster_category
count(*)
========
125,675,688
select count(*) from poster_category where cat_id = 623
count(*)
========
342,820
explain
select
p.*,
c.*
from
poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
pc.cat_id = 623
order by
p.name
limit 32;
id select_type table type possible_keys key key_len ref rows
== =========== ===== ==== ============= === ======= === ====
1 SIMPLE c const PRIMARY PRIMARY 3 const 1
1 SIMPLE p index PRIMARY name 257 null 32
1 SIMPLE pc eq_ref PRIMARY PRIMARY 7 const,foo_db.p.poster_id 1
select
p.*,
c.*
from
poster_category pc
inner join category c on pc.cat_id = c.cat_id
inner join poster p on pc.poster_id = p.poster_id
where
pc.cat_id = 623
order by
p.name
limit 32;
Statement:21/08/2010
0:00:00.021: Query OK
The query you listed is how the final query will look like? (So they have the apcatnum=/ID/ ?)
where poster_data.apnumber=poster_prodcat.apnumber and poster_categories.apcatnum=poster_prodcat.apcatnum and poster_prodcat.apcatnum='623'
poster_prodcat.apcatnum='623'
will vastly decrease the data-set mysql has to work on, thus this should be the first parsed part of the query.
Then go on to swap the where-comparisons so those minimizing the data-set the most will be parsed first.
You may also want to try sub-queries. I’m not sure that will help, but mysql probably won’t first get all 3 tables, but first do the sub-query and then the other one. This should minimize memory consumption while querying.
Although this is not an option if you really want to select all columns (as you’re using a * there).
You need to have an index on apnumber in POSTER_DATA. Scanning 841,152 records is killing the performance.
Looks like the query is using the apptitle index to get the ordering but it is doing a full scan to filter the results. I think it might help if you have a composite index across both apptitle and apnumber on poster_data. MySQL might then be able to use this to do both the sort order and the filter.
create index data_title_anum_idx on poster_data(aptitle,apnumber);