I'd like to rewrite the following subquery as it's used over and over again in a larger query. The DBMS used is Postgres and the table has the following structure table (id uuid, seq int, value int).
Given a value for id (id_value), the query finds all records in "table" where seq < seq of id_value
My naive (slow) solution so far is the following:
select * from table
where seq < (select seq from table where id = id_value)
table
id, seq, value
a, 1, 12
b, 2, 22
c, 3, 32
x, 4, 43
d, 5, 54
s, 6, 32
a, 7, 54
e.g. a query
select * from table where seq < (select seq from table where id = 'x')
returns
a, 1, 12
b, 2, 22
c, 3, 32
For testing purposes, I've tried to hardcode the relevant seq field and it improves the whole query significantly, but I really don't like to query for seq as a two-stage process. Ideally this could happen as part of the query. Any ideas or inspiration would be appreciated.
CREATE TABLE foo
(
seq integer NOT NULL,
id uuid NOT NULL,
CONSTRAINT foo_pkey PRIMARY KEY (id),
CONSTRAINT foo_id_key UNIQUE (id),
CONSTRAINT foo_seq_key UNIQUE (seq)
);
CREATE UNIQUE INDEX idx_foo_id
ON public.foo USING btree
(id)
TABLESPACE pg_default;
CREATE UNIQUE INDEX idx_foo_seq
ON public.foo USING btree
(seq)
TABLESPACE pg_default;
You may have so many redundant indexes that you are confusing Postgres. Simply defining a column as primary key or unique is sufficient. You don't need multiple index declarations.
For what you want to do, this should be optimal:
select f.*
from foo f
where f.seq < (select f2.seq from foo f2 where f2.id = :id_value)
This should use the index to fetch the seq value in the subquery. Then it should return the appropriate rows.
You could also try:
select f.*
from (select f.*, min(seq) filter (where id = :id_value) over () as min_seq
from foo f
) f
where seq < min_seq;
However, my suspicion is simply that the query is returning a large number of rows and that is affecting performance.
Related
I try to generate several thousands of random test data for some synthetic load testing, but I've run into a weird bug I don't understand.
Here's the basic minimum reproducible code I managed to narrow it down. Let's create a table that has some unique rows:
CREATE TABLE vals (
value INT PRIMARY KEY
);
INSERT INTO vals SELECT generate_series(1, 10);
Let's check there are unique values:
SELECT array(SELECT * FROM vals);
>> {1,2,3,4,5,6,7,8,9,10}
Yep, that's good. Now let's create a table that has lots of user data that references to the vals table:
CREATE TABLE tmp (
a INT REFERENCES vals,
b INT[]
);
And fill it with lots of random data:
WITH test_count AS (SELECT generate_series(1, 10000))
-- some more CTEs, so I cannot give up on them
INSERT
INTO tmp
SELECT
(SELECT value FROM vals ORDER BY random() LIMIT 1),
array(SELECT value FROM vals WHERE random() > 0.85)
FROM test_count;
But when we check it, there are 10000 rows filled with the same values:
SELECT DISTINCT a, b FROM tmp;
>> a | b
---------
2 | {8,5}
I've found out that sometimes postgres optimizes calls random() in the same row to the same value, e.g. SELECT random(), random() would return 0.345, 0.345: link.
But in this case random in-row are different, but random over all the rows is the same.
➥ What is the way to fix it?
The problem is premature optimization. Although there are other ways to phrase the query, adding a (non-sensical) correlation clause causes the subqueries to be run for each iteration:
WITH test_count AS (
SELECT generate_series(1, 10000) as id
)
INSERT INTO tmp
SELECT (SELECT value FROM vals WHERE tc.id is not null ORDER BY random() LIMIT 1),
(SELECT value FROM vals WHERE tc.id is not null ORDER BY random() LIMIT 1)
FROM test_count tc;
Here is a db<>fiddle.
There are two queries below which return count of ID column excluding NULL values
and second query will return the count of all the rows from the table including NULL rows.
select COUNT(ID) from TableName
select COUNT(*) from TableName
My Confusion :
Is there any performance difference ?
TL/DR: Plans might not be the same, you should test on appropriate
data and make sure you have the correct indexes and then choose the best solution based on your investigations.
The query plans might not be the same depending on the indexing and the nullability of the column which is used in the COUNT function.
In the following example I create a table and fill it with one million rows.
All the columns have been indexed except column 'b'.
The conclusion is that some of these queries do result in the same execution plan but most of them are different.
This was tested on SQL Server 2014, I do not have access to an instance of 2012 at this moment. You should test this yourself to figure out the best solution.
create table t1(id bigint identity,
dt datetime2(7) not null default(sysdatetime()),
a char(800) null,
b char(800) null,
c char(800) null);
-- We will use these 4 indexes. Only column 'b' does not have any supporting index on it.
alter table t1 add constraint [pk_t1] primary key NONCLUSTERED (id);
create clustered index cix_dt on t1(dt);
create nonclustered index ix_a on t1(a);
create nonclustered index ix_c on t1(c);
insert into T1 (a, b, c)
select top 1000000
a = case when low = 1 then null else left(REPLICATE(newid(), low), 800) end,
b = case when low between 1 and 10 then null else left(REPLICATE(newid(), 800-low), 800) end,
c = case when low between 1 and 192 then null else left(REPLICATE(newid(), 800-low), 800) end
from master..spt_values
cross join (select 1 from master..spt_values) m(ock)
where type = 'p';
checkpoint;
-- All rows, no matter if any columns are null or not
-- Uses primary key index
select count(*) from t1;
-- All not null,
-- Uses primary key index
select count(id) from t1;
-- Some values of 'a' are null
-- Uses the index on 'a'
select count(a) from t1;
-- Some values of b are null
-- Uses the clustered index
select count(b) from t1;
-- No values of dt are null and the table have a clustered index on 'dt'
-- Uses primary key index and not the clustered index as one could expect.
select count(dt) from t1;
-- Most values of c are null
-- Uses the index on c
select count(c) from t1;
Now what would happen if we were more explicit in what we wanted our count to do? If we tell the query planner that we want to get only rows which have not null, will that change anything?
-- Homework!
-- What happens if we explicitly count only rows where the column is not null? What if we add a filtered index to support this query?
-- Hint: It will once again be different than the other queries.
create index ix_c2 on t1(c) where c is not null;
select count(*) from t1 where c is not null;
I have two select on the same view. One Select will be filtered with the primary key, the other select will be filterd on a non unique index. The used view are complicated. The Select with Primary Key needs approximately 15 seconds. The select with the non unique index needs 0,5 seconds.
Why is the query which using the primary key so slow?
I use "EXPLAIN PLAN FOR" to create a execution plan for both.
The execution plan for fast select: fast select
The execution plan for slow select: slow select
--Pseudocode
create table TableA
(
ID number, --(Primary Key)
ProjectID number, --(Not unique index)
TableB_id number, --(Foreign Key to Table TableB)
TableC_id number, --(Foreign Key to Table TableC)
TableD_id number --(Foreign Key to Table TableD)
);
Create view viewX
as
Select
ID as TableB_ID,
0 as TableC_ID,
0 as TableD_ID,
Value1,
Value2
from TableB
union all
Select
0 as TableB_ID,
ID as TableC_ID,
0 as TableD_ID,
Value1,
value2
from TableC
union all
Select
0 as TableB_ID,
0 as TableC_ID,
id as TableD_ID,
value1,
value2
from viewz;
Create view viewA
as
Select
t.id,
t.ProjectID,
x.TableB_ID,
x.TableC_ID,
x.TableD_ID
from TableA t
inner join viewX x
on t.TableB_ID = x.TableB_ID and
t.TableC_ID = x.TableC_ID and
t.TableD_ID = x.TableD_ID;
--this select needs o,5 seconds
Select *
from ViewA
where ProjectID = 2220;
--this select needs 15 seconds
Select *
from viewA
where id = 5440;
The select on TableA and on ViewX separatly are fast.
--this select needs 0,5 seconds
select *
from TableA
where id = 5440;
Result: ID = 5440, ProjektID = 2220, TableB_ID = 123, TableC_ID = 5325, TableD_ID = 7654
--this select needs 0,3 seconds
Select *
viewX x
where TableB_ID = 123 and
TableC_ID = 5325 and
TableD_ID = 7654;
Thanks for your support
I would say it is because the optimizer will decompose the select against the view to selects against he base tables. In the second case, you are not union-ing all the rows of the other tables, just the rows that meet the where clause for that table, therefore the second query is faster because it has to go through less rows.
My database schema looks like this:
Table t1:
id
valA
valB
Table t2:
id
valA
valB
What I want to do, is, for a given set of rows in one of these tables, find rows in both tables that have the same valA or valB (comparing valA with valA and valB with valB, not valA with valB). Then, I want to look for rows with the same valA or valB as the rows in the result of the previous query, and so on.
Example data:
t1 (id, valA, valB):
1, a, B
2, b, J
3, d, E
4, d, B
5, c, G
6, h, J
t2 (id, valA, valB):
1, b, E
2, d, H
3, g, B
Example 1:
Input: Row 1 in t1
Output:
t1/4, t2/3
t1/3, t2/2
t2/1
...
Example 2:
Input: Row 6 in t1
Output:
t1/2
t2/1
I would like to have the level of the search at that the row was found in the result (e.g. in Example 1: Level 1 for t1/2 and t2/1, level 2 for t1/5, ...) A limited depth of recursion is okay. Over time, I maybe want to include more tables following the same schema into the query. It would be nice if it was easy to extend the query for that purpose.
But what matters most, is the performance. Can you tell me the fastest possible way to accomplish this?
Thanks in advance!
try this although it's not fully tested but looked like it was working :P (http://pastie.org/1140339)
drop table if exists t1;
create table t1
(
id int unsigned not null auto_increment primary key,
valA char(1) not null,
valB char(1) not null
)
engine=innodb;
drop table if exists t2;
create table t2
(
id int unsigned not null auto_increment primary key,
valA char(1) not null,
valB char(1) not null
)
engine=innodb;
drop view if exists t12;
create view t12 as
select 1 as tid, id, valA, valB from t1
union
select 2 as tid, id, valA, valB from t2;
insert into t1 (valA, valB) values
('a','B'),
('b','J'),
('d','E'),
('d','B'),
('c','G'),
('h','J');
insert into t2 (valA, valB) values
('b','E'),
('d','H'),
('g','B');
drop procedure if exists find_children;
delimiter #
create procedure find_children
(
in p_tid tinyint unsigned,
in p_id int unsigned
)
proc_main:begin
declare done tinyint unsigned default 0;
declare dpth smallint unsigned default 0;
create temporary table children(
tid tinyint unsigned not null,
id int unsigned not null,
valA char(1) not null,
valB char(1) not null,
depth smallint unsigned default 0,
primary key (tid, id, valA, valB)
)engine = memory;
insert into children select p_tid, t.id, t.valA, t.valB, dpth from t12 t where t.tid = p_tid and t.id = p_id;
create temporary table tmp engine=memory select * from children;
/* http://dec.mysql.com/doc/refman/5.0/en/temporary-table-problems.html */
while done <> 1 do
if exists(
select 1 from t12 t
inner join tmp on tmp.valA = t.valA or tmp.valB = t.valB and tmp.depth = dpth) then
insert ignore into children
select
t.tid, t.id, t.valA, t.valB, dpth+1
from t12 t
inner join tmp on tmp.valA = t.valA or tmp.valB = t.valB and tmp.depth = dpth;
set dpth = dpth + 1;
truncate table tmp;
insert into tmp select * from children where depth = dpth;
else
set done = 1;
end if;
end while;
select * from children order by depth;
drop temporary table if exists children;
drop temporary table if exists tmp;
end proc_main #
delimiter ;
call find_children(1,1);
call find_children(1,6);
You can do it with stored procedures (see listings 7 and 7a):
http://www.artfulsoftware.com/mysqlbook/sampler/mysqled1ch20.html
You just need to figure out a query for the step of the recursion - taking the already-found rows and finding some more rows.
If you had a database which supported SQL-99 recursive common table expressions (like PostgreSQL or Firebird, hint hint), you could take the same approach as in the above link, but using a rCTE as the framework, so avoiding the need to write a stored procedure.
EDIT: I had a go at doing this with an rCTE in PostgreSQL 8.4, and although i can find the rows, i can't find a way to label them with the depth at which they were found. First, i create a a view to unify the tables:
create view t12 (tbl, id, vala, valb) as (
(select 't1', id, vala, valb from t1)
union
(select 't2', id, vala, valb from t2)
)
Then do this query:
with recursive descendants (tbl, id, vala, valb) as (
(select *
from t12
where tbl = 't1' and id = 1) -- the query that identifies the seed rows, here just t1/1
union
(select c.*
from descendants p, t12 c
where (p.vala = c.vala or p.valb = c.valb)) -- the recursive term
)
select * from descendants;
You would imagine that capturing depth would be as simple as adding a depth column to the rCTE, set to zero in the seed query, then somehow incremented in the recursive step. However, i couldn't find any way to do that, given that you can't write subqueries against the rCTE in the recursive step (so nothing like select max(depth) + 1 from descendants in the column list), and you can't use an aggregate function in the column list (so no max(p.depth) + 1 in the column list coupled with a group by c.* on the select).
You would also need to add a restriction to the query to exclude already-selected rows; you don't need to do that in the basic version, because of the distincting effect of the union, but if you add a count column, then a row can be included in the results more than once with different counts, and you'll get a Cartesian explosion. But you can't easily prevent it, because you can't have subqueries against the rCTE, which means you can't say anything like and not exists (select * from descendants d where d.tbl = c.tbl and d.id = c.id)!
I know all this stuff about recursive queries is of no use to you, but i find it riveting, so please do excuse me.
This is my input data
GroupId Serial Action
1 1 Start
1 2 Run
1 3 Jump
1 8 End
2 9 Shop
2 10 Start
2 11 Run
For each activitysequence in a group I want to Find pairs of Actions where Action1.SerialNo = Action2.SerialNo + k and how may times it happens
Suppose k = 1, then output will be
FirstAction NextAction Frequency
Start Run 2
Run Jump 1
Shop Start 1
How can I do this in SQL, fast enough given the input table contains millions of entries.
tful, This should produce the result you want, but I don't know if it will be as fast as you 'd like. It's worth a try.
create table Actions(
GroupId int,
Serial int,
"Action" varchar(20) not null,
primary key (GroupId, Serial)
);
insert into Actions values
(1,1,'Start'), (1,2,'Run'), (1,3,'Jump'),
(1,8,'End'), (2,9,'Shop'), (2,10,'Start'),
(2,11,'Run');
go
declare #k int = 1;
with ActionsDoubled(Serial,Tag,"Action") as (
select
Serial, 'a', "Action"
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B
), Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table Actions;
If you will be doing the same computation for various #k values on stable data, this may work better in the long run:
declare #k int = 1;
select
Serial, 'a' as Tag, "Action"
into ActionsDoubled
from Actions as A
union all
select
Serial-#k, 'b', "Action"
from Actions
as B;
go
create unique clustered index AD_S on ActionsDoubled(Serial,Tag);
create index AD_a on ActionsDoubled(Tag,Serial);
go
with Pivoted(Serial,a,b) as (
select Serial,a,b
from ActionsDoubled
pivot (
max("Action") for Tag in ([a],[b])
) as P
)
select
a, b, count(*) as ct
from Pivoted
where a is not NULL and b is not NULL
group by a,b
order by a,b;
go
drop table ActionsDoubled;
SELECT a1.Action AS FirstActio, a2.Action AS NextAction, COUNT(*) AS Frequency
FROM Activities a1 JOIN Activities a2
ON (a1.GroupId = a2.GroupId AND a1.Serial = a2.Serial + #k)
GROUP BY a1.Action, a2.Action;
The problem is this: Your query has to go through EVERY row regardless.
You can make it more manageable for your database by tackling each group separately as separate queries. Especially if the size of each group is SMALL.
There's a lot going on under the hood and when the query has to do a scan of the entire table, this actually ends up being many times slower than if you did small chunks which effectively cover all million rows.
So for instance:
--Stickler for clean formatting...
SELECT
a1.Action AS FirstAction,
a2.Action AS NextAction,
COUNT(*) AS Frequency
FROM
Activities a1 JOIN Activities a2
ON (a1.groupid = a2.groupid
AND a1.Serial = a2.Serial + #k)
WHERE
a1.groupid = 1
GROUP BY
a1.Action,
a2.Action;
By the way, you have an index (GroupId, Serial) on the table, right?