Wrong plan when inner-joining a view/subquery that has left join - sql

I'm trying to build a query that inner joins a view (which exists for reusability), but apparently the fact that this view has an internal left join is somehow messing up the optimizer, and I can't really understand why (indices statistics are updated).
Below is an MCVE. It's actually very simple. You can picture it as a simple customer (B) - order (C) design where customer's address (optional) is in another table (A). And then we have a view to join the customer to it's address (vw_B).
Metadata and example data:
create table A (
id int not null,
fieldA char(10) not null,
constraint pk_A primary key (id)
);
create table B (
id int not null,
fieldB char(10) not null,
idA int,
constraint pk_B primary key (id),
constraint fk_A foreign key (idA) references A (id)
);
create view VW_B as
select b.*, a.fieldA from B
left join A on a.id = b.idA;
create table C (
id int not null,
mydate date not null,
idB int not null,
constraint pk_C primary key (id),
constraint fk_B foreign key (idB) references B (id)
);
create index ix_C on C (mydate);
insert into A (id, fieldA)
with recursive n as (
select 1 as n from rdb$database
union all
select n.n + 1 from n
where n < 10
)
select n.n, 'A' from n;
SET STATISTICS INDEX PK_A;
insert into B (id, fieldB, idA)
with recursive n as (
select 1 as n from rdb$database
union all
select n.n + 1 from n
where n < 100
)
select n.n, 'B', IIF(MOD(n.n, 5) = 0, null, MOD(n.n, 10)+1) from n;
SET STATISTICS INDEX PK_B;
SET STATISTICS INDEX FK_A;
insert into C (id, mydate, idB)
with recursive n as (
select 1 as n from rdb$database
union all
select n.n + 1 from n
where n < 1000
)
select n.n, cast('01.01.2020' as date) + 100*rand(), mod(n.n, 100)+1 from n;
SET STATISTICS INDEX PK_C;
SET STATISTICS INDEX FK_B;
SET STATISTICS INDEX IX_C;
With this design, I want to have a query that can join all tables in such a way that I can efficiently search orders by date (c.mydate) or any indexed customer information (table B). The obvious choice is an inner join between B and C, and it works fine. But if I want to add customer's address to the result, by using vw_B instead of B, the optimizer no longer selects the best plan.
Here are some queries to show this:
Manually joining all tables and filtering by date. Optimizer works fine.
select c.*, b.fieldB, a.fieldA from C
inner join B on b.id = c.idB
left join A on a.id = b.idA
where c.mydate = '01.01.2020'
PLAN JOIN (JOIN (C INDEX (IX_C), B INDEX (PK_B)), A INDEX (PK_A))
Reusing vw_B to have A table joined automatically. Optimizer selects a NATURAL plan on (VW_B B).
select c.*, b.fieldB, b.fieldA from C
inner join VW_B b on b.id = c.idB
where c.mydate = '01.01.2020'
PLAN JOIN (JOIN (B B NATURAL, B A INDEX (PK_A)), C INDEX (FK_B, IX_C))
Why does that happen? I thought these two queries should produce the exact same operation in the engine. Now, this is a very simple MVCE, and I have much more complex views that are very reusable, and with larger tables joining with those views is causing performance issues.
Do you have any suggestions to improve performance/PLAN selection, but preserving the convenience of reusability that views provide?
Server version is WI-V3.0.4.33054.

The Firebird optimizer is not intelligent enough to consider the queries equivalent.
Your query with view is equivalent to:
select c.*, b.fieldB, a.fieldA from C
inner join (B left join A on a.id = b.idA)
on b.id = c.idB
where c.mydate = '01.01.2020'
This will produce (almost) the same plan. So, the problem is not with the use of views or not itself, but with how table expressions are nested. This changes how they are evaluated by the engine, and which reordering of joins the engine thinks are possible.
As BrakNicku indicated in the comments, there is no general solution for this.

Related

faster way to retrieve parent values in SQL

Here is a description of my problem with dummy data:
I have a table in SQL Server as below:
Table
id parentid extid Isparent
0 a m 0
1 a m 1
2 a s 0
3 a s 0
4 b q 1
5 b z 0
for each group of records with the same parentid, there is only one record with Isparent = 1.
for each record, I want to find the extid of their parents.
So for id = 0, the parent record is id=1, and extid=m for id=1 is the value I need.
Here is the output I want.
childid parentid child_extid parent_extid
0 a m m
1 a m m
2 a s m
3 a s m
4 b q q
5 b z q
I'm doing this with a self join, but since the table is large, the performance is really slow, I also need to do this multiple times for a few different tables which makes things even worse.
SELECT
a.Id AS 'ChildId',
a.parentid As 'ParentId',
a.extid AS 'child_extid ',
b.extid AS 'parent_extid '
FROM Table a
LEFT JOIN Table b ON (a.parentid = b.parentid)
WHERE b.isparent = 1
Just wondering if there is a better way to do this.
Thanks!
This method of design is incredibly unorthodox. not only is this structure incapable of representing true hierarchy structures, but it also appears to represent a trinary relationship rather than a binary one. If you have control over the design of this data, and this was not your intent, I can help you reformat it for your intent if desired. Until then, this is a very basic representation of what you might be looking for, but falls short because there are unanswered questions about the intent of the data, such as what happens if you generate a row with the values ParentID = a, ExitID = y, IsParent = 1?
Having said that, here's a crack at it. Please note that this execution will be a self-semi join, and will require an index to work properly. This also excludes an order by clause because of the question above.
The code below will be in TSQL format until the DBMS is clarified.
CREATE FUNCTION ParentExitID (#ChildID INT)
AS
BEGIN
RETURN (
SELECT TOP 1 a.ParentID
FROM SampleTable A
WHERE EXISTS (
SELECT 1
FROM SampleTable B
WHERE A.ParentID = B.ParentID
AND A.IsParent = 1
AND B.ChildID = #ChildID
)
END
The "faster" way is to use a relational database, relationally and employ normalization in the design. What the question is demonstrating is cramming as much as possible into a table without any normalization and then building complex (and inefficient) querying logic on top of it.
create table #thing ( id int ); --id is PK
create table #ext ( id varchar(1) ); -- id is PK
create table #thingext ( thing int, ext varchar(1) ); -- thing,ext is PK
create table #parent ( id varchar(1), ext varchar(1)); -- id is PK, ext is FK
create table #thingparent ( thing int, parent varchar(1)); -- thing,parent is PK
insert into #thing values (0),(1),(2),(3),(4),(5);
insert into #ext values ('m'),('s'),('q'),('z');
insert into #thingext values (0,'m'),(1,'m'),(2,'s'),(3,'s'),(4,'q'),(5,'z');
insert into #parent values ('a','m'),('b','q');
insert into #thingparent values (0,'a'),(1,'a'),(2,'a'),(3,'a'),(4,'b'),(5,'b');
select t.id as childid, p.ext as extid
from #thing t
join #thingparent tp on t.id = tp.thing
join #parent p on tp.parent = p.id
What's getting jumbled up in the complexity of the original question is that extid is dependent upon the parent primary key but it is not dependent upon the child primary key. And a model has to reflect that.
Not sure if it would really speed it up.
But you can run this without a WHERE clause by putting that condition on b.isparent in the JOIN.
Example using a table variable:
declare #Table table (id int identity(0,1) primary key, parentid varchar(30), extid varchar(30), isparent bit);
insert into #Table (parentid, extid, isparent) values
('a','m',0),
('a','m',1),
('a','s',0),
('a','s',0),
('b','q',1),
('b','z',0);
SELECT
a.Id AS 'ChildId',
a.parentid AS 'ParentId',
a.extid AS 'child_extid',
b.extid AS 'parent_extid'
FROM #Table a
LEFT JOIN #Table b ON (a.parentid = b.parentid and b.isparent = 1);
But the Explain Plan will probably be the same as in your query.
Adding an non-unique index on parentid could speed things up.

Index over multiple lookup tables in SQL Server

In SQL Server 2012, let's have three tables: Foos, Lookup1 and Lookup2 created with the following SQL:
CREATE TABLE Foos (
Id int NOT NULL,
L1 int NOT NULL,
L2 int NOT NULL,
Value int NOT NULL,
CONSTRAINT PK_Foos PRIMARY KEY CLUSTERED (Id ASC)
);
CREATE TABLE Lookup1 (
Id int NOT NULL,
Name nvarchar(50) NOT NULL,
CONSTRAINT PK_Lookup1 PRIMARY KEY CLUSTERED (Id ASC),
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC)
);
CREATE TABLE Lookup2 (
Id int NOT NULL,
Name nvarchar(50) NOT NULL,
CONSTRAINT PK_Lookup2 PRIMARY KEY CLUSTERED (Id ASC),
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC)
);
CREATE NONCLUSTERED INDEX IX_Foos ON Foos (
L1 ASC,
L2 ASC,
Value ASC
);
ALTER TABLE Foos WITH CHECK ADD CONSTRAINT FK_Foos_Lookup1
FOREIGN KEY(L2) REFERENCES Lookup1 (Id);
ALTER TABLE Foos CHECK CONSTRAINT FK_Foos_Lookup1;
ALTER TABLE Foos WITH CHECK ADD CONSTRAINT FK_Foos_Lookup2
FOREIGN KEY(L1) REFERENCES Lookup2 (Id);
ALTER TABLE Foos CHECK CONSTRAINT FK_Foos_Lookup2;
BAD PLAN:
The following SQL query to get Foos by the lookup tables:
select top(1) f.* from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b'
order by f.Value
does not fully utilize the IX_Foos index, see http://sqlfiddle.com/#!6/cd5c1/1/0 and the plan with data.
(It just chooses one of the lookup tables.)
GOOD PLAN:
However if I rewrite the query:
declare #l1Id int = (select Id from Lookup1 where Name = 'a');
declare #l2Id int = (select Id from Lookup2 where Name = 'b');
select top(1) f.* from Foos f
where f.L1 = #l1Id and f.L2 = #l2Id
order by f.Value
it works as expected. It firstly lookup both lookup tables and then uses to seek the IX_Foos index.
Is it possible to use a hint to force the SQL Server in the first query (with joins) to lookup the ids first and then use it for IX_Foos?
Because if the Foos table is quite large, the first query (with joins) locks the whole table:(
NOTE: The inner join query comes from LINQ. Or is it possible to force LINQ in Entity Framework to rewrite the queries using declare. Since doing the lookup in multiple requests could have longer roundtrip delay in more complex queries.
NOTE2: In Oracle it works ok, it seems like a problem of SQL Server.
NOTE3: The locking issue is more apparent when adding TOP(1) to the select f.* from Foos .... (For instance you need to get only the min or max value.)
UPDATE:
According to the #Hoots hint, I have changed IX_Lookup1 and IX_Lookup2:
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC, Id ASC)
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC, Id ASC)
It helps, but it is still sorting all results:
Why is it taking all 10,000 rows from Foos that are matching f.L1 and f.L2, instead of just taking the first row. (The IX_Foos contains Value ASC so it could find the first row without processing all 10,000 rows and sort them.) The previous plan with declared variables is using the IX_Foos, so it is not doing the sort.
Looking at the query plans, SQL Server is using the same indexes in both versions of the SQL you've put down, it's just in the second version of sql it's executing 3 seperate pieces of SQL rather than 1 and so evaluating the indexes at different times.
I have checked and I think the solution is to change the indexes as below...
CONSTRAINT IX_Lookup1 UNIQUE NONCLUSTERED (Name ASC, ID ASC)
and
CONSTRAINT IX_Lookup2 UNIQUE NONCLUSTERED (Name ASC, ID ASC)
when it evaluates the index it won't go off and need to get the ID from the table data as it will have it in the index. This changes the plan to be what you want, hopefully preventing the locking you're seeing but I'm not going to guarantee that side of it as locking isn't something I'll be able to reproduce.
UPDATE: I now see the issue...
The second piece of SQL is effectively not using set based operations. Simplifying what you've done you're doing...
select f.*
from Foos f
where f.L1 = 1
and f.L2 = 1
order by f.Value desc
Which only has to seek on a simple index to get the results that are already ordered.
In the first bit of SQL (as shown below) you're combining different data sets that has indexes only on the individual table items. The next two bits of SQL do the same thing with the same query plan...
select f.* -- cost 0.7099
from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b'
order by f.Value
select f.* -- cost 0.7099
from Foos f
inner join (SELECT l1.id l1Id, l2.id l2Id
from Lookup1 l1, Lookup2 l2
where l1.Name = 'a' and l2.Name='b') lookups on (f.L1 = lookups.l1Id and f.L2=lookups.l2Id)
order by f.Value desc
The reason I've put both down is because you can hint in the second version quite easily that it's not set based but singular and write it down as this...
select f.* -- cost 0.095
from Foos f
inner join (SELECT TOP 1 l1.id l1Id, l2.id l2Id
from Lookup1 l1, Lookup2 l2
where l1.Name = 'a' and l2.Name='b') lookups on (f.L1 = lookups.l1Id and f.L2=lookups.l2Id)
order by f.Value desc
Of course you can only do this knowing that the sub query will bring back a single record whether the top 1 is mentioned or not. This then brings down the cost from 0.7099 to 0.095. I can only summise that now that there is explicitly a single record input the optimiser now knows the order of things can be dealt with by the index rather than having to 'manually' order them.
Note: 0.7099 isn't very large for a query that runs singularly i.e. you'll hardly notice but if it's part of a larger set of executions you can get the cost down if you like. I suspect the question is more about the reason why, which I believe is down to set based operations against singular seeks.
Try to use CTE like this
with cte as
(select min(Value) as Value from Foos f
join Lookup1 l1 on f.L1 = l1.Id
join Lookup2 l2 on f.L2 = l2.Id
where l1.Name = 'a' and l2.Name = 'b')
select top(1) * from Foos where exists (select * from cte where cte.Value=Foos.Value)
option (recompile)
This will twice reduce logical reads from Foos table and execution time.
set statistics io,time on
1) your first query with indexes by #Hoots
Estimated Subtree Cost = 0.888
Table 'Foos'. Scan count 1, logical reads 59
CPU time = 15 ms, elapsed time = 151 ms.
2) this cte query with the same indexes
Estimated Subtree Cost = 0.397
Table 'Foos'. Scan count 2, logical reads 34
CPU time = 15 ms, elapsed time = 66 ms.
But this technique for billions of rows in Foos can be quite slow as far as we touch this table twice instead of your first query.

postgis spatial query assistance

Assuming I have three tables:
A. Municipalities (MultiPolygon)
B. Postcode centroids (Point)
C. User data (Point)
Entries from (C) match entries on (B) with FK (code).
I am looking for an efficient way to:
Count number of user data (C) in municipalities (A) using ST_Contains.
BUT
here is the catch:
If an entry in C is NULL (or matches another condition) use if exists the matching entry in B using the FK !!!
Currently I have tried various patterns and although spatially querying A & B and A & C both are sub-second, once I add them all together in one query (goal) the outcome is over 4secs
Sample of what I've tried:
This is the worse (60+ secs):
SELECT
A.*,
(SELECT FROM
(SELECT CASE WHEN C.GEOM IS NULL THEN B.GEOM ELSE C.GEOM END
FROM C LEFT JOIN B ON C.ID=B.ID) AS b
WHERE ST_CONTAINS(A.GEOM, b.GEOM)
) count
FROM
A
This is 15 sec:
SELECT
A.ID, ..., -- other A fields
COUNT(B.GEOM)
FROM
A,
(SELECT CASE WHEN C.GEOM IS NULL THEN B.GEOM ELSE C.GEOM END
FROM C LEFT JOIN B ON C.ID=B.ID) AS b
WHERE
ST_Contains(A.GEOM, b.GEOM)
GROUP BY
A.ID, ... -- other A fields
As I said
SELECT COUNT(*) FROM A LEFT JOIN B ON ST_Contains(A.GEOM, B.GEOM)
and
SELECT COUNT(*) FROM A LEFT JOIN C ON ST_Contains(A.GEOM, C.GEOM)
both return in under a second.
All indexes are in place for the foreign key as well (B.ID = C.ID)
Thanks
Did you make indexes for A.geom and B.geom?
It would be
CREATE INDEX idx_A ON A USING GIST ( GEOM );
VACUUM ANALYZE A (GEOM);
CREATE INDEX idx_B ON B USING GIST ( GEOM );
VACUUM ANALYZE B (GEOM);

How do I select unique pairs of rows from a table at random?

I have two tables like these:
CREATE TABLE people (
id INT NOT NULL,
PRIMARY KEY (id)
)
CREATE TABLE pairs (
person_a_id INT,
person_b_id INT,
FOREIGN KEY (person_a_id) REFERENCES people(id),
FOREIGN KEY (person_b_id) REFERENCES people(id)
)
I want to select pairs of people at random from the people table, and after selecting them I add the randomly select pair to the pairs table. person_a_id always refers to the person with the lower id of the pair (since the order of the pair is not relevant).
The thing is that I never want to select the same pair twice, so I need to check the pairs table before I return my randomly selected pair.
Is it possible to do this using just a single SQL query in a reasonably efficient and elegant manner?
(I'm doing this using the Java Persistence API, but hopefully I'll be able to translate any answers into JPA code)
select a.id, b.id
from people1 a
inner join people1 b on a.id < b.id
where not exists (
select *
from pairs1 c
where c.person_a_id = a.id
and c.person_b_id = b.id)
order by a.id * rand()
limit 1;
Limit 1 returns just one pair if you are "drawing lots" one at a time. Otherwise, up the limit to however many pairs you need.
The above query assumes that you can get
1 - 2
2 - 7
and that the pairing 2 - 7 is valid since it doesn't exist, even if 2 is featured again. If you only want a person to feature in only one pair ever, then
select a.id, b.id
from people1 a
inner join people1 b on a.id < b.id
where not exists (
select *
from pairs1 c
where c.person_a_id in (a.id, b.id))
and not exists (
select *
from pairs1 c
where c.person_b_id in (a.id, b.id))
order by a.id * rand()
limit 1;
If multiple pairs are to be generated in one single query, AND the destination table is still empty, you could use this single query. Take note that LIMIT 6 returns only 3 pairs.
select min(a) a, min(b) b
from
(
select
case when mod(#p,2) = 1 then id end a,
case when mod(#p,2) = 0 then id end b,
#p:=#p+1 grp
from (
select id
from (select #p:=1) p, people1
order by rand()
limit 6
) x
) y
group by floor(grp/2)
This cannot be accomplished in a single-query set-based approach because your set will not have knowledge of what pairs are inserted into the pairs table.
Instead, you should loop
WHILE EXISTS(SELECT * FROM people
WHERE id NOT IN (SELECT person_a_id FROM pairs)
AND id NOT IN (SELECT person_b_id FROM pairs)
This will loop while there are unmatched people.
Then you should two random numbers from 1 to the CNT(*) of that table
which gives you the number of unmatched people... if you get the same number twice, roll again. (IF you're worried about this, randomize numbers from the two halves of the set... but then you're losing some randomness based on your sort criteria)
Pair those people.
Wash, rinse, repeat....
Your only "redo" will be when you generate the same random number twice... more likely as you get few people but still only a 25% chance at most (much better than 1/n^2)

Oracle sql query running for (almost) forever

An application of mine is trying to execute a count(*) query which returns after about 30 minutes. What's strange is that the query is very simple and the tables involved are large, but not gigantic (10,000 and 50,000 records).
The query which takes 30 minutes is:
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
The database schema is essentially:
create table BATCH (
BATCH_ID int not null,
[other columns]...,
CONSTRAINT PK_BATCH PRIMARY KEY (BATCH_ID)
);
create table GROUP (
GROUP_ID int not null,
BATCH_ID int,
ENABLED char(1) not null,
[other columns]...,
CONSTRAINT PK_GROUP PRIMARY KEY (GROUP_ID),
CONSTRAINT FK_GROUP_BATCH_ID FOREIGN KEY (BATCH_ID)
REFERENCES BATCH (BATCH_ID),
CONSTRAINT CHK_GROUP_ENABLED CHECK(ENABLED in ('Y', 'N'))
);
create table RECORD (
GROUP_ID int not null,
RECORD_NUMBER int not null,
[other columns]...,
CONSTRAINT PK_RECORD PRIMARY KEY (GROUP_ID, RECORD_NUMBER),
CONSTRAINT FK_RECORD_GROUP_ID FOREIGN KEY (GROUP_ID)
REFERENCES GROUP (GROUP_ID)
);
create index IDX_GROUP_BATCH_ID on GROUP(BATCH_ID);
I checked whether there are any blocks in the database and there are none. I also ran the following pieces of the query and all except the last two returned instantly:
select count(*) from RECORD -- 55,501
select count(*) from GROUP -- 11,693
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
-- 55,501
select count(*)
from GROUP g
where g.BATCH_ID = 1 and g.ENABLED = 'Y'
-- 3,112
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.BATCH_ID = 1
-- 27,742 - took around 5 minutes to run
select count(*)
from RECORD r inner join GROUP g
on g.GROUP_ID = r.GROUP_ID
where g.ENABLED = 'Y'
-- 51,749 - took around 5 minutes to run
Can someone explain what's going on? How can I improve the query's performance? Thanks.
A coworker figured out the issue. It's because the table statistics weren't being updated and the last time the table was analyzed was a couple of months ago (when the table was essentially empty). I ran analyze table RECORD compute statistics and now the query is returning in less than a second.
I'll have to talk to the DBA about why the table statistics weren't being updated.
SELECT COUNT(*)
FROM RECORD R
LEFT OUTER JOIN GROUP G ON G.GROUP_ID = R.GROUP_ID
AND G.BATCH_ID = 1
AND G.ENABLED = 'Y'
Try that and let me know how it turns out. Not saying this IS the answer, but since I don't have access to a DB right now, I can't test it. Hope it works for ya.
An explain plan would be a good place to start.
See here:
Strange speed changes with sql query
for how to use the explain plan syntax (and query to see the result.)
If that doesn't show anything suspicious, you'll probably want to look at a trace.