SQL query performance question (multiple sub-queries) - sql

I have this query:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND (
r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId AND r2.status = 'active')
OR r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
)
Which returns each page and the latest active revision for each, unless no active revision is available, in which case it simply returns the latest revision.
Is there any way this can be optimised to improve performance or just general readability? I'm not having any issues right now, but my worry is that when this gets into a production environment (where there could be a lot of pages) it's going to perform badly.
Also, are there any obvious problems I should be aware of? The use of sub-queries always bugs me, but to the best of my knowledge this cant be done without them.
Note:
The reason the conditions are in the JOIN rather than a WHERE clause is that in other queries (where this same logic is used) I'm LEFT JOINing from the "site" table to the "page" table, and If no pages exist I still want the site returned.
Jack
Edit: I'm using MySQL

If "active" is the first in alphabetical order you migt be able to reduce subqueries to:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND
r.id = (SELECT r2.id
FROM page_revision as r2
WHERE r2.pageId = r.pageId
ORDER BY r2.status, r2.id DESC
LIMIT 1)
Otherwise you can replace ORDER BY line with
ORDER BY CASE r2.status WHEN 'active' THEN 0 ELSE 1 END, r2.id DESC
These all come from my assumptions on SQL Server, your mileage with MySQL may vary.

Maybe a little re-factoring is in order?
If you added a latest_revision_id column onto pages your problem would disappear, hopefully with only a couple of lines added to your page editor.
I know it's not normalized but it would simplify (and greatly speed up) the query, and sometimes you do have to denormalize for performance.

Your problem is a particular case of what is described in this question.
The best you can get using standard ANSI SQL seems to be:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id
AND r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
Other approaches are available but dependent on what database you're using. I'm not really sure it can be improved much for MySQL.

In MS SQL 2005+ and Oracle:
SELECT p.id, r.status, r.title
FROM (
SELECT p.*, r,*,
ROW_NUMBER() OVER (PARTITION BY p.pageId ORDER BY CASE WHEN p.status = 'active' THEN 0 ELSE 1 END, r.id DESC) AS rn
FROM page AS p, page_revision r
WHERE r.id = p.pageId
) o
WHERE rn = 1
In MySQL that can become a problem, as subqueries cannot use the INDEX RANGE SCAN as the expression from the outer query is not considered constant.
You'll need to create two indexes and a function that returns the last page revision to use those indexes:
CREATE INDEX ix_revision_page_status_id ON page_revision (page_id, id, status);
CREATE INDEX ix_revision_page_id (page_id, id);
CREATE FUNCTION `fn_get_last_revision`(input_id INT) RETURNS int(11)
BEGIN
DECLARE id INT;
SELECT r_id
INTO id
FROM (
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_status_id)
WHERE page_id = input_id
AND status = 'active'
ORDER BY id DESC
LIMIT 1
UNION ALL
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_id)
WHERE page_id = input_id
ORDER BY id DESC
LIMIT 1
) o
LIMIT 1;
RETURN id;
END;
SELECT po.id, r.status, r.title
FROM (
SELECT p.*, fn_get_last_revision(p.page_id) AS rev_id
FROM page p
) po, page_revision r
WHERE r.id = po.rev_id;
This will efficiently use index to get the last revision of the page.
P. S. If you will use codes for statuses and use 0 for active, you can get rid of the second index and simplify the query.

Related

Selecting within a selection using SQL from a single table

I have to make a query inside another query in order to find entries in a table that have characteristics but not others. The characteristics are derived from a connection to another table.
Basically, I have a plans table and a parcels table. I need to find the plans that relate to both (building strata, bareland strata, common ownership) and (road, subdivision, park, interest). These plans should contain entries in one list, but not both.
Here is what I have so far.
SELECT *
FROM parcelfabric_plans
WHERE
(name in
(select pl.name from parcelfabric_parcels p inner join
parcelfabric_plans pl on p.planid = pl.objectid
WHERE
p.parcelclass IN ( 'ROAD', 'SUBDIVISION', 'PARK', 'INTEREST')))
This is the first query, which gets all the plans that have parcels related to them in this list. How do I query this selection to get plans within this selection that are also related to the second list (subdivisions, interests, roads, parks)?
This query returns 268983 results of plans. Of these results, I would like to be able to query them and get the number of plans that are also related to subdivisions, interests, roads, parks.
This would require elements from both lists:
select pl.name
from parcelfabric_plans pl
where exists (
select 1 from parcelfabric_parcels p
where p.planid = pl.objectid
and p.parcelclass in ('ROAD', 'SUBDIVISION', 'PARK', 'INTEREST')
) and exists (
select 1 from parcelfabric_parcels p
where p.planid = pl.objectid
and p.parcelclass in (<list 2>)
)
I'm not clear about the requirement though. If you want them to be mutually exclusive then I think this is a better idea:
with data as (
select p.planid,
count(case when p.parcelclass in
('ROAD', 'SUBDIVISION', 'PARK', 'INTEREST') then 1 end) as cnt1,
count(case when p.parcelclass in
(<list 2>) then 1 end) as cnt2
from parcelfabric_plans pl inner join parcelfabric_parcels p
on p.planid = pl.objectid
-- possible optimization
/* where p.parcelclass in (<combined list>) */
group by p.planid
)
select * from data
where cnt1 > 0 and cnt2 = 0 or cnt1 = 0 and cnt2 > 0;
I would like to thank everyone for their comments and answers. I figured out a solution, though it is quite clunky. But at least it works.
SELECT *
FROM pmbcprod.pmbcowner.ParcelFabric_Plans
WHERE
(name in
(select pl.name from parcelfabric_parcels p inner join parcelfabric_plans pl on p.planid = pl.objectid
WHERE
p.parcelclass IN ('ROAD','INTEREST','SUBDIVISION','PARK')
)and name in
(select pl.name from parcelfabric_parcels p inner join parcelfabric_plans pl on p.planid = pl.objectid
WHERE
p.parcelclass IN ('BUILDING STRATA','COMMON OWNERSHIP','BARE LAND STRATA')
)
)
What I was after was simpler than I thought, I just needed to wrap my head around the structure. It's basically a nested query (subquery?). The inner query is made, then the next one is formed around it.
Again, thank you and it is much appreciated. Cheers to all.

Trying to update column values in SQL Server based on time of insertion and getting "The column "ID" was specificed multiple times for "p""

Here is my query that's throwing the "The column "ID" was specified multiple times for "p"":
update tracking.tag set
tracking.tag.PageViewID = p.id
, tracking.tag.BrowserInfoID = p.BrowserInfoID
from (
select
t.id, t.[name], t.VisitID, t.CreatedDate, p.id, p.VisitID, p.BrowserInfoID
from [Tracking].[Tag] as t
inner join (
select id, visitid, BrowserInfoID, createddate, uri
from [tracking].[PageView]
) as p on abs(datediff(second, p.CreatedDate, t.createddate)) < 1 and p.VisitID = t.VisitID
order by 1 desc
) as p
I've seen quite a few questions with the same error on SO but can't seem to see what to apply in this scenario. Any help is greatly appreciated.
Unfortunately there is a lot broken with your statement. The error you are getting is the least of your worries and it in fact just a typo. Let me go through them.
The error: if you consider the following query, which is in essence what you have, how does SQL Server know which of the 2 columns in your sub-query to refer to? They are both called id! Hence if you need to select both columns you need to alias one of them to a unique name.
select id
from (
select
t.id, p.id
from [Tracking].[Tag] as t
inner join [tracking].[PageView] as p
on ABS(datediff(second, p.CreatedDate, t.createddate)) < 1
and p.VisitID = t.VisitID
) as p
Fixed:
select id -- Now we have a unique id column, so SQL Server knows which to select.
from (
select
t.id TagID, p.id
from [Tracking].[Tag] as t
inner join [tracking].[PageView] as p
on ABS(datediff(second, p.CreatedDate, t.createddate)) < 1
and p.VisitID = t.VisitID
) as p
You have a syntax error with your ORDER BY, you can't order a sub-query in that way as it doesn't mean anything.
This is a recommendation, but don't reuse the same table alias (in your case P) in multiple nested sub-queries because its really confusing to know which table/derived table you are referencing.
Your inner-most sub-query is un-necessary, just join the table directly.
Finally you aren't actually joining the table you are updating onto the query you are producing, yes you do have a join inside, but thats not the same table reference as the one you are updating. I assume thats why you have attempted to add an ORDER BY inside your sub-query despite the fact that its giving you a syntax error. In fact all you need is a simple UPDATE + JOIN as follows:
-- Note you use the table alias here for the update rather than the table name
update t set
PageViewID = p.id
, BrowserInfoID = p.BrowserInfoID
-- I assume this select is what you were running into issues with as you tried to test that your update was correct.
-- In this format you no longer need to alias the duplicate column names, but you could for clarity
-- select t.id TagID, t.[name], t.VisitID TagVisitId, t.CreatedDate, p.id, p.VisitID, p.BrowserInfoID
from [Tracking].[Tag] as t
inner join [tracking].[PageView] as p on abs(datediff(second, p.CreatedDate, t.createddate)) < 1 and p.VisitID = t.VisitID

How to use SELECT DISTINCT with RANDOM() function in PostgreSQL?

I am trying to run a SQL query to get four random items. As the table product_filter has more than one touple in product i have to use DISTINCT in SELECT, so i get this error:
for SELECT DISTINCT, ORDER BY expressions must appear in select list
But if i put RANDOM() in my SELECT it will avoid the DISTINCT result.
Someone know how to use DISTINCT with the RANDOM() function? Below is my problematic query.
SELECT DISTINCT
p.id,
p.title
FROM
product_filter pf
JOIN product p ON pf.cod_product = p.cod
JOIN filters f ON pf.cod_filter = f.cod
WHERE
p.visible = TRUE
LIMIT 4
ORDER BY RANDOM();
You either do a subquery
SELECT * FROM (
SELECT DISTINCT p.cod, p.title ... JOIN... WHERE
) ORDER BY RANDOM() LIMIT 4;
or you try GROUPing for those same fields:
SELECT p.cod, p.title, MIN(RANDOM()) AS o FROM ... JOIN ...
WHERE ... GROUP BY p.cod, p.title ORDER BY o LIMIT 4;
Which of the two expressions will evaluate faster depends on table structure and indexing; with proper indexing on cod and title, the subquery version will run faster (cod and title will be taken from index cardinality information, and cod is the only key needed for the JOIN, so if you index by title, cod and visible (used in the WHERE), it is likely that the physical table will not even be accessed at all.
I am not so sure whether this would happen with the second expression too.
You can simplify your query to avoid the problem a priori:
SELECT p.cod, p.title
FROM product p
WHERE p.visible
AND EXISTS (
SELECT 1
FROM product_filter pf
JOIN filters f ON f.cod = pf.cod_filter
WHERE pf.cod_product = p.cod
)
ORDER BY random()
LIMIT 4;
Major points:
You have only columns from table product in the result, other tables are only checked for existence of a matching row. For a case like this the EXISTS semi-join is likely the fastest and simplest solution. Using it does not multiply rows from the base table product, so you don't need to remove them again with DISTINCT.
LIMIT has to come last, after ORDER BY.
I simplified WHERE p.visible = 't' to p.visible, because this should be a boolean column.
Use a subquery. Don't forget the table alias, t. LIMIT comes after ORDER BY.
SELECT *
FROM (SELECT DISTINCT a, b, c
FROM datatable WHERE a = 'hello'
) t
ORDER BY random()
LIMIT 10;
I think you need a subquery:
select *
from (select DISTINCT p.cod, p.title
from product_filter pf join
product p
on pf.cod_product = p.cod
where p.visible = 't'
) t
LIMIT 4
order by RANDOM()
Calculate the distinct values first, and then do the limit.
Do note, this does have performance implications, because this query does a distinct on everything before selecting what you want. Whether this matters depends on the size of your table and how you are using the query.
SELECT DISTINCT U.* FROM
(
SELECT p.cod, p.title FROM product__filter pf
JOIN product p on pf.cod_product = p.cod
JOIN filters f on pf.cod_filter = f.cod
WHERE p.visible = 't'
ORDER BY RANDOM()
) AS U
LIMIT 4
This does the RANDOM first then the LIMIT afterwards.

SQL picking same entry multiple times

SELECT p.*
FROM StatusUpdates p
JOIN FriendRequests fr
ON (fr.From = p.AuthorId OR fr.To = p.AuthorId)
WHERE fr.To = ".$Id." OR fr.From = ".$Id."
AND fr.Accepted = 1
ORDER BY p.DatePosted DESC
I'm using this SQL code at the moment which somone wrote for me on a different question. I'm using PHP, but that shouldn't make much difference, since the only thing I'm doing with it is concatenating a variable into it.
What it's meant to do is go through all your friends and get all their status posts, and order them. It works fine, but it picks out "$Id"'s posts either not at all, or the amount of friends you have
Eg, if you had 5 friends, it would pick our your posts 5 times. I only want it to do this once. How could I do this?
You need to use LEFT JOIN instead of join if you want to display post id regardless of number of friends. And GROUP BY p.id in order not to display the same posts more than 1 time:
SELECT p.*
FROM StatusUpdates p
LEFT JOIN FriendRequests fr
ON ((fr.From = p.AuthorId OR fr.To = p.AuthorId) AND fr.Accepted = 1)
WHERE p.AuthorId = ".$Id."
GROUP BY p.id
ORDER BY p.DatePosted DESC
Either GROUP BY p.id or SELECT DISTINCT p.id, p.*

Approach to Selecting top item matching a criteria

EDIT: my apologies, this was a MSSQL2008 issue.
I have a SQL problem that I've come up against routinely, and normally just solved w/ a nested query. I'm hoping someone can suggest a more elegant solution.
It often happens that I need to select a result set for a user, conditioned upon it being the most recent, or the most sizeable or whatever.
For example: Their complete list of pages created, but I only want the most recent name they applied to a page. It so happens that the database contains many entries for each page, and only the most recent one is desired.
I've been using a nested select like:
SELECT pg.customName, pg.id
FROM (
select id, max(createdAt) as mostRecent
from pages
where userId = #UserId
GROUP BY id
) as MostRecentPages
JOIN pages pg
ON pg.id = MostRecentPages.id
AND pg.createdAt = MostRecentPages.mostRecent
Is there a better syntax to perform this selection?
Looks like you want
SELECT id, customname
FROM (SELECT id, customname,
row_number() OVER(PARTITION BY id ORDER BY createdat DESC) as pos
FROM pages
WHERE pages.userid = #UserId
) x
WHERE x.row_number = 1
(I'm assuming you're using SQL Server from the #UserId parameter. row_number() will work for SQL Server 2005, and tbh the above should also work for Oracle, Postgresql 8.4...)
This will select all the pages by userid and work out which is the most recent using a sort. An alternative would be sth like:
SELECT id, (SELECT TOP 1 customname
FROM pages pages_inner
WHERE pages_inner.id = pages_outer.id
ORDER BY pages_inner.createdat DESC) as customname
FROM (SELECT DISTINCT id FROM pages WHERE pages.userid = #UserId) pages_inner
Which approach is better depends on how many pages rows per id you have compared to pages per userid, I guess.
I'm not sure about better but a different syntax you could try is
SELECT pg.customName, pg.id
FROM pages pg
WHERE userId = #UserId
AND NOT EXISTS
(
SELECT * FROM pages pg2
WHERE pg2.UserId = pg.UserId
AND pg2.id = pg.id
AND pg2.createdAt > pg.createdAt
)
Another alternative would be an OUTER JOIN as in Bill Karwin's answer here How to get all the fields of a row using the SQL MAX function?
For what database (including version)? What you posted could be MySQL, SQL Server, or Sybase...
Using:
SELECT pg.customName,
pg.id
FROM PAGES pg
JOIN (SELECT t.userid,
MAX(t.createdAt) as mostRecent
FROM PAGES t
GROUP BY t.userid) x ON x.id = pg.id
AND x.mostRecent = pg.createdAt
AND x.userid = #UserId
This is the best approach for a portable query, assuming column references are correct. But there are alternatives for limiting the data set - SQL Server uses TOP, MySQL/Postgre/SQLite use LIMIT, Oracle uses ROWNUM.
What's best depends on your data & how the respective optimizer sees it, and your needs (portable vs not). Check the explain plan for the respective database to see how efficient the query is.
Are you using Oracle? Try to see if this query that uses analytic function would work for you. (Don't have access to db right now, so can't test myself.)
SELECT DISTINCT pg.id,
FIRST_VALUE(pg.customName) OVER (PARTITION BY pg.id ORDER BY pg.createdAt DESC) AS customName
FROM pages pg
Assuming SQL Server and your Pages table like so:
CREATE TABLE Pages (
Id int IDENTITY(1, 1) PRIMARY KEY
, CustomName nvarchar(20) NOT NULL
, CreatedAt datetime NOT NULL DEFAULT GETDATE()
, UserId int references Users(Id)
)
I would do the following:
select TOP 1 p.Id as PageId
, p.CustomName
from Pages p
where p.UserId = #UserId
order by p.Created desc
Or even:
select TOP 1 p.Id as PageId
, p.CustomName
, MAX(p.CreatedAt) DateTimeCreated
from Pages p
where p.UserId = #UserId
group by p.Id
, p.CustomName
I hope this helps! (If not, please provide further details so that we may be of better helping hand)
I don't know what your table looks like
Select top 1 pg.createdAt
,pg.customName
,pg.id
from table pg
where pg.UserId = #UserId
order by pg.createdAt Desc
I need a bit more info on your table(s)