Selecting within a selection using SQL from a single table - sql

I have to make a query inside another query in order to find entries in a table that have characteristics but not others. The characteristics are derived from a connection to another table.
Basically, I have a plans table and a parcels table. I need to find the plans that relate to both (building strata, bareland strata, common ownership) and (road, subdivision, park, interest). These plans should contain entries in one list, but not both.
Here is what I have so far.
SELECT *
FROM parcelfabric_plans
WHERE
(name in
(select pl.name from parcelfabric_parcels p inner join
parcelfabric_plans pl on p.planid = pl.objectid
WHERE
p.parcelclass IN ( 'ROAD', 'SUBDIVISION', 'PARK', 'INTEREST')))
This is the first query, which gets all the plans that have parcels related to them in this list. How do I query this selection to get plans within this selection that are also related to the second list (subdivisions, interests, roads, parks)?
This query returns 268983 results of plans. Of these results, I would like to be able to query them and get the number of plans that are also related to subdivisions, interests, roads, parks.

This would require elements from both lists:
select pl.name
from parcelfabric_plans pl
where exists (
select 1 from parcelfabric_parcels p
where p.planid = pl.objectid
and p.parcelclass in ('ROAD', 'SUBDIVISION', 'PARK', 'INTEREST')
) and exists (
select 1 from parcelfabric_parcels p
where p.planid = pl.objectid
and p.parcelclass in (<list 2>)
)
I'm not clear about the requirement though. If you want them to be mutually exclusive then I think this is a better idea:
with data as (
select p.planid,
count(case when p.parcelclass in
('ROAD', 'SUBDIVISION', 'PARK', 'INTEREST') then 1 end) as cnt1,
count(case when p.parcelclass in
(<list 2>) then 1 end) as cnt2
from parcelfabric_plans pl inner join parcelfabric_parcels p
on p.planid = pl.objectid
-- possible optimization
/* where p.parcelclass in (<combined list>) */
group by p.planid
)
select * from data
where cnt1 > 0 and cnt2 = 0 or cnt1 = 0 and cnt2 > 0;

I would like to thank everyone for their comments and answers. I figured out a solution, though it is quite clunky. But at least it works.
SELECT *
FROM pmbcprod.pmbcowner.ParcelFabric_Plans
WHERE
(name in
(select pl.name from parcelfabric_parcels p inner join parcelfabric_plans pl on p.planid = pl.objectid
WHERE
p.parcelclass IN ('ROAD','INTEREST','SUBDIVISION','PARK')
)and name in
(select pl.name from parcelfabric_parcels p inner join parcelfabric_plans pl on p.planid = pl.objectid
WHERE
p.parcelclass IN ('BUILDING STRATA','COMMON OWNERSHIP','BARE LAND STRATA')
)
)
What I was after was simpler than I thought, I just needed to wrap my head around the structure. It's basically a nested query (subquery?). The inner query is made, then the next one is formed around it.
Again, thank you and it is much appreciated. Cheers to all.

Related

Sum fields of an Inner join

How I can add two fields that belong to an inner join?
I have this code:
select
SUM(ACT.NumberOfPlants ) AS NumberOfPlants,
SUM(ACT.NumOfJornales) AS NumberOfJornals
FROM dbo.AGRMastPlanPerformance MPR (NOLOCK)
INNER JOIN GENRegion GR ON (GR.intGENRegionKey = MPR.intGENRegionLink )
INNER JOIN AGRDetPlanPerformance DPR (NOLOCK) ON
(DPR.intAGRMastPlanPerformanceLink =
MPR.intAGRMastPlanPerformanceKey)
INNER JOIN vwGENPredios P ​​(NOLOCK) ON ( DPR.intGENPredioLink =
P.intGENPredioKey )
INNER JOIN AGRSubActivity SA (NOLOCK) ON (SA.intAGRSubActivityKey =
DPR.intAGRSubActivityLink)
LEFT JOIN (SELECT RA.intGENPredioLink, AR.intAGRActividadLink,
AR.intAGRSubActividadLink, SUM(AR.decNoPlantas) AS
intPlantasTrabajads, SUM(AR.decNoPersonas) AS NumOfJornales,
SUM(AR.decNoPlants) AS NumberOfPlants
FROM AGRRecordActivity RA WITH (NOLOCK)
INNER JOIN AGRActividadRealizada AR WITH (NOLOCK) ON
(AR.intAGRRegistroActividadLink = RA.intAGRRegistroActividadKey AND
AR.bitActivo = 1)
INNER JOIN AGRSubActividad SA (NOLOCK) ON (SA.intAGRSubActividadKey
= AR.intAGRSubActividadLink AND SA.bitEnabled = 1)
WHERE RA.bitActive = 1 AND
AR.bitActive = 1 AND
RA.intAGRTractorsCrewsLink IN(2)
GROUP BY RA.intGENPredioLink,
AR.decNoPersons,
AR.decNoPlants,
AR.intAGRAActivityLink,
AR.intAGRSubActividadLink) ACT ON (ACT.intGENPredioLink IN(
DPR.intGENPredioLink) AND
ACT.intAGRAActivityLink IN( DPR.intAGRAActivityLink) AND
ACT.intAGRSubActivityLink IN( DPR.intAGRSubActivityLink))
WHERE
MPR.intAGRMastPlanPerformanceKey IN(4) AND
DPR.intAGRSubActivityLink IN( 1153)
GROUP BY
P.vchRegion,
ACT.NumberOfFloors,
ACT.NumOfJournals
ORDER BY ACT.NumberOfFloors DESC
However, it does not perform the complete sum. It only retrieves all the values ​​of the columns and adds them 1 by 1, instead of doing the complete sum of the whole column.
For example, the query returns these results:
What I expect is the final sums. In NumberOfPlants the result of the sum would be 163,237 and of NumberJornales would be 61.
How can I do this?
First of all the (nolock) hints are probably not accomplishing the benefit you hope for. It's not an automatic "go faster" option, and if such an option existed you can be sure it would be already enabled. It can help in some situations, but the way it works allows the possibility of reading stale data, and the situations where it's likely to make any improvement are the same situations where risk for stale data is the highest.
That out of the way, with that much code in the question we're better served with a general explanation and solution for you to adapt.
The issue here is GROUP BY. When you use a GROUP BY in SQL, you're telling the database you want to see separate results per group for any aggregate functions like SUM() (and COUNT(), AVG(), MAX(), etc).
So if you have this:
SELECT Sum(ColumnB) As SumB
FROM [Table]
GROUP BY ColumnA
You will get a separate row per ColumnA group, even though it's not in the SELECT list.
If you don't really care about that, you can do one of two things:
Remove the GROUP BY If there are no grouped columns in the SELECT list, the GROUP BY clause is probably not accomplishing anything important.
Nest the query
If option 1 is somehow not possible (say, the original is actually a view) you could do this:
SELECT SUM(SumB)
FROM (
SELECT Sum(ColumnB) As SumB
FROM [Table]
GROUP BY ColumnA
) t
Note in both cases any JOIN is irrelevant to the issue.

Designing and Querying Product / Review system

I created a product / review system from scratch and I´m having a hard time to do the following query in SQL Server.
My schema has different tables, for: products, reviews, categories, productPhotos and Brand. I have to query them all to find the brand and category name, photos details, Average Rating and Number of Reviews.
I´m having a hard time to get No. of reviews and average rating.
Reviews can be hidden (user has deleted) or blocked (waiting for moderation). My product table doesn't have No. of Reviews or Average Rating columns, so I need to count it on that query, but not counting the blocked and hidden ones (r.bloqueado=0 and r.hidden=0).
I have the query below, but it´s counting the blocked and hidden. If I uncomment the "and r.bloqueado=0 and r.hidden=0" part I get the right counting, but then it doesn't show products that has 0 reviews (something I need!).
select top 20
p.id, p.brand, m.nome, c.name,
count(r.product) AS NoReviews, Avg(r.nota) AS AvgRating,
f.id as cod_foto,f.nome as nome_foto
from
tblBrands AS m
inner join
(tblProducts AS p
left join
tblProductsReviews AS r ON p.id = r.product) ON p.brand = m.id
left join
tblProductsCategorias as c on p.categoria = c.id
left join
(select
id_product, id, nome
from
tblProductsFotos O
where
id = (SELECT min(I.id)
FROM tblProductsFotos I
WHERE I.id_product = O.id_product)) as f on p.id = f.id_product
where
p.bloqueado = 0
//Problem - and r.bloqueado=0 and r.hidden=0
group by
p.id, p.brand, p.modalidade, m.nome, c.name, f.id,f.nome"
Need your advice:
I have seen other systems that has Avg Rating and No. of Reviews in the product table. This would help a lot in the complexity of this query (probably also performance), but then I have to do extra queries in every new review, blocked and hidden actions. I can easily to that. Considering that includes and updates occurs much much less than showing the products, this sounds nice.
Would be a better idea to do that ?
Or is it better to find a way to fix this query ? Can you help me find a solution ?
Thanks
For count the number of product you can use case when and sum assigning 1 there the value is not r.bloqueado=0 or r.hidden=0 and 0 for these values (so you can avoid the filter in where)
"select top 20 p.id, p.brand, m.nome, c.name, sum(
case when r.bloqueado=0 then 0
when r.hidden=0 then 0
else 1
end ) AS NoReviews,
Avg(r.nota) AS AvgRating, f.id as cod_foto,f.nome as nome_foto
from tblBrands AS m
inner join (tblProducts AS p
left join tblProductsReviews AS r ON p.id=r.product ) ON p.brand = m.id
left join tblProductsCategorias as c on p.categoria=c.id
left join (select id_product,id,nome from tblProductsFotos O
where id = (SELECT min(I.id) FROM tblProductsFotos I
WHERE I.id_product = O.id_product)) as f on p.id = f.id_product where p.bloqueado=0
group by p.id, p.brand, p.modalidade, m.nome, c.name, f.id,f.nome"
for avg could be you can do somethings similar
It's very easy to lose records when combining a where clause with an outer join. Rows that do not exist in the outer table are returned as NULL. Your filter has accidentally excluded these nulls.
Here's an example that demonstrates what's happening:
/* Sample data.
* There are two tables: product and review.
* There are two products: 1 & 2.
* Only product 1 has a review.
*/
DECLARE #Product TABLE
(
ProductId INT
)
;
DECLARE #Review TABLE
(
ReviewId INT,
ProductId INT,
Blocked BIT
)
;
INSERT INTO #Product
(
ProductId
)
VALUES
(1),
(2)
;
INSERT INTO #Review
(
ReviewId,
ProductId,
Blocked
)
VALUES
(1, 1, 0)
;
Outer joining the tables, without a where clause, returns:
Query
-- No where.
SELECT
p.ProductId,
r.ReviewId,
r.Blocked
FROM
#Product AS p
LEFT OUTER JOIN #Review AS r ON r.ProductId = p.ProductId
;
Result
ProductId ReviewId Blocked
1 1 0
2 NULL NULL
Filtering for Blocked = 0 would remove the second record, and therefore ProductId 2. Instead:
-- With where.
SELECT
p.ProductId,
r.ReviewId,
r.Blocked
FROM
#Product AS p
LEFT OUTER JOIN #Review AS r ON r.ProductId = p.ProductId
WHERE
r.Blocked = 0
OR r.Blocked IS NULL
;
This query retains the NULL value, and ProductId 2. Your example is a little more complicated because you have two fields.
SELECT
...
WHERE
(
Blocked = 0
AND Hidden = 0
)
OR Blocked IS NULL
;
You do not need to check both fields for NULL, as they appear in the same table.

Unable to Group on MSAccess SQL multiple search query

please can you help me before I go out of my mind. I've spent a while on this now and resorted to asking you helpful wonderful people. I have a search query:
SELECT Groups.GroupID,
Groups.GroupName,
( SELECT Sum(SiteRates.SiteMonthlySalesValue)
FROM SiteRates
WHERE InvoiceSites.SiteID = SiteRates.SiteID
) AS SumOfSiteRates,
( SELECT Count(InvoiceSites.SiteID)
FROM InvoiceSites
WHERE SiteRates.SiteID = InvoiceSites.SiteID
) AS CountOfSites
FROM (InvoiceSites
INNER JOIN (Groups
INNER JOIN SitesAndGroups
ON Groups.GroupID = SitesAndGroups.GroupID
) ON InvoiceSites.SiteID = SitesAndGroups.SiteID)
INNER JOIN SiteRates
ON InvoiceSites.SiteID = SiteRates.SiteID
GROUP BY Groups.GroupID
With the following table relationship
http://m-ls.co.uk/ExtFiles/SQL-Relationship.jpg
Without the GROUP BY entry I can get a list of the entries I want but it drills the results down by SiteID where instead I want to GROUP BY the GroupID. I know this is possible but lack the expertise to complete this.
Any help would be massively appreciated.
I think all you need to do is add groups.Name to the GROUP BY clause, however I would adopt for a slightly different approach and try to avoid the subqueries if possible. Since you have already joined to all the required tables you can just use normal aggregate functions:
SELECT Groups.GroupID,
Groups.GroupName,
SUM(SiteRates.SiteMonthlySalesValue) AS SumOfSiteRates,
COUNT(InvoiceSites.SiteID) AS CountOfSites
FROM (InvoiceSites
INNER JOIN (Groups
INNER JOIN SitesAndGroups
ON Groups.GroupID = SitesAndGroups.GroupID
) ON InvoiceSites.SiteID = SitesAndGroups.SiteID)
INNER JOIN SiteRates
ON InvoiceSites.SiteID = SiteRates.SiteID
GROUP BY Groups.GroupID, Groups.GroupName;
I think what you are looking for is something like the following:
SELECT Groups.GroupID, Groups.GroupName, SumResults.SiteID, SumResults.SumOfSiteRates, SumResults.CountOfSites
FROM Groups INNER JOIN
(
SELECT SitesAndGroups.SiteID, Sum(SiteRates.SiteMonthlySalesValue) AS SumOfSiteRates, Count(InvoiceSites.SiteID) AS CountOfSites
FROM SitesAndGroups INNER JOIN (InvoiceSites INNER JOIN SiteRates ON InvoiceSites.SiteID = SiteRates.SiteID) ON SitesAndGroups.SiteID = InvoiceSites.SiteID
GROUP BY SitesAndGroups.SiteID
) AS SumResults ON Groups.SiteID = SumResults.SiteID
This query will group your information based on the SiteID like you want. That query is referenced in the from statement linking to the Groups table to pull the group information that you want.

A messy SQL statement

I have a case where I wanna choose any database entry that have an invalid Country, Region, or Area ID, by invalid, I mean an ID for a country or region or area that no longer exists in my tables, I have four tables: Properties, Countries, Regions, Areas.
I was thinking to do it like this:
SELECT * FROM Properties WHERE
Country_ID NOT IN
(
SELECT CountryID FROM Countries
)
OR
RegionID NOT IN
(
SELECT RegionID FROM Regions
)
OR
AreaID NOT IN
(
SELECT AreaID FROM Areas
)
Now, is my query right? and what do you suggest that i can do and achieve the same result with better performance?!
Your query in fact is optimal.
LEFT JOIN's proposed by others are worse, as they select ALL values and then filter them out.
Most probably your subquery will be optimized to this:
SELECT *
FROM Properties p
WHERE NOT EXISTS
(
SELECT 1
FROM Countries i
WHERE i.CountryID = p.CountryID
)
OR
NOT EXISTS
(
SELECT 1
FROM Regions i
WHERE i.RegionID = p.RegionID
)
OR
NOT EXISTS
(
SELECT 1
FROM Areas i
WHERE i.AreaID = p.AreaID
)
, which you should use.
This query selects at most 1 row from each table, and jumps to the next iteration right as it finds this row (i. e. if it does not find a Country for a given Property, it will not even bother checking for a Region).
Again, SQL Server is smart enough to build the same plan for this query and your original one.
Update:
Tested on 512K rows in each table.
All corresponding ID's in dimension tables are CLUSTERED PRIMARY KEY's, all measure fields in Properties are indexed.
For each row in Property, PropertyID = CountryID = RegionID = AreaID, no actual missing rows (worst case in terms of execution time).
NOT EXISTS 00:11 (11 seconds)
LEFT JOIN 01:08 (68 seconds)
You could rewrite it differently as follows:
SELECT p.*
FROM Properties p
LEFT JOIN Countries c ON p.Country_ID = c.CountryID
LEFT JOIN Regions r on p.RegionID = r.RegionID
LEFT JOIN Areas a on p.AreaID = a.AreaID
WHERE c.CountryID IS NULL
OR r.RegionID IS NULL
OR a.AreaID IS NULL
Test the performance difference (if there is any - there should be as NOT IN is a nasty search, especially over a lot of items as it HAS to test every single one).
You can also make this faster by indexing the IDS being searched - in each master table (Country, Region, Area) they should be clustered primary keys.
Since this seems to be cleanup sql, this should be ok. But how about using foreign keys so that it does not bother you next time around?
Well, you could try things like UNION (instead of OR) - but I expect that the optimizer is already doing the best it can given the information available:
SELECT * FROM Properties
WHERE NOT EXISTS (SELECT 1 FROM Areas WHERE Areas.AreaID = Properties.AreaID)
UNION
SELECT * FROM Properties
WHERE NOT EXISTS (SELECT 1 FROM Regions WHERE Regions.RegionID = Properties.RegionID)
UNION
SELECT * FROM Properties
WHERE NOT EXISTS (SELECT 1 FROM Countries WHERE Countries.CountryID = Properties.CountryID)
Subqueries in the conditions can be quite inefficient. Instead you can do left joins against the related tables. Where there are no matching record you get a null value. You can use this in the condition to select only the records where there is a matching record missing:
select p.*
from Properties p
left join Countries c on c.CountryID = p.Country_ID
left join Regions r on r.RegionID = p.RegionID
left join Areas a on a.AreaID = p.AreaID
where c.CountryID is null or r.RegionID is null or a.AreaID is null
If you're not grabbing the row data from countries/regions/areas you can try using "exists":
SELECT Properties.*
FROM Properties
WHERE Properties.CountryID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Countries WHERE Countries.CountryID = Properties.CountryID)
OR Properties.RegionID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Regions WHERE Regions.RegionID = Properties.RegionID)
OR Properties.AreaID IS NOT NULL AND NOT EXISTS (SELECT 1 FROM Areas WHERE Areas.AreaID = Properties.AreaID)
This will typically hint to use the pkey indices of countries et al for the existence check... but whether that is an improvement depends on your data stats, you simply have to plug it into query analyzer and try it.

SQL query performance question (multiple sub-queries)

I have this query:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND (
r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId AND r2.status = 'active')
OR r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
)
Which returns each page and the latest active revision for each, unless no active revision is available, in which case it simply returns the latest revision.
Is there any way this can be optimised to improve performance or just general readability? I'm not having any issues right now, but my worry is that when this gets into a production environment (where there could be a lot of pages) it's going to perform badly.
Also, are there any obvious problems I should be aware of? The use of sub-queries always bugs me, but to the best of my knowledge this cant be done without them.
Note:
The reason the conditions are in the JOIN rather than a WHERE clause is that in other queries (where this same logic is used) I'm LEFT JOINing from the "site" table to the "page" table, and If no pages exist I still want the site returned.
Jack
Edit: I'm using MySQL
If "active" is the first in alphabetical order you migt be able to reduce subqueries to:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id AND
r.id = (SELECT r2.id
FROM page_revision as r2
WHERE r2.pageId = r.pageId
ORDER BY r2.status, r2.id DESC
LIMIT 1)
Otherwise you can replace ORDER BY line with
ORDER BY CASE r2.status WHEN 'active' THEN 0 ELSE 1 END, r2.id DESC
These all come from my assumptions on SQL Server, your mileage with MySQL may vary.
Maybe a little re-factoring is in order?
If you added a latest_revision_id column onto pages your problem would disappear, hopefully with only a couple of lines added to your page editor.
I know it's not normalized but it would simplify (and greatly speed up) the query, and sometimes you do have to denormalize for performance.
Your problem is a particular case of what is described in this question.
The best you can get using standard ANSI SQL seems to be:
SELECT p.id, r.status, r.title
FROM page AS p
INNER JOIN page_revision as r ON r.pageId = p.id
AND r.id = (SELECT MAX(r2.id) from page_revision as r2 WHERE r2.pageId = r.pageId)
Other approaches are available but dependent on what database you're using. I'm not really sure it can be improved much for MySQL.
In MS SQL 2005+ and Oracle:
SELECT p.id, r.status, r.title
FROM (
SELECT p.*, r,*,
ROW_NUMBER() OVER (PARTITION BY p.pageId ORDER BY CASE WHEN p.status = 'active' THEN 0 ELSE 1 END, r.id DESC) AS rn
FROM page AS p, page_revision r
WHERE r.id = p.pageId
) o
WHERE rn = 1
In MySQL that can become a problem, as subqueries cannot use the INDEX RANGE SCAN as the expression from the outer query is not considered constant.
You'll need to create two indexes and a function that returns the last page revision to use those indexes:
CREATE INDEX ix_revision_page_status_id ON page_revision (page_id, id, status);
CREATE INDEX ix_revision_page_id (page_id, id);
CREATE FUNCTION `fn_get_last_revision`(input_id INT) RETURNS int(11)
BEGIN
DECLARE id INT;
SELECT r_id
INTO id
FROM (
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_status_id)
WHERE page_id = input_id
AND status = 'active'
ORDER BY id DESC
LIMIT 1
UNION ALL
SELECT r.id
FROM page_revisions
FORCE INDEX (ix_revision_page_id)
WHERE page_id = input_id
ORDER BY id DESC
LIMIT 1
) o
LIMIT 1;
RETURN id;
END;
SELECT po.id, r.status, r.title
FROM (
SELECT p.*, fn_get_last_revision(p.page_id) AS rev_id
FROM page p
) po, page_revision r
WHERE r.id = po.rev_id;
This will efficiently use index to get the last revision of the page.
P. S. If you will use codes for statuses and use 0 for active, you can get rid of the second index and simplify the query.