Efficiently pull different columns using a common correlated subquery - sql

I need to pull multiple columns from a subquery which also requires a WHERE filter referencing columns of the FROM table. I have a couple of questions about this:
Is there another solution to this problem besides mine below?
Is another solution even necessary or is this solution efficient enough?
Example:
In the following example I'm writing a view to present test scores, particularly to discover failures that may need to be addressed or retaken.
I cannot simply use JOIN because I need to filter my actual subquery first (notice I'm getting TOP 1 for the "examinee", sorted either by score or date descending)
My goal is to avoid writing (and executing) essentially the same subquery repeatedly.
SELECT ExamineeID, LastName, FirstName, Email,
(SELECT COUNT(examineeTestID)
FROM exam.ExamineeTest tests
WHERE E.ExamineeID = ExamineeID AND TestRevisionID = 3 AND TestID = 2) Attempts,
(SELECT TOP 1 ExamineeTestID
FROM exam.ExamineeTest T
WHERE E.ExamineeID = ExamineeID AND TestRevisionID = 3 AND TestID = 2
ORDER BY Score DESC) bestExamineeTestID,
(SELECT TOP 1 Score
FROM exam.ExamineeTest T
WHERE E.ExamineeID = ExamineeID AND TestRevisionID = 3 AND TestID = 2
ORDER BY Score DESC) bestScore,
(SELECT TOP 1 DateDue
FROM exam.ExamineeTest T
WHERE E.ExamineeID = ExamineeID AND TestRevisionID = 3 AND TestID = 2
ORDER BY Score DESC) bestDateDue,
(SELECT TOP 1 TimeCommitted
FROM exam.ExamineeTest T
WHERE E.ExamineeID = ExamineeID AND TestRevisionID = 3 AND TestID = 2
ORDER BY Score DESC) bestTimeCommitted,
(SELECT TOP 1 ExamineeTestID
FROM exam.ExamineeTest T
WHERE E.ExamineeID = ExamineeID AND TestRevisionID = 3 AND TestID = 2
ORDER BY DateDue DESC) currentExamineeTestID,
(SELECT TOP 1 Score
FROM exam.ExamineeTest T
WHERE E.ExamineeID = ExamineeID AND TestRevisionID = 3 AND TestID = 2
ORDER BY DateDue DESC) currentScore,
(SELECT TOP 1 DateDue
FROM exam.ExamineeTest T
WHERE E.ExamineeID = ExamineeID AND TestRevisionID = 3 AND TestID = 2
ORDER BY DateDue DESC) currentDateDue,
(SELECT TOP 1 TimeCommitted
FROM exam.ExamineeTest T
WHERE E.ExamineeID = ExamineeID AND TestRevisionID = 3 AND TestID = 2
ORDER BY DateDue DESC) currentTimeCommitted
FROM exam.Examinee E

To answer your second question first, yes, a better way is in order, because the query you're using is hard to understand, hard to maintain, and even if the performance is acceptable now, it's a shame to query the same table multiple times when you don't need to plus the performance may not always be acceptable if your application ever grows to an appreciable size.
To answer your first question, I have a few methods for you. These assume SQL 2005 or up unless where noted.
Note that you don't need BestExamineeID and CurrentExamineeID because they will always be the same as ExamineeID unless no tests were taken and they're NULL, which you can tell from the other columns being NULL.
You can think of OUTER/CROSS APPLY as an operator that lets you move correlated subqueries from the WHERE clause into the JOIN clause. They can have an outer reference to a previously-named table, and can return more than one column. This enables you to do the job only once per logical query rather than once for each column.
SELECT
ExamineeID,
LastName,
FirstName,
Email,
B.Attempts,
BestScore = B.Score,
BestDateDue = B.DateDue,
BestTimeCommitted = B.TimeCommitted,
CurrentScore = C.Score,
CurrentDateDue = C.DateDue,
CurrentTimeCommitted = C.TimeCommitted
FROM
exam.Examinee E
OUTER APPLY ( -- change to CROSS APPLY if you only want examinees who've tested
SELECT TOP 1
Score, DateDue, TimeCommitted,
Attempts = Count(*) OVER ()
FROM exam.ExamineeTest T
WHERE
E.ExamineeID = T.ExamineeID
AND T.TestRevisionID = 3
AND T.TestID = 2
ORDER BY Score DESC
) B
OUTER APPLY ( -- change to CROSS APPLY if you only want examinees who've tested
SELECT TOP 1
Score, DateDue, TimeCommitted
FROM exam.ExamineeTest T
WHERE
E.ExamineeID = T.ExamineeID
AND T.TestRevisionID = 3
AND T.TestID = 2
ORDER BY DateDue DESC
) C
You should experiment to see if my Count(*) OVER () is better than having an additional OUTER APPLY that just gets the count. If you're not restricting the Examinee from the exam.Examinee table, it may be better to just do a normal aggregate in a derived table.
Here's another method that (sort of) goes and gets all the data in one swoop. It conceivably could perform better than other queries, except my experience is that windowing functions can get very and surprisingly expensive in some situations, so testing is in order.
WITH Data AS (
SELECT
*,
Count(*) OVER (PARTITION BY ExamineeID) Cnt,
Row_Number() OVER (PARTITION BY ExamineeID ORDER BY Score DESC) ScoreOrder,
Row_Number() OVER (PARTITION BY ExamineeID ORDER BY DateDue DESC) DueOrder
FROM
exam.ExamineeTest
), Vals AS (
SELECT
ExamineeID,
Max(Cnt) Attempts,
Max(CASE WHEN ScoreOrder = 1 THEN Score ELSE NULL END) BestScore,
Max(CASE WHEN ScoreOrder = 1 THEN DateDue ELSE NULL END) BestDateDue,
Max(CASE WHEN ScoreOrder = 1 THEN TimeCommitted ELSE NULL END) BestTimeCommitted,
Max(CASE WHEN DueOrder = 1 THEN Score ELSE NULL END) BestScore,
Max(CASE WHEN DueOrder = 1 THEN DateDue ELSE NULL END) BestDateDue,
Max(CASE WHEN DueOrder = 1 THEN TimeCommitted ELSE NULL END) BestTimeCommitted
FROM Data
GROUP BY
ExamineeID
)
SELECT
E.ExamineeID,
E.LastName,
E.FirstName,
E.Email,
V.Attempts,
V.BestScore, V.BestDateDue, V.BestTimeCommitted,
V.CurrentScore, V.CurrentDateDue, V.CurrentTimeCommitted
FROM
exam.Examinee E
LEFT JOIN Vals V ON E.ExamineeID = V.ExamineeID
-- change join to INNER if you only want examinees who've tested
Finally, here's a SQL 2000 method:
SELECT
E.ExamineeID,
E.LastName,
E.FirstName,
E.Email,
Y.Attempts,
Y.BestScore, Y.BestDateDue, Y.BestTimeCommitted,
Y.CurrentScore, Y.CurrentDateDue, Y.CurrentTimeCommitted
FROM
exam.Examinee E
LEFT JOIN ( -- change to inner if you only want examinees who've tested
SELECT
X.ExamineeID,
X.Cnt Attempts,
Max(CASE Y.Which WHEN 1 THEN T.Score ELSE NULL END) BestScore,
Max(CASE Y.Which WHEN 1 THEN T.DateDue ELSE NULL END) BestDateDue,
Max(CASE Y.Which WHEN 1 THEN T.TimeCommitted ELSE NULL END) BestTimeCommitted,
Max(CASE Y.Which WHEN 2 THEN T.Score ELSE NULL END) CurrentScore,
Max(CASE Y.Which WHEN 2 THEN T.DateDue ELSE NULL END) CurrentDateDue,
Max(CASE Y.Which WHEN 2 THEN T.TimeCommitted ELSE NULL END) CurrentTimeCommitted
FROM
(
SELECT ExamineeID, Max(Score) MaxScore, Max(DueDate) MaxDueDate, Count(*) Cnt
FROM exam.ExamineeTest
WHERE
TestRevisionID = 3
AND TestID = 2
GROUP BY ExamineeID
) X
CROSS JOIN (SELECT 1 UNION ALL SELECT 2) Y (Which)
INNER JOIN exam.ExamineeTest T
ON X.ExamineeID = T.ExamineeID
AND (
(Y.Which = 1 AND X.MaxScore = T.MaxScore)
OR (Y.Which = 2 AND X.MaxDueDate = T.MaxDueDate)
)
WHERE
T.TestRevisionID = 3
AND T.TestID = 2
GROUP BY
X.ExamineeID,
X.Cnt
) Y ON E.ExamineeID = Y.ExamineeID
This query will return unexpected extra rows if the combination of (ExamineeID, Score) or (ExamineeID, DueDate) can return multiple rows. That's probably not unlikely with Score. If neither is unique, then you need to use (or add) some additional column that can grant uniqueness so it can used to select one row. If only Score can be duplicated then an additional pre-query that gets the max Score first, then dovetailing in with the max DueDate would combine to pull the most recent score that was a tie for the highest at the same time as getting the most recent data. Let me know if you need more SQL 2000 help.
Note: The biggest thing that is going to control whether CROSS APPLY or a ROW_NUMBER() solution is better is whether you have an index on the columns that are being looked up and whether the data is dense or sparse.
Index + you're pulling only a few examinees with lots of tests each = CROSS APPLY wins.
Index + you're pulling a huge number of examines with only a few tests each = ROW_NUMBER() wins.
No index = string concatenation/value packing method wins (not shown here).
The group by solution that I gave for SQL 2000 will probably perform the worst, but not guaranteed. Like I said, testing is in order.
If any of my queries do give performance problems let me know and I'll see what I can do to help. I'm sure I probably have typos as I didn't work up any DDL to recreate your tables, but I did my best without trying it.
If performance really does become crucial, I would create ExamineeTestBest and ExamineeTestCurrent tables that get pushed to by a trigger on the ExamineeTest table that would always keep them updated. However, this is denormalization and probably not necessary or a good idea unless you've scaled so awfully big that retrieving results becomes unacceptably long.

It's not same subquery. It's three different subqueries.
count() on all
TOP (1) ORDER BY Score DESC
TOP (1) ORDER BY DateDue DESC
You can't avoid executing it less than 3 times.
The question is, how to make it execute no more than 3 times.
One option would be to write 3 inline table functions and use them with outer apply. Make sure they are actually inline, otherwise your performance will drop a hundred times. One of these three functions might be:
create function dbo.topexaminee_byscore(#ExamineeID int)
returns table
as
return (
SELECT top (1)
ExamineeTestID as bestExamineeTestID,
Score as bestScore,
DateDue as bestDateDue,
TimeCommitted as bestTimeCommitted
FROM exam.ExamineeTest
WHERE (ExamineeID = #ExamineeID) AND (TestRevisionID = 3) AND (TestID = 2)
ORDER BY Score DESC
)
Another option would be to do essentially the same, but with subqueries. Because you fetch data for all students anyway, there shouldn't be too much of a difference performance-wise. Create three subqueries, for example:
select bestExamineeTestID, bestScore, bestDateDue, bestTimeCommitted
from (
SELECT
ExamineeTestID as bestExamineeTestID,
Score as bestScore,
DateDue as bestDateDue,
TimeCommitted as bestTimeCommitted,
row_number() over (partition by ExamineeID order by Score DESC) as takeme
FROM exam.ExamineeTest
WHERE (TestRevisionID = 3) AND (TestID = 2)
) as foo
where foo.takeme = 1
Same for ORDER BY DateDue DESC and for all records, with respective columns being selected.
Join these three on the examineeid.
What is going to be better/more performant/more readable is up to you. Do some testing.

It looks like you can replace the three columns that are based on the alias "bestTest" with a view. All three of those subqueries have the same WHERE clause and the same ORDER BY clause.
Ditto for the subquery aliased "bestNewTest". Ditto ditto for the subquery aliased "currentTeest".
If I counted right, that would replace 8 subqueries with 3 views. You can join on the views. I think the joins would be faster, but if I were you, I'd check the execution plan of both versions.

You could use a CTE and OUTER APPLY.
;WITH testScores AS
(
SELECT ExamineeID, ExamineeTestID, Score, DateDue, TimeCommitted
FROM exam.ExamineeTest
WHERE TestRevisionID = 3 AND TestID = 2
)
SELECT ExamineeID, LastName, FirstName, Email, total.Attempts,
bestTest.*, currentTest.*
FROM exam.Examinee
LEFT OUTER JOIN
(
SELECT ExamineeID, COUNT(ExamineeTestID) AS Attempts
FROM testScores
GROUP BY ExamineeID
) AS total ON exam.Examinee.ExamineeID = total.ExamineeID
OUTER APPLY
(
SELECT TOP 1 ExamineeTestID, Score, DateDue, TimeCommitted
FROM testScores
WHERE exam.Examinee.ExamineeID = t.ExamineeID
ORDER BY Score DESC
) AS bestTest (bestExamineeTestID, bestScore, bestDateDue, bestTimeCommitted)
OUTER APPLY
(
SELECT TOP 1 ExamineeTestID, Score, DateDue, TimeCommitted
FROM testScores
WHERE exam.Examinee.ExamineeID = t.ExamineeID
ORDER BY DateDue DESC
) AS currentTest (currentExamineeTestID, currentScore, currentDateDue,
currentTimeCommitted)

Related

How to do this query using self join or anything but without using window function

Below is the solution but I want to know other ways to accomplish the same results (preferably in PostgreSQL).
This is the DB
Question - How many customers have churned straight after their initial free trial? what percentage is this rounded to the
nearest whole number?
WITH ranking AS (
SELECT
s.customer_id,
s.plan_id,
p.plan_name,
ROW_NUMBER() OVER (
PARTITION BY s.customer_id
ORDER BY s.plan_id) AS plan_rank
FROM dbo.subscriptions s
JOIN dbo.plans p
ON s.plan_id = p.plan_id)
SELECT
COUNT(*) AS churn_count,
ROUND(100 * COUNT(*) / (
SELECT COUNT(DISTINCT customer_id)
FROM dbo.subscriptions),0) AS churn_percentage
FROM ranking
WHERE plan_id = 4 -- Filter to churn plan
AND plan_rank = 2
You can achieve the same results with a single aggregation on customer_id with a few CASE WHEN statements:
SELECT count(*) as total_customers
,count(case when total_subscriptions = 2
and includes_free = 1
and includes_churn = 1 then 1 end) as churn_count
,100 * count(case when total_subscriptions = 2
and includes_free = 1
and includes_churn = 1 then 1 end) / count(*) as target_percent
FROM (
SELECT customer_id
,count(*) as total_subscriptions
,max(case when plan_id = 0 then 1 else 0 end) as includes_free
,max(case when plan_id = 4 then 1 else 0 end) as includes_churn
FROM dbo.subscriptions
GROUP BY customer_id
) AS tbl
-- Remove any records for people who didnt use the free trial
-- or people who are still on the free trial
WHERE includes_free = 1 AND total_subscriptions > 1
The difference between our solutions are:
Yours doesn't specify that the customer actually had a free trial
Mine doesn't include customers who went from Free -> Churn -> (something else)
Depending on your requirements you might want to make further alterations/use a different approach.

I just started learning SQL and I couldn't do the query, can you help me?

There is a field in the sql query that I can't do. First of all, a new column must be added to the table below. The value of this column needs to be percent complete, so it's a percentage value. So for example, there are 7 values from Cupboard=1 shelves. Where IsCounted is here, 3 of them are counted. In other words, those with Cupboard = 1 should write the percentage value of 3/7 as the value in the new column to be created. If the IsCounted of the others is 0, it will write zero percent. How can I do this?
My Sql Code:
SELECT a.RegionName,
a.Cupboard,
a.Shelf,
(CASE WHEN ToplamSayım > 0 THEN 1 ELSE 0 END) AS IsCounted
FROM (SELECT p.RegionName,
r.Shelf,
r.Cupboard,
(SELECT COUNT(*)
FROM FAZIKI.dbo.PM_ProductCountingNew
WHERE RegionCupboardShelfTypeId = r.Id) AS ToplamSayım
FROM FAZIKI.dbo.DF_PMRegionType p
JOIN FAZIKI.dbo.DF_PMRegionCupboardShelfType r ON p.Id = r.RegionTypeId
WHERE p.WarehouseId = 45) a
ORDER BY a.RegionName;
The result is as in the picture below:
It looks like a windowed AVG should do the trick, although it's not entirely clear what the partitioning column should be.
The SELECT COUNT can be simplified to an EXISTS
SELECT a.RegionName,
a.Cupboard,
a.Shelf,
a.IsCounted,
AVG(a.IsCounted * 1.0) OVER (PARTITION BY a.RegionName, a.Cupboard) Percentage
FROM (
SELECT p.RegionName,
r.Shelf,
r.Cupboard,
CASE WHEN EXISTS (SELECT 1
FROM FAZIKI.dbo.PM_ProductCountingNew pcn
WHERE pcn.RegionCupboardShelfTypeId = r.Id
) THEN 1 ELSE 0 END AS IsCounted
FROM FAZIKI.dbo.DF_PMRegionType p
JOIN FAZIKI.dbo.DF_PMRegionCupboardShelfType r ON p.Id = r.RegionTypeId
WHERE p.WarehouseId = 45
) a
ORDER BY a.RegionName;

SQL Joined Tables - Multiple rows on joined table per 'on' matched field merged into one row?

I have two tables I am pulling data from. Here is a minimal recreation of what I have:
Select
Jobs.Job_Number,
Jobs.Total_Amount,
Job_Charges.Charge_Code,
Job_Charges.Charge_Amount
From
DB.Jobs
Inner Join
DB.Job_Charges
On
Jobs.Job_Number = Job_Charges.Job_Number;
So, what happens is that I end up getting a row for each different Charge_Code and Charge_Amount per Job_Number. Everything else on the row is the same. Is it possible to have it return something more like:
Job_Number - Total_Amount - Charge_Code[1] - Charge_Amount[1] - Charge_Code[2] - Charge_Amount[2]
ETC?
This way it creates one line per job number with each associated charge and amount on the same line. I have been reading through W3 but haven't been able to tell definitively if this is possible or not. Anything helps, thank you!
To pivot your resultset over a fixed number of columns, you can use row_number() and conditional aggregation:
select
job_number,
total_amount,
max(case when rn = 1 then charge_code end) charge_code1,
max(case when rn = 1 then charge_amount end) charge_amount1,
max(case when rn = 2 then charge_code end) charge_code2,
max(case when rn = 2 then charge_amount end) charge_amount2,
max(case when rn = 3 then charge_code end) charge_code3,
max(case when rn = 3 then charge_amount end) charge_amount3
from (
select
j.job_number,
j.total_amount,
c.charge_code,
c.charge_amount,
row_number() over(partition by job_number, total_amount order by c.charge_code) rn
from DB.Jobs j
inner join DB.Job_Charges c on j.job_number = c.job_number
) t
group by job_number, total_amount
The above query handes up to 3 charge codes and amounts par job number (ordered by job codes). You can expand the select clause with more max(case ...) expressions to handle more of them.

Check whether an employee is present on three consecutive days

I have a table called tbl_A with the following schema:
After insert, I have the following data in tbl_A:
Now the question is how to write a query for the following scenario:
Put (1) in front of any employee who was present three days consecutively
Put (0) in front of employee who was not present three days consecutively
The output screen shoot:
I think we should use case statement, but I am not able to check three consecutive days from date. I hope I am helped in this
Thank you
select name, case when max(cons_days) >= 3 then 1 else 0 end as presence
from (
select name, count(*) as cons_days
from tbl_A, (values (0),(1),(2)) as a(dd)
group by name, adate + dd
)x
group by name
With a self-join on name and available = 'Y', we create an inner table with different combinations of dates for a given name and take a count of those entries in which the dates of the two instances of the table are less than 2 units apart i.e. for each value of a date adate, it will check for entries with its own value adate as well as adate + 1 and adate + 2. If all 3 entries are present, the count will be 3 and you will have a flag with value 1 for such names(this is done in the outer query). Try the below query:
SELECT Z.NAME,
CASE WHEN Z.CONSEQ_AVAIL >= 3 THEN 1 ELSE 0 END AS YOUR_FLAG
FROM
(
SELECT A.NAME,
SUM(CASE WHEN B.ADATE >= A.ADATE AND B.ADATE <= A.ADATE + 2 THEN 1 ELSE 0 END) AS CONSEQ_AVAIL
FROM
TABL_A A INNER JOIN TABL_A B
ON A.NAME = B.NAME AND A.AVAILABLE = 'Y' AND B.AVAILABLE = 'Y'
GROUP BY A.NAME
) Z;
Due to the complexity of the problem, I have not been able to test it out. If something is really wrong, please let me know and I will be happy to take down my answer.
--Below is My Approch
select Name,
Case WHen Max_Count>=3 Then 1 else 0 end as Presence
from
(
Select Name,MAx(Coun) as Max_Count
from
(
select Name, (count(*) over (partition by Name,Ref_Date)) as Coun from
(
select Name,adate + row_number() over (partition by Name order by Adate desc) as Ref_Date
from temp
where available='Y'
)
) group by Name
);
select name as employee , case when sum(diff) > =3 then 1 else 0 end as presence
from
(select id, name, Available,Adate, lead(Adate,1) over(order by name) as lead,
case when datediff(day, Adate,lead(Adate,1) over(order by name)) = 1 then 1 else 0 end as diff
from table_A
where Available = 'Y') A
group by name;

SQL Query Best techniques to get MAX data from a foreign key linked table

I have written two queries however feel they are inefficient.
I have two queries, one which prepares the data (the data was originally from a old fox pro db and the dates etc where nvarchars, so I convert them to dates etc) the second which collates all of the data ready to be exported to a csv and eventually the csv is sent to a web service.
So the first query...
I have a table of people and a table of placements (placements being a job that they have had) the placements table will have lots of different rows for a single person and I need only the latest (based on start and end date), is the below the most efficient way of doing this?
PersonCode = unique id for the person, Code = unique id for the placement
SELECT * FROM Person c
LEFT JOIN
(
SELECT MAX(StartDate) AS StartDate, MAX(EndDate) AS EndDate, MAX(Code) AS Code, PersonCode
FROM PersonPlacement
GROUP BY PersonCode
) cp ON c.PersonCode = cp.PersonCode
LEFT JOIN PersonPlacement cp2 ON cp.Code = cp2.Code
So my second query is below...
The second query reads from the first query and needs to do the following:
Get only unique candidates based on last contact date (the original data had dupes)
Get the latest placement
Get Resume data
Only get people that are not currently in a job based on start and end date of placement
If they are in a job that is ending soon then show them
See query below...
SELECT *
FROM Pre_PersonView c
INNER JOIN (
SELECT PersonCode, Code, row_number() over(partition by PersonCode order by StartDate desc) as rn
FROM Pre_PersonView
) pj ON c.PersonCode = pj.PersonCode AND pj.rn = 1
LEFT JOIN Pre_PersonView cp ON pj.Code = cp.Code
INNER JOIN (
SELECT PersonCode, row_number() over(partition by PersonCode order by LastContactDate desc) as rn
FROM Person
) uc ON c.PersonCode = uc.PersonCode AND uc.rn = 1
LEFT JOIN [PersonResumeText] ct ON c.PersonId = ct.PersonId
WHERE c.PersonCode NOT IN
(
SELECT pcv.PersonCode
FROM Pre_PersonView pcv
WHERE pcv.Department IN ('x','y','z')
AND pcv.StartDate <= GETDATE()
AND (CASE WHEN pcv.EndDate = '1899-12-30' THEN GETDATE() + 1 ELSE pcv.EndDate END) > GETDATE()
)
AND DATEDIFF(DAY, ISNULL((CASE WHEN cp.StartDate = '0216-07-22' THEN '2016-07-22' ELSE cp.StartDate END), GETDATE() -365), ISNULL((CASE WHEN cp.EndDate = '1899-12-30' THEN NULL ELSE cp.EndDate END), GETDATE() + 1))
>
(CASE WHEN cp.Department IN ('x','y','z') THEN 365 ELSE 2 END)
Again my question here is this the most efficient way to be doing this?