Derived Tables for summary statistics - sql

I often find myself running a query to get the number of people who meet a certain criteria, the total number of people in that population and the finding the percentage that meets that criteria. I've been doing it for the same way for a while and I was wondering what SO would do to solve the same type of problem. Below is how I wrote the query:
select m.state_cd
,m.injurylevel
,COUNT(distinct m.patid) as pplOnRx
,x.totalPatientsPerState
,round((COUNT(distinct m.patid) /cast(x.totalPatientsPerState as float))*100,2) as percentPrescribedNarcotics
from members as m
inner join rx on rx.patid=m.PATID
inner join DrugTable as dt on dt.drugClass=rx.drugClass
inner join
(
select m2.state_cd, m2.injurylevel, COUNT(distinct m2.patid) as totalPatientsPerState
from members as m2
inner join rx on rx.patid=m2.PATID
group by m2.STATE_CD,m2.injuryLevel
) x on x.state_cd=m.state_cd and m.injuryLevel=x.injurylevel
where drugText like '%narcotics%'
group by m.state_cd,m.injurylevel,x.totalPatientsPerState
order by m.STATE_CD,m.injuryLevel
In this example not everyone who appears in the members table is in the rx table. The derived table makes sure that everyone whose in rx is also in members without the condition of drugText like narcotics. From what little I've played with it it seems that the over(partition by clause might work here. I have no idea if it does, just seems like it to me. How would someone else go about tackling this problem?
results:

This is exactly what MDX and SSAS is designed to do. If you insist on doing it in SQL (nothing wrong with that), are you asking for a way to do it with better performance? In that case, it would depend on how the tables are indexed, tempdb speed, and if the tables are partitioned, then that too.
Also, the distinct count is going to be one of larger performance hits. The like '%narcotics%' in the predicate is going to force a full table scan and should be avoided at all costs (can this be an integer key in the data model?)
To answer your question, not really sure windowing (over partition by) is going to perform any better. I would test it and see, but there is nothing "wrong" with the query.
You could rewrite the count distinct's as virtual tables or temp tables with group by's or a combination of those two.
To illustrate, this is a stub for windowing that you could grow into the same query:
select a.state_cd,a.injurylevel,a.totalpatid, count(*) over (partition by a.state_cd, a.injurylevel)
from
(select state_cd,injurylevel,count(*) as totalpatid, count(distinct patid) as patid
from
#members
group by state_cd,injurylevel
) a
see what I mean about not really being that helpful? Then again, sometimes rewriting a query slightly can improve performance by selecting a better execution plan, but rather then taking stabs in the dark, I'd first find the bottlenecks in the query you have, since you already took the time to write it.

Related

Time based accumulation based on type: Speed considerations in SQL

Based on surfing the web, I came up with two methods of counting the records in a table "Table1". The counter field increments according to a date field "TheDate". It does this by summing records with an older TheDate value. Furthermore, records with different values for the compound field (Field1,Field2) are counted using separate counters. Field3 is just an informational field that is included for added awareness and does not affect the counting or how records are grouped for counting.
Method 1: Use corrrelated subquery
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
(
SELECT SUM(1) FROM Table1 InnerQuery
WHERE InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
) AS RunningCounter
FROM Table1 MainQuery
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
Method 2: Use join and group-by
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
SUM(1) AS RunningCounter
FROM Table1 MainQuery INNER JOIN Table1 InnerQuery
ON InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
There is no inner query per se in Method 2, but I use the table alias InnerQuery so that a ready parellel with Method 1 can be drawn. The role is the same; the 2nd instance of Table 1 is for accumulating the counts of the records which have TheDate less than that of any record in MainQuery (1st instance of Table 1) with the same Field1 and Field2 values.
Note that in Method 2, Field 3 is include in the Group-By clause even though I said that it does not affect how the records are grouped for counting. This is still true, since the counting is done using the matching records in InnerQuery, whereas the GROUP By applies to Field 3 in MainQuery.
I found that Method 1 is noticably faster. I'm surprised by this because it uses a correlated subquery. The way I think of a correlated subquery is that it is executed for each record in MainQuery (whether or not that is done in practice after optimization). On the other hand, Method 2 doesn't run an inner query over and over again. However, the inner join still has multiple records in InnerQuery matching each record in MainQuery, so in a sense, it deals with a similar order of complexity.
Is there a decent intuitive explanation for this speed difference, as well as best practice or considerations in choosing an approach for time-base accumulation?
I've posted this to
Microsoft Answers
Stack Exchange
In fact, I think the easiest way is to do this:
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
COUNT(*)
FROM Table1 MainQuery
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
(The order by isn't required to get the same data, just to order it. In other words, removing it will not change the number or contents of each row returned, just the order in which they are returned.)
You only need to specify the table once. Doing a self-join (joining a table to itself as both your queries do) is not required. The performance of your two queries will depend on a whole load of things which I don't know - what the primary keys are, the number of rows, how much memory is available, and so on.
First, your experience makes a lot of sense. I'm not sure why you need more intuition. I imagine you learned, somewhere along the way, that correlated subqueries are evil. Well, as with some of the things we teach kids as being really bad ("don't cross the street when the walk sign is not green") turn out to be not so bad, the same is true of correlated subqueries.
The easiest intuition is that the uncorrelated subquery has to aggregate all the data in the table. The correlated version only has to aggregate matching fields, although it has to do this over and over.
To put numbers to it, say you have 1,000 rows with 10 rows per group. The output is 100 rows. The first version does 100 aggregations of 10 rows each. The second does one aggregation of 1,000 rows. Well, aggregation generally scales in a super-linear fashion (O(n log n), technically). That means that 100 aggregations of 10 records takes less time than 1 aggregation of 1000 records.
You asked for intuition, so the above is to provide some intuition. There are a zillion caveats that go both ways. For instance, the correlated subquery might be able to make better use of indexes for the aggregation. And, the two queries are not equivalent, because the correct join would be LEFT JOIN.
Actually, I was wrong in my original post. The inner join is way, way faster than the correlated subquery. However, the correlated subquery is able to display its results records as they are generated, so it appears faster.
As a side curiosity, I'm finding that if the correlated sub-query approach is modified to use sum(-1) instead of sum(1), the number of returned records seems to vary from N-3 to N (where N is the correct number, i.e., the number of records in Table1). I'm not sure if this is due to some misbehaviour in Access's rush to display initial records or what-not.
While it seems that the INNER JOIN wins hands-down, there is a major insidious caveat. If the GROUP BY fields do not uniquely distinguish each record in Table1, then you will not get an individual SUM for each record of Table1. Imagine that a particular combination of GROUP BY field values matching (say) THREE records in Table1. You will then get a single SUM for all of them. The problem is, each of these 3 records in MainQuery also matches all 3 of the same records in InnerQuery, so those instances in InnerQuery get counted multiple times. Very insidious (I find).
So it seems that the sub-query may be the way to go, which is awfully disturbing in view of the above problem with repeatability (2nd paragraph above). That is a serious problem that should send shivers down any spine. Another possible solution that I'm looking at is to turn MainQuery into a subquery by SELECTing the fields of interest and DISTINCTifying them before INNER JOINing the result with InnerQuery.

Why does breaking out this correlated subquery vastly improve performance?

I tried running this query against two tables which were very different sizes - #temp was about 15,000 rows, and Member is about 70,000,000, about 68,000,000 of which do not have the ID 307.
SELECT COUNT(*)
FROM #temp
WHERE CAST(individual_id as varchar) NOT IN (
SELECT IndividualID
FROM Member m
INNER JOIN Person p ON p.PersonID = m.PersonID
WHERE CompanyID <> 307)
This query ran for 18 hours, before I killed it and tried something else, which was:
SELECT IndividualID
INTO #source
FROM Member m
INNER JOIN Person p ON p.PersonID = m.PersonID
WHERE CompanyID <> 307
SELECT COUNT(*)
FROM #temp
WHERE CAST(individual_id AS VARCHAR) NOT IN (
SELECT IndividualID
FROM #source)
And this ran for less than a second before giving me a result.
I was pretty surprised by this. I'm a middle-tier developer rather than a SQL expert and my understanding of what goes on under the hood is a little murky, but I would have presumed that, since the sub-query in my first attempt is the exact same code, asking for the exact same data as in the second attempt, that these would be roughly equivalent.
But that's obviously wrong. I can't look at the execution plan for my original query to see what SQL Server is trying to do. So can someone kindly explain why splitting the data out into a temp table is so much faster?
EDIT: Table schemas and indexes
The #temp table has two columns, Individual_ID int and Source_Code varchar(50)
Member and Person are more complex. They has 29 and 13 columns respectively so I don't really want to post them all in full. PersonID is an int and is the PK on Person and an FK on Member. IndividualID is a column on Person - this is not clear in the query as written.
I tried using a LEFT JOIN instead of NOT IN before asking the question. The performance on the second query wasn't noticeably different - both were sub-second. On the first query I let it run for an hour before stopping it, presuming it would make no significant difference.
I also added an index on #source, just like on the original table, so the performance impact should be identical.
First, your query has two faux pas's that really stick out. You are converting to varchar(), but you do not include a length argument. This should not be allowed! The default length varies by context and you need to be explicit.
Second, you are matching two keys in different tables and they seemingly have different types. Foreign key references should always have the same type. This can have a very big impact on performance. If you are dealing with tables that have millions of rows, then you need to pay some attention to the data structure.
To understand the difference in performance, you need to understand execution plans. The two queries have very different execution plans. My (educated) guess is that the first version version is using a nested loop join algorithm. The second version is using a more sophisticated algorithm. In your case, this would be due to the ability of SQL Server to maintain statistics on tables. So, instantiating the intermediate results actually helps the optimizer produce a better query plan.
The subject of how best to write this logic has been investigated a lot. Here is a very good discussion on the subject by Aaron Bertrand.
I do agree with Aaron on the preference for not exists in this case:
SELECT COUNT(*)
FROM #temp t
WHERE NOT EXISTS (SELECT 1
FROM Member m JOIN
Person p
ON p.PersonID = m.PersonID
WHERE MemberID <> 307 and individual_id = t. individual_id
);
However, I don't know if this will have better performance in this particular case.
This line is probably what kills the first query
WHERE CAST(individual_id as varchar) NOT IN
My guess would be that this forces a table scan rather than using any indexes.

Queries within queries: Is there a better way?

As I build bigger, more advanced web applications, I'm finding myself writing extremely long and complex queries. I tend to write queries within queries a lot because I feel making one call to the database from PHP is better than making several and correlating the data.
However, anyone who knows anything about SQL knows about JOINs. Personally, I've used a JOIN or two before, but quickly stopped when I discovered using subqueries because it felt easier and quicker for me to write and maintain.
Commonly, I'll do subqueries that may contain one or more subqueries from relative tables.
Consider this example:
SELECT
(SELECT username FROM users WHERE records.user_id = user_id) AS username,
(SELECT last_name||', '||first_name FROM users WHERE records.user_id = user_id) AS name,
in_timestamp,
out_timestamp
FROM records
ORDER BY in_timestamp
Rarely, I'll do subqueries after the WHERE clause.
Consider this example:
SELECT
user_id,
(SELECT name FROM organizations WHERE (SELECT organization FROM locations WHERE records.location = location_id) = organization_id) AS organization_name
FROM records
ORDER BY in_timestamp
In these two cases, would I see any sort of improvement if I decided to rewrite the queries using a JOIN?
As more of a blanket question, what are the advantages/disadvantages of using subqueries or a JOIN? Is one way more correct or accepted than the other?
In simple cases, the query optimiser should be able to produce identical plans for a simple join versus a simple sub-select.
But in general (and where appropriate), you should favour joins over sub-selects.
Plus, you should avoid correlated subqueries (a query in which the inner expression refer to the outer), as they are effectively a for loop within a for loop). In most cases a correlated subquery can be written as a join.
JOINs are preferable to separate [sub]queries.
If the subselect (AKA subquery) is not correlated to the outer query, it's very likely the optimizer will scan the table(s) in the subselect once because the value isn't likely to change. When you have correlation, like in the example provided, the likelihood of single pass optimization becomes very unlikely. In the past, it's been believed that correlated subqueries execute, RBAR -- Row By Agonizing Row. With a JOIN, the same result can be achieved while ensuring a single pass over the table.
This is a proper re-write of the query provided:
SELECT u.username,
u.last_name||', '|| u.first_name AS name,
r.in_timestamp,
r.out_timestamp
FROM RECORDS r
LEFT JOIN USERS u ON u.user_id = r.user_id
ORDER BY r.in_timestamp
...because the subselect can return NULL if the user_id doesn't exist in the USERS table. Otherwise, you could use an INNER JOIN:
SELECT u.username,
u.last_name ||', '|| u.first_name AS name,
r.in_timestamp,
r.out_timestamp
FROM RECORDS r
JOIN USERS u ON u.user_id = r.user_id
ORDER BY r.in_timestamp
Derived tables/inline views are also possible using JOIN syntax.
a) I'd start by pointing out that the two are not necessarily interchangable. Nesting as you have requires there to be 0 or 1 matching value otherwise you will get an error. A join puts no such requirement and may exclude the record or introduce more depending on your data and type of join.
b) In terms of performance, you will need to check the query plans but your nested examples are unlikely to be more efficient than a table join. Typically sub-queries are executed once per row but that very much depends on your database, unique constraints, foriegn keys, not null etc. Maybe the DB can rewrite more efficiently but joins can use a wider variety of techniques, drive the data from different tables etc because they do different things (though you may not observe any difference in your output depending on your data).
c) Most DB aware programmers I know would look at your nested queries and rewrite using joins, subject to the data being suitably 'clean'.
d) Regarding "correctness" - I would favour joins backed up with proper constraints on your data where necessary (e.g. a unique user ID). You as a human may make certain assumptions but the DB engine cannot unless you tell it. The more it knows, the better job it (and you) can do.
Joins in most cases will be much more faster.
Lets take this with an example.
Lets use your first query:
SELECT
(SELECT username FROM users WHERE records.user_id = user_id) AS username,
(SELECT last_name||', '||first_name FROM users WHERE records.user_id = user_id) AS name,
in_timestamp,
out_timestamp
FROM records
ORDER BY in_timestamp
Now consider we have 100 records in records and 100 records in user.(Assuming we dont have index on user_id)
So if we understand your algorithm it says:
For each record
Scan all 100 records in users to find out username
Scan all 100 records in users to find out last name and first name
So its like we scanned users table 100*100*2 time. Is it really worth. If we consider index on user_id it will make thing better, but is it still worth.
Now consider a join (nested loop will almost produce same result as above, but consider a hash join):
Its like.
Make a hash map of user.
For each record
Find a mapping record in Hashmap. Which will be certainly much more faster then looping and finding a record.
So clearly, joins should be favorable.
NOTE: Example used of 100 record may produce identical plan, but the idea is to analyze how it can effect the performance.

Is there a better way to sort this query?

We generate a lot of SQL procedurally and SQL Server is killing us. Because of some issues documented elsewhere we basically do SELECT TOP 2 ** 32 instead of TOP 100 PERCENT.
Note: we must use the subqueries.
Here's our query:
SELECT * FROM (
SELECT [me].*, ROW_NUMBER() OVER( ORDER BY (SELECT(1)) )
AS rno__row__index FROM (
SELECT [me].[id], [me].[status] FROM (
SELECT TOP 4294967296 [me].[id], [me].[status] FROM
[PurchaseOrders] [me]
LEFT JOIN [POLineItems] [line_items]
ON [line_items].[id] = [me].[id]
WHERE ( [line_items].[part_id] = ? )
ORDER BY [me].[id] ASC
) [me]
) [me]
) rno_subq
WHERE rno__row__index BETWEEN 1 AND 25
Are there better ways to do this that anyone can see?
UPDATE: here is some clarification on the whole subquery issue:
The key word of my question is "procedurally". I need the ability to reliably encapsulate resultsets so that they can be stacked together like building blocks. For example I want to get the first 10 cds ordered by the name of the artist who produced them and also get the related artist for each cd.. What I do is assemble a monolithic subselect representing the cds ordered by the joined artist names, then apply a limit to it, and then join the nested subselects to the artist table and only then execute the resulting query. The isolation is necessary because the code that requests the ordered cds is unrelated and oblivious to the code selecting the top 10 cds which in turn is unrelated and oblivious to the code that requests the related artists.
Now you may say that I could move the inner ORDER BY into the OVER() clause, but then I break the encapsulation, as I would have to SELECT the columns of the joined table, so I can order by them later. An additional problem would be the merging of two tables under one alias; if I have identically named columns in both tables, the select me.* would stop right there with an ambiguous column name error.
I am willing to sacrifice a bit of the optimizer performance, but the 2**32 seems like too much of a hack to me. So I am looking for middle ground.
If you want top rows by me.id, just ask for that in the ROW_NUMBER's ORDER BY. Don't chase your tail around subqueries and TOP.
If you have a WHERE clause on a joined table field, you can have an outer JOIN. All the outer fields will be NULL and filtered out by the WHERE, so is effectively an inner join.
.
WITH cteRowNumbered AS (
SELECT [me].id, [me].status
ROW_NUMBER() OVER (ORDER BY me.id ASC) AS rno__row__index
FROM [PurchaseOrders] [me]
JOIN [POLineItems] [line_items] ON [line_items].[id] = [me].[id]
WHERE [line_items].[part_id] = ?)
SELECT me.id, me.status
FROM cteRowNumbered
WHERE rno__row__index BETWEEN 1 and 25
I use CTEs instead of subqueries just because I find them more readable.
Use:
SELECT x.*
FROM (SELECT po.id,
po.status,
ROW_NUMBER() OVER( ORDER BY po.id) AS rno__row__index
FROM [PurchaseOrders] po
JOIN [POLineItems] li ON li.id = po.id
WHERE li.pat_id = ?) x
WHERE x.rno__row__index BETWEEN 1 AND 25
ORDER BY po.id ASC
Unless you've omitted details in order to simplify the example, there's no need for all your subqueries in what you provided.
Kudos to the only person who saw through naysaying and actually tried the query on a large table we do not have access to. To all the rest saying this simply will not work (will return random rows) - we know what the manual says, and we know it is a hack - this is why we ask the question in the first place. However outright dismissing a query without even trying it is rather shallow. Can someone provide us with a real example (with preceeding CREATE/INSERT statements) demonstrating the above query malfunctioning?
Your update makes things much clearer. I think that the approach which you're using is seriously flawed. While it's nice to be able to have encapsulated, reusable code in your applications, front-end applications are a much different animal than a database. They typically deal with small structures and small, discrete process that run against those structures. Databases on the other hand often deal with tables that are measured in the millions of rows and sometimes more than that. Using the same methodologies will often result in code that simply performs so badly as to be unusable. Even if it works now, it's very likely that it won't scale and will cause major problems down the road.
Best of luck to you, but I don't think that this approach will end well in all but the smallest of databases.

Avoiding a nested subquery in SQL

I have a SQL table that contains data of the form:
Id int
EventTime dateTime
CurrentValue int
The table may have multiple rows for a given id that represent changes to the value over time (the EventTime identifying the time at which the value changed).
Given a specific point in time, I would like to be able to calculate the count of distinct Ids for each given Value.
Right now, I am using a nested subquery and a temporary table, but it seems it could be much more efficient.
SELECT [Id],
(
SELECT
TOP 1 [CurrentValue]
FROM [ValueHistory]
WHERE [Ids].[Id]=[ValueHistory].[Id] AND
[EventTime] < #StartTime
ORDER BY [EventTime] DESC
) as [LastValue]
INTO #temp
FROM [Ids]
SELECT [LastValue], COUNT([LastValue])
FROM #temp
GROUP BY [LastValue]
DROP TABLE #temp
Here is my first go:
select ids.Id, count( distinct currentvalue)
from ids
join valuehistory vh on ids.id = vh.id
where vh.eventtime < #StartTime
group by ids.id
However, I am not sure I understand your table model very clearly, or the specific question you are trying to solve.
This would be: The distinct 'currentvalues' from valuehistory before a certain date that for each Id.
Is that what you are looking for?
I think I understand your question.
You want to get the most recent value for each id, group by that value, and then see how many ids have that same value? Is this correct?
If so, here's my first shot:
declare #StartTime datetime
set #StartTime = '20090513'
select ValueHistory.CurrentValue, count(ValueHistory.id)
from
(
select id, max(EventTime) as LatestUpdateTime
from ValueHistory
where EventTime < #StartTime
group by id
) CurrentValues
inner join ValueHistory on CurrentValues.id = ValueHistory.id
and CurrentValues.LatestUpdateTime = ValueHistory.EventTime
group by ValueHistory.CurrentValue
No guarantee that this is actually faster though - for this to work with any decent speed you'll need an index on EventTime.
Let us keep in mind that, because the SQL language describes what you want and not how to get it, there are many ways of expressing a query that will eventually be turned into the same query execution plan by a good query optimizer. Of course, the level of "good" depends on the database you're using.
In general, subqueries are just a syntactically different way of describing joins. The query optimizer is going to recognize this and determine the most optimal way, to the best of its knowledge, to execute the query. Temporary tables may be created as needed. So in many cases, re-working the query is going to do nothing for your actual execution time -- it may come out to the same query execution plan in the end.
If you're going to attempt to optimize, you need to examine the query plan by doing a describe on that query. Make sure it's not doing full-table scans against large tables, and is picking the appropriate indices where possible. If, and only if, it is making sub-optimal choices here, should you attempt to manually optimize the query.
Now, having said all that, the query you pasted isn't entirely compatible with your stated goal of "calculat[ing] the count of distinct Ids for each given Value". So forgive me if I don't quite answer your need, but here's something to perf-test against your current query. (Syntax is approximate, sorry -- away from my desk).
SELECT [IDs].[Id], vh1.[CurrentValue], COUNT(vh2.[CurrentValue]) FROM
[IDs].[Id] as ids JOIN [ValueHistory] AS vh1 ON ids.[Id]=vh1.[Id]
JOIN [ValueHistory] AS vh2 ON vh1.[CurrentValue]=vh2.[CurrentValue]
GROUP BY [Id], [LastValue];
Note that you'll probably see better performance increases by adding indices to make those joins optimal than re-working the query, assuming you're willing to take the performance hit to update operations.