Comparison Group by VS Over Partition By

Comparison Group by VS Over Partition By - sql

Assuming one table CAR with two columns CAR_ID (int) and VERSION (int).
I want to retrieve the maximum version of each car.
So there are two solutions (at least) :
select car_id, max(version) as max_version
from car
group by car_id;
Or :
select car_id, max_version
from ( select car_id, version
, max(version) over (partition by car_id) as max_version
from car
) max_ver
where max_ver.version = max_ver.max_version
Are these two queries similarly performant?

I know this is extremely old but thought it should be pointed out.
select car_id, max_version
from (select car_id
, version
, max(version) over (partition by car_id) as max_version
from car ) max_ver
where max_ver.version = max_ver.max_version
Not sure why you did option two like that... in this case the sub select should be theoretically slower because your selecting from the same table 2x and then joining the results back to itself.
Just remove version from your inline view and they are the same thing.
select car_id, max(version) over (partition by car_id) as max_version
from car
The performance really depends on the optimizer in this situation, but yes the as original answer suggests inline views as they do narrow results. Though this is not a good example being its the same table with no filters in the selections given.
Partitioning is also helpful when you are selecting a lot of columns but need different aggregations that fit the result set. Otherwise you are forced to group by every other column.

Yes It may affects
Second query is an example of Inline View.
It's a very useful method for performing reports with various types of counts or use of any aggregate functions with it.
Oracle executes the subquery and then uses the resulting rows as a view in the FROM clause.
As we consider about performance , always recommend inline view instead of choosing another subquery type.
And one more thing second query will give all max records,while first one will give you only one max record.
see here

It will depend on your indexing scheme and the amount of data in the table. The optimizer will likely make different decisions based on the data that's actually inside the table.
I have found, at least in SQL Server (I know you asked about Oracle) that the optimizer is more likely to perform a full scan with the PARTITION BY query vs the GROUP BY query. But that's only in cases where you have an index which contains CAR_ID and VERSION (DESC) in it.
The moral of the story is that I would test thoroughly to choose the right one. For small tables, it doesn't matter. For really, really big data sets, neither may be fast...

Related

Time based accumulation based on type: Speed considerations in SQL

Based on surfing the web, I came up with two methods of counting the records in a table "Table1". The counter field increments according to a date field "TheDate". It does this by summing records with an older TheDate value. Furthermore, records with different values for the compound field (Field1,Field2) are counted using separate counters. Field3 is just an informational field that is included for added awareness and does not affect the counting or how records are grouped for counting.
Method 1: Use corrrelated subquery
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
(
SELECT SUM(1) FROM Table1 InnerQuery
WHERE InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
) AS RunningCounter
FROM Table1 MainQuery
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
Method 2: Use join and group-by
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
SUM(1) AS RunningCounter
FROM Table1 MainQuery INNER JOIN Table1 InnerQuery
ON InnerQuery.Field1 = MainQuery.Field1 AND
InnerQuery.Field2 = MainQuery.Field2 AND
InnerQuery.TheDate <= MainQuery.TheDate
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
There is no inner query per se in Method 2, but I use the table alias InnerQuery so that a ready parellel with Method 1 can be drawn. The role is the same; the 2nd instance of Table 1 is for accumulating the counts of the records which have TheDate less than that of any record in MainQuery (1st instance of Table 1) with the same Field1 and Field2 values.
Note that in Method 2, Field 3 is include in the Group-By clause even though I said that it does not affect how the records are grouped for counting. This is still true, since the counting is done using the matching records in InnerQuery, whereas the GROUP By applies to Field 3 in MainQuery.
I found that Method 1 is noticably faster. I'm surprised by this because it uses a correlated subquery. The way I think of a correlated subquery is that it is executed for each record in MainQuery (whether or not that is done in practice after optimization). On the other hand, Method 2 doesn't run an inner query over and over again. However, the inner join still has multiple records in InnerQuery matching each record in MainQuery, so in a sense, it deals with a similar order of complexity.
Is there a decent intuitive explanation for this speed difference, as well as best practice or considerations in choosing an approach for time-base accumulation?
I've posted this to
Microsoft Answers
Stack Exchange

In fact, I think the easiest way is to do this:
SELECT MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate,
COUNT(*)
FROM Table1 MainQuery
GROUP BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.Field3,
MainQuery.TheDate
ORDER BY MainQuery.Field1,
MainQuery.Field2,
MainQuery.TheDate,
MainQuery.Field3
(The order by isn't required to get the same data, just to order it. In other words, removing it will not change the number or contents of each row returned, just the order in which they are returned.)
You only need to specify the table once. Doing a self-join (joining a table to itself as both your queries do) is not required. The performance of your two queries will depend on a whole load of things which I don't know - what the primary keys are, the number of rows, how much memory is available, and so on.

First, your experience makes a lot of sense. I'm not sure why you need more intuition. I imagine you learned, somewhere along the way, that correlated subqueries are evil. Well, as with some of the things we teach kids as being really bad ("don't cross the street when the walk sign is not green") turn out to be not so bad, the same is true of correlated subqueries.
The easiest intuition is that the uncorrelated subquery has to aggregate all the data in the table. The correlated version only has to aggregate matching fields, although it has to do this over and over.
To put numbers to it, say you have 1,000 rows with 10 rows per group. The output is 100 rows. The first version does 100 aggregations of 10 rows each. The second does one aggregation of 1,000 rows. Well, aggregation generally scales in a super-linear fashion (O(n log n), technically). That means that 100 aggregations of 10 records takes less time than 1 aggregation of 1000 records.
You asked for intuition, so the above is to provide some intuition. There are a zillion caveats that go both ways. For instance, the correlated subquery might be able to make better use of indexes for the aggregation. And, the two queries are not equivalent, because the correct join would be LEFT JOIN.

Actually, I was wrong in my original post. The inner join is way, way faster than the correlated subquery. However, the correlated subquery is able to display its results records as they are generated, so it appears faster.
As a side curiosity, I'm finding that if the correlated sub-query approach is modified to use sum(-1) instead of sum(1), the number of returned records seems to vary from N-3 to N (where N is the correct number, i.e., the number of records in Table1). I'm not sure if this is due to some misbehaviour in Access's rush to display initial records or what-not.
While it seems that the INNER JOIN wins hands-down, there is a major insidious caveat. If the GROUP BY fields do not uniquely distinguish each record in Table1, then you will not get an individual SUM for each record of Table1. Imagine that a particular combination of GROUP BY field values matching (say) THREE records in Table1. You will then get a single SUM for all of them. The problem is, each of these 3 records in MainQuery also matches all 3 of the same records in InnerQuery, so those instances in InnerQuery get counted multiple times. Very insidious (I find).
So it seems that the sub-query may be the way to go, which is awfully disturbing in view of the above problem with repeatability (2nd paragraph above). That is a serious problem that should send shivers down any spine. Another possible solution that I'm looking at is to turn MainQuery into a subquery by SELECTing the fields of interest and DISTINCTifying them before INNER JOINing the result with InnerQuery.

Order Fields priority

I am using this query:
select o.orderno,
o.day,
a.name,
o.description,
a.adress,
o.quantity,
a.orderType,
o.status,
a.Line,
a.price,
a.serial
from orders o
inner join account a
on o.orderid=a.orderid
order by o.day
I am ordering by day. After sorting the results based on day, what is the next field that is considered on sorting,for the same day, what order is considered?

There is no further sorting. You'll get the results within each day in whatever order Oracle happened to retrieve them, which is not guaranteed in any way, and can be different for the same query being run multiple times. It depends on many things under the hood which you generally have no control over or even visibility of. You may see the results in an apparent order that suits you at the moment, but it could change for a future execution. Changing data will affect the execution plan, for example, which can affect the order to see the results.
If you need a specific order, or just want them returned in a consistent order every time you run the query, you must specify it in the order by clause.
This is something Tom Kyte often stresses; for example in this often-quoted article.

It tries to order by the unique/primary key. In this case, orderno if it is your primary key.
However, your query is laden with errors.
e.g. The table aliases are used in the SELECT clause but are not specified in the FROM

Derived Tables for summary statistics

I often find myself running a query to get the number of people who meet a certain criteria, the total number of people in that population and the finding the percentage that meets that criteria. I've been doing it for the same way for a while and I was wondering what SO would do to solve the same type of problem. Below is how I wrote the query:
select m.state_cd
,m.injurylevel
,COUNT(distinct m.patid) as pplOnRx
,x.totalPatientsPerState
,round((COUNT(distinct m.patid) /cast(x.totalPatientsPerState as float))*100,2) as percentPrescribedNarcotics
from members as m
inner join rx on rx.patid=m.PATID
inner join DrugTable as dt on dt.drugClass=rx.drugClass
inner join
(
select m2.state_cd, m2.injurylevel, COUNT(distinct m2.patid) as totalPatientsPerState
from members as m2
inner join rx on rx.patid=m2.PATID
group by m2.STATE_CD,m2.injuryLevel
) x on x.state_cd=m.state_cd and m.injuryLevel=x.injurylevel
where drugText like '%narcotics%'
group by m.state_cd,m.injurylevel,x.totalPatientsPerState
order by m.STATE_CD,m.injuryLevel
In this example not everyone who appears in the members table is in the rx table. The derived table makes sure that everyone whose in rx is also in members without the condition of drugText like narcotics. From what little I've played with it it seems that the over(partition by clause might work here. I have no idea if it does, just seems like it to me. How would someone else go about tackling this problem?
results:

This is exactly what MDX and SSAS is designed to do. If you insist on doing it in SQL (nothing wrong with that), are you asking for a way to do it with better performance? In that case, it would depend on how the tables are indexed, tempdb speed, and if the tables are partitioned, then that too.
Also, the distinct count is going to be one of larger performance hits. The like '%narcotics%' in the predicate is going to force a full table scan and should be avoided at all costs (can this be an integer key in the data model?)
To answer your question, not really sure windowing (over partition by) is going to perform any better. I would test it and see, but there is nothing "wrong" with the query.
You could rewrite the count distinct's as virtual tables or temp tables with group by's or a combination of those two.
To illustrate, this is a stub for windowing that you could grow into the same query:
select a.state_cd,a.injurylevel,a.totalpatid, count(*) over (partition by a.state_cd, a.injurylevel)
from
(select state_cd,injurylevel,count(*) as totalpatid, count(distinct patid) as patid
from
#members
group by state_cd,injurylevel
) a
see what I mean about not really being that helpful? Then again, sometimes rewriting a query slightly can improve performance by selecting a better execution plan, but rather then taking stabs in the dark, I'd first find the bottlenecks in the query you have, since you already took the time to write it.

Database query times out on heroku

I'm stress testing an app by adding loads and loads of items and forcing it to do lots of work.
select *, (
select price
from prices
WHERE widget_id = widget.id
ORDER BY id DESC
LIMIT 1
) as maxprice
FROM widgets
ORDER BY created_at DESC
LIMIT 20 OFFSET 0
that query selects from widgets (approx 8500) and prices has 777000 or so entries in it.
The query is timing out on the test environment which is using the basic Heroku shared database. (193mb in use of the 5gig max.)
What will solve that time out issue? The prices update each hour, so every hour you get 8500x new rows.
It's hugely excessive amounts for the app (in reality it's unlikely it would ever have 8500 widgets) but I'm wondering what's appropriate to solve this?
Is my query stupid? (i.e. is it a bad style of query to do that subselect - my SQL knowledge is terrible, one of the goals of this project is to improve it!)
Or am I just hitting a limit of a shared db and should expect to move onto a dedicated db (e.g. the min $200 per month dedicated postgres instance from Heroku.) given the size of the prices table? Is there a deeper issue in terms of how I've designed the DB? (i.e. it's a one to many, one widget has many prices.) Is there a more sensible approach?
I'm totally new to the world of sql and queries etc. at scale, hence the utter ignorance expressed above. :)

Final version after comments below:
#Dave wants the latest price per widget. You could do that in sub-queries and LIMIT 1 per widget, but in modern PostgreSQL, a window function does the job more elegantly. Consider first_value() / last_value():
SELECT w.*
, first_value(p.price) OVER (PARTITION BY w.id
ORDER BY created_at DESC) AS latest_price
FROM (
SELECT *
FROM widgets
ORDER BY created_at DESC
LIMIT 20
) w
JOIN prices p ON p.widget_id = w.id
GROUP BY w.col1, w.col2 -- spell out all columns of w.*
Original post for the maximum price per widget:
SELECT w.*
, max(p.price) AS max_price
FROM (
SELECT *
FROM widgets
ORDER BY created_at DESC
LIMIT 20
) w
JOIN prices p ON p.widget_id = w.id
GROUP BY w.col1, w.col2 -- spell out all columns of w.*
Fix table aliases.
Retrieve all columns of widgets like the question demonstrates
In PostgreSQL 8.3 you must spell out all non-aggregated columns of the SELECT list in the GROUP BY clause. In PostgreSQL 9.1 or later, the primary key column would cover the whole table. I quote the manual here:
Allow non-GROUP BY columns in the query target list when the primary
key is specified in the GROUP BY clause
I advice to never use mixed case identifiers like maxWidgetPrice. Unquoted identifiers are folded to lower case by default in PostgreSQL. Do yourself a favor and use lower case identifiers exclusively.
Always use explicit JOIN conditions where possible. It's the canonical SQL way and it's more readable.
OFFSET 0 is just noise
Indexes:
However, the key to performance are the right indexes. I would go two indexes like these:
CREATE INDEX widgets_created_at_idx ON widgets (created_at DESC);
CREATE INDEX prices_widget_id_idx ON prices(widget_id, price DESC);
The second one is a multicolumn index, that should provide best performance for retrieving the maximum prize after you have determined the top 20 widgets using the first index. Not sure if PostgreSQL 8.3 (default on Heroku shared db) is already smart enough to make the most of it. PostgreSQL 9.1 certainly is.
For the latest price (see comments), use this index instead:
CREATE INDEX prices_widget_id_idx ON prices(widget_id, created_at DESC);
You don't have to (and shouldn't) just trust me. Test performance and query plans with EXPLAIN ANALYZE with and without indexes and see for yourself. Index creation should be very fast, even for a million rows.
If you consider to switch to a standalone PostgreSQL database on Heroku, you may be interested in this recent Heroku blog post:
The default is now PostgreSQL 9.1.
There you can cancel long running queries now.

I'm not quite clear on what you are asking, but here is my understanding:
Find the widgets you want to price. In this case it looks like you are looking for the most recent 20 widgets:
SELECT w.id
FROM widgets
ORDER BY created_at DESC
LIMIT 20 OFFSET 0
For each of the 20 widgets you found, it seems you want to find the highest associated price from the widget table:
SELECT s.id, MAX(p.price) AS maxWidgetPrice
FROM (SELECT w.id
FROM widgets
ORDER BY created_at DESC
LIMIT 20 OFFSET 0
) s -- widget subset
, prices p
WHERE s.id = p.widget_id
GROUP BY s.id
prices.widget_id needs to be indexed for this to be effective. You don't want to process the entire prices table each time if it is relatively large, just the subset of rows you need.
EDIT: added "group by" (and no, this was not tested)

Is there a better way to sort this query?

We generate a lot of SQL procedurally and SQL Server is killing us. Because of some issues documented elsewhere we basically do SELECT TOP 2 ** 32 instead of TOP 100 PERCENT.
Note: we must use the subqueries.
Here's our query:
SELECT * FROM (
SELECT [me].*, ROW_NUMBER() OVER( ORDER BY (SELECT(1)) )
AS rno__row__index FROM (
SELECT [me].[id], [me].[status] FROM (
SELECT TOP 4294967296 [me].[id], [me].[status] FROM
[PurchaseOrders] [me]
LEFT JOIN [POLineItems] [line_items]
ON [line_items].[id] = [me].[id]
WHERE ( [line_items].[part_id] = ? )
ORDER BY [me].[id] ASC
) [me]
) [me]
) rno_subq
WHERE rno__row__index BETWEEN 1 AND 25
Are there better ways to do this that anyone can see?
UPDATE: here is some clarification on the whole subquery issue:
The key word of my question is "procedurally". I need the ability to reliably encapsulate resultsets so that they can be stacked together like building blocks. For example I want to get the first 10 cds ordered by the name of the artist who produced them and also get the related artist for each cd.. What I do is assemble a monolithic subselect representing the cds ordered by the joined artist names, then apply a limit to it, and then join the nested subselects to the artist table and only then execute the resulting query. The isolation is necessary because the code that requests the ordered cds is unrelated and oblivious to the code selecting the top 10 cds which in turn is unrelated and oblivious to the code that requests the related artists.
Now you may say that I could move the inner ORDER BY into the OVER() clause, but then I break the encapsulation, as I would have to SELECT the columns of the joined table, so I can order by them later. An additional problem would be the merging of two tables under one alias; if I have identically named columns in both tables, the select me.* would stop right there with an ambiguous column name error.
I am willing to sacrifice a bit of the optimizer performance, but the 2**32 seems like too much of a hack to me. So I am looking for middle ground.

If you want top rows by me.id, just ask for that in the ROW_NUMBER's ORDER BY. Don't chase your tail around subqueries and TOP.
If you have a WHERE clause on a joined table field, you can have an outer JOIN. All the outer fields will be NULL and filtered out by the WHERE, so is effectively an inner join.
.
WITH cteRowNumbered AS (
SELECT [me].id, [me].status
ROW_NUMBER() OVER (ORDER BY me.id ASC) AS rno__row__index
FROM [PurchaseOrders] [me]
JOIN [POLineItems] [line_items] ON [line_items].[id] = [me].[id]
WHERE [line_items].[part_id] = ?)
SELECT me.id, me.status
FROM cteRowNumbered
WHERE rno__row__index BETWEEN 1 and 25
I use CTEs instead of subqueries just because I find them more readable.

Use:
SELECT x.*
FROM (SELECT po.id,
po.status,
ROW_NUMBER() OVER( ORDER BY po.id) AS rno__row__index
FROM [PurchaseOrders] po
JOIN [POLineItems] li ON li.id = po.id
WHERE li.pat_id = ?) x
WHERE x.rno__row__index BETWEEN 1 AND 25
ORDER BY po.id ASC
Unless you've omitted details in order to simplify the example, there's no need for all your subqueries in what you provided.

Kudos to the only person who saw through naysaying and actually tried the query on a large table we do not have access to. To all the rest saying this simply will not work (will return random rows) - we know what the manual says, and we know it is a hack - this is why we ask the question in the first place. However outright dismissing a query without even trying it is rather shallow. Can someone provide us with a real example (with preceeding CREATE/INSERT statements) demonstrating the above query malfunctioning?

Your update makes things much clearer. I think that the approach which you're using is seriously flawed. While it's nice to be able to have encapsulated, reusable code in your applications, front-end applications are a much different animal than a database. They typically deal with small structures and small, discrete process that run against those structures. Databases on the other hand often deal with tables that are measured in the millions of rows and sometimes more than that. Using the same methodologies will often result in code that simply performs so badly as to be unusable. Even if it works now, it's very likely that it won't scale and will cause major problems down the road.
Best of luck to you, but I don't think that this approach will end well in all but the smallest of databases.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas