Efficiently order by columns from different tables - sql

We're currently trying to improve a system that allows the user to filter and sort a large list of objects (> 100k) by fields that are not being displayed. Since the fields can be selected dynamically we'd plan to build the query dynamically as well.
That doesn't sound too hard and the basics are done easily but the problem lies in how the data is structured. In some cases some more or less expensive joins would be needed which could some up to a quite expensive query, especially when those joins are combined (i.e. select * from table join some_expensive_join join another_expensive_join ...).
Filtering wouldn't be that big a problem since we can use intersections.
Ordering, however, would require us to first build a table that contains all necessary data which if being done via a huge select statement with all those joins would become quite expensive.
So the question is: is there a more efficient way to do that?
I could think of doing it like this:
do a select query for the first column and order by that
for all elements that basically have the same order (e.g. same value) do another query to resolve that
repeat the step above until the order is unambiguous or we run out of sort criteria
Does that make sense? If yes, how could this be done in Postgresql 9.4 (we currently can't upgrade so 9.5+ solutions though welcome wouldn't help as right now).

Does this help, or is it too trivial? (the subqueries could be prefab join views)
SELECT t0.id, t0.a,t0.b,t0.c, ...
FROM main_table t0
JOIN ( SELECT t1.id AS id
, rank() OVER (ORDER BY whatever) AS rnk
FROM different_tables_or_JOINS
) AS t1 ON t1.id=t0.id
JOIN ( SELECT t2.id AS id
, rank() OVER (ORDER BY whatever) AS rnk
FROM different_tables_or_JOINS2
) AS t2 ON t2.id=t0.id
...
ORDER BY t1.rnk
, t2.rnk
...
, t0.id
;

Related

How do you explain the difference in performance of these two queries?

I have the following two Queries:
WITH
T AS(
SELECT
ROW_NUMBER() OVER() AS uid,
position_geom
FROM
`bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
WHERE
date > "2017-12-26" )
SELECT
T1.uid,
T2.uid,
ST_DISTANCE(T1.position_geom, T2.position_geom) distance
FROM
T AS T1,
T AS T2
WHERE
T1.uid <> T2.uid QUALIFY ROW_NUMBER() OVER (PARTITION BY T1.uid ORDER BY ST_DISTANCE(T1.position_geom, T2.position_geom)) < 101
ORDER BY
T1.uid,
ST_DISTANCE(T1.position_geom, T2.position_geom)
AND
WITH
T AS(
SELECT
ROW_NUMBER() OVER() AS uid,
position_geom
FROM
`bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017`
WHERE
date > "2017-12-26" ),
T2 AS(
SELECT
T1.position_geom pointA,
T1.uid uidA,
T2.position_geom pointB,
T2.uid uidB,
ST_DISTANCE(T1.position_geom, T2.position_geom) distance
FROM
T AS T1,
T AS T2 )
SELECT
uidA,
uidB,
distance
FROM
T2
WHERE
uidA <> uidB QUALIFY ROW_NUMBER() OVER (PARTITION BY uidA ORDER BY distance) < 101
ORDER BY
uidA,
distance
Both queries give the same result, but the second goes almost twice as fast (in term of elapsed time and slot time consumed). Why is the first query not internally optimized by bigquery?
You can see in the execution details how the queries stages are processed. The first query reads all the data that you are selecting after you join the t1 table and the t2 table, and then is fast to sort because the data is already in the temporary table, so this makes the whole sorting process faster, but when the data is being read, it makes the query execution slower.
The second query is faster because it reads the data when you are in the sorting process. This means that the sorting process is slower than the first query, but you don’t have a repartition process to read the data, and this makes the query execution faster.
Here, the difference is how the query is written because, in the second one, you are given a subquery to distribute how the query will work, and when you read it, it is faster because there is only one reading process, which makes this consume less slot time.
In order to give some hints, you need to precise a few elements :
Which database are you using
Which index(es) is(are) defined
The number of rows in bigquery-public-data.catalonian_mobile_coverage.mobile_data_2015_2017
Are you querying the database directly or do you go through a connector ?
Depending on the number of rows, your database engine may not use the index.
Each database has - normally - a set of tools that helps to analyse queries.
mysql (should work for MariaDB too) for exemple provides the keyword EXPLAIN and ANALIZE TABLE
If you have the Enterprise Edition, you can get access to Mysql query analysis
SQLServer provide SQL server query analyzer
...

LEFT JOIN with ROW NUM DB2

I am using two tables having one to many mapping (in DB2) .
I need to fetch 20 records at a time using ROW_NUMBER from two table using LEFT JOIN. But due to the one to many mapping, result is not consistent. I might be getting 20 records but those records does not contains 20 unique records of first table
SELECT
A.*,
B.*,
ROW_NUMBER() OVER (ORDER BY A.COLUMN_1 DESC) as rn
from
table1
LEFT JOIN
table2 ON A.COLUMN_3 = B.COLUMN3
where
rn between 1 and 20
Please suggest some solution.
Sure, this is easy... once you know that you can use subqueries as a table reference:
SELECT <relevant columns from Table1 and Table2>, rn
FROM (SELECT <relevant columns from Table1>,
ROW_NUMBER() OVER (ORDER BY <relevant columns> DESC) AS rn
FROM table1) Table1
LEFT JOIN Table2
ON <relevant equivalent columns>
WHERE rn >= :startOfRange
AND rn < :startOfRange + :numberOfElements
For production code, never do SELECT * - always explicitly list the columns you want (there are several reasons for this).
Prefer inclusive lower-bound (>=), exclusive upper-bound (<) for (positive) ranges. For everything except integral types, this is required to sanely/cleanly query the values. Do this with integral types both to be consistent, as well as for ease of querying (note that you don't actually need to know which value you "stop" on). Further, the pattern shown is considered the standard when dealing with iterated value constructs.
Note that this query currently has two problems:
You need to list sufficient columns for the ORDER BY to return consistent results. This is best done by using a unique value - you probably want something in an index that the optimizer can use.
Every time you run this query, you (usually) have to order the ENTIRE set of results before you can get whatever slice of them you want (especially for anything after the first page). If your dataset is large, look at the answers to this question for some ideas for performance improvements. The optimizer may be able to cache the results for you, but it's not guaranteed (especially on tables that receive many updates).

SELECT MAX() too slow - any alternatives?

I've inherited a SQL Server based application and it has a stored procedure that contains the following, but it hits timeout. I believe I've isolated the issue to the SELECT MAX() part, but I can't figure out how to use alternatives, such as ROW_NUMBER() OVER( PARTITION BY...
Anyone got any ideas?
Here's the "offending" code:
SELECT BData.*, B.*
FROM BData
INNER JOIN
(
SELECT MAX( BData.StatusTime ) AS MaxDate, BData.BID
FROM BData
GROUP BY BData.BID
) qryMaxDates
ON ( BData.BID = qryMaxDates.BID ) AND ( BData.StatusTime = qryMaxDates.MaxDate )
INNER JOIN BItems B ON B.InternalID = qryMaxDates.BID
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
Thanks in advance.
SQL performance problems are seldom addressed by rewriting the query. The compiler already know how to rewrite it anyway. The problem is always indexing. To get MAX(StatusTime ) ... GROUP BY BID efficiently, you need an index on BData(BID, StatusTime). For efficient seek of WHERE B.ICID = 2 you need an index on BItems.ICID.
The query could also be, probably, expressed as a correlated APPLY, because it seems that what is what's really desired:
SELECT D.*, B.*
FROM BItems B
CROSS APPLY
(
SELECT TOP(1) *
FROM BData
WHERE B.InternalID = BData.BID
ORDER BY StatusTime DESC
) AS D
WHERE B.ICID = 2
ORDER BY D.StatusTime DESC;
SQL Fiddle.
This is not semantically the same query as OP, the OP would return multiple rows on StatusTime collision, I just have a guess though that this is what is desired ('the most recent BData for this BItem').
Consider creating the following index:
CREATE INDEX LatestTime ON dbo.BData(BID, StatusTime DESC);
This will support a query with a CTE such as:
;WITH x AS
(
SELECT *, rn = ROW_NUMBER() OVER (PARTITION BY BID ORDER BY StatusDate DESC)
FROM dbo.BData
)
SELECT * FROM x
INNER JOIN dbo.BItems AS bi
ON x.BID = bi.InternalID
WHERE x.rn = 1 AND bi.ICID = 2
ORDER BY x.StatusDate DESC;
Whether the query still gets efficiencies from any indexes on BItems is another issue, but this should at least make the aggregate a simpler operation (though it will still require a lookup to get the rest of the columns).
Another idea would be to stop using SELECT * from both tables and only select the columns you actually need. If you really need all of the columns from both tables (this is rare, especially with a join), then you'll want to have covering indexes on both sides to prevent lookups.
I also suggest calling any identifier the same thing throughout the model. Why is the ID that links these tables called BID in one table and InternalID in another?
Also please always reference tables using their schema.
Bad habits to kick : avoiding the schema prefix
This may be a late response, but I recently ran into the same performance issue where a simple query involving max() is taking more than 1 hour to execute.
After looking at the execution plan, it seems in order to perform the max() function, every record meeting the where clause condition will be fetched. In your case, it's every record in your table will need to be fetched before performing max() function. Also, indexing the BData.StatusTime will not speed up the query. Indexing is useful for looking up a particular record, but it will not help performing comparison.
In my case, I didn't have the group by so all I did was using the ORDER BY DESC clause and SELECT TOP 1. The query went from over 1 hour down to under 5 minutes. Perhaps, you can do what Gordon Linoff suggested and use PARTITION BY. Hopefully, your query can speed up.
Cheers!
The following is the version of your query using row_number():
SELECT bd.*, b.*
FROM (select bd.*, row_number() over (partition by bid order by statustime desc) as seqnum
from BData bd
) bd INNER JOIN
BItems b
ON b.InternalID = bd.BID and bd.seqnum = 1
WHERE B.ICID = 2
ORDER BY BData.StatusTime DESC;
If this is not faster, then it would be useful to see the query plans for your query and this query to figure out how to optimize them.
Depends entirely on what kind of data you have there. One alternative that may be faster is using CROSS APPLY instead of the MAX subquery. But more than likely it won't yield any faster results.
The best option would probably be to add an index on BID, with INCLUDE containing the StatusTime, and if possible filtering that by InternalID's matching BItems.ICID = 2.
[UNSOLVED] But I've moved on!
Thanks to everyone who provided answers / suggestions. Unfortunately I couldn't get any further with this, so have given-up trying for now.
It looks like the best solution is to re-write the application to UPDATE the latest data into into a different table, that way it's a really quick and simple SELECT to latest readings.
Thanks again for the suggestions.

Is derived table executed once or three times?

Every time you make use of a derived table, that query is going to be executed. When using a CTE, that result set is pulled back once and only once within a single query.
Does the quote suggest that the following query will cause derived table to be executed three times ( once for each aggregate function’s call ):
SELECT
AVG(OrdersPlaced),MAX(OrdersPlaced),MIN(OrdersPlaced)
FROM (
SELECT
v.VendorID,
v.[Name] AS VendorName,
COUNT(*) AS OrdersPlaced
FROM Purchasing.PurchaseOrderHeader AS poh
INNER JOIN Purchasing.Vendor AS v ON poh.VendorID = v.VendorID
GROUP BY v.VendorID, v.[Name]
) AS x
thanx
No that should be one pass, take a look at the execution plan
here is an example where something will run for every row in table table2
select *,(select COUNT(*) from table1 t1 where t1.id <= t2.id) as Bla
from table2 t2
Stuff like this with a running counts will fire for each row in the table2 table
CTE or a nested (uncorrelated) subquery will generally have no different execution plan. Whether a CTE or a subquery is used has never had an effect on my intermediate queries being spooled.
With regard to the Tony Rogerson link - the explicit temp table performs better than the self-join to the CTE because it's indexed better - many times when you go beyond declarative SQL and start to anticipate the work process for the engine, you can get better results.
Sometimes, the benefit of a simpler and more maintainable query with many layered CTEs instead of a complex multi-temp-table process outweighs the performance benefits of a multi-table process. A CTE-based approach is a single SQL statement, which cannot be as quietly broken by a step being accidentally commented out or a schema changing.
Probably not, but it may spool the derived results so it only needs to access it once.
In this case, there should be no difference between a CTE and derived table.
Where is the quote from?

Is there a better way to sort this query?

We generate a lot of SQL procedurally and SQL Server is killing us. Because of some issues documented elsewhere we basically do SELECT TOP 2 ** 32 instead of TOP 100 PERCENT.
Note: we must use the subqueries.
Here's our query:
SELECT * FROM (
SELECT [me].*, ROW_NUMBER() OVER( ORDER BY (SELECT(1)) )
AS rno__row__index FROM (
SELECT [me].[id], [me].[status] FROM (
SELECT TOP 4294967296 [me].[id], [me].[status] FROM
[PurchaseOrders] [me]
LEFT JOIN [POLineItems] [line_items]
ON [line_items].[id] = [me].[id]
WHERE ( [line_items].[part_id] = ? )
ORDER BY [me].[id] ASC
) [me]
) [me]
) rno_subq
WHERE rno__row__index BETWEEN 1 AND 25
Are there better ways to do this that anyone can see?
UPDATE: here is some clarification on the whole subquery issue:
The key word of my question is "procedurally". I need the ability to reliably encapsulate resultsets so that they can be stacked together like building blocks. For example I want to get the first 10 cds ordered by the name of the artist who produced them and also get the related artist for each cd.. What I do is assemble a monolithic subselect representing the cds ordered by the joined artist names, then apply a limit to it, and then join the nested subselects to the artist table and only then execute the resulting query. The isolation is necessary because the code that requests the ordered cds is unrelated and oblivious to the code selecting the top 10 cds which in turn is unrelated and oblivious to the code that requests the related artists.
Now you may say that I could move the inner ORDER BY into the OVER() clause, but then I break the encapsulation, as I would have to SELECT the columns of the joined table, so I can order by them later. An additional problem would be the merging of two tables under one alias; if I have identically named columns in both tables, the select me.* would stop right there with an ambiguous column name error.
I am willing to sacrifice a bit of the optimizer performance, but the 2**32 seems like too much of a hack to me. So I am looking for middle ground.
If you want top rows by me.id, just ask for that in the ROW_NUMBER's ORDER BY. Don't chase your tail around subqueries and TOP.
If you have a WHERE clause on a joined table field, you can have an outer JOIN. All the outer fields will be NULL and filtered out by the WHERE, so is effectively an inner join.
.
WITH cteRowNumbered AS (
SELECT [me].id, [me].status
ROW_NUMBER() OVER (ORDER BY me.id ASC) AS rno__row__index
FROM [PurchaseOrders] [me]
JOIN [POLineItems] [line_items] ON [line_items].[id] = [me].[id]
WHERE [line_items].[part_id] = ?)
SELECT me.id, me.status
FROM cteRowNumbered
WHERE rno__row__index BETWEEN 1 and 25
I use CTEs instead of subqueries just because I find them more readable.
Use:
SELECT x.*
FROM (SELECT po.id,
po.status,
ROW_NUMBER() OVER( ORDER BY po.id) AS rno__row__index
FROM [PurchaseOrders] po
JOIN [POLineItems] li ON li.id = po.id
WHERE li.pat_id = ?) x
WHERE x.rno__row__index BETWEEN 1 AND 25
ORDER BY po.id ASC
Unless you've omitted details in order to simplify the example, there's no need for all your subqueries in what you provided.
Kudos to the only person who saw through naysaying and actually tried the query on a large table we do not have access to. To all the rest saying this simply will not work (will return random rows) - we know what the manual says, and we know it is a hack - this is why we ask the question in the first place. However outright dismissing a query without even trying it is rather shallow. Can someone provide us with a real example (with preceeding CREATE/INSERT statements) demonstrating the above query malfunctioning?
Your update makes things much clearer. I think that the approach which you're using is seriously flawed. While it's nice to be able to have encapsulated, reusable code in your applications, front-end applications are a much different animal than a database. They typically deal with small structures and small, discrete process that run against those structures. Databases on the other hand often deal with tables that are measured in the millions of rows and sometimes more than that. Using the same methodologies will often result in code that simply performs so badly as to be unusable. Even if it works now, it's very likely that it won't scale and will cause major problems down the road.
Best of luck to you, but I don't think that this approach will end well in all but the smallest of databases.