Sort Postgresql table using previously calculated data - sql

I have an operation which gives me a list of IDs and some score related to these IDs.
Then I need query the database and sort rows using the data above.
I tried something like (I'm using PostgreSQL):
SELECT * FROM sometable
LEFT OUTER JOIN (VALUES (629, 3), (624, 1)) /* Here is my data */
AS x(id, ordering)
USING (id)
WHERE some_column_id=1
ORDER BY x.ordering;
But for ~10000 rows it runs about 15sec on my machine.
Is there a better way to sort my table using a previously calculated data?

What is the performance of this version?
SELECT st.*
FROM sometable st
WHERE st.some_column_id = 1
ORDER BY (CASE WHEN st.id = 629 then 3 WHEN st.id = 624 THEN 1 END);
An index on sometable(some_column_id) might also speed the query.
However, I don't understand why your version on a table with 10,000 rows would take 15 seconds.

Related

SQL Server - Optimizing MAX() on large tables

My company has a series of SQL views. One critical view has a sub select that fetches the max(id) from a large table and joins with another table.
On a sample, I populated a test table with 1m rows. Max(id) (id is an integer value) takes 8 minutes. Top with and order by desc take 8 minutes. Just experimenting, I tried max(id) over(partion by(id)) takes one second. The result set is correct. Not sure why this sped things up so much. Any ideas much appreciated. New test table with 1m rows is tblmsg_nicholas
INNER JOIN LongviewHoldTable lvhold WITH (NOLOCK) ON lvhold.MsgID = case tm.MsgType when 'LV_BLIM' /*then (select max(tm2.ID) from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID)*/
/*then (SELECT TOP 1 ID FROM TBLMSG_NICHOLAS TM2 WHERE msgtype = 'LV_ALLOC' and TM.GroupID =tm2.GroupID ORDER BY ID DESC)*/
then (select max(tm2.ID) OVER (PARTITION BY ID) from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID)
else tm.ID
end
WHERE
TA.TARGETTASKID IS NOT NULL AND
TA.RESPONSE IS NULL
About MAX(). It looks like you are computing your MAX() more-or-less like this.
select max(tm2.ID)
from [dbo].[TBLMSG_NICHOLAS] tm2
where msgtype = 'LV_ALLOC' and TM.GroupID = tm2.groupID
An index on TBLMSG_NICHOLAS (msgtype, GroupID, ID DESC) accelerates that subquery. The query planner can random-access that index directly: the first matching row contains the MAX(ID) value you want.
But you also use a so-called dependent -- correlated -- subquery. It's usually a good idea to refactor such subqueries into JOINed independent subqueries. But it's hard to help you do that because you didn't show your entire query.

sql server does not take most restrictive condition for execution plan

We have a query with multiple joins where sql server 2016 does not take the optimal path and we cannot convince it without hints (which we prefer not to use)
Simplified the problem is as follows :
Table A (12 million rows)
Table B (type table, 5 rows)
Table C (12 million rows)
query (simplified to clarify)
SELECT
[A].[ID]
,[A].[DATE_CREATED]
,[A].[DATE_LAST_MODIFIED]
,[A].[CODE]
,[B].[CODE]
,[B].[DESCRIPTION]
,[C].[EVENT_ID]
,[C].[SOURCE_REFERENCE]
,[C].[EVTY_ID]
,[C].[BUSINESS_KEY]
,[C].[DATA]
,[C].[EVENT_DATE]
FROM A
JOIN B ON [B].[ID] = [A].[PSTY_ID] AND [B].[ACTIVE] = 1
JOIN C ON [C].[ID] = [B].[EVEN_ID] AND [C].[ACTIVE] = 1
WHERE [B].[CODE] = 'nopr' OR [B].[CODE] = 'inpr'
the selected codes from B correspond to values 1 and 2
Table A contain max 10 PSTY_ID values 1 or 2 the rest is 3,4 or 5
There is a foreign key from A.PSTY_ID to B.ID
There is a filtered index on table A PSTY_ID 1,2 and all selected columns as included columns
The optimizer does not seem to recognize that we try to select values 1 and 2, and does not use the index or start with table B (trying to force with subqueries or changing table order do not help, only the hint OPTION (FORCE ORDER) can convince the optimizer, but this we do not want)
Only when we hard code the B.ID or A.PSTY_ID values 1 and 2 in the where clause the optimizer takes the correct path, starting with table B.
If we do not do this, it starts to join table A with table C, and only then with table B, leading to vastly more processing time (approx 50X)
We also tried to declare the values and using them as variables, but still no luck.
Would anyone know if this is a known issue, or if this can be worked around ?
Your filtered index will not be used in this case unless you include values 1 and 2 in the where clause, you cannot change this even if you try to join with the table that ONLY has 1,2 in its rows.
Filtered index will never be used based on some "assumptions" of what values some table (physical or derived like CTE or subquery), and in fact your subquery did not help.
So if you want to use it, you should add the where condition equivalent to those of filtered index to your query.
Since you don't want to add this condition, but still want to change join order of your tables starting with B table you can use temporary table/table variable like this:
select [ID]
,[CODE]
,[DESCRIPTION]
,[EVEN_ID]
into #tmp
from B
where ([CODE] = 'nopr' OR [CODE] = 'inpr') and [ACTIVE] = 1
And now use this #tmp instead of B in your query.

Order of Operation in SQL Server Query

I have the below query selecting items and one of its feature from a hundred thousand row of items.
But I am concerned about the performance of sub query. Will it be executed after or before the where clause ?
Suppose, I am selecting 25 items from 10000 items, this subquery will be executed only for 25 items or 10000 items ?
declare #BlockStart int = 1
, #BlockSize int = 25
;
select *, (
select Value_Float
from Features B
where B.Feature_ModelID = Itm.ModelID
and B.Feature_PropertyID = 5
) as Price
from (
select *
, row_number() over (order by ModelID desc) as RowNumber
from Models
) Itm
where Itm.RowNumber >= #BlockStart
and Itm.RowNumber < #BlockStart + #BlockSize
order by ModelID desc
The sub query in the FROM clause produces a full set of results, but the sub query in the SELECT clause will (generally!) only be run for the records included with the final result set.
As with all things SQL, there is a query optimizer involved, which may at times decide to create seemingly-strange execution plans. In this case, I believe we can be pretty confident, but I need to caution about making sweeping generalizations about SQL language order of operations.
Moving on, have you seen the OFFSET/FECTH syntax available in Sql Server 2012 and later? This seems like a better way to handle the #BlockStart and #BlockSize values, especially as it looks like you're paging on the clustered key. (If you end up paging on an alternate column, the link shows a much faster method).
Also, at risk of making generalizations again, if you can know that only one Features record exists per ModelID with Feature_PropertyID = 5, you will tend to get better performance using a JOIN:
SELECT m.*, f.Value_Float As Price
FROM Models m
LEFT JOIN Features f ON f.Feature_ModelID = m.ModelID AND f.Feature_PropertyID = 5
ORDER BY m.ModelID DESC
OFFSET #BlockStart ROWS FETCH NEXT #BlockSize ROWS ONLY
If you can't make that guarantee, you may get better performance from an APPLY operation:
SELECT m.*, f.Value_Float As Price
FROM Models m
OUTER APPLY (
SELECT TOP 1 Value_Float
FROM Features f
WHERE f.Feature_ModelID = m.ModelID AND f.Feature_PropertyID = 5
) f
ORDER BY m.ModelID DESC
OFFSET #BlockStart ROWS FETCH NEXT #BlockSize ROWS ONLY
Finally, this smells like yet another variation of the Entity-Attribute-Value pattern... which, while it has it's places, typically should be a pattern of last resort.

The most efficient way to check whether a specific number of rows exists from a table

I'm have an oracle query as below which is working well :
INSERT /*+APPEND*/ INTO historical
SELECT a.* FROM TEMP_DATA a WHERE NOT EXISTS(SELECT 1 FROM historical WHERE KEY=a.KEY)
With the query, when i run explain plan, i notice that the optimizer chooses a HASH JOIN plan and the cost is fairly low
However there's a new request to state how many rows that can exists in the historical table to check against the TEMP_DATA table, and hence the query is changed to:
INSERT /*+APPEND*/ INTO historical
SELECT a.* FROM TEMP_DATA a WHERE (SELECT COUNT(1) FROM historical WHERE KEY=a.KEY) < 2
Which means if 1 row of record exists in the historical data given the key (not primary key), the data still could be inserted.
However with this approach the query slow down a lot, with the cost more than 10 times of the original cost. I'd also noticed that the optimizer chooses a NESTED LOOP plan now.
Note that the historical table is a partitioned table with indexes.
Is there anyway i can optimized this?
Thanks.
The following query should do the same thing and should be more performant:
select a.*
from temp_data a
left
join(select key, count(*) cnt
from historical
group
by key
) b
on a.key = b.key
where nvl(b.cnt, 0) < 2;
Hope it helps
An alternative to #DirkNM's answer would be:
select a.*
from temp_data a
where not exists (select null
from historical h
where h.key = a.key
and rownum <= 2
group by h.key
having count(*) > 1);
You would have to test with your data sets to work out which is the best solution for you.
NB: I wouldn't expect the new query (whichever one you choose) to be as performant as your original query.

Performance of nested select

I know this is a common question and I have read several other posts and papers but I could not find one that takes into account indexed fields and the volume of records that both queries could return.
My question is simple really. Which of the two is recommended here written in an SQL-like syntax (in terms of performance).
First query:
Select *
from someTable s
where s.someTable_id in
(Select someTable_id
from otherTable o
where o.indexedField = 123)
Second query:
Select *
from someTable
where someTable_id in
(Select someTable_id
from otherTable o
where o.someIndexedField = s.someIndexedField
and o.anotherIndexedField = 123)
My understanding is that the second query will query the database for every tuple that the outer query will return where the first query will evaluate the inner select first and then apply the filter to the outer query.
Now the second query may query the database superfast considering that the someIndexedField field is indexed but say that we have thousands or millions of records wouldn't it be faster to use the first query?
Note: In an Oracle database.
In MySQL, if nested selects are over the same table, the execution time of the query can be hell.
A good way to improve the performance in MySQL is create a temporary table for the nested select and apply the main select against this table.
For example:
Select *
from someTable s1
where s1.someTable_id in
(Select someTable_id
from someTable s2
where s2.Field = 123);
Can have a better performance with:
create temporary table 'temp_table' as (
Select someTable_id
from someTable s2
where s2.Field = 123
);
Select *
from someTable s1
where s1.someTable_id in
(Select someTable_id
from tempTable s2);
I'm not sure about performance for a large amount of data.
About first query:
first query will evaluate the inner select first and then apply the
filter to the outer query.
That not so simple.
In SQL is mostly NOT possible to tell what will be executed first and what will be executed later.
Because SQL - declarative language.
Your "nested selects" - are only visually, not technically.
Example 1 - in "someTable" you have 10 rows, in "otherTable" - 10000 rows.
In most cases database optimizer will read "someTable" first and than check otherTable to have match. For that it may, or may not use indexes depending on situation, my filling in that case - it will use "indexedField" index.
Example 2 - in "someTable" you have 10000 rows, in "otherTable" - 10 rows.
In most cases database optimizer will read all rows from "otherTable" in memory, filter them by 123, and than will find a match in someTable PK(someTable_id) index. As result - no indexes will be used from "otherTable".
About second query:
It completely different from first. So, I don't know how compare them:
First query link two tables by one pair: s.someTable_id = o.someTable_id
Second query link two tables by two pairs: s.someTable_id = o.someTable_id AND o.someIndexedField = s.someIndexedField.
Common practice to link two tables - is your first query.
But, o.someTable_id should be indexed.
So common rules are:
all PK - should be indexed (they indexed by default)
all columns for filtering (like used in WHERE part) should be indexed
all columns used to provide match between tables (including IN, JOIN, etc) - is also filtering, so - should be indexed.
DB Engine will self choose the best order operations (or in parallel). In most cases you can not determine this.
Use Oracle EXPLAIN PLAN (similar exists for most DBs) to compare execution plans of different queries on real data.
When i used directly
where not exists (select VAL_ID FROM #newVals = OLDPAR.VAL_ID) it was cost 20sec. When I added the temp table it costs 0sec. I don't understand why. Just imagine as c++ developer that internally there loop by values)
-- Temp table for IDX give me big speedup
declare #newValID table (VAL_ID int INDEX IX1 CLUSTERED);
insert into #newValID select VAL_ID FROM #newVals
insert into #deleteValues
select OLDPAR.VAL_ID
from #oldVal AS OLDPAR
where
not exists (select VAL_ID from #newValID where VAL_ID=OLDPAR.VAL_ID)
or exists (select VAL_ID from #VaIdInternals where VAL_ID=OLDPAR.VAL_ID);