CROSS APPLY query very slow when additional column added

CROSS APPLY query very slow when additional column added - sql

I have a CROSS APPLY query which executes very quickly (1 second). However, if I add certain additional columns to the top SELECT, the query will run very slow (many minutes). I'm not seeing what is causing this.
SELECT
cs.show_title, im.primaryTitle
FROM
captive_state cs
CROSS APPLY
(SELECT TOP 1
imdb.tconst, imdb.titleType, imdb.primaryTitle,
imdb.genres, imdb.genre1, imdb.genre2, imdb.genre3
FROM
imdb_data imdb
WHERE
(imdb.primaryTitle LIKE cs.show_title+'%')
AND (imdb.titleType like 'tv%' OR imdb.titleType = 'movie')
ORDER BY
imdb.titleType, imdb.tconst DESC) AS im
WHERE
cs.genre1 IS NULL
I've tried adding/removing various columns and only when adding the 'genre' fields - e.g. genre2 (varchar(50)) - does the slowness occur. For example,
SELECT cs.show_title, im.primaryTitle, im.genre2
I would expect the query to basically have the same performance whether adding one additional column or not.
Here are the query plans without the extra column, and with.
The first table (cs) has a primary key index and an index on genre1. The second table (imdb) has a primary key index and an index on primaryTitle.
I'm not sure if those would cause any problems though.
Thanks for any suggestions.

In your second screenshot, you're performing an Index Scan on the primary key for imdb_data. This is essentially scanning the table as if there is no index.
You have two options. Either change your query to use the indexed columns of imdb_data or create a new index to cover this query.

Maybe switch to an alternative for the topped CROSS APPLY
SELECT TOP 1 WITH TIES
cs.show_title,
imdb.tconst, imdb.titleType, imdb.primaryTitle,
imdb.genres, imdb.genre1, imdb.genre2, imdb.genre3
FROM captive_state cs
JOIN imdb_data imdb
ON imdb.primaryTitle LIKE cs.show_title+'%'
AND (imdb.titleType = 'movie' OR imdb.titleType LIKE 'tv%')
WHERE cs.genre1 IS NULL
ORDER BY ROW_NUMBER() OVER (PARTITION BY cs.show_title ORDER BY imdb.titleType, imdb.tconst DESC)

You could include additional columns to index [imdb_data].[idx_primary_table]. (name is not readable from screenshot):
CREATE INDEX idx_name ON [imdb_data].[idx_primary_table](same cols as in original)
INCLUDE (genre1, genre2, genre3) WITH (DROP_EXISTING=ON)

Try to use "join" with "row_number()" instead of "apply"
select
dat.primaryTitle
,dat.show_title
from (
select
imdb.primaryTitle
,cs.show_title
,row_number() over (partition by cs.show_title order by imdb.titleType, imdb.tconst DESC) as rn
from imdb_data imdb
inner join captive_state cs on imdb.primaryTitle LIKE cs.show_title+'%'
where (imdb.titleType like 'tv%' OR imdb.titleType = 'movie')
and cs.genre1 IS NULL
) dat
where dat.rn = 1

Related

Is there something I can change to make my view run faster?

I just created a view but it is really slow, since my actual table has something around 800k rows.
Is there something I can change in the actual sql code to make it run faster?
Here is how it looks now:
Select B.*
FROM
(Select A.*, (select count(B.KEY_ID)/77
FROM book_new B
where B.KEY_ID = A.KEY_ID) as COUNT_KEY
FROM
(select *
from book_new
where region = 'US'
and (actual_release_date is null or
actual_release_date >= To_Date( '01/07/16','dd/mm/yy'))
) A
) B
WHERE B.COUNT_KEY = 1
OR (B.COUNT_KEY > 1 AND B.NEW_OLD <> 'Old')

The most obvious things to do are add indexes:
Add an index on book_new(key_id)
Add an index on book_new(region, actual_release_date)
These are probably sufficient. It is possible that rewriting the query would help, but this is a good beginning. If you want to rewrite the query, it would help if you described the logic you are trying to implement.

There are many ways to solve this issue based on your needs
You can create an indexed view
You can create an index in the base tables which are used in this view.
You can use the required columns in the SELECT statement instead of using SELECT * FROM,
If the table contains many columns but you require only few columns, you can create a NON CLUSTERED INDEX with INCLUDE COLUMNS option which will reduce the LOGICAL READS.

For starters, replace the scalar subquery for COUNT_KEY with a windowed COUNT(*).
SELECT * FROM
(
select book_new.*, COUNT(*) OVER ( PARTITION BY book_new.key_id)/77 COUNT_KEY
from book_new
where region = 'US'
and (actual_release_date is null or
actual_release_date >= To_Date( '01/07/16','dd/mm/yy'))
)
WHERE count_key = 1 OR ( count_key > 1 AND new_old <> 'Old' )
This way, you only go through the BOOK_NEW table one time.
BTW, I agree with other comments that this query makes little sense.

Ensuring two columns only contain valid results from same subquery

I have the following table:
id symbol_01 symbol_02
1 abc xyz
2 kjh okd
3 que qid
I need a query that ensures symbol_01 and symbol_02 are both contained in a list of valid symbols. In other words I would needs something like this:
select *
from mytable
where symbol_01 in (
select valid_symbols
from somewhere)
and symbol_02 in (
select valid_symbols
from somewhere)
The above example would work correctly, but the subquery used to determine the list of valid symbols is identical both times and is quite large. It would be very innefficient to run it twice like in the example.
Is there a way to do this without duplicating two identical sub queries?

Another approach:
select *
from mytable t1
where 2 = (select count(distinct symbol)
from valid_symbols vs
where vs.symbol in (t1.symbol_01, t1.symbol_02));
This assumes that the valid symbols are stored in a table valid_symbols that has a column named symbol. The query would also benefit from an index on valid_symbols.symbol

You could try use a CTE like;
WITH ValidSymbols AS (
SELECT DISTINCT valid_symbol
FROM somewhere
)
SELECT mt.*
FROM MyTable mt
INNER JOIN ValidSymbols v1
ON mt.symbol_01 = v1.valid_symbol
INNER JOIN ValidSymbols v2
ON mt.symbol_02 = v2.valid_symbol

From a performance perspective, your query is the right way to do this. I would write it as:
select *
from mytable t
where exists (select 1
from valid_symbols vs
where t.symbol_01 = vs.valid_symbol
) and
exists (select 1
from valid_symbols vs
where t.symbol_02 = vs.valid_symbol
) ;
The important component is that you need an index on valid_symbols(valid_symbol). With this index, the lookup should be pretty fast. Appropriate indexes can even work if valid_symbols is a view, although the effect depends on the complexity of the view.
You seem to have a situation where you have two foreign key relationships. If you explicitly declare these relationships, then the database will enforce that the columns in your table match the valid symbols.

PostgreSQL: How to select on non-aggregating column?

Seems like a simple question but I'm having trouble accomplishing it. What I want to do is return all names that have duplicate ids. The view looks as such:
id | name | other_col
---+--------+----------
1 | James | x
2 | John | x
2 | David | x
3 | Emily | x
4 | Cameron| x
4 | Thomas | x
And so in this case, I'd just want the result:
name
-------
John
David
Cameron
Thomas
The following query works but it seems like an overkill to have two separate selects:
select name
from view where id = ANY(select id from view
WHERE other_col='x'
group by id
having count(id) > 1)
and other_col='x';
I believe it should be possible to do something under the lines of:
select name from view WHERE other_col='x' group by id, name having count(id) > 1;
But this returns nothing at all! What is the 'proper' query?
Do I just have to it like my first working suggestion or is there a better way?

You state you want to avoid two "queries", which isn't really possible. There are plenty of solutions available, but I would use a CTE like so:
WITH cte AS
(
SELECT
id,
name,
other_col,
COUNT(name) OVER(PARTITION BY id) AS id_count
FROM
table
)
SELECT name FROM cte WHERE id_count > 1;
You can reuse the CTE, so you don't have to duplicate logic and I personally find it easier to read and understand what it is doing.

SELECT name FROM Table
WHERE id IN (SELECT id, COUNT(*) FROM Table GROUP BY id HAVING COUNT(*)>1) Temp

Use EXIST operator
SELECT * FROM table t1
WHERE EXISTS(
SELECT null FROM table t2
WHERE t1.id = t2.id
AND t1.name <> t2.name
)

Use a join:
select distinct name
from view v1
join view v2 on v1.id = v2.id
and v1.name != v2.name
The use of distinct is there in case there are more than 2 rows sharing the same id. If that's not possible, you can omit distinct.
A note: Naming a column id when it's not unique will likely cause confusion, because it's the industry standard for the unique identifier column. If there isn't a unique column at all, it will cause coding difficulties.

Do not use a CTE. That's typically more expensive because Postgres has to materialize the intermediary result.
An EXISTS semi-join is typically fastest for this. Just make sure to repeat predicates (or match the values):
SELECT name
FROM view v
WHERE other_col = 'x'
AND EXISTS (
SELECT 1 FROM view
WHERE other_col = 'x' -- or: other_col = v.other_col
AND id <> v.id -- exclude join to self
);
That's a single query, even if you see the keyword SELECT twice here. An EXISTS expression does not produce a derived table, it will be resolved to simple index look-ups.
Speaking of which: a multicolumn index on (other_col, id) should help. Depending on data distribution and access patterns, appending the payload column name to enable index-only scans might help: (other_col, id, name). Or even a partial index, if other_col = 'x' is a constant predicate:
CREATE INDEX ON view (id) WHERE other_col = 'x';
PostgreSQL does not use a partial index
The upcoming Postgres 9.6 would even allow an index-only scan on the partial index:
CREATE INDEX ON view (id, name) WHERE other_col = 'x';
You will love this improvement (quoting the /devel manual):
Allow using an index-only scan with a partial index when the index's
predicate involves column(s) not stored in the index (Tomas Vondra,
Kyotaro Horiguchi)
An index-only scan is now allowed if the query mentions such columns
only in WHERE clauses that match the index predicate
Verify performance with EXPLAIN (ANALYZE, TIMING OFF) SELECT ...
Run a couple of times to rule out caching effects.

How can I stop joins from adding rows in my match query?

I'm having difficulty translating what I want into functional programming, since I think imperatively. Basically, I have a table of forms, and a table of expectations. In the Expectation view, I want it to look through the forms table and tell me if each one found a match. However, when I try to use joins to accomplish this, the joins are adding rows to the Expectation table when two or more forms match. I do not want this.
In an imperative fashion, I want the equivalent of this:
ForEach (row in Expectation table)
{
if (any form in the Form table matches the criteria)
{
MatchID = form.ID;
SignDate = form.SignDate;
...
}
}
What I have in SQL is this:
SELECT
e.*, match.ID, match.SignDate, ...
FROM
POFDExpectation e LEFT OUTER JOIN
(SELECT MIN(ID) as MatchID, MIN(SignDate) as MatchSignDate,
COUNT(*) as MatchCount, ...
FROM Form f
GROUP BY (matching criteria columns)
) match
ON (form.[match criteria] = expectation.[match criteria])
Which works okay, but very slowly, and every time there are TWO matches, a row is added to the Expectation results. Mathematically I understand that a join is a cross multiply and this is expected, but I'm unsure how to do this without them. Subquery perhaps?
I'm not able to give too many further details about the implementation, but I'll be happy to try any suggestion and respond with the results. I have 880 Expectation rows, and 942 results being returned. If I only allow results that match one form, I get 831 results. Neither are desirable, so if yours gets me to exactly 880, yours is the accepted answer.
Edit: I am using SQL Server 2008 R2, though a generic solution would be best.
Sample code:
--DROP VIEW ExpectationView; DROP TABLE Forms; DROP TABLE Expectations;
--Create Tables and View
CREATE TABLE Forms (ID int IDENTITY(1,1) PRIMARY KEY, ReportYear int, Name varchar(100), Complete bit, SignDate datetime)
GO
CREATE TABLE Expectations (ID int IDENTITY(1,1) PRIMARY KEY, ReportYear int, Name varchar(100))
GO
CREATE VIEW ExpectationView AS select e.*, filed.MatchID, filed.SignDate, ISNULL(filed.FiledCount, 0) as FiledCount, ISNULL(name.NameCount, 0) as NameCount from Expectations e LEFT OUTER JOIN
(select MIN(ID) as MatchID, ReportYear, Name, Complete, Min(SignDate) as SignDate, COUNT(*) as FiledCount from Forms f GROUP BY ReportYear, Name, Complete) filed
on filed.ReportYear = e.ReportYear AND filed.Name like '%'+e.Name+'%' AND filed.Complete = 1 LEFT OUTER JOIN
(select MIN(ID) as MatchID, ReportYear, Name, COUNT(*) as NameCount from Forms f GROUP BY ReportYear, Name) name
on name.ReportYear = e.ReportYear AND name.Name like '%'+e.Name+'%'
GO
--Insert Text Data
INSERT INTO Forms (ReportYear, Name, Complete, SignDate)
SELECT 2011, 'Bob Smith', 1, '2012-03-01' UNION ALL
SELECT 2011, 'Bob Jones', 1, '2012-10-04' UNION ALL
SELECT 2011, 'Bob', 1, '2012-07-20'
GO
INSERT INTO Expectations (ReportYear, Name)
SELECT 2011, 'Bob'
GO
SELECT * FROM ExpectationView --Should only return 1 result, returns 9
The 'filed' shows that they have completed a form, 'name' shows that they may have started one but not finished it. My view has four different 'match criteria' - each a little more strict, and counts each. 'Name Only Matches', 'Loose Matches', 'Matches' (default), 'Tight Matches' (used if there are more than one default match.

This is how I do it when I want to keep to a JOIN-type query format:
SELECT
e.*,
match.ID,
match.SignDate,
...
FROM POFDExpectation e
OUTER APPLY (
SELECT TOP 1
MIN(ID) as MatchID,
MIN(SignDate) as MatchSignDate,
COUNT(*) as MatchCount,
...
FROM Form f
WHERE form.[match criteria] = expectation.[match criteria]
GROUP BY ID (matching criteria columns)
-- Add ORDER BY here to control which row is TOP 1
) match
It usually performs better as well.
Semantically, {CROSS|OUTER} APPLY (table-expression) specifies a table-expression that is called once for each row in the preceding table expressions of the FROM clause and then joined to them. Pragmatically, however, the compiler treats it almost identically to a JOIN.
The practical difference is that unlike a JOIN table-expression, the APPLY table-expression is dynamically re-evaluated for each row. So instead of an ON clause, it relies on its own logic and WHERE clauses to limit/match its rows to the preceding table-expressions. This also allows it to make reference to the column-values of the preceding table-expressions, inside its own internal subquery expression. (This is not possible in a JOIN)
The reason that we want this here, instead of a JOIN, is that we need a TOP 1 in the sub-query to limit its returned rows, however, that means that we need to move the ON clause conditions to the internal WHERE clause so that it will get applied before the TOP 1 is evaluated. And that means that we need an APPLY here, instead of the more usual JOIN.

#RBarryYoung answered the question as I asked it, but there was a second question that I didn't make very clear. What I really wanted was a combination of his answer and this question, so for the record here's what I used:
SELECT
e.*,
...
match.ID,
match.SignDate,
match.MatchCount
FROM
POFDExpectation e
OUTER APPLY (
SELECT TOP 1
ID as MatchID,
ReportYear,
...
SignDate as MatchSignDate,
COUNT(*) as MatchCount OVER ()
FROM
Form f
WHERE
form.[match criteria] = expectation.[match criteria]
-- Add ORDER BY here to control which row is TOP 1
) match

sql server-query optimization with many columns

we have "Profile" table with over 60 columns like (Id, fname, lname, gender, profilestate, city, state, degree, ...).
users search other peopel on website. query is like :
WITH TempResult as (
select ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum, profile.id from Profile
where
(#a is null or a = #a) and
(#b is null or b = #b) and
...(over 60 column)
)
SELECT profile.* FROM TempResult join profile on TempResult.id = profile.id
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)
sql server by default use clustered index for execution query. but total execution time is over 300. we test another solution such as multi column index in all columns in where clause but total execution time is over 400.
do you have any solution to make total execution time lower than 100.
we using sql server 2008.

Unfortunately I don't think there is a pure SQL solution to your issue. Here are a couple alternatives:
Dynamic SQL - build up a query that only includes WHERE clause statements for values that are actually provided. Assuming the average search actually only fills in 2-3 fields, indexes could be added and utilized.
Full Text Search - go to something more like a Google keyword search. No individual options.
Lucene (or something else) - Search outside of SQL; This is a fairly significant change though.
One other option that I just remembered implementing in a system once. Create a vertical table that includes all of the data you are searching on and build up a query for it. This is easiest to do with dynamic SQL, but could be done using Table Value Parameters or a temp table in a pinch.
The idea is to make a table that looks something like this:
Profile ID
Attribute Name
Attribute Value
The table should have a unique index on (Profile ID, Attribute Name) (unique to make the search work properly, index will make it perform well).
In this table you'd have rows of data like:
(1, 'city', 'grand rapids')
(1, 'state', 'MI')
(2, 'city', 'detroit')
(2, 'state', 'MI')
Then your SQL will be something like:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
WHERE (AttributeName = 'city' AND AttributeValue = 'grand rapids')
AND (AttributeName = 'state' AND AttributeValue = 'MI')
GROUP BY ProfileID
HAVING COUNT(*) = 2
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
Like I said, you could use a temp table that has attribute name/values:
SELECT *
FROM Profile
JOIN (
SELECT ProfileID
FROM ProfileAttributes
JOIN PassedInAttributeTable ON ProfileAttributes.AttributeName = PassedInAttributeTable.AttributeName
AND ProfileAttributes.AttributeValue = PassedInAttributeTable.AttributeValue
GROUP BY ProfileID
HAVING COUNT(*) = CountOfRowsInPassedInAttributeTable -- calculate or pass in
) SelectedProfiles ON Profile.ProfileID = SelectedProfiles.ProfileID
... -- Add your paging here
As I recall, this ended up performing very well, even on fairly complicated queries (though I think we only had 12 or so columns).

As a single query, I can't think of a clever way of optimising this.
Provided that each column's check is highly selective, however, the following (very long winded) code, might prove faster, assuming each individual column has it's own separate index...
WITH
filter AS (
SELECT
[a].*
FROM
(SELECT * FROM Profile WHERE #a IS NULL OR a = #a) AS [a]
INNER JOIN
(SELECT id FROM Profile WHERE b = #b UNION ALL SELECT NULL WHERE #b IS NULL) AS [b]
ON ([a].id = [b].id) OR ([b].id IS NULL)
INNER JOIN
(SELECT id FROM Profile WHERE c = #c UNION ALL SELECT NULL WHERE #c IS NULL) AS [c]
ON ([a].id = [c].id) OR ([c].id IS NULL)
.
.
.
INNER JOIN
(SELECT id FROM Profile WHERE zz = #zz UNION ALL SELECT NULL WHERE #zz IS NULL) AS [zz]
ON ([a].id = [zz].id) OR ([zz].id IS NULL)
)
, TempResult as (
SELECT
ROW_NUMBER() OVER(ORDER BY #sortColumn DESC) as RowNum,
[filter].*
FROM
[filter]
)
SELECT
*
FROM
TempResult
WHERE
(RowNum >= #FirstRow)
AND (RowNum <= #LastRow)
EDIT
Also, thinking about it, you may even get the same result just by having the 60 individual indexes. SQL Server can do INDEX MERGING...

You've several issues imho. One is that you're going to end up with a seq scan no matter what you do.
But I think your more crucial issue here is that you've an unnecessary join:
SELECT profile.* FROM TempResult
WHERE
(RowNum >= #FirstRow)
AND
(RowNum <= #LastRow)

This is a classic "SQL Filter" query problem. I've found that the typical approaches of "(#b is null or b = #b)" & it's common derivatives all yeild mediocre performance. The OR clause tends to be the cause.
Over the years I've done a lot of Perf/Tuning & Query Optimisation. The Approach I've found best is to generate Dynamic SQL inside a Stored Proc. Most times you also need to add "with Recompile" on the statement. The Stored Proc helps reduce potential for SQL injection attacks. The Recompile is needed to force the selection of indexes appropriate to the parameters you are searching on.
Generally it is at least an order of magnitude faster.
I agree you should also look at points mentioned above like :-
If you commonly only refer to a small subset of the columns you could create non-clustered "Covering" indexes.
Highly selective (ie:those with many unique values) columns will work best if they are the lead colum in the index.
If many colums have a very small number of values, consider using The BIT datatype. OR Create your own BITMASKED BIGINT to represent many colums ie: a form of "Enumerated datatyle". But be careful as any function in the WHERE clause (like MOD or bitwise AND/OR) will prevent the optimiser from choosing an index. It works best if you know the value for each & can combine them to use an equality or range query.
While often good to find RoWID's with a small query & then join to get all the other columns you want to retrieve. (As you are doing above) This approach can sometimes backfire. If the 1st part of the query does a Clustred Index Scan then often it is faster to get the otehr columns you need in the select list & savethe 2nd table scan.
So always good to try it both ways & see what works best.
Remember to run SET STATISTICS IO ON & SET STATISTICS TIME ON. Before running your tests. Then you can see where the IO is & it may help you with index selection, for the mose frequent combination of paramaters.
I hope this makes sense without long code samples. (it is on my other machine)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas