Bad performance when not selecting a specific column from a view

Bad performance when not selecting a specific column from a view - sql

Using SQL Server 2016 SP1. I have a view Users that goes like
SELECT
ROW_NUMBER() OVER (ORDER BY ID) AS DataModelID, *
FROM
(Some query) AS tbl
I then select from it
SELECT
U1.ID UserId, U1.IdentityNumber IdentityNumber,
U1.ArabicFirstName, U1.ArabicSecondName
FROM
USERS U1
LEFT JOIN
USERS U2 ON U1.IdentityNumber = U2.IdentityNumber
AND U1.ID <> U2.ID
AND U1.RoleId = 2
WHERE
U2.ID IS NOT NULL
AND U1.IdentityNumber <> ''
AND PATINDEX('[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]', U1.IdentityNumber) = 1
The thing here is with the above query when selecting * or include column DataModelID it runs in 3 secs but when selecting any columns without this one it runs in more than 2 mins.
Why is this happening, running faster when including a column?
I tried everything for the cash to clear it and run multiple times and it has the same results

Without seeing the actual execution plan there is no way to say for sure but as #mvisser mentioned - the likely cause is that the optimizer is choosing a better index when you do a
SELECT * or include column DataModelID than when you don't. There are a number of solutions here, one suggestion would be to look at the execution plan for the queries that run in 3 seconds, note what index is being used and use an index hint (see section G) to force the optimizer to use that index in your queries that don't reference those columns. I would not suggest this though - there are too many unanswered variables to consider this a viable option.
Here's what I recommend:
First, as #Lukasz Szozda mentioned, this is not SARGable:
AND PATINDEX( '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]',U1.IdentityNumber) = 1
But this is:
U1.IdentityNumber LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]'
So I'd fix that first. Next, the fastest, most sure-fire way to resolve this is to simply include DataModelID in your queries even if you don't need them. You can either filter that column out at the application level or create a stored proc that populates a temp table then, for the final result set, you can retrieve your results from that temp table excluding DataModelID.
OPTION #2
You can create an Indexed View on you USERS table that looks something like this:
CREATE VIEW dbo.vwUSERS_clean
WITH SCHEMABINDING AS
SELECT U1.ID, UserId, U1.IdentityNumber IdentityNumber,
U1.ArabicFirstName, U1.ArabicSecondName, DataModelID, U2.IdentityNumber
FROM USERS U1
WHERE U2.ID IS NOT NULL
AND U1.IdentityNumber <> ''
AND U1.IdentityNumber LIKE '[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]';
GO
Then create a unique, clustered index on it. Next you would change the query that you posted to reference your indexed view (e.g. change both references to USERS to dbo.vwUSERS_clean WITH (NOEXPAND)).
Note that ROW_NUMBER is not allowed in indexed views but, if you make ID your clustered index (or the first column in a composite clustered index) then there will be no cost to Adding ROW_NUMBER() OVER ORDER BY ID to queries that reference that Indexed view.

Related

sql, query optimisation with and inner join?

I'm trying to optimise my query, it has an inner join and coalesce.
The join table, is simple a table with one field of integer, I've added a unique key.
For my where clause I've created a key for the three fields.
But when I look at the plan it still says it's using a table scan.
Where am I going wrong ?
Here's my query
select date(a.startdate, '+'||(b.n*a.interval)||' '||a.intervaltype) as due
from billsndeposits a
inner join util_nums b on date(a.startdate, '+'||(b.n*a.interval)||'
'||a.intervaltype) <= coalesce(a.enddate, date('2013-02-26'))
where not (intervaltype = 'once' or interval = 0) and factid = 1
order by due, pid;

Most likely your JOIN expression cannot use any index and it is calculated by doing a NATURAL scan and calculate date(a.startdate, '+'||(b.n*a.interval)||' '||a.intervaltype) for every row.
BTW: That is a really weird join condition in itself. I suggest you find a better way to join billsndeposits to util_nums (if that is actually needed).

I think I understand what you are trying to achieve. But this kind of join is a recipe for slow performance. Even if you remove date computations and the coalesce (i.e. compare one date against another), it will still be slow (compared to integer joins) even with an index. And because you are creating new dates on the fly you cannot index them.
I suggest creating a temp table with 2 columns (1) pid (or whatever id you use in billsndeposits) and (2) recurrence_dt
populate the new table using this query:
INSERT INTO TEMP
SELECT PID, date(a.startdate, '+'||(b.n*a.interval)||' '||a.intervaltype)
FROM billsndeposits a, util_numbs b;
Then create an index on recurrence_dt columns and runstats. Now your select statement can look like this:
SELECT recurrence_dt
FROM temp t, billsndeposits a
WHERE t.pid = a.pid
AND recurrence_dt <= coalesce(a.enddate, date('2013-02-26'))
you can add a exp_ts on this new table, and expire temporary data afterwards.
I know this adds more work to your original query, but this is a guaranteed performance improvement, and should fit naturally in a script that runs frequently.
Regards,
Edit
Another thing I would do, is make enddate default value = date('2013-02-26'), unless it will affect other code and/or does not make business sense. This way you don't have to work with coalesce.

How do I go about optimizing an Oracle query?

I was given a SQL query, saying that I have to optimize this query.
I came accross explain plan. So, in SQL developer, I ran explain plan for ,
It divided the query into different parts and showed the cost for each of them.
How do I go about optimizing the query? What do I look for? Elements with high costs?
I am a bit new to DB, so if you need more information, please ask me, and I will try to get it.
I am trying to understand the process rather than just posting the query itself and getting the answer.
The query in question:
SELECT cr.client_app_id,
cr.personal_flg,
r.requestor_type_id
FROM credit_request cr,
requestor r,
evaluator e
WHERE cr.evaluator_id = 96 AND
cr.request_id = r.request_id AND
cr.evaluator_id = e.evaluator_id AND
cr.request_id != 143462 AND
((r.soc_sec_num_txt = 'xxxxxxxxx' AND
r.soc_sec_num_txt IS NOT NULL) OR
(lower(r.first_name_txt) = 'test' AND
lower(r.last_name_txt) = 'newprogram' AND
to_char(r.birth_dt, 'MM/DD/YYYY') = '01/02/1960' AND
r.last_name_txt IS NOT NULL AND
r.first_name_txt IS NOT NULL AND
r.birth_dt IS NOT NULL))
On running explain plan, I am trying to upload the screenshot.
OPERATION OBJECT_NAME OPTIONS COST
SELECT STATEMENT 15
NESTED LOOPS
NESTED LOOPS 15
HASH JOIN 12
Access Predicates
CR.EVALUATOR_ID=E.EVALUATOR_ID
INDEX EVALUATOR_PK UNIQUE SCAN 0
Access Predicates
E.EVALUATOR_ID=96
TABLE ACCESS CREDIT_REQUEST BY INDEX ROWID 11
INDEX CRDRQ_DONE_EVAL_TASK_REQ_NDX SKIP SCAN 10
Access Predicates
CR.EVALUATOR_ID=96
Filter Predicates
AND
CR.EVALUATOR_ID=96
CR.REQUEST_ID<>143462
INDEX REQUESTOR_PK RANGE SCAN 1
Access Predicates
CR.REQUEST_ID=R.REQUEST_ID
Filter Predicates
R.REQUEST_ID<>143462
TABLE ACCESS REQUESTOR BY INDEX ROWID 3
Filter Predicates
OR
R.SOC_SEC_NUM_TXT='XXXXXXXX'
AND
R.BIRTH_DT IS NOT NULL
R.LAST_NAME_TXT IS NOT NULL
R.FIRST_NAME_TXT IS NOT NULL
LOWER(R.FIRST_NAME_TXT)='test'
LOWER(R.LAST_NAME_TXT)='newprogram'
TO_CHAR(INTERNAL_FUNCTION(R.BIRTH_DT),'MM/DD/YYYY')='01/02/1960'

As a quick update to your query, you're going to want to refactor it to something like this:
SELECT
cr.client_app_id,
cr.personal_flg,
r.requestor_type_id
FROM
credit_request cr
inner join requestor r on
cr.request_id = r.request_id
inner join evaluator e on
cr.evaluator_id = e.evaluator_id
WHERE
cr.evaluator_id = 96
and cr.request_id != 143462
and (r.soc_sec_num_txt = 'xxxxxxxxx'
or (
lower(r.first_name_txt) = 'test'
and lower(r.last_name_txt) = 'newprogram'
and r.birth_dt = date '1960-01-02'
)
)
Firstly, joining by commas creates a cross join, which you want to avoid. Luckily, Oracle's smart enough to do it as an inner join since you specified join conditions, but you want to be explicit so you don't accidentally miss something.
Secondly, your is not null checks are pointless--if a column is null, and = check you do will return false for that row. In fact, any comparison with a null column, even null = null returns false. You can try this with select 1 where null = null and select 1 where null is null. Only the second one returns.
Thirdly, Oracle's smart enough to compare dates with the ISO format (at least the last time I used it, it was). You can just do r.birth_dt = date '1960-01-02' and avoid doing a string format on that column.
That being said, your query isn't exactly poorly written in terms of egregious performance mistakes. What you want to look for are indices. Does evaluator have one on evaluator_id? Does credit_request? What types are they? Typically, evaluator will have a one on the PK evaluator_id, and credit_request will have one for that column, as well. The same for requestor and the request_id columns.
Other indices you may want to consider are all the fields you're using to filter. In this case, soc_sec_num_txt, first_name_txt, last_name_txt, birth_dt. Consider putting a multi-column index on the latter three, and a single column index on the soc_sec_num_txt column.

After refactoring the query comes indexes, so following on from #eric's post:
credit_request:
You're joining this onto requestor on request_id, which I hope is unique. In your where clause you then have a condition on evaluator_id and select client_app_id and personal_flg in the query. So, you probably need a unique index, on credit_request of (request_id, evaulator_id, client_app_id, personal_flg.
By putting the columns you're selecting into the index you avoid the by index rowid, which means that you have selected your values from the index then re-entered the table to pick up more information. If this information is already in the index then there's no need.
You're joining it onto evaluator on evaluator_id, which is included in the first index.
requestor:
This is being joined onto on request_id and your where clause include soc_sec_num_text, lower(first_name_txt), lower(last_name_txt) and birth_dt. So, you need a unique if possible, index on (request_id, soc_sec_num_text) because of the or this is further complicated because you should really have an index on as many of the conditions as possible. You're also selecting requestor_type_iud.
In this case to avoid a functional index, with many columns, I'd index on (request_id, soc_sec_num_text, birth_dt ) if you have the space, time and inclination then adding lower(first_name_txt)... etc to this may improve the speed depending on how selective the column is. This means that if there are far more values in for instance, first_name_txt than birth_dt you'd be better of putting this in front of birth_dt in the index so your query has less to scan if it's a non-unique index.
You notice that I haven't added the selected column into this index as you're already going to have to go into the table so you gain nothing by adding it.
evaluator:
This is only being joined on evaluator_id so you need a unique, if possible, index on this column.

Why is a UDF so much slower than a subquery?

I have a case where I need to translate (lookup) several values from the same table. The first way I wrote it, was using subqueries:
SELECT
(SELECT id FROM user WHERE user_pk = created_by) AS creator,
(SELECT id FROM user WHERE user_pk = updated_by) AS updater,
(SELECT id FROM user WHERE user_pk = owned_by) AS owner,
[name]
FROM asset
As I'm using this subquery a lot (that is, I have about 50 tables with these fields), and I might need to add some more code to the subquery (for example, "AND active = 1" ) I thought I'd put these into a user-defined function UDF and use that. But the performance using that UDF was abysmal.
CREATE FUNCTION dbo.get_user ( #user_pk INT )
RETURNS INT
AS BEGIN
RETURN ( SELECT id
FROM ice.dbo.[user]
WHERE user_pk = #user_pk )
END
SELECT dbo.get_user(created_by) as creator, [name]
FROM asset
The performance of #1 is less than 1 second. Performance of #2 is about 30 seconds...
Why, or more importantly, is there any way I can code in SQL server 2008, so that I don't have to use so many subqueries?
Edit:
Just a litte more explanation of when this is useful. This simple query (that is, get userid) gets a lot more complex when I want to have a text for a user, since I have to join with profile to get the language, with a company to see if the language should be fetch'ed from there instead, and with the translation table to get the translated text. And for most of these queries, performance is a secondary issue to readability and maintainability.

The UDF is a black box to the query optimiser so it's executed for every row.
You are doing a row-by-row cursor. For each row in an asset, look up an id three times in another table. This happens when you use scalar or multi-statement UDFs (In-line UDFs are simply macros that expand into the outer query)
One of many articles on the problem is "Scalar functions, inlining, and performance: An entertaining title for a boring post".
The sub-queries can be optimised to correlate and avoid the row-by-row operations.
What you really want is this:
SELECT
uc.id AS creator,
uu.id AS updater,
uo.id AS owner,
a.[name]
FROM
asset a
JOIN
user uc ON uc.user_pk = a.created_by
JOIN
user uu ON uu.user_pk = a.updated_by
JOIN
user uo ON uo.user_pk = a.owned_by
Update Feb 2019
SQL Server 2019 starts to fix this problem.

As other posters have suggested, using joins will definitely give you the best overall performance.
However, since you've stated that that you don't want the headache of maintaining 50-ish similar joins or subqueries, try using an inline table-valued function as follows:
CREATE FUNCTION dbo.get_user_inline (#user_pk INT)
RETURNS TABLE AS
RETURN
(
SELECT TOP 1 id
FROM ice.dbo.[user]
WHERE user_pk = #user_pk
-- AND active = 1
)
Your original query would then become something like:
SELECT
(SELECT TOP 1 id FROM dbo.get_user_inline(created_by)) AS creator,
(SELECT TOP 1 id FROM dbo.get_user_inline(updated_by)) AS updater,
(SELECT TOP 1 id FROM dbo.get_user_inline(owned_by)) AS owner,
[name]
FROM asset
An inline table-valued function should have better performance than either a scalar function or a multistatement table-valued function.
The performance should be roughly equivalent to your original query, but any future changes can be made in the UDF, making it much more maintainable.

To get the same result (NULL if user is deleted or not active).
select
u1.id as creator,
u2.id as updater,
u3.id as owner,
[a.name]
FROM asset a
LEFT JOIN user u1 ON (u1.user_pk = a.created_by AND u1.active=1)
LEFT JOIN user u2 ON (u2.user_pk = a.created_by AND u2.active=1)
LEFT JOIN user u3 ON (u3.user_pk = a.created_by AND u3.active=1)

Am I missing something? Why can't this work? You are only selecting the id which you already have in the table:
select created_by as creator, updated_by as updater,
owned_by as owner, [name]
from asset
By the way, in designing you really should avoid keywords, like name, as field names.

SQL - Temp Table: Storing all columns in temp table versus only Primary key

I would need to create a temp table for paging purposes. I would be selecting all records into a temp table and then do further processing with it.
I am wondering which of the following is a better approach:
1) Select all the columns of my Primary Table into the Temp Table and then being able to select the rows I would need
OR
2) Select only the primary key of the Primary Table into the Temp Table and then joining with the Primary Table later on?
Is there any size consideration when working with approach 1 versus approach 2?
[EDIT]
I am asking because I would have done the first approach but looking at PROCEDURE [dbo].[aspnet_Membership_FindUsersByName], that was included with ASP.NET Membership, they are doing Approach 2
[EDIT2]
With people without access to the Stored procedure:
-- Insert into our temp table
INSERT INTO #PageIndexForUsers (UserId)
SELECT u.UserId
FROM dbo.aspnet_Users u, dbo.aspnet_Membership m
WHERE u.ApplicationId = #ApplicationId AND m.UserId = u.UserId AND u.LoweredUserName LIKE LOWER(#UserNameToMatch)
ORDER BY u.UserName
SELECT u.UserName, m.Email, m.PasswordQuestion, m.Comment, m.IsApproved,
m.CreateDate,
m.LastLoginDate,
u.LastActivityDate,
m.LastPasswordChangedDate,
u.UserId, m.IsLockedOut,
m.LastLockoutDate
FROM dbo.aspnet_Membership m, dbo.aspnet_Users u, #PageIndexForUsers p
WHERE u.UserId = p.UserId AND u.UserId = m.UserId AND
p.IndexId >= #PageLowerBound AND p.IndexId <= #PageUpperBound
ORDER BY u.UserName

If you have a non-trivial amount of rows (more than 100) than a table variable's performance is generally going to be worse than a temp table equivalent. But test it to make sure.
Option 2 would use less resources, because there is less data duplication.
Tony's points about this being a dirty read are really something you should be considering.

With approach 1, the data in the temp table may be out of step with the real data, i.e. if other sessions make changes to the real data. This may be OK if you are just viewing a snapshot of the data taken at a certain point, but would be dangerous if you were also updating the real table based on changes made to the temporary copy.

This is exactly the approach I use for Paging on the server,
Create a Table Variable (why incur the overhead of transaction logging ?) With just the key values. (Create the table with an autonum Identity column Primary Key - this will be RowNum. )
Insert keys into the table based on users sort/filtering criteria.. Identity column is now a row number which can be used for paging.
Select from table variable joined to other tables with real data required, Joined on key value,
Where RowNum Between ((PageNumber-1) * PageSize) + 1 And PageNumber * PageSize

Think about it this way. Suppose your query would return enough records to populate 1000 pages. How many users do you think would really look at all those pages? By returning only the ids, you aren't returning a lot of information you may or may not need to see. So it should save on network and server resources. And if they really do go through a lot of pages, it would take enough time that the data details might indeed need to be refreshed.

An alternative to paging (the way my company does it) is to use CTE's.
Check out this example from http://softscenario.blogspot.com/2007/11/sql-2005-server-side-paging-using-cte.html
CREATE PROC GetPagedEmployees (#NumbersOnPage INT=25,#PageNumb INT = 1)
AS BEGIN
WITH AllEmployees AS
(SELECT ROW_NUMBER() OVER (Order by [Person].[Contact].[LastName]) AS RowID,
[FirstName],[MiddleName],[LastName],[EmailAddress] FROM [Person].[Contact])
SELECT [FirstName],[MiddleName],[LastName],[EmailAddress]
FROM AllEmployees WHERE RowID BETWEEN
((#PageNumb - 1) * #NumbersOnPage) + 1 AND #PageNumb * NumbersOnPage
ORDER BY RowID

T-SQL Query Optimization

I'm working on some upgrades to an internal web analytics system we provide for our clients (in the absence of a preferred vendor or Google Analytics), and I'm working on the following query:
select
path as EntryPage,
count(Path) as [Count]
from
(
/* Sub-query 1 */
select
pv2.path
from
pageviews pv2
inner join
(
/* Sub-query 2 */
select
pv1.sessionid,
min(pv1.created) as created
from
pageviews pv1
inner join Sessions s1 on pv1.SessionID = s1.SessionID
inner join Visitors v1 on s1.VisitorID = v1.VisitorID
where
pv1.Domain = isnull(#Domain, pv1.Domain) and
v1.Campaign = #Campaign
group by
pv1.sessionid
) t1 on pv2.sessionid = t1.sessionid and pv2.created = t1.created
) t2
group by
Path;
I've tested this query with 2 million rows in the PageViews table and it takes about 20 seconds to run. I'm noticing a clustered index scan twice in the execution plan, both times it hits the PageViews table. There is a clustered index on the Created column in that table.
The problem is that in both cases it appears to iterate over all 2 million rows, which I believe is the performance bottleneck. Is there anything I can do to prevent this, or am I pretty much maxed out as far as optimization goes?
For reference, the purpose of the query is to find the first page view for each session.
EDIT: After much frustration, despite the help received here, I could not make this query work. Therefore, I decided to simply store a reference to the entry page (and now exit page) in the sessions table, which allows me to do the following:
select
pv.Path,
count(*)
from
PageViews pv
inner join Sessions s on pv.SessionID = s.SessionID
and pv.PageViewID = s.ExitPage
inner join Visitors v on s.VisitorID = v.VisitorID
where
(
#Domain is null or
pv.Domain = #Domain
) and
v.Campaign = #Campaign
group by pv.Path;
This query runs in 3 seconds or less. Now I either have to update the entry/exit page in real time as the page views are recorded (the optimal solution) or run a batch update at some interval. Either way, it solves the problem, but not like I'd intended.
Edit Edit: Adding a missing index (after cleaning up from last night) reduced the query to mere milliseconds). Woo hoo!

For starters,
where pv1.Domain = isnull(#Domain, pv1.Domain)
won't SARG. You can't optimize a match on a function, as I remember.

I'm back. To answer your first question, you could probably just do a union on the two conditions, since they are obviously disjoint.
Actually, you're trying to cover both the case where you provide a domain, and where you don't. You want two queries. They may optimize entirely differently.

What's the nature of the data in these tables? Do you find most of the data is inserted/deleted regularly?
Is that the full schema for the tables? The query plan shows different indexing..
Edit: Sorry, just read the last line of text. I'd suggest if the tables are routinely cleared/insertsed, you could think about ditching the clustered index and using the tables as heap tables.. just a thought
Definately should put non-clustered index(es) on Campaign, Domain as John suggested

Your inner query (pv1) will require a nonclustered index on (Domain).
The second query (pv2) can already find the rows it needs due to the clustered index on Created, but pv1 might be returning so many rows that SQL Server decides that a table scan is quicker than all the locks it would need to take. As pv1 groups on SessionID (and hence has to order by SessionID), a nonclustered index on SessionID, Created, and including path should permit a MERGE join to occur. If not, you can force a merge join with "SELECT .. FROM pageviews pv2 INNER MERGE JOIN ..."
The two indexes listed above will be:
CREATE NONCLUSTERED INDEX ncixcampaigndomain ON PageViews (Domain)
CREATE NONCLUSTERED INDEX ncixsessionidcreated ON PageViews(SessionID, Created) INCLUDE (path)

SELECT
sessionid,
MIN(created) AS created
FROM
pageviews pv
JOIN
visitors v ON pv.visitorid = v.visitorid
WHERE
v.campaign = #Campaign
GROUP BY
sessionid
so that gives you the sessions for a campaign. Now let's see what you're doing with that.
OK, this gets rid of your grouping:
SELECT
campaignid,
sessionid,
pv.path
FROM
pageviews pv
JOIN
visitors v ON pv.visitorid = v.visitorid
WHERE
v.campaign = #Campaign
AND NOT EXISTS (
SELECT 1 FROM pageviews
WHERE sessionid = pv.sessionid
AND created < pv.created
)

To continue from doofledorf.
Try this:
where
(#Domain is null or pv1.Domain = #Domain) and
v1.Campaign = #Campaign
Ok, I have a couple of suggestions
Create this covered index:
create index idx2 on [PageViews]([SessionID], Domain, Created, Path)
If you can amend the Sessions table so that it stores the entry page, eg. EntryPageViewID you will be able to heavily optimise this.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas