How to reuse an already calculated column in SELECT - sql

How to reuse an already calculated SELECT column?
Current query
SELECT
SUM(Mod),
SUM(Mod) - SUM(Spent)
FROM
tblHelp
GROUP BY
SourceID
Pseudo query
SELECT
SUM(Mod),
USE ALREADY CALCULATED VALUE - SUM(Spent)
FROM
tblHelp
GROUP BY
SourceID
Question: since SUM(Mod) is already calculated, can I put it in temp variable and use it in other columns in the SELECT clause? Will doing so increase the efficiency of SQL query?

You can't, at least not directly.
You can use tricks such as using a derived table or a cte or cross apply but you can't use a value computed in the select clause in the same select clause.
example:
SELECT SumMode, SumMode - SumSpent
FROM
(
SELECT
SUM(Mod) As SumMode,
SUM(Spent) As SumSpent
FROM tblHelp GROUP BY SourceID
) As DerivedTable;
It will probably not increase performance, but for complicated computation it can help with code clarity, though.

A subquery could do this for you, but it won't make any difference to sql server. If you think that this would make the query more readable than go ahead, here is an example
select t.modsum,
t.modsum - t.modspent
from ( SELECT SUM(Mod) as modsum,
SUM(Spent) as modspent
FROM tblHelp
GROUP BY SourceID
) t
But, is this more readable for you than
SELECT
SUM(Mod),
SUM(Mod) - SUM(Spent)
FROM tblHelp GROUP BY SourceID
IMHO I don't find the first query more readable. That could change off course when the query gets much bigger and more complicated.
There won't be any improvement to performance, so the only reason to do this is to make it more clear/readable for you

SQL Server has a quite intelligent query parser, so while I can't prove it I would be very surprised if it didn't calculate it only once. However, you can make sure of it with:
select x.SourceId, x.Mod, x.Mod - x.Spent
from
(
select SourceId, sum(Mod) Mod, sum(Spent) Spent
from tblHelp
group by SourceId
) x
)

The other answers already cover some good ground, but please note:
Surely you should not select into a #variable to "save" one sum and then make a new select on your table alongside with that value, because you will be scanning the table twice.
I understand how one would try to optimize performance by thinking low-level (CPU operations), which would lead you to think of avoiding extra summations. However, SQL Server is a different beast. You have to learn to read the execution plan, and the data pages involved. If your code avoids uneccessary page reads, doing more cpu work (if even that happens) is very usually negligible. In layman's terms for your case: if the table has few rows, it probably isn't worth even thinking. If it has many, reading the entirety of those pages from disk (and sorting them due to the grouping by iff no index exists) will take 99.99% of time relative to adding the values for the sums

Related

If you do a simple SELECT-WHERE on a CTE that is already sorted, are your results guaranteed to still be in that same order, just filtered?

Wondering about expected/deterministic ordering output from Oracle 11g for queries based on sorted CTEs.
Consider this (extremely-oversimplified for the sake of the) example SQL query. Again, note how the CTE has an ORDER BY clause in it.
WITH SortedArticles as (
SELECT. *
FROM Articles
ORDER BY DatePublished
)
SELECT *
FROM SortedArticles
WHERE Author = 'Joe';
Can it be assumed that the outputted rows are guaranteed to be in the same order as the CTE, or do I have to re-sort them a second time?
Again, this is an extremely over-simplified example but it contains the important parts of what I'm asking. They are...
The CTE is sorted
The final SELECT statement selects only against the CTE, nothing else (no joins, etc.), and
The final SELECT statement only specifies a WHERE clause. It is purely a filtering statement.
The short answer is no. The only way to guarantee ordering is with an ORDER BY clause on your outer query. But there is no need to sort the results in the CTE in that situation.
However, if the sort expression is complex, and you need sorting in the derived CTEs (e.g. because of using OFFSET/FETCH or ROWNUM), you could simplify the subsequent sorting by adding a row number field to the original CTE based on its sort criteria and then just sorting the derived CTEs by that row number. For your example:
WITH SortedArticles as (
SELECT *,
ROW_NUMBER() OVER (ORDER BY DatePublished) AS rn
FROM Articles
)
SELECT *
FROM SortedArticles
WHERE Author = 'Joe'
ORDER BY rn
No, the results are not guaranteed to be in the same order as in the subquery. Never was, never will be. You may observe a certain behaviour, especially if the CTE is materialized, which you can try to influence with optimizer hints like /*+ MATERIALIZE */ and /*+ INLINE */. However, the behaviour of the query optimizer depends also on data volume, IO v cpu speed, and most importantly on the database version. For instance, Oracle 12.2 introduces a feature called "In-Memory Cursor Duration Temp Table" that tries to speed up queries like yours, without preserving the order in the subquery.
I'd go along with #Nick's suggestion of adding a row number field in the subquery.

Can I alias and reuse my subqueries?

I'm working with a data warehouse doing report generation. As the name would suggest, I have a LOT of data. One of the queries that pulls a LOT of data is getting to take longer than I like (these aren't performed ad-hoc, these queries run every night and rebuild tables to cache the reports).
I'm looking at optimizing it, but I'm a little limited on what I can do. I have one query that's written along the lines of...
SELECT column1, column2,... columnN, (subQuery1), (subquery2)... and so on.
The problem is, the sub queries are repeated a fair amount because each statement has a case around them such as...
SELECT
column1
, column2
, columnN
, (SELECT
CASE
WHEN (subQuery1) > 0 AND (subquery2) > 0
THEN CAST((subQuery1)/(subquery2) AS decimal)*100
ELSE 0
END) AS "longWastefulQueryResults"
Our data comes from multiple sources and there are occasional data entry errors, so this prevents potential errors when dividing by a zero. The problem is, the sub-queries can repeat multiple times even though the values won't change. I'm sure there's a better way to do it...
I'd love something like what you see below, but I get errors about needing sq1 and sq2 in my group by clause. I'd provide an exact sample, but it'd be painfully tedious to go over.
SELECT
column1
, column2
, columnN
, (subQuery1) as sq1
, (subquery2) as sq2
, (SELECT
CASE
WHEN (sq1) > 0 AND (sq2) > 0
THEN CAST((sq1)/(sq2) AS decimal)*100
ELSE 0
END) AS "lessWastefulQueryResults"
I'm using Postgres 9.3 but haven't been able to get a successful test yet. Is there anything I can do to optimize my query?
Yup, you can create a Temp Table to store your results and query them again in the same session
I'm not sure how good the Postgres optimizer is, so I'm not sure whether optimizing in this way will do any good. (In my opinion, it shouldn't because the DBMS should be taking care of this kind of thing; but it's not at all surprising if it isn't.) OTOH if your current form has you repeating query logic, then you can benefit from doing something different whether or not it helps performance...
You could put the subqueries in with clauses up front, and that might help.
with subauery1 as (select ...)
, subquery2 as (select ...)
select ...
This is similar to putting the subqueries in the FROM clause as Allen suggests, but may offer more flexibility if your queries are complex.
If you have the freedom to create a temp table as Andrew suggests, that too might work but could be a double-edged sword. At this point you're limiting the optimizer's options by insisting that the temp tables be populated first and then used in the way that makes sense to you, which may not always be the way that actually gets the most efficiency. (Again, this comes down to how good the optimizer is... it's often folly to try to outsmart a really good one.) On the other hand, if you do create temp or working tables, you might be able to apply useful indexes or stats (if they contain large datasets) that would further improve downstream steps' performance.
It looks like many of your subqueries might return single values. You could put the queries into a procedure and capture those individual values as variables. This is similar to the temp table approach, but doesn't require creation of objects (as you may not be able to do that) and will have less risk of confusing the optimizer by making it worry about a table where there's really just one value.
Sub-queries in the column list tend to be a questionable design. The first approach I'd take to solving this is to see if you can move them down to the from clause.
In addition to allowing you to use the result of those queries in multiple columns, doing this often helps the optimizer to come up with a better plan for your query. This is because the queries in the column list have to be executed for every row, rather than merged into the rest of the result set.
Since you only included a portion of the query in your question, I can't demonstrate this particularly well, but what you should be looking for would look more like:
SELECT column1,
column2,
columnn,
subquery1.sq1,
subquery2.sq2,
(SELECT CASE
WHEN (subquery1.sq1) > 0 AND (subquery2.sq2) > 0 THEN
CAST ( (subquery1.sq1) / (subquery2.sq2) AS DECIMAL) * 100
ELSE
0
END)
AS "lessWastefulQueryResults"
FROM some_table
JOIN (SELECT *
FROM other_table
GROUP BY some_columns) subquery1
ON some_table.some_columns = subquery1.some_columns
JOIN (SELECT *
FROM yet_another_table
GROUP BY more_columns) subquery1
ON some_table.more_columns = subquery1.more_columns

SELECT DISTINCT Inside WHERE IN clause performance

I have a performance question about the following code...
SELECT*FROM GCL_Loans WHERE Loan_ID IN
(
SELECT Loan_ID FROM GCL_Loan_Items
)
GCL_Loans has a list of loans with basic infomation
CCL_Loan_Items has information about a specific item in a loan. There can be duplicate Loan_ID's in GCL_Loan_Items
Can anyone explain why this query would be faster or slower than the one above?
SELECT*FROM GCL_Loans WHERE Loan_ID IN
(
SELECT DISTINCT Loan_ID FROM GCL_Loan_Items
)
The "DISTINCT" version is probably faster, because the IN clause will have a smaller data set to search to determine if any given GCL_Loans.Loan_ID is in the set. Without the DISTINCT, the data set will be larger.
There's a reasonably good argument to be made that the query optimizer will automatically recognize the IN test is a set-wise, not a list-wise test and do the DISTINCT during auto-indexing ... but I've seen that fail before.
Note that subselects can be a fail here too, because some databases (mysql) will execute the subselect for each element in the primary select.
The plan and performance of both is equal
Because by selecting DISTINCT there is less criteria in the SUBQuery (IN).
My understanding is SQL will run the subquery first to generate the list of items that are to be included in the IN.

SQL "WITH" Performance and Temp Table (possible "Query Hint" to simplify)

Given the example queries below (Simplified examples only)
DECLARE #DT int; SET #DT=20110717; -- yes this is an INT
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
and ...
DECLARE #DT int; SET #DT=20110717;
BEGIN TRY DROP TABLE #LargeData END TRY BEGIN CATCH END CATCH; -- dump any possible table.
SELECT * -- This is a MASSIVE table indexed on dt field
INTO #LargeData -- put smaller results into temp
FROM mydata
WHERE dt=#DT;
WITH Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM #LargeData
)
SELECT * FROM Ordered
Both produce the same results, which is a limited and ranked list of values from a list based on a fields data.
When these queries get considerably more complicated (many more tables, lots of criteria, multiple levels of "with" table alaises, etc...) the bottom query executes MUCH faster then the top one. Sometimes in the order of 20x-100x faster.
The Question is...
Is there some kind of query HINT or other SQL option that would tell the SQL Server to perform the same kind of optimization automatically, or other formats of this that would involve a cleaner aproach (trying to keep the format as much like query 1 as possible) ?
Note that the "Ranking" or secondary queries is just fluff for this example, the actual operations performed really don't matter too much.
This is sort of what I was hoping for (or similar but the idea is clear I hope). Remember this query below does not actually work.
DECLARE #DT int; SET #DT=20110717;
WITH LargeData AS (
SELECT * -- This is a MASSIVE table indexed on dt field
FROM mydata
WHERE dt=#DT
**OPTION (USE_TEMP_OR_HARDENED_OR_SOMETHING) -- EXAMPLE ONLY**
), Ordered AS (
SELECT TOP 10 *
, ROW_NUMBER() OVER (ORDER BY valuefield DESC) AS Rank_Number
FROM LargeData
)
SELECT * FROM Ordered
EDIT: Important follow up information!
If in your sub query you add
TOP 999999999 -- improves speed dramatically
Your query will behave in a similar fashion to using a temp table in a previous query. I found the execution times improved in almost the exact same fashion. WHICH IS FAR SIMPLIER then using a temp table and is basically what I was looking for.
However
TOP 100 PERCENT -- does NOT improve speed
Does NOT perform in the same fashion (you must use the static Number style TOP 999999999 )
Explanation:
From what I can tell from the actual execution plan of the query in both formats (original one with normal CTE's and one with each sub query having TOP 99999999)
The normal query joins everything together as if all the tables are in one massive query, which is what is expected. The filtering criteria is applied almost at the join points in the plan, which means many more rows are being evaluated and joined together all at once.
In the version with TOP 999999999, the actual execution plan clearly separates the sub querys from the main query in order to apply the TOP statements action, thus forcing creation of an in memory "Bitmap" of the sub query that is then joined to the main query. This appears to actually do exactly what I wanted, and in fact it may even be more efficient since servers with large ammounts of RAM will be able to do the query execution entirely in MEMORY without any disk IO. In my case we have 280 GB of RAM so well more then could ever really be used.
Not only can you use indexes on temp tables but they allow the use of statistics and the use of hints. I can find no refernce to being able to use the statistics in the documentation on CTEs and it says specifically you cann't use hints.
Temp tables are often the most performant way to go when you have a large data set when the choice is between temp tables and table variables even when you don't use indexes (possobly because it will use statistics to develop the plan) and I might suspect the implementation of the CTE is more like the table varaible than the temp table.
I think the best thing to do though is see how the excutionplans are different to determine if it is something that can be fixed.
What exactly is your objection to using the temp table when you know it performs better?
The problem is that in the first query SQL Server query optimizer is able to generate a query plan. In the second query a good query plan isn't able to be generated because you're inserting the values into a new temporary table. My guess is there is a full table scan going on somewhere that you're not seeing.
What you may want to do in the second query is insert the values into the #LargeData temporary table like you already do and then create a non-clustered index on the "valuefield" column. This might help to improve your performance.
It is quite possible that SQL is optimizing for the wrong value of the parameters.
There are a couple of options
Try using option(RECOMPILE). There is a cost to this as it recompiles the query every time but if different plans are needed it might be worth it.
You could also try using OPTION(OPTIMIZE FOR #DT=SomeRepresentatvieValue) The problem with this is you pick the wrong value.
See I Smell a Parameter! from The SQL Server Query Optimization Team blog

Does the order of the columns in a SELECT statement make a difference?

This question was inspired by a previous question posted on SO, "Does the order of the WHERE clause make a differnece?". Would it improve a SELECT statement's performance if the the columns used in the WHERE section are placed at the begining of the SELECT statement?
example:
SELECT customer.id,
transaction.id,
transaction.efective_date,
transaction.a,
[...]
FROM customer, transaction
WHERE customer.id = transaction.id;
I do know that limiting the list of columns to only the needed ones in a SELECT statement improves performance as opposed to using SELECT * because the current list is smaller.
For Oracle and Informix and any other self-respecting DBMS, the order of the columns should have no impact on performance. Similarly, it should be the case that the query engine finds the optimal order to process the Where clause so the order should not matter all things being equal (i.e., looking past constructs which might force an execution order).