I'm working with a data warehouse doing report generation. As the name would suggest, I have a LOT of data. One of the queries that pulls a LOT of data is getting to take longer than I like (these aren't performed ad-hoc, these queries run every night and rebuild tables to cache the reports).
I'm looking at optimizing it, but I'm a little limited on what I can do. I have one query that's written along the lines of...
SELECT column1, column2,... columnN, (subQuery1), (subquery2)... and so on.
The problem is, the sub queries are repeated a fair amount because each statement has a case around them such as...
SELECT
column1
, column2
, columnN
, (SELECT
CASE
WHEN (subQuery1) > 0 AND (subquery2) > 0
THEN CAST((subQuery1)/(subquery2) AS decimal)*100
ELSE 0
END) AS "longWastefulQueryResults"
Our data comes from multiple sources and there are occasional data entry errors, so this prevents potential errors when dividing by a zero. The problem is, the sub-queries can repeat multiple times even though the values won't change. I'm sure there's a better way to do it...
I'd love something like what you see below, but I get errors about needing sq1 and sq2 in my group by clause. I'd provide an exact sample, but it'd be painfully tedious to go over.
SELECT
column1
, column2
, columnN
, (subQuery1) as sq1
, (subquery2) as sq2
, (SELECT
CASE
WHEN (sq1) > 0 AND (sq2) > 0
THEN CAST((sq1)/(sq2) AS decimal)*100
ELSE 0
END) AS "lessWastefulQueryResults"
I'm using Postgres 9.3 but haven't been able to get a successful test yet. Is there anything I can do to optimize my query?
Yup, you can create a Temp Table to store your results and query them again in the same session
I'm not sure how good the Postgres optimizer is, so I'm not sure whether optimizing in this way will do any good. (In my opinion, it shouldn't because the DBMS should be taking care of this kind of thing; but it's not at all surprising if it isn't.) OTOH if your current form has you repeating query logic, then you can benefit from doing something different whether or not it helps performance...
You could put the subqueries in with clauses up front, and that might help.
with subauery1 as (select ...)
, subquery2 as (select ...)
select ...
This is similar to putting the subqueries in the FROM clause as Allen suggests, but may offer more flexibility if your queries are complex.
If you have the freedom to create a temp table as Andrew suggests, that too might work but could be a double-edged sword. At this point you're limiting the optimizer's options by insisting that the temp tables be populated first and then used in the way that makes sense to you, which may not always be the way that actually gets the most efficiency. (Again, this comes down to how good the optimizer is... it's often folly to try to outsmart a really good one.) On the other hand, if you do create temp or working tables, you might be able to apply useful indexes or stats (if they contain large datasets) that would further improve downstream steps' performance.
It looks like many of your subqueries might return single values. You could put the queries into a procedure and capture those individual values as variables. This is similar to the temp table approach, but doesn't require creation of objects (as you may not be able to do that) and will have less risk of confusing the optimizer by making it worry about a table where there's really just one value.
Sub-queries in the column list tend to be a questionable design. The first approach I'd take to solving this is to see if you can move them down to the from clause.
In addition to allowing you to use the result of those queries in multiple columns, doing this often helps the optimizer to come up with a better plan for your query. This is because the queries in the column list have to be executed for every row, rather than merged into the rest of the result set.
Since you only included a portion of the query in your question, I can't demonstrate this particularly well, but what you should be looking for would look more like:
SELECT column1,
column2,
columnn,
subquery1.sq1,
subquery2.sq2,
(SELECT CASE
WHEN (subquery1.sq1) > 0 AND (subquery2.sq2) > 0 THEN
CAST ( (subquery1.sq1) / (subquery2.sq2) AS DECIMAL) * 100
ELSE
0
END)
AS "lessWastefulQueryResults"
FROM some_table
JOIN (SELECT *
FROM other_table
GROUP BY some_columns) subquery1
ON some_table.some_columns = subquery1.some_columns
JOIN (SELECT *
FROM yet_another_table
GROUP BY more_columns) subquery1
ON some_table.more_columns = subquery1.more_columns
Related
How to reuse an already calculated SELECT column?
Current query
SELECT
SUM(Mod),
SUM(Mod) - SUM(Spent)
FROM
tblHelp
GROUP BY
SourceID
Pseudo query
SELECT
SUM(Mod),
USE ALREADY CALCULATED VALUE - SUM(Spent)
FROM
tblHelp
GROUP BY
SourceID
Question: since SUM(Mod) is already calculated, can I put it in temp variable and use it in other columns in the SELECT clause? Will doing so increase the efficiency of SQL query?
You can't, at least not directly.
You can use tricks such as using a derived table or a cte or cross apply but you can't use a value computed in the select clause in the same select clause.
example:
SELECT SumMode, SumMode - SumSpent
FROM
(
SELECT
SUM(Mod) As SumMode,
SUM(Spent) As SumSpent
FROM tblHelp GROUP BY SourceID
) As DerivedTable;
It will probably not increase performance, but for complicated computation it can help with code clarity, though.
A subquery could do this for you, but it won't make any difference to sql server. If you think that this would make the query more readable than go ahead, here is an example
select t.modsum,
t.modsum - t.modspent
from ( SELECT SUM(Mod) as modsum,
SUM(Spent) as modspent
FROM tblHelp
GROUP BY SourceID
) t
But, is this more readable for you than
SELECT
SUM(Mod),
SUM(Mod) - SUM(Spent)
FROM tblHelp GROUP BY SourceID
IMHO I don't find the first query more readable. That could change off course when the query gets much bigger and more complicated.
There won't be any improvement to performance, so the only reason to do this is to make it more clear/readable for you
SQL Server has a quite intelligent query parser, so while I can't prove it I would be very surprised if it didn't calculate it only once. However, you can make sure of it with:
select x.SourceId, x.Mod, x.Mod - x.Spent
from
(
select SourceId, sum(Mod) Mod, sum(Spent) Spent
from tblHelp
group by SourceId
) x
)
The other answers already cover some good ground, but please note:
Surely you should not select into a #variable to "save" one sum and then make a new select on your table alongside with that value, because you will be scanning the table twice.
I understand how one would try to optimize performance by thinking low-level (CPU operations), which would lead you to think of avoiding extra summations. However, SQL Server is a different beast. You have to learn to read the execution plan, and the data pages involved. If your code avoids uneccessary page reads, doing more cpu work (if even that happens) is very usually negligible. In layman's terms for your case: if the table has few rows, it probably isn't worth even thinking. If it has many, reading the entirety of those pages from disk (and sorting them due to the grouping by iff no index exists) will take 99.99% of time relative to adding the values for the sums
I have a table which includes 30 records, and a smaller table has 10 records, both tables have the same schema. All I want to do is to return a table, whose records are in the big table, but not in the small table. The solution I found is to use Except operator. However, when I run the query, it took me about 30 mins. so I am just wondering that if Except is computational expensive and it took a lot of resources?
Is there any functions can replace Except? Thanks for any help !
EXCEPT is a set operator and it should be reasonably optimized. It does remove duplicate values, so there is a bit more overhead than one might expect.
It is not so unoptimized that it would take 30 seconds on such small tables, unless you have columns whose size measures in many megabytes. Something else might be going on -- such as network or server contention.
EXCEPT is a very reasonable approach. NOT IN has a problem with NULL values and only works with one column. NOT EXISTS is going to work best when you have an appropriate index. Under some circumstances, EXCEPT is faster than NOT EXISTS.
In this case, you should be using EXISTS. It is one of the most performant operations in SQL Server
SELECT *
FROM big_table b
WHERE NOT EXISTS (
SELECT 1
FROM small_table s
WHERE s.id = b.id)
There is no need to make things complicated for something so simple.
SELECT * FROM Table1 WHERE ID NOT IN (SELECT ID FROM Table2)
I have received a SQL query that makes use of the distinct keyword. When I tried running the query it took at least a minute to join two tables with hundreds of thousands of records and actually return something.
I then took out the distinction and it came back in 0.2 seconds. Does the distinct keyword really make things that bad?
Here's the query:
SELECT DISTINCT
c.username, o.orderno, o.totalcredits, o.totalrefunds,
o.recstatus, o.reason
FROM management.contacts c
JOIN management.orders o ON (c.custID = o.custID)
WHERE o.recDate > to_date('2010-01-01', 'YYYY/MM/DD')
Yes, as using DISTINCT will (sometimes according to a comment) cause results to be ordered. Sorting hundreds of records takes time.
Try GROUP BY all your columns, it can sometimes lead the query optimiser to choose a more efficient algorithm (at least with Oracle I noticed significant performance gain).
Distinct always sets off alarm bells to me - it usually signifies a bad table design or a developer who's unsure of themselves. It is used to remove duplicate rows, but if the joins are correct, it should rarely be needed. And yes there is a large cost to using it.
What's the primary key of the orders table? Assuming it's orderno then that should be sufficient to guarantee no duplicates. If it's something else, then you may need to do a bit more with the query, but you should make it a goal to remove those distincts! ;-)
Also you mentioned the query was taking a while to run when you were checking the number of rows - it can often be quicker to wrap the entire query in "select count(*) from ( )" especially if you're getting large quantities of rows returned. Just while you're testing obviously. ;-)
Finally, make sure you have indexed the custID on the orders table (and maybe recDate too).
Purpose of DISTINCT is to prune duplicate records from the result set for all the selected columns.
If any of the selected columns is unique after join you can drop DISTINCT.
If you don't know that, but you know that the combination of the values of selected column is unique, you can drop DISTINCT.
Actually, normally, with properly designed databases you rarely need DISTINCT and in those cases that you do it is (?) obvious that you need it. RDBMS however can not leave it to chance and must actually build an indexing structure to establish it.
Normally you find DISTINCT all over the place when people are not sure about JOINs and relationships between tables.
Also, in classes when talking about pure relational databases where the result should be a proper set (with no repeating elements = records) you can find it quite common for people to stick DISTINCT in to guarantee this property for purposes of theoretical correctness. Sometimes this creeps in into production systems.
You can try to make a group by like this:
SELECT c.username,
o.orderno,
o.totalcredits,
o.totalrefunds,
o.recstatus,
o.reason
FROM management.contacts c,
management.orders o
WHERE c.custID = o.custID
AND o.recDate > to_date('2010-01-01', 'YYYY-MM-DD')
GROUP BY c.username,
o.orderno,
o.totalcredits,
o.totalrefunds,
o.recstatus,
o.reason
Also verify if you have index on o.recDate
I have the SQL query:
SELECT ISNULL(t.column1, t.column2) as [result]
FROM t
I need to filter out data by [result] column. What is the best approach regarding performance from the two listed below:
WHERE ISNULL(t.column1, t.column2) = #filterValue
or:
WHERE t.column1 = #filterValue OR t.column2 = #filterValue
UPDATE: Sorry, I have forgotten to mention that the column2 is always null if the column1 is filled.
Measure, don't guess! This is something you should be doing yourself, with production-like data. We don't know the make-up of your data and that makes a big difference.
Having said that, I wouldn't do it either way. I'd create another column, column3 to store column1 if non-NULL and column2 if column1 is NULL.
Then I'd have an insert/update trigger to populate that column correctly, index it and use the screaming-banshee-speed:
select t.column3 as [result] from t
The vast majority of databases are read more often than written and it's better if this calculation is done as few times as possible (i.e., when the data changes, not every time you select it). If you want your databases to be scalable, don't use per-row functions.
It's perfectly valid to sacrifice disk space for speed and the triggers ensure that the data doesn't become inconsistent.
If adding another column and triggers is out of the question, I'd go for the or solution since it can often be split into two parallel queries by the smarter DBMS engines.
An alternative, which MarkB gave but since deleted his answer so I'll have to go hunting for another good answer of his to upvote :-), is to use UNION ALL. If your DBMS isn't quite smart enough to recognise OR as a chance for parallelism, it may be smart enough to recognise UNION ALL in that context, something like:
select column1 as c from t where column1 is not NULL
union all
select column2 as c from t where column1 is NULL
But again, it depends on both your database and your data. A smart DBA would put the whole thing in a stored procedure so they could swap in a new method seamlessly should the data change its properties.
On an MSSQL-Table (MSSQL 2000) with 13.000.000 entries and indexes on Col1 and Col2 i get the following results:
select top 1000000 * from Table1 with(nolock) where isnull(Col1,Col2) > '0'
-- Compile-Time: 4ms
-- CPU-Time: 18265ms
-- Elapsed-Time: 24882ms = ~25s
select top 1000000 * from Table1 with(nolock) where Col1 > '0' or (Col1 is null and Col2 > '0')
-- Compile-Time: 9ms
-- CPU-Time: 7781ms
-- Elapsed-Time: 25734 = ~26s
The measured values are subject to strong fluctuations base on the workload of the server.
The first statment need lesser time to compile but takes more cpu-time for excecution (culstered index scan).
Its important to know that many storage-engines have an optimizer who reorganize the statment for better results und executiontimes. Ultimately, both statements will rebuild to mostly the same statement by the optimizer.
I think, your replacement expression does not mean the same. Assume filterValue is 2, then ISNULL(1,2)=2 is false, but 1=2 or 2=2 is true. The expression you need looks more like:
(c1=filter) or ((c1 is null) and (c2 = filter));
There is a chance that a server can answer this from the index on c1. First part of the soultion is an index scan over c1=filter. The second part is a scan over c1=null and then a linear search for c2=filter. I'd even say that a clustered index (c1,c2) could work here.
OTOH, you should rather measure before make assumptions like this, speculations doesn't work usually in SQL unless you have intimate knowledge on the implementation. For example, I'm pretty sure that the query planners already knows that ISNULL(X,Y) can be decomposed into a boolean statement with its implications for searching, but I would not rely on that but rather measure and then decide what to do.
I have measured the performance of both queries on SQL Sever 2008.
And have got the following results:
Both approaches had almost the same Estimated Subtree Cost metric.
But the OR approach had more accurate value of the Estimated Number of Rows metric.
So the query optimizer will build more appropriate execution plan for the OR than for ISNULL approach.
The following query is working as expected. But I guess there is enough room for optimization. Any help?
SELECT a.cond_providentid,
b.flag1
FROM c_master a
WHERE a.cond_status = 'OnService'
ORDER BY a.cond_providentid,
a.rto_number;
May I suggest placing the query within your left join in a database view - in that way, the code can be much more cleaner and easier to maintain.
Also, check the columns that you often use the most.. it could be a candidate for indexing so that when you run your query, it can be faster.
You also might check your column data types... I see that you have this type of code:
(CASE
WHEN b.tray_type IS NULL
THEN 1
ELSE 0
END) flag2
If you have a chance to change the design for your tables, (i.e. b.Tray_Type to bit, or use a computed column to determine the flag) it would run faster because you don't have to use Case statements to determine the flag. You can just add it as another column for your query.
Hope this helps! :)
Ann