SQL 'case when' vs 'where' efficiency

SQL 'case when' vs 'where' efficiency - sql

Which is more efficient:
Select SUM(case when col2=2 then col1 Else 0 End) From myTable
OR
Select SUM(Col1) From myTable where col2=2
Or are they the same speed?

Definitively the second one should be faster. This is because of the concept of "Access". Access refers to the amount of data that the query needs to retrieve in order to produce the result. It has a big impact on the "operator" the database engine optimizer decides to include in the execution plan.
Safe some exceptions, the first query needs to access all the table rows and then compute the result, including rows that don't have anything to do with the case.
The second query only refers to the specific rows needed to compute the result. Therefore, it has the potentiality of being faster. In order for it to be materialized, the presence of indexes is crucial. For example:
create index ix1 on myTable (col2);
In this case it will only access the subset of rows that match the filtering predicate col2 = 2.

The second is more efficient:
It would generally process fewer rows (assuming there are non-"2" values), because rows would be ignored before the aggregation function is called.
It allows the optimizer to take advantage of indexes.
It allows the optimizer to take advantage of table partitions.
Under some circumstances, they might appear to take the same amount of time, particularly on small tables.

Related

SQL: index on "order by" column when a query also has a lot of "where" predicates

Suppose I have this sql query:
select * from my_table
where col1 = 'abc' and col2 = 'qwe' and ... --e.g. 10 predicates or more
order by my_date desc
WIll the index only on my_date column even be used by DB? Will it improve performance somehow?
I'm more interested in Postgres.

The PostgreSQL optimizer will use the index if it thinks that that is cheaper than fetching the rows that match the WHERE condition and sorting them.
This will probably be that case if:
there are many such rows, and sorting would be more expensive than the index scan
there are no indexes to support the WHERE condition

Without a LIMIT, the chances of using the single-column index to provide order here are pretty low. Indeed, I can't contrive a situation to do so without monkeying around with enable_sort or enable_seqsan.
Even with a LIMIT, after applying 10 equality conditions it will be pretty unusual for the expected number of rows left over to be high enough to make the index appear to be worthwhile.

DB2 query using zero value in IN CLAUSE is causing table scan, index on column is ignored

SELECT * FROM TABLE1 WHERE COL1 in( 597966104, 597966100);
SELECT * FROM TABLE1 WHERE COL1 in( 0, 597966100)
In the above 2 queries the first query uses index created on COL1 but the second query does not use index. The only difference in both queries is that zero (0) is used in the IN CLAUSE of the second query. Why is the zero causing the index to be ignored. This leading to table scan and slowing down the query performance. Is there any solution for this problem. Any help on this issue is welcome and appreciated. Database used is DB2

DB2 has a cost based optimizer. It tries to fugure out the best access plan and uses its statistics and configuration to determine it.
In your case the number of rows with col1 = 0 could really matter. For example when col1=0 for 40% of your data it could be cheaper to do the table scan.
If you want to figure out more details explain the query and you will see how the data is accessed and how much rows the optimizer guesses for the result set.
Make sure you have the correct and up-to-date statistics by running runstats for the table(s) as this will be the most important source of information for the optimizer.

Optimize Oracle SELECT on large dataset

I am new in Oracle (working on 11gR2). I have a table TABLE with something like ~10 millions records in it, and this pretty simple query :
SELECT t.col1, t.col2, t.col3, t.col4, t.col5, t.col6, t.col7, t.col8, t.col9, t.col10
FROM TABLE t
WHERE t.col1 = val1
AND t.col11 = val2
AND t.col12 = val3
AND t.col13 = val4
The query is currently taking about 30s/1min.
My question is: how can I improve performance ? After a lot of research, I am aware of the most classical ways to improve performance but I have some problems :
Partitioning: can't really, the table is used in an other project and it would be too impactful. Plus it only delay the problem given the number of rows inserted in the table every day.
Add an index: The thing is, the columns used in the WHERE clause are not the one returned by the query (except for one). Thus, I have not been able to find an appropriate index yet. As far as I know, setting an index on 12~13 columns does not make a lot of sense (or does it?).
Materialized views: I must say I never used them, but I understood the maintenance cost is pretty high and my table is updated quite often.
I think the best way to do this would be to add an appropriate index, but I can't find the right columns on which it should be created.

An index makes sense provided that your query results in a small percentage of all rows. You would create one index on all four columns used in the WHERE clause.
If too many records match, then a full table scan will be done. You may be able to speed this up by having this done in parallel threads using the PARALLEL hint:
SELECT /*+parallel(t,4)*/
t.col1, t.col2, t.col3, t.col4, t.col5, t.col6, t.col7, t.col8, t.col9, t.col10
FROM TABLE t
WHERE t.col1 = val1 AND t.col11 = val2 AND t.col12 = val3 AND t.col13 = val4;

Table with 10 millions records is quite little table. You just need to create an appropriate index. Which column select for index - depends on content of them. For example, if you have column that contains only "1" and "0", or "yes" and "no", you shouldn't index it. The more different values contains column - the more effect gives index. Also you can make index on two or three (and more) columns, or function-based index (in this case index contains results of your SQL function, not columns values). Also you can create more than one index on table.
And in any case, if your query selects more then 20 - 30% of all table records, index will not help.
Also you said that table is used by many people. In this case, you need to cooperate with them to avoid duplicating indexes.

Indexes on each of the columns referenced in the WHERE clause will help performance of a query against a table with a large number of rows, where you are seeking a small subset, even if the columns in the WHERE clause are not returned in the SELECT column list.
The downside of course is that indexes impede insert/update performance. So when loading the table with large numbers of records, you might need to disable/drop the indexes prior to loading and then re-create/enable them again afterwards.

Indexed query along with non index conditions

The following is a query that I'm executing.
select col1, col2 from table1 where col1 in (select table2.col1 from table2) and col2 = 'ABC' ;
In table1, an index is available on col2, no index is available on
col1.
I have around 4 million records for 'ABC' in table1.
The total size of the table1 is around 50 million.
The size of table2 is less. Around 1 million. (No indexes are available on this table)
The query takes a lot of time to come out. If I remove the "in (select table2.col1 from table2)" condition, the query behaves the way an indexed query needs to.
My question is,
In case we have an indexed column being used in the where clause and we include a condition for a non indexed column (specifically an in condition), is there a possibility of a performance hit? Explain plan on the query does not give any hint of non index fetch.
Also, does the order of the conditions matter?
i.e In case I give the index clause before the non index clause, will oracle apply the non index clause only on the subset chosen?
Thanks in advance.

The order of your predicates does not matter. The optimizer determines that.
It's not as simple as "index is always better", so the optimizer tries to evaluate the "selectivity" of each predicate, and then determine the "best" order.
If one predicate is determined to result in a very small portion of the table, and an index exist, it is likely that indexed access will be used. In your case, I don't think an index will help you, unless the rows are physically sorted (on disk) by col2.
If you can share the execution plans, we could probably help you.

Does ISNULL or OR have better performance?

I have the SQL query:
SELECT ISNULL(t.column1, t.column2) as [result]
FROM t
I need to filter out data by [result] column. What is the best approach regarding performance from the two listed below:
WHERE ISNULL(t.column1, t.column2) = #filterValue
or:
WHERE t.column1 = #filterValue OR t.column2 = #filterValue
UPDATE: Sorry, I have forgotten to mention that the column2 is always null if the column1 is filled.

Measure, don't guess! This is something you should be doing yourself, with production-like data. We don't know the make-up of your data and that makes a big difference.
Having said that, I wouldn't do it either way. I'd create another column, column3 to store column1 if non-NULL and column2 if column1 is NULL.
Then I'd have an insert/update trigger to populate that column correctly, index it and use the screaming-banshee-speed:
select t.column3 as [result] from t
The vast majority of databases are read more often than written and it's better if this calculation is done as few times as possible (i.e., when the data changes, not every time you select it). If you want your databases to be scalable, don't use per-row functions.
It's perfectly valid to sacrifice disk space for speed and the triggers ensure that the data doesn't become inconsistent.
If adding another column and triggers is out of the question, I'd go for the or solution since it can often be split into two parallel queries by the smarter DBMS engines.
An alternative, which MarkB gave but since deleted his answer so I'll have to go hunting for another good answer of his to upvote :-), is to use UNION ALL. If your DBMS isn't quite smart enough to recognise OR as a chance for parallelism, it may be smart enough to recognise UNION ALL in that context, something like:
select column1 as c from t where column1 is not NULL
union all
select column2 as c from t where column1 is NULL
But again, it depends on both your database and your data. A smart DBA would put the whole thing in a stored procedure so they could swap in a new method seamlessly should the data change its properties.

On an MSSQL-Table (MSSQL 2000) with 13.000.000 entries and indexes on Col1 and Col2 i get the following results:
select top 1000000 * from Table1 with(nolock) where isnull(Col1,Col2) > '0'
-- Compile-Time: 4ms
-- CPU-Time: 18265ms
-- Elapsed-Time: 24882ms = ~25s
select top 1000000 * from Table1 with(nolock) where Col1 > '0' or (Col1 is null and Col2 > '0')
-- Compile-Time: 9ms
-- CPU-Time: 7781ms
-- Elapsed-Time: 25734 = ~26s
The measured values are subject to strong fluctuations base on the workload of the server.
The first statment need lesser time to compile but takes more cpu-time for excecution (culstered index scan).
Its important to know that many storage-engines have an optimizer who reorganize the statment for better results und executiontimes. Ultimately, both statements will rebuild to mostly the same statement by the optimizer.

I think, your replacement expression does not mean the same. Assume filterValue is 2, then ISNULL(1,2)=2 is false, but 1=2 or 2=2 is true. The expression you need looks more like:
(c1=filter) or ((c1 is null) and (c2 = filter));
There is a chance that a server can answer this from the index on c1. First part of the soultion is an index scan over c1=filter. The second part is a scan over c1=null and then a linear search for c2=filter. I'd even say that a clustered index (c1,c2) could work here.
OTOH, you should rather measure before make assumptions like this, speculations doesn't work usually in SQL unless you have intimate knowledge on the implementation. For example, I'm pretty sure that the query planners already knows that ISNULL(X,Y) can be decomposed into a boolean statement with its implications for searching, but I would not rely on that but rather measure and then decide what to do.

I have measured the performance of both queries on SQL Sever 2008.
And have got the following results:
Both approaches had almost the same Estimated Subtree Cost metric.
But the OR approach had more accurate value of the Estimated Number of Rows metric.
So the query optimizer will build more appropriate execution plan for the OR than for ISNULL approach.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas