Say, I have a table my_table with field kind:string and an index on this field.
I've noticed that Postgres builds two different query plans for the queries:
SELECT * FROM my_table
WHERE kind = 'kind1' OR kind IS NULL;
and
SELECT * FROM my_table
WHERE kind = 'kind1';
The first one does not use index whereas the second one does. Why?
I know there are a lot of conditions why indexes may be used or not, and I've read a lot about query plans but this case still is not clear to me.
Abelisto explains that the two versions of the query are not the same. SQL engines (in general) can do a poor job of using indexes for ORs. It is possible that there are so many NULL values, that Postgres simply does not think an index is useful when comparing to NULLs. That depends on the data.
You can try rewriting the query as:
SELECT *
FROM my_table
WHERE kind = 'type1'
UNION ALL
SELECT *
FROM my_table
WHERE kind IS NULL;
Postgres might choose to use indexes on each subquery, if they are appropriate for the data.
Related
I need to write a query to fetch some data from table and need to use concat method. I end up preparing two different queries and I'm not sure which one will have better performance when we have huge amounts of data.
Please help me understand which query will perform better, and why.
I need to concat twice, one for displaying and one for where condition:
select
id, concat(boolean_value, long_value, string_value, double_value) as Value
from
table
where
concat(boolean_value, long_value, string_value, double_value) = 'XX'
Write the above query as a subquery and add the where condition
Select *
from
(select
id, concat(boolean_value, long_value, string_value, double_value) as Value
from
table) as output
where
Value = 'XX'
Note: this is sample query and the actual query will have multiple joins and the need to concat from multiple columns of different tables
SQL databases represent the result set being produced. You seem to be asking if there is common expression elimination -- that is, will the concat() be performed exactly once.
You have a better chance of this with the subquery than with the version that repeat the expression. But SQL Server could be smart enough to only evaluate it once.
If you want to guarantee single evaluation, then I think cross apply does that:
select t.id, v.value
from table t cross apply
(values (concat( boolean_value, long_value, string_value, double_value))
) as Value
where v.value = 'XX';
I would, however, question why you are comparing the result of the concat() rather than the base columns. Comparing the base columns would allow the optimizer to take advantage of indexes, partitions, and statistics.
You may run EXPLAIN on both queries to see what exactly is on the mind of SQL Server. I would expect that the first version would be preferable, because, unlike the second version, it does not force SQL Server to materialize an intermediate table. Regarding whether or not SQL Server would have to actually evaluate CONCAT twice in the first version, it may not even matter if the cost of the subquery be very high.
I want to know what is faster, assuming I have the following queries and they retrieve the same data
select * from tableA a inner join tableB b on a.id = b.id where b.columnX = value
or
select * from tableA inner join (select * from tableB where b.columnX = value) b on a.id = b.id
I think makes sense to reduce the dataset from tableB in advanced, but I dont find anything to backup my perception.
In a database such as Teradata, the two should have exactly the same performance characteristics.
SQL is not a procedural language. A SQL query describes the result set. It does not specify the sequence of actions.
SQL engines process a query in three steps:
Parse the query.
Optimize the parsed query.
Execute the optimized query.
The second step gives the engine a lot of flexibility. And most query engines will be quite intelligent about ignoring subqueries, using indexes and partitions based on where clauses, and so on.
Most SQL dialects compile your query into an execution plan. Teradata and most SQL system show the expected execution plan with the "explain" command. Teradata has a visual explain too, which is simple to learn from
It depends on the data volumes and key type in each table, if any method would be advantageous
Most SQL compilers will work this out correctly using the current table statistics (data size and spread)
In some SQL systems your second command would be worse, as it may force a full temporary table build by ALL fields on tableB
It should be (not that I recommend this query style at all)
select * from tableA inner join (select id from tableB where columnX = value) b on a.id = b.id
In most cases, don't worry about this, unless you have a specific performance issue, and then use the explain commands to work out why
A better way in general is to use common table expressions (CTE) to break the problem down. This leads to better queries that can be tested and maintain over the long term
Whenever you come across such scenarios wherein you feel that which query would yeild the results faster in teradata, please use the EXPLAIN plan in teradata - which would properly dictate how the PE is going to retrieve records. If you are using Teradata sql assistant then you can select the query and press F6.
The DBMS decides the access path that will be used to resolve the query, you can't decide it, but you can do certain things like declaring indexes so that the DBMS takes those indexes into consideration when deciding which access path it will use to resolve the query, and then you will get a better performance.
For instance, in this example you are filtering tableB by b.columnX, normally if there are no indexes declared for tableB the DBMS will have to do a full table scan to determine which rows fulfill that condition, but suppose you declare an index on tableB by columnX, in that case the DBMS will probably consider that index and determine an access path that makes use of the index, getting a much better performance than a full table scan, specially if the table is big.
I have a query in this form that will on average take ~100 in clause elements, and at some rare times > 1000 elements. If greater than 1000 elements, we will chunk the in clause down to 1000 (an Oracle maximum).
The SQL is in the form of
SELECT * FROM tab WHERE PrimaryKeyID IN (1,2,3,4,5,...)
The tables I am selecting from are huge and will contain millions more rows than what is in my in clause. My concern is that the optimizer may elect to do a table scan (our database does not have up to date statistics - yeah - I know ...)
Is there a hint I can pass to force the use of the primary key - WITHOUT knowing the index name of the primary Key, perhaps something like ... /*+ DO_NOT_TABLE_SCAN */?
Are there any creative approaches to pulling back the data such that
We perform the least number of round-trips
We we read the least number of blocks (at the logical IO level?)
Will this be faster ..
SELECT * FROM tab WHERE PrimaryKeyID = 1
UNION
SELECT * FROM tab WHERE PrimaryKeyID = 2
UNION
SELECT * FROM tab WHERE PrimaryKeyID = 2
UNION ....
If the statistics on your table are accurate, it should be very unlikely that the optimizer would choose to do a table scan rather than using the primary key index when you only have 1000 hard-coded elements in the WHERE clause. The best approach would be to gather (or set) accurate statistics on your objects since that should cause good things to happen automatically rather than trying to do a lot of gymnastics in order to work around incorrect statistics.
If we assume that the statistics are inaccurate to the degree that the optimizer would be lead to believe that a table scan would be more efficient than using the primary key index, you could potentially add in a DYNAMIC_SAMPLING hint that would force the optimizer to gather more accurate statistics before optimizing the statement or a CARDINALITY hint to override the optimizer's default cardinality estimate. Neither of those would require knowing anything about the available indexes, it would just require knowing the table alias (or name if there is no alias). DYNAMIC_SAMPLING would be the safer, more robust approach but it would add time to the parsing step.
If you are building up a SQL statement with a variable number of hard-coded parameters in an IN clause, you're likely going to be creating performance problems for yourself by flooding your shared pool with non-sharable SQL and forcing the database to spend a lot of time hard parsing each variant separately. It would be much more efficient if you created a single sharable SQL statement that could be parsed once. Depending on where your IN clause values are coming from, that might look something like
SELECT *
FROM table_name
WHERE primary_key IN (SELECT primary_key
FROM global_temporary_table);
or
SELECT *
FROM table_name
WHERE primary_key IN (SELECT primary_key
FROM TABLE( nested_table ));
or
SELECT *
FROM table_name
WHERE primary_key IN (SELECT primary_key
FROM some_other_source);
If you got yourself down to a single sharable SQL statement, then in addition to avoiding the cost of constantly re-parsing the statement, you'd have a number of options for forcing a particular plan that don't involve modifying the SQL statement. Different versions of Oracle have different options for plan stability-- there are stored outlines, SQL plan management, and SQL profiles among other technologies depending on your release. You can use these to force particular plans for particular SQL statements. If you keep generating new SQL statements that have to be re-parsed, however, it becomes very difficult to use these technologies.
Do you need to create an index for fields of group by fields in an Oracle database?
For example:
select *
from some_table
where field_one is not null and field_two = ?
group by field_three, field_four, field_five
I was testing the indexes I created for the above and the only relevant index for this query is an index created for field_two. Other single-field or composite indexes created on any of the other fields will not be used for the above query. Does this sound correct?
It could be correct, but that would depend on how much data you have. Typically I would create an index for the columns I was using in a GROUP BY, but in your case the optimizer may have decided that after using the field_two index that there wouldn't be enough data returned to justify using the other index for the GROUP BY.
No, this can be incorrect.
If you have a large table, Oracle can prefer deriving the fields from the indexes rather than from the table, even there is no single index that covers all values.
In the latest article in my blog:
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle
, there is a query in which Oracle does not use full table scan but rather joins two indexes to get the column values:
SELECT l.id, l.value
FROM t_left l
WHERE NOT EXISTS
(
SELECT value
FROM t_right r
WHERE r.value = l.value
)
The plan is:
SELECT STATEMENT
HASH JOIN ANTI
VIEW , 20090917_anti.index$_join$_001
HASH JOIN
INDEX FAST FULL SCAN, 20090917_anti.PK_LEFT_ID
INDEX FAST FULL SCAN, 20090917_anti.IX_LEFT_VALUE
INDEX FAST FULL SCAN, 20090917_anti.IX_RIGHT_VALUE
As you can see, there is no TABLE SCAN on t_left here.
Instead, Oracle takes the indexes on id and value, joins them on rowid and gets the (id, value) pairs from the join result.
Now, to your query:
SELECT *
FROM some_table
WHERE field_one is not null and field_two = ?
GROUP BY
field_three, field_four, field_five
First, it will not compile, since you are selecting * from a table with a GROUP BY clause.
You need to replace * with expressions based on the grouping columns and aggregates of the non-grouping columns.
You will most probably benefit from the following index:
CREATE INDEX ix_sometable_23451 ON some_table (field_two, field_three, field_four, field_five, field_one)
, since it will contain everything for both filtering on field_two, sorting on field_three, field_four, field_five (useful for GROUP BY) and making sure that field_one is NOT NULL.
Do you need to create an index for fields of group by fields in an Oracle database?
No. You don't need to, in the sense that a query will run irrespective of whether any indexes exist or not. Indexes are provided to improve query performance.
It can, however, help; but I'd hesitate to add an index just to help one query, without thinking about the possible impact of the new index on the database.
...the only relevant index for this query is an index created for field_two. Other single-field or composite indexes created on any of the other fields will not be used for the above query. Does this sound correct?
Not always. Often a GROUP BY will require Oracle to perform a sort (but not always); and you can eliminate the sort operation by providing a suitable index on the column(s) to be sorted.
Whether you actually need to worry about the GROUP BY performance, however, is an important question for you to think about.
Suppose I want to query a table based on multiple WHERE clauses.
Would either of these statements be faster than the other?
SELECT *
FROM table
WHERE (line_type='section_intro' OR line_type='question')
AND (line_order BETWEEN 0 AND 12)
ORDER BY line_order";
...or:
SELECT *
FROM table
WHERE (line_order BETWEEN 0 AND 12)
AND (line_type='section_intro' OR line_type='question')
ORDER BY line_order;
I guess what it would come down to is whether the first one would select more than 12 records, and then pare down from there.
No, the order does not matter. Query optimizer is going to estimate all conditions separately and decide on the best order based on what indexes are applicable / size of targeted selection / etc...
It depends on your indexes. If you have a multi- index on (line_type, line_order), the first query is faster. If you have an index on (line_order, line_type), the second one is faster. This is because for multi-column primary keys, MySQL can only do the comparisons in order. Otherwise, there is no difference.