Indexing affects only the WHERE clause? - sql

If I have something like:
CREATE INDEX idx_myTable_field_x
ON myTable
USING btree (field_x);
SELECT COUNT(field_x), field_x FROM myTable GROUP BY field_x ORDER BY field_x;
Imagine myTable with around 500,000 rows and most of field_x values being unique.
Since I don't use any WHERE clause, will the created index have any effect at all in my query?
Edit: I'm asking this question because I don't get any relevant difference between query-times before and after creating the index; They always take about 8 seconds (which, of course is too much time!). Is this behaviour expected?

The index will not help here as you are reading the whole table anyway there is no use in going to an index first (PostgreSQL does not yet have index-only scans)
Because nearly all values in the index are unique, it wouldn't really help in this situation anyway. Index lookups (including index-scans for other DBMS) tend to be really helpful for lookup of a small number of rows.
There is a slight possibility that the index might be used for ordering but I doubt that.
If you look at the output of EXPLAIN ANALYZE VERBOSE you can see if the sorting is done in memory or (due to the size of the result) is done on disk.
If sorting is done on disk, you can speed up the query by increasing the work_mem - either globally or just for your session.

Since field_x is the only column referenced in your query, your index covers the query and should help you avoid lookups into actual rows of myTable.
EDIT: As indicated in the comment discussion below, while this answer is valid for most RDBMS implementations, it does not apply to postgresql.

The index should be used. If you ever want to see how your indexes are being used (or not), the execution plan of the query is a great place to see what the database has decided to do. In your case you should execute something like:
explain SELECT COUNT(field_x), field_x FROM myTable GROUP BY field_x ORDER BY field_x;
More information about what all the output you are seeing means can be found in the postgres docs: http://www.postgresql.org/docs/8.4/static/sql-explain.html
There is also: http://wiki.postgresql.org/wiki/Image:Explaining_EXPLAIN.pdf which is a bit more in depth.

Related

Increase PostgreSQL query performance using indices

I have a Rails app on a Postgres database which relies heavily on queries like this:
SELECT DISTINCT client_id FROM orders WHERE orders.total>100
I need, essentially, the ids of all the clients who have orders which meet a certain condition. I only need the id, so I figured this is way faster than using joins.
Would I benefit from adding an index to the column "total"? I don't mind insert speed, I just need the query to run extremely fast.
I would expect the following multicolumn index to be fastest:
CREATE INDEX orders_foo_idx ON orders (total DESC, client_id);
PostgreSQL 9.2 could benefit even more. With it's "index-only tuples" feature, it could serve the query without hitting the table under favorable circumstances: no writes since the last VACUUM.
DESC or ASC hardly matters in this case. A B-tree index can be searched in both directions almost equally efficient.
Absolutely. With no index on the total column, this query will require a table scan. With an index on the total column, it will require an index seek and key lookup. This will provide your query with huge performance gains as the size of the table grows.
> I only need the id, so I figured this is way faster than using joins.
True, though I'm not sure why you would consider using joins in the first place in this case.
As cmotley said, you're going to require an index on the total column for this query. However, optimal performance is going to depend on exactly which queries you're running. For example, for this query, with this table structure, the fastest you're going to get is to create an index like so:
CREATE INDEX IX_OrderTotals ON orders (total, client_id)
By including the client_id in the index, you create something called a covered index on the client_id column, so the database engine won't have to look up the row behind the scenes in order to fetch your data.

Tuning table select SQL having a RAW column in Oracle 10g

I have a table with several columns and a unique RAW column. I created an unique index on the RAW column.
My query selects all columns from the table (6 million rows).
when i see the cost of the query its too high (51K). and its still using INDEX FULL scan. The query do not have any filter conditions, its a plain select * from.
Please suggest how can i tune the query operation.
Thanks in advance.
Why are you hinting it to use the index if you're retrieving all columns from all rows? The index would only help if you were filtering on the indexed column. If you were only retrieving the indexed column then an INDEX_FFS hint might help. But if you have to go back to the data for any non-indexed columns then using the index at all becomes counterproductive beyond a certain proportion of returned data as you're having to access both the index data blocks and the table data blocks repeatedly.
So, your query is:
select /*+ index (rawdata idx_test) */
rawdata.*
from v_wis_cds_cp_rawdata_test rawdata
and you want to know why Oracle is choosing an INDEX FULL scan?
Well, as Alex said, the reason is the "index (raw data idx_text)" hint. This is a directive that tells the Oracle optimizer, "when you access rawdata, use an index access on the idx_text index", which means that's what Oracle will do if at all possible - even if that's not the best plan.
Hints don't make queries faster automatically. They are a way of telling the optimizer what not to do.
I've seen queries like this before - sometimes a hint like this is added in order to return the rows in sorted order, without actually doing a sort. However, if this was the requirement, I'd strongly recommend adding an ORDER BY clause in anyway, because if the hint becomes invalid for some reason (e.g. the index gets dropped or renamed), the sorting would no longer happen and no error would be reported.
If you don't need the rows returned in any particular order, I suggest you remove the hint and see if the performance improves.

Postgresql 8.3, simple query not using index

I have two tables:
table1 (about 200000 records)
number varchar(8)
table2 (about 2000000 records)
number varchar(8)
Fields 'number' in both tables have standard indexes.
For each record in table1 there is about 10 records in table2 assigned.
I execute query:
explain select table1.number from table1, table2 where table1.number = table2.number;
Query plan shows that indexes won't be used, Seq Scans all over ;)
But if I reduce amount of records in table1 to ~2000 query plan starts showing that index will be used.
Maybe somebody can tell me why postgresql behaves in that way?
Sequential scans are normal (and optimal) for queries with very low selectivity - that is, for queries that traverse whole tables.
When you deleted most rows from table1, it was no longer covering all possible distinct values from table2 - that's why index scan came to use.
For starters, I'd recommend trying this query:
select * from pg_stats where tablename in ('table1','table2');
That's the information that PostgreSQL uses to build a query plan.
The planner itself is quite complicated - consult the docs (mentioned by Jonathan) and sources [http://doxygen.postgresql.org/ -> src/backend/optimizer ] if you are so curious.
Yes, the PostgreSQL docs can tell you!
Here are some highlights:
When indexes are not used, it can be
useful for testing to force their use.
There are run-time parameters that can
turn off various plan types (see
Section 18.6.1). For instance, turning
off sequential scans (enable_seqscan)
and nested-loop joins
(enable_nestloop), which are the most
basic plans, will force the system to
use a different plan. If the system
still chooses a sequential scan or
nested-loop join then there is
probably a more fundamental reason why
the index is not being used; for
example, the query condition does not
match the index. (What kind of query
can use what kind of index is
explained in the previous sections.)
If forcing index usage does use the
index, then there are two
possibilities: Either the system is
right and using the index is indeed
not appropriate, or the cost estimates
of the query plans are not reflecting
reality. So you should time your query
with and without indexes. The EXPLAIN
ANALYZE command can be useful here.
It could depend on the way your indexes were created. If "number" is actually a number, you should think about changing the column type to bigint. Again, not 100% sure, but I think indexing on character columns works different than on numeric-based fields...I could however be talking out of my butt.

How To Reduce Number of Rows Scanned by MySQL

I have a query that pulls 5 records from a table of ~10,000. The order clause isn't covered by an index, but the where clause is.
The query scans about 7,700 rows to pull these 5 results, and that seems like a bit much. I understand, though, that the complexity of the ordering criteria complicates matters. How, if at all, can i reduce the number of rows scanned?
The query looks like this:
SELECT *
FROM `mediatypes_article`
WHERE `mediatypes_article`.`is_published` = 1
ORDER BY `mediatypes_article`.`published_date` DESC, `mediatypes_article`.`ordering` ASC, `mediatypes_article`.`id` DESC LIMIT 5;
medaitypes_article.is_published is indexed.
How many rows apply to "is_published = 1" ?
I assume that is like... 7.700 rows?
Either way you take it, the full result that will match the WHERE clause has to be fetched and completely ordered by all sorting criteria. Then the full list of all sorted published articles will be truncated after the first 5 results.
Maybe it will help you to look at the MySQL documentation article about ORDER BY optimization, but for the first you should try to apply indices on the columns that are stated in the ORDER BY statement. It is very likely that this will speed up things greatly.
Executing OPTIMIZE TABLE may not help, but it doesn't hurt either.
When you have ordering, you have to traverse all the btree to figure out the proper order.
10,000 records to order is not that big amount to worry about. Remember, with proper indexing, the RDBMS doesn't fetch the whole record to figure out the order. It has the indexed columns in btree pages saved on disk and with few page reads, the whole btree is loaded in memory and can be traversed.
In MySQL you can make an index that includes multiple columns. I think what you probably need to do is make an index that includes is_published and published_date. You should look at the output from the EXPLAIN statement to make sure it's doing things the smart way, and add an index if it is not.

Do indexes work with "IN" clause

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..