Postgresql 8.3, simple query not using index - sql

I have two tables:
table1 (about 200000 records)
number varchar(8)
table2 (about 2000000 records)
number varchar(8)
Fields 'number' in both tables have standard indexes.
For each record in table1 there is about 10 records in table2 assigned.
I execute query:
explain select table1.number from table1, table2 where table1.number = table2.number;
Query plan shows that indexes won't be used, Seq Scans all over ;)
But if I reduce amount of records in table1 to ~2000 query plan starts showing that index will be used.
Maybe somebody can tell me why postgresql behaves in that way?

Sequential scans are normal (and optimal) for queries with very low selectivity - that is, for queries that traverse whole tables.
When you deleted most rows from table1, it was no longer covering all possible distinct values from table2 - that's why index scan came to use.
For starters, I'd recommend trying this query:
select * from pg_stats where tablename in ('table1','table2');
That's the information that PostgreSQL uses to build a query plan.
The planner itself is quite complicated - consult the docs (mentioned by Jonathan) and sources [http://doxygen.postgresql.org/ -> src/backend/optimizer ] if you are so curious.

Yes, the PostgreSQL docs can tell you!
Here are some highlights:
When indexes are not used, it can be
useful for testing to force their use.
There are run-time parameters that can
turn off various plan types (see
Section 18.6.1). For instance, turning
off sequential scans (enable_seqscan)
and nested-loop joins
(enable_nestloop), which are the most
basic plans, will force the system to
use a different plan. If the system
still chooses a sequential scan or
nested-loop join then there is
probably a more fundamental reason why
the index is not being used; for
example, the query condition does not
match the index. (What kind of query
can use what kind of index is
explained in the previous sections.)
If forcing index usage does use the
index, then there are two
possibilities: Either the system is
right and using the index is indeed
not appropriate, or the cost estimates
of the query plans are not reflecting
reality. So you should time your query
with and without indexes. The EXPLAIN
ANALYZE command can be useful here.

It could depend on the way your indexes were created. If "number" is actually a number, you should think about changing the column type to bigint. Again, not 100% sure, but I think indexing on character columns works different than on numeric-based fields...I could however be talking out of my butt.

Related

SQL : Can WHERE clause increase a SELECT DISTINCT query's speed?

So here's the specific situation: I have primary unique indexed keys set to each entry in the database, but each row has a secondID referring to an attribute of the entry, and as such, the secondIDs are not unique. There is also another attribute of these rows, let's call it isTitle, which is NULL by default, but each group of entries with the same secondID have at least one entry with 1 isTitle value.
Considering the conditions above, would a WHERE clause increase the processing speed of the query or not? See the following:
SELECT DISTINCT secondID FROM table;
vs.
SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
EDIT:
The first query without the WHERE clause is faster, but could someone explain me why? Algorithmically the process should be faster with having only one 'if' in the cycle, no?
In general, to benchmark performances of queries, you usually use queries that gives you the execution plan the query they receive in input (Every small step that the engine is performing to solve your request).
You are not mentioning your database engine (e.g. PostgreSQL, SQL Server, MySQL), but for example in PostgreSQL the query is the following:
EXPLAIN SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
Going back to your question, since the isTitle is not indexed, I think the first action the engine will do is a full scan of the table to check that attribute and then perform the SELECT hence, in my opinion, the first query:
SELECT DISTINCT secondID FROM table;
will be faster.
If you want to optimize it, you can create an index on isTitle column. In such scenario, the query with the WHERE clause will become faster.
This is a very hard question to answer, particularly without specifying the database. Here are three important considerations:
Will the database engine use the index on secondID for select distinct? Any decent database optimizer should, but that doesn't mean that all do.
How wide is the table relative to the index? That is, is scanning the index really that much faster than scanning the table?
What is the ratio of isTitle = 1 to all rows with the same value of secondId?
For the first query, there are essentially two ways to process the query:
Scan the index, taking each unique value as it comes.
Scan the table, sort or hash the table, and choose the unique values.
If it is not obvious, (1) is much faster than (2), except perhaps in trivial cases where there are a small number of rows.
For the second query, the only real option is:
Scan the table, filter out the non-matching values, sort or hash the table, and choose the unique values.
The key issues here are how much data needs to be scanned and how much is filtered out. It is even possible -- if you had, say, zillions of rows per secondaryId, no additional columns, and small number of values -- that this might be comparable or slightly faster than (1) above. There is a little overhead for scanning an index and sorting a small amount of data is often quite fast.
And, this method is almost certainly faster than (2).
As mentioned in the comments, you should test the queries on your system with your data (use a reasonable amount of data!). Or, update the table statistics and learn to read execution plans.

Indexing affects only the WHERE clause?

If I have something like:
CREATE INDEX idx_myTable_field_x
ON myTable
USING btree (field_x);
SELECT COUNT(field_x), field_x FROM myTable GROUP BY field_x ORDER BY field_x;
Imagine myTable with around 500,000 rows and most of field_x values being unique.
Since I don't use any WHERE clause, will the created index have any effect at all in my query?
Edit: I'm asking this question because I don't get any relevant difference between query-times before and after creating the index; They always take about 8 seconds (which, of course is too much time!). Is this behaviour expected?
The index will not help here as you are reading the whole table anyway there is no use in going to an index first (PostgreSQL does not yet have index-only scans)
Because nearly all values in the index are unique, it wouldn't really help in this situation anyway. Index lookups (including index-scans for other DBMS) tend to be really helpful for lookup of a small number of rows.
There is a slight possibility that the index might be used for ordering but I doubt that.
If you look at the output of EXPLAIN ANALYZE VERBOSE you can see if the sorting is done in memory or (due to the size of the result) is done on disk.
If sorting is done on disk, you can speed up the query by increasing the work_mem - either globally or just for your session.
Since field_x is the only column referenced in your query, your index covers the query and should help you avoid lookups into actual rows of myTable.
EDIT: As indicated in the comment discussion below, while this answer is valid for most RDBMS implementations, it does not apply to postgresql.
The index should be used. If you ever want to see how your indexes are being used (or not), the execution plan of the query is a great place to see what the database has decided to do. In your case you should execute something like:
explain SELECT COUNT(field_x), field_x FROM myTable GROUP BY field_x ORDER BY field_x;
More information about what all the output you are seeing means can be found in the postgres docs: http://www.postgresql.org/docs/8.4/static/sql-explain.html
There is also: http://wiki.postgresql.org/wiki/Image:Explaining_EXPLAIN.pdf which is a bit more in depth.

how to optimize sql server table for faster response?

i found a in a table there are 50 thousands records and it takes one minute when we fetch data from sql server table just by issuing a sql. there are one primary key that means a already a cluster index is there. i just do not understand why it takes one minute. beside index what are the ways out there to optimize a table to get the data faster. in this situation what i need to do for faster response. also tell me how we can write always a optimize sql. please tell me all the steps in detail for optimization.
thanks.
The fastest way to optimize indexes in table is to use SQL Server Tuning Advisor. Take a look http://www.youtube.com/watch?v=gjT8wL92mqE <-- here
Select only the columns you need, rather than select *. If your table has some large columns e.g. OLE types or other binary data (maybe used for storing images etc) then you may be transferring vastly more data off disk and over the network than you need.
As others have said, an index is no help to you when you are selecting all rows (no where clause). Using an index would be slower in such cases because of the index read and table lookup for each row, vs full table scan.
If you are running select * from employee (as per question comment) then no amount of indexing will help you. It's an "Every column for every row" query: there is no magic for this.
Adding a WHERE won't help usually for select * query too.
What you can check is index and statistics maintenance. Do you do any? Here's a Google search
Or change how you use the data...
Edit:
Why a WHERE clause usually won't help...
If you add a WHERE that is not the PK..
you'll still need to scan the table unless you add an index on the searched column
then you'll need a key/bookmark lookup unless you make it covering
with SELECT * you need to add all columns to the index to make it covering
for a many hits, the index will probably be ignored to avoid key/bookmark lookups.
Unless there is a network issue or such, the issue is reading all columns not lack of WHERE
If you did SELECT col13 FROM MyTable and had an index on col13, the index will probably be used.
A SELECT * FROM MyTable WHERE DateCol < '20090101' with an index on DateCol but matched 40% of the table, it will probably be ignored or you'd have expensive key/bookmark lookups
Irrespective of the merits of returning the whole table to your application that does sound an unexpectedly long time to retrieve just 50000 rows of employee data.
Does your query have an ORDER BY or is it literally just select * from employee?
What is the definition of the employee table? Does it contain any particularly wide columns? Are you storing binary data such as their CVs or employee photo in it?
How are you issuing the SQL and retrieving the results?
What isolation level are your select statements running at (You can use SQL Profiler to check this)
Are you encountering blocking? Does adding NOLOCK to the query speed things up dramatically?

Optimizing MySQL Queries: Is it always possible to optimize a query so that it doesn't use "ALL"

According to the MySQL documentation regarding Optimizing Queries With Explain:
* ALL: A full table scan is done for each combination of rows from the previous tables. This is normally not good if the table is the first table not marked const, and usually very bad in all other cases. Normally, you can avoid ALL by adding indexes that allow row retrieval from the table based on constant values or column values from earlier tables.
Does this mean that any query that uses ALL can be optimized so that it is no longer is doing a full table scan?
In other words, by adding the correct indexes to the table, is it possible to always avoid using ALL? Or are there some cases where ALL is unavoidable, no matter what indexes you add?
It's almost (there are cases where doing full scan is actually cheaper) always possible to optimize ONE query to avoid full scan by creating appropriate indexes. However, if you run multiple queries against the same table there are scenarios when either some of them will end up doing full scan or you'll end up with more indexes then you have columns in your table :-)
Yes, there are some queries where you'd be hard-pressed to produce an appropriate index. For example:
SELECT * FROM mytable WHERE colA * arg0 - colB > arg1
I'm not entirely sure why you'd want to make such a query, though :)
That said, too many indexes will use up more cache memory and disk space, and slow down updates and inserts.

SQL Server Index - Any improvement for LIKE queries?

We have a query that runs off a fairly large table that unfortunately needs to use LIKE '%ABC%' on a couple varchar fields so the user can search on partial names, etc. SQL Server 2005
Would adding an index on these varchar fields help any in terms of select query performance when using LIKE or does it basically ignore the indexes and do a full scan in those cases?
Any other possible ways to improve performance when using LIKE?
Only if you add full-text searching to those columns, and use the full-text query capabilities of SQL Server.
Otherwise, no, an index will not help.
You can potentially see performance improvements by adding index(es), it depends a lot on the specifics :)
How much of the total size of the row are your predicated columns? How many rows do you expect to match? Do you need to return all rows that match the predicate, or just top 1 or top n rows?
If you are searching for values with high selectivity/uniqueness (so few rows to return), and the predicated columns are a smallish portion of the entire row size, an index could be quite useful. It will still be a scan, but your index will fit more rows per page than the source table.
Here is an example where the total row size is much greater than the column size to search across:
create table t1 (v1 varchar(100), b1 varbinary(8000))
go
--add 10k rows of filler
insert t1 values ('abc123def', cast(replicate('a', 8000) as varbinary(8000)))
go 10000
--add 1 row to find
insert t1 values ('abc456def', cast(replicate('a', 8000) as varbinary(8000)))
go
set statistics io on
go
select * from t1 where v1 like '%456%'
--shows 10001 logical reads
--create index that only contains the column(s) to search across
create index t1i1 on t1(v1)
go
select * from t1 where v1 like '%456%'
--or can force to
--shows 37 logical reads
If you look at the actual execution plan you can see the engine scanned the index and did a bookmark lookup on the matching row. Or you can tell the optimizer directly to use the index, if it hadn't decide to use this plan on its own:
select * from t1 with (index(t1i1)) where v1 like '%456%'
If you have a bunch of columns to search across only a few that are highly selective, you could create multiple indexes and use a reduction approach. E.g. first determine a set of IDs (or whatever your PK is) from your highly selective index, then search your less selective columns with a filter against that small set of PKs.
If you always need to return a large set of rows you would almost certainly be better off with a table scan.
So the possible optimizations depend a lot on the specifics of your table definition and the selectivity of your data.
HTH!
-Adrian
The only other way (other than using fulltext indexing) you could improve performance is to use "LIKE ABC%" - don't add the wildcard on both ends of your search term - in that case, an index could work.
If your requirements are such that you have to have wildcards on both ends of your search term, you're out of luck...
Marc
Like '%ABC%' will always perform a full table scan. There is no way around that.
You do have a couple of alternative approaches. Firstly full text searching, it's really designed for this sort of problem so I'd look at that first.
Alternatively in some circumstances it might be appropriate to denormalize the data and pre-process the target fields into appropriate tokens, then add these possible search terms into a separate one to many search table. For example, if my data always consisted of a field containing the pattern 'AAA/BBB/CCC' and my users were searching on BBB then I'd tokenize that out at insert/update (and remove on delete). This would also be one of those cases where using triggers, rather than application code, would be much preferred.
I must emphasis that this is not really an optimal technique and should only be used if the data is a good match for the approach and for some reason you do not want to use full text search (and the database performance on the like scan really is unacceptable). It's also likely to produce maintenance headaches further down the line.
create statistics on that column. sql srever 2005 has optimized the in string search so you might benfit from that.