Optimizing an `IN(x,y,z)` query on a large table

Optimizing an `IN(x,y,z)` query on a large table - sql

I have a table with several million rows. It has over a dozen columns, many of which are long. It is a large table.
I have a lot of common operations where I need to select data from 2 columns based on one of the columns as a lookup.
I have several indexes tuned to a handful of operations ; including ones that only contain an ID and some boolean fields. This has largely worked out well.
I just ran into a problem, where an "IN()" select on a field that contains the md5 sum of another field became a bottleneck; deferring to a sequential scan and ignoring all indexes that had the md5 sum in it.
A normal scan took 45seconds. Turning enable_seqscan off took a few milliseconds.
After playing around for a bit, I realized that this index would work:
CREATE INDEX speed_idx_YAY ON table( field_md5 );
But having any other fields in the index would fail:
CREATE INDEX speed_idx_BOO ON table( field_md5 , field_other );
The shift from using a multi-column index to a sequential scan happened "overnight", as the database grew. At one time it worked, and then it didn't.
Does anyone have tips on how to best prepare for potential situations like this ? Part of me is tempted to create single-column indexes for every indexed field on some tables as a backup.
Referenced:
Why doesn't Postgresql use index for IN query?
PostgreSQL: Why is this query not using my index?
Why isn't Postgres using the index?

Have you thought about creatign indexes based on md5_field and include the other fields
CREATE NONCLUSTERED INDEX IX_speed_idx
ON speed_idx (md5_field)
INCLUDE (field_a,field_b,field_c);

Related

SQL Server non-clustered index and the query optimizer

In one of the projects I am working on, there is a table which has about one million records. For better performance I created a non-clustered index and defined sid field as index key column. When I execute this query
SELECT [id]
,[sid]
,[idm]
,[origin]
,[status]
,[pid]
FROM [EpollText_Db].[dbo].[PhoneNumbers] where sid = 9
The execution plan is like the above picture. My question is, why does SQL server ignore the sid index and scan the whole one million records instead, to find the query result. Your help is greatly appreciated

I believe that the problem is in the size of your result. You are selecting ten thousand records from your database which is quite a lot if you consider the necessary query plan that would include index seek operation. The plan includes index seek would be something like this
Therefore, ten thousand key lookups would be included and a significant number of random logical accesses. Due to this, if your table row is small, he could decide to use clustered index scan. If you are really concerned about the performance of this query create a covering index:
CREATE INDEX idx_PhoneNumbers_sid
ON [EpollText_Db].[dbo].[PhoneNumbers](sid)
INCLUDE ([id],[idm],[origin],[status],[pid])
However, this may slow down inserts, deletes, and updates, and it may also double the size of your table.

How to get list of values stored in index?

I'm having this issue in Oracle 11g R2. Table containing not null column which is indexed with non unique index. The index is not containing other columns.
Then I assumed that if I query distinct values of the column from the table, it would use index to get different values of the column (sounds logical to me). However at least explain plan is telling me it's doing full table scan. Also it took some time so probably the plan was not changed during run time. Optimizer index hint didn't helped.
I tried to search answer for this but no luck. Is there way to get values stored in index or somehow query the table without "touching" the table at all (like multi column index joins can)?
Thanks!
EDIT: This was about Oracle EBS gl_balances table and gl_balances_n2 index. I got answer and this changed the explain plan:
select /*+ index_ffs(gl gl_balances_n2) */
distinct gl.period_name
from gl_balances gl;

It may not be more efficient to scan the index than to scan the table -- don't forget that the index segment also contains branch nodes, and each index entry has to contain a ROWID of about 16 bytes (if memory serves).
So a "fast full index scan", which is the plan you're looking to get, may not be as fast as a full table scan. (You'd use an index_ffs() hint for that, by the way.)
edit: It be possible to use a more exotic method
Maintaining your own list by periodically querying the table using DBMS_Scheduler.
A materialized view. Complete refresh on demand might be adequate, though barely better than just periodically querying the data and maintaining your own unique list.
Making the index compressed, though that would only be of value for longish index keys.
A bitmap index -- not for a concurrently modified table though.

Issue with the big tables ( no primary key available)

Tabe1 has around 10 Lack records (1 Million) and does not contain any primary key. Retrieving the data by using SELECT command ( With a specific WHERE condition) is taking large amount of time. Can we reduce the time of retrieval by adding a primary key to the table or do we need to follow any other ways to do the same. Kindly help me.

A primary key does not have a direct affect on performance. But indirectly, it does. This is because when you add a primary key to a table, SQL Server creates a unique index (clustered by default) that is used to enforce entity integrity. But you can create your own unique indexes on a table. So, strictly speaking, a primary index does not affect performance, but the index used by the primary key does.
WHEN SHOULD PRIMARY KEY BE USED?

Primary key is needed for referring to a specific record.
To make your SELECTs run fast you should consider adding an index on an appropriate columns you're using in your WHERE.
E.g. to speed-up SELECT * FROM "Customers" WHERE "State" = 'CA' one should create an index on State column.

Primarykey will not help if you don't have Primarykey in where cause.
If you would like to make you quesry faster, you can create non-cluster index on columns in where cause. You may want include columns on top of your index(it depend on your select cause)
The SQL optimizer will seek on your indexs that will make your query faster.
(but you should think about when data adding in your table. Insert operation might takes time if you create index on many columns.)

It depends on the SELECT statement, and the size of each row in the table, the number of rows in the table, and whether you are retrieving all the data in each row or only a small subset of the data (and if a subset, whether the data columns that are needed are all present in a single index), and on whether the rows must be sorted.
If all the columns of all the rows in the table must be returned, then you can't speed things up by adding an index. If, on the other hand, you are only trying to retrieve a tiny fraction of the rows, then providing appropriate indexes on the columns involved in the filter conditions will greatly improve the performance of the query. If you are selecting all, or most, of the rows but only selecting a few of the columns, then if all those columns are present in a single index and there are no conditions on columns not in the index, an index can help.
Without a lot more information, it is hard to be more specific. There are whole books written on the subject, including:
Relational Database Index Design and the Optimizers

One way you can do it is to create indexes on your table. It's always better to create a primary key, which creates a unique index that by default will reduce the retrieval time .........
The optimizer chooses an index scan if the index columns are referenced in the SELECT statement and if the optimizer estimates that an index scan will be faster than a table scan. Index files generally are smaller and require less time to read than an entire table, particularly as tables grow larger. In addition, the entire index may not need to be scanned. The predicates that are applied to the index reduce the number of rows to be read from the data pages.
Read more: Advantages of using indexes in database?

How do i optimize this query?

I have a very specific query. I tried lots of ways but i couldn't reach the performance i want.
SELECT *
FROM
items
WHERE
user_id=1
AND
(item_start < 20000 AND item_end > 30000)
i created and index on user_id, item_start, item_end
this didn't work and i dropped all indexes and create new indexes
user_id, (item_start, item_end)
also this didn't work.
(user_id, item_start and item_end are int)
edit: database is MySQL 5.1.44, engine is InnoDB

UPDATE: per your comment below, you need all the columns in the query (hence your SELECT *). If that's the case, you have a few options to maximize query performance:
create (or change) your clustered index to be on item_user_id, item_start, item_end. This will ensure that as few rows as possible are examined for each query. Per my original answer below, this approach may speed up this particular query but may slow down others, so you'll need to be careful.
if it's not practical to change your clustered index, you can create a non-clustered index on item_user_id, item_start, item_end and any other columns your query needs. This will slow down inserts somewhat, and will double the storage required for your table, but will speed up this particular query.
There are always other ways to increase performance (e.g. by reducing the size of each row) but the primary way is to decrease the number of rows which must be accessed and to increase the % of rows which are accessed sequentially rather than randomly. The indexing suggestions above do both.
ORIGINAL ANSWER BELOW:
Without knowing the exact schema or query plan, the main performance problem with this query is that SELECT * forces a lookup back to your clustered index for every row. If there are large numbers of matching rows for a particular user ID and if your clustered index's first column is not item_user_id, then this will likley be a very inefficient operation because your disk will be trying to fetch lots of randomly distributed rows from teh clustered inedx.
In other words, even thouggh filtering the rows you want is fast (because of your index), actually fetching the data is slower. .
If, however, your clustered index is ordered by item_user_id, item_start, item_end then that should speed things up. Note that this is not a panacea, since if you have other queries which depend on different ordering, or if you're inserting rows in a differnet order, you could end up slowing down other queries.
A less impactful solution would be to create a covering index which contains only the columns you want (also ordered by item_user_id, item_start, item_end, and then add the other cols you need). THen change your query to only pull back the cols you need, instead of using SELECT *.
If you could post more info about the DBMS brand and version, and the schema of your table, and we can help with more details.

Do you need to SELECT *?
If not, you can create a index on user_id, item_start, item_end with the fields you need in the SELECT-part as included columns. This all assuming you're using Microsoft SQL Server 2005+

Will I save any time on a INDEX that SELECTs only once?

On DBD::SQLite of SQLite3
If I am going to query a SELECT only once.
Should I CREATE a INDEX first and then query the SELECT
or
just query the SELECT without an INDEX,
which is faster ?
If need to be specified, the col. to be index on is a INTEGER of undef or 1, just these 2 possibilities.

Building an index takes longer than just doing a table scan. So, if your single query — which you're only running once — is just a table scan, adding an index will be slower.
However, if your single query is not just a table scan, adding the index may be faster. For example, without an index, the database may perform a join as many table scans, once for each joined row. Then the index would probably be faster.
I'd say to benchmark it, but that sounds silly for a one-off query that you're only ever going to run once.

If you consider setting and index on a column that only has two possible values it's not worth the effort as index will give very little improvement. Indexes are useful on a columns that has a high degree of uniqueness and are frequently queried for a certain value or range. On the other hard indexes make inserting and updating slower so in this case you should skip it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optimizing an `IN(x,y,z)` query on a large table - sql

Have you thought about creatign indexes based on md5_field and include the other fields CREATE NONCLUSTERED INDEX IX_speed_idx ON speed_idx (md5_field) INCLUDE (field_a,field_b,field_c);

Related

SQL Server non-clustered index and the query optimizer

How to get list of values stored in index?

Issue with the big tables ( no primary key available)

How do i optimize this query?

Will I save any time on a INDEX that SELECTs only once?

Categories

Resources