indexes for group by on two columns

indexes for group by on two columns - sql

I have a large temp table (~160 million rows) #itemsTemp
itemId | style | styleWeight
--------------------------------
int | smallint | float(53)
and the following query on it:
select
itemId,
style,
SUM(styleWeight) itemCount
from
#itemsTemp
group by itemId,style
Currently #itemsTemp has no indexes. I'm a little confused about what would be best here:
A composite index on itemId and style (and probably include styleWeight)
Separate indexes on itemId and style
Which way should I go? Why? Any other options?

Composite index on itemId and style with styleWeight included would be the best option.
This will allow Stream Aggregate without sorting and/or clustered seek/RID lookup overhead.

SQL Server 2008 actually suggests missing indexes if you include the actual execution plan. The database tuning advisor tool also suggests indexes for you.
However the optimal indexes depends on the other queries run against this table:
Evert index you add to a table has both a storage penalty and a performance penalty when writing, and so if you write to this table you want to keep the number of indexes reasonably low in order to keep write performance acceptable.
If many other queries use the same 2 columns then you may want to use a composite index as long as those queries can all take advantage of that index (remember that the order of a composite index matters).
Conversely if other queries cannot take advantage of a composite index it may be better to use two separate indexes - the performance may be lower for this query however this could be worth it overall if the index re-use reduces the number of indexes on this table.
In reality the index suggestion feature tends to work pretty well - I usully just do what it suggests (after a quick think / sanity check) and then just run some simple tests to make sure that the query is actually performing with the new index(es).

Aside from evaluating the performance both ways (manually), you can use query optimizing hints -- for example: http://msdn.microsoft.com/en-us/library/ms181714.aspx.
Also -- if your temp table is so big, I wonder if there isn't a better way to solve the problem than using a temp table.
Also -- how often are you writing versus reading? How long is the session? Are you making it available to other procedures?

Related

oracle - two unrelated categories on row - how to index?

I have an OLTP application with three tables
Item Table - ItemId, CategoryId, AgeGroupId, ... 100K rows.
CategoryTable - CategoryId, ... (only 5-10 rows)
AgeGroupTable - AgeGroupId, ... (only 4-5 rows)
What is appropriate index for CategoryId and AgeGroupId for Item table? It would be nice to query items by Category or Agegroup or both of them!
I was thinking that a bitmap index might work due to low cardinality, but I don't know how exactly they work with multiple bitmap indexes per table? How would horizontal partitioning help, if at all?

Since this is an OLTP application, you almost certainly don't want to use a bitmap index. Bitmap indexes tend not to work well with OLTP applications. They tend to grow in size very rapidly when you do a lot of single-row operations on the data (though this effect is lessened in more recent versions). But more importantly, the locking impact tends to radically reduce the scalability of an application. If you had a bitmap index on CategoryID, for example, updating a single row's CategoryID would effectively require locking every row in the table that has a CategoryID of either the source or target value.
It sounds like, at most, you need composite indexes on (AgeGroupID, CategoryID) and (CategoryID, AgeGroupID). Potentially, you could use just the composite index on (AgeGroupID, CategoryID) and let Oracle use an index skip scan if only CategoryID is specified. It depends on the trade-offs you want to make-- multiple indexes will make queries just on CategoryID more efficient at the expense of additional index maintenance on DML operations and additional disk space usage.
Are you licensed to use partitioning? That is an extra cost option on top of the enterprise edition license. Potentially, I suppose, you could partition the table. A table with just 100,000 rows is pretty small to consider partitioning, though. And whatever you partition by would tend to make queries that don't use the partition key less efficient. That might make sense if you know that queries that specify AgeGroupID are much more common than CategoryID (or vice versa) but that doesn't sound like what you are describing.

This started off as a comment, but it's getting too long.
What is appropriate index for CategoryId and AgeGroupId?
In what context? Both data domains appear as primary and foreign keys in your example schema. However this beside the point.
You should only add indices where they are actually going to add value, and with less than 10 rows in each table, unless the data is very skewed, there's no benefit to indexing either domain at all. Inserts/updates will be slower and accessing the data via such an index will be slower than performing a full table scan on each of the 3 tables.
There may implicit relationships between other attributes in the item table whereby it makes sense to add the domains to other indexes (but not at the front) but without knowing a lot more about the data and the queries being run against it, I'd ignore this for now.

It really depends on what your queries look like. If you are always going to be filtering on or joining on only one column at a time, then bitmap indexes will work fine. If you will be filtering or joining based on both columns, a composite index can work as well.
In my experience, the best way to know for sure is to test both options. I have had success putting multiple bitmap indexes on a table, as well as using composite indexes. With only 100K rows in a table, you should be able to create and drop indexes very quickly. Then you can test your most common queries on with different sets of indexes.

Database indexes: A good thing, a bad thing, or a waste of time?

Adding indexes is often suggested here as a remedy for performance problems.
(I'm talking about reading & querying ONLY, we all know indexes can make writing slower).
I have tried this remedy many times, over many years, both on DB2 and MSSQL, and the result were invariably disappointing.
My finding has been that no matter how 'obvious' it was that an index would make things better, it turned out that the query optimiser was smarter, and my cleverly-chosen index almost always made things worse.
I should point out that my experiences relate mostly to small tables (<100'000 rows).
Can anyone provide some down-to-earth guidelines on choices for indexing?
The correct answer would be a list of recommendations something like:
Never/always index a table with less than/more than NNNN records
Never/always consider indexes on multi-field keys
Never/always use clustered indexes
Never/always use more than NNN indexes on a single table
Never/always add an index when [some magic condition I'm dying to learn about]
Ideally, the answer will give some instructive examples.

Indexes are kind of like chemotherapy...too much and it kills you...too little and you die...do it the wrong way and you die. You gotta know just how much, how often, and what kind to make it not kill you.
Your hardware, platform, environment, load all play a role. So to answer your questions..
Yes, possibly sometimes.

As a rule of thumb, primary keys and foreign keys need to be indexed. Usually primary key are indexed just by defining them as such, but FKs are not in every database (they definitely are not in SQL Server, I can't really speak for other dbs). You will be using these in joins, so it is generally critical to performance to define these.
Now if you have fields you often use in where clauses, they can benefit from indexes as well providing several things:
First the field must have a range of
values. A bit field or a field with
only 2 or 3 values will almost never
use an index.
Second the queries you write must be sargable. That is they must be designed to use indexes. I suspect if you never get performance improvements from what look like likely candidates for indexes, then you probably have queries that are not sargable. For instance take "WHERE Name like '%Smith'" as a where clause. Without knowing the first characters, the optimizer can't use the index.
Small tables rarely benefit much from indexes. If the optimizer can hold the whole thing in memory, then it is often faster to do so. If you were working with multimillion record tables, you would see that indexes are critical.
Indexing can be very complex and if you are interested in the subject, I suggest you get a good book on performance tuning your particular database and read in depth about them.

An index that's never used is a waste of disk space, as well as adding to the insert/update/delete time. It's probably best to define the clustering index first, then define
additional indexes as you find yourself writing WHERE clauses.
One common index mistake I see is people wondering why a select on col2 (or col3) takes so long when the index is defined as col1 ASC, col2 ASC, col3 ASC. When you have a multiple column index, your WHERE clause must use the first column in the index, or the first and second column in the index, and so forth.
If you need to access the data by col2, then you need an additional index that's defined as col2 ASC.
With small domain tables, it's sometimes faster to do a table scan than it is to read rows from the table using an index. This depends on the speed of your database machine and the speed of the network.

You need indexes. Only with indexes you can access data fast enough.
To make it as short as possible:
add indexes for columns you are frequently filtering (or grouping) for. (eg. a state or name)
like and sql functions could make the DBMS not use indexes.
add indexes only on columns which have many different values (eg. no boolean fields)
It is common to add indexes to foreign keys, but it is not always needed.
don't add indexes in very short tables
never add indexes when you don't know how they should enhance performance.
Finally: look into execution plans to decide how to optimize queries.
You'll add indexes just for a single, critical query. In this case, you'll add exactly the indexes that are needed in the query in question (multi-column indexes).

Basically when DB is collecting data and it's alive indexes have to go and evolve with that flow. There maybe really good index on table but after growing beyond of XXX records the same index in the same table is useless and in that case it should be refactored.
To have optimized and fast DB the only way is to monitor it all the time and refactor it over the time as records come in.
Real life example i got some time ago was super fast query restricted by some time range (created_at between A and B) and super slow query where time range was different. Same query, same database, same application and only one difference on time range.

Always use clustered indexes.
In fact you can't help but using them. The data in a table will be laid out on disk in some particular order anyway, it can't be save as a pile or something. You have the chance of specifying how exactly this data will be laid out. Why burn it?
When you have a table which gets new records appended and you observe that some value in those records always grow (like StackOverflow question number), make a clustered index out of it. Then the new data will not be inserted in the middle but will basically be appended to a file on disk which is a relatively cheap operation.

If a table is expected to be the target of a join then it is best to have a clustered index on that table so that the joins can be performed sequentially through the data pages. The columns in the clustered index will (on some DB systems) be included in all of the other indexes on that table, since those are the values that the indexes will use to reference the table data. To keep the other indexes from getting too large, the columns in the clustered index should be as narrow as possible, so it is best to use only numeric—rather than character—data types in the clustered index. In general, fewer columns are better than more columns, but notice that three int columns (12 bytes per row) are much better than one nvarchar(32) column (potentially 64 bytes per row).
If the clustered index is narrow, then a few additional indexes should not negatively impact performance very much even on very large tables.

Seems you are confusing two concepts here.
Adding indices *generally can only make a read query faster, very very rarely (almost never) slower. Adding an index never forces the query optimizer to use it. It will only use it if it thinks it can benefit from it, and it is generally very smart about those decisions.
For inserts/updates, of course, every index hurts performance a bit more... But at the other end of the spectrum, for, say a read only database, (like a USPS address database which is distributed monthly), in operational use there would ne no inserts/updates, so the only negative impact of additional indices is the disk space they take up.
This is entirely different that specifying that the query optimizer USE an index, in effect overriding what it would do on it's own... That can potentially make a query slower.
EDIT: Edited to eliminate opportunity for misinterpretation by overly literal readers.

Creating indexes for 'OR' operator in queries

I have some MySQL queries with conditions like
where field1=val1 or field2=val2
and some like
where fieldx=valx and fieldy=valy and (field1=val1 or field2=val2)
How can I optimize these queries by creating indexes? My intuition is to create separate indexes for field1 and field2 for first query as it is an OR, so a composite index probably won't do much good.
For the second query I intend to create 2 indexes: fieldx, fieldy, field1 and fieldx,fieldy,field2 again for the above stated reason.
Is this solution correct? This is a really large table so I can't just experiment by applying indexes and explaining the query.

As with all DBMS optimisation questions, it depends on your execution engine.
I would start with the simplest scenario, four separate indexes on each of the columns.
This will ensure that any queries using those columns in a way you haven't anticipated will still run okay (a fieldx/fieldy/field1 index will be of zero use to a query only using fieldy).
Any decent execution engine will efficiently choose the index with lowest cardinality first so as to reduce the result set and then perform the other filters based on that.
Then, and only if you have a performance problem, you can look into improving it with different indexes. You should test performance on production-type data, not any test databases you have built yourself (unless they mirror the attributes of production anyway).
And keep in mind that database tuning is rarely a set-and-forget operation. You should periodically re-tune because performance depends both on the schema and the data you hold.
Even if the schema never changes, the data may vary wildly. Re your comment "I just cant experiment by applying indexes and explaining the query", that's exactly what you should be doing.
If you're worried about playing in production (and you should be), you should have another environment set up with similar specs, copy the production data across to it, then fiddle around with your indexes there.

My intuition is to create separate
indexes for field1 and field2 for
first query as it is an OR, so a
composite index probably won't do much
good.
That's correct.
For the second query I intend to create 2
indexes: fieldx, fieldy, field1 and
fieldx,fieldy,field2 again for the
above stated reason.
That's one option, the other will be an index on fieldx, fieldy, field1 and an index on field2 (same as for you first query!). Now you also have 2 indexes, but the second one will be much smaller. Your second query can use both indexes, the bigger one for the AND-part of your query and the small index for the OR part of field2. MySQL should be smart enough nowadays.
EXPLAIN will help you out.

When should you consider indexing your sql tables?

How many records should there be before I consider indexing my sql tables?

There's no good reason to forego obvious indexes (FKs, etc.) when you're creating the table. It will never noticeably affect performance to have unnecessary indexes on tiny tables, and it's good to take a first cut when your mind is into schema design. Also, some indexes serve to prevent duplicates, which can be useful regardless of table size.
I guess the proper answer to your question is that the number of records in the table should have nothing to do with when to create indexes.

I would create the index entries when I create my table. If you decide to create indices after the table has grown to 100, 1000, 100000 entries it can just take alot of time and perhaps make your database unavailable while you are doing it.
Think about the table first, create the indices you think you'll need, and then move on.
In some cases you will discover that you should have indexed a column, if thats the case, fix it when you discover it.
Creating an index on a searched field is not a pre-optimization, its just what should be done.

When the query time is unacceptable. Better yet, create a few indexes now that are likely to be useful, and run an EXPLAIN or EXPLAIN ANALYZE on your queries once your database is populated by representative data. If the indexes aren't helping, drop them. If there are slow queries that could benefit from more or different indexes, change the indexes.
You are not going to be locked in to an initial choice of indexes. Experiment, and make sure you measure performance!

In general I agree with the previous advice.
Always declare the referential integrity for the tables (Primary Key, Foreign Keys), column constraints (not null, check). Saves you from nightmares when apps put bad data into the tables (even in development).
I'd consider adding indexes for the common access columns (columns in your where clauses which are used in =, <> tests), as well.
Most of the modern RDBMS implementations are quite good at keeping you indexes up to date, without hitting your performance. So, the cost of having indexes is minimal.
Also, most RDBMS's have query plan evaluators which look at the relative costs going to the data rows via the index, or using some sort of table scan. So, again the performance hits are minimal.

Two.
I'm serious. If there are two rows now, and there will always be two rows, the cost of indexing is almost zero. It's quicker to index than to ponder whether you should. It won't take the optimizer very long to figure out that scanning the table is quicker than using the index.
If there are two rows now, but there will be 200,000 in the near future, the cost of not indexing could become prohibitively high. The right time to consider indexing is now.
Having said this, remember that you get an index automatically when you declare a primary key. Creating a table with no primary key is asking for trouble in most cases. So the only time you really need to consider indexing is when you want an index other than the index on the primary key. You need to know the traffic, and the anticipated volume to make this call. If you get it wrong, you'll know, and you can reverse the decision.
I once saw a reference table that had been created with no index when it contained 20 rows. Due to a business change, this table had grown to about 900 rows, but the person who should have noticed the absence of an index didn't. The time to insert a new order had grown from about 10 seconds to 15 minutes.

As a matter of routine I perform the following on read heavy tables:
Create indexes on common join fields such as Foreign Keys when I create the table.
Check the query plan for Views or Stored Procedures and add indexes wherever a table scan is indicated.
Check the query plan for queries by my application and add indexes wherever a table scan is indicated. (and often try to make them into Stored Procedures)
On write heavy tables (like activity logs) I avoid indexes unless they are absolutely necessary. I also tend to archive such data into indexed tables at regular intervals.

It depends.
How much data is in the table? How often is data inserted? A lot of indexes can slow down insertion time. Do you always query all the rows of the table? In this case indexes probably won't help much.
Those aren't common usages though. In most cases, you know you're going to be querying a subset of data. ON what fields? Are there common fields that are always joined on? Look at query plans for common or typical queries, it will generally show you where it's spending all of its time.

If there's a unique constraint on the table (and there should be at least one), then that will usually be enforced by a unique index.
Otherwise, you add indexes when the query performance is bad and adding the index will demonstrably improve the performance. There are books on the subject of how to create good sets of indexes on tables, including Relational Database Index Design and the Optimizers. It will give you a lot of ideas and the reasons why they are good.
See also:
No indexes on small tables
When to create a new SQL Server index
Best Practices and Anti-Patterns in Creating Indexes
and, no doubt, a host of others.

SQL Relationships and indexes

I have an MS SQL server application where I have defined my relationships and primary keys.
However do I need to further define indexes on relationship fields which are sometimes not used in joins and just as part of a where clause?
I am working on the assumption that defining a relationship creates an index, which the sql engine can reuse.

Some very thick books have been written on this subject!
Here are some ruiles of thumb:-
Dont bother indexing (apart from PK) any table with < 1000 rows.
Otherwise index all your FKs.
Examine your SQL and look for the where clauses that will most reduce your result sets and index that columun.
eg. given:
SELECT OWNER FROM CARS WHERE COLOUR = 'RED' AND MANUFACTURER = "BMW" AND ECAP = "2.0";
You may have 5000 red cars out of 20,000 so indexing this wont help much.
However you may only have 100 BMWs so indexing MANUFACURER will immediatly reduce you result set to 100 and you can eliminate the the blue and white cars by simply scanning through the hundred rows.
Generally the dbms will pick one or two of the indexes available based on cardinality so it pays to second guess and define only those indexes that are likely to be used.

No indexes will be automatically created on foreign keys constraint. But unique and primary key constraints will create theirs.
Creating indexes on the queries you use, be it on joins or on the WHERE clause is the way to go.

Like everything in the programming world, it depends. You obviously want to create indexes and relationships to preserve normalization and speed up database lookups. But you also want to balance that by not having too many indexes that it will take SQL Server more time to build every index. Also the more indexes you have the more fragmentation that can occur in your database.
So what I do is put in the obvious indexes and relationships and then optimize after the application is build on the possible slow queries.

Defining a relationship does not create the index.
Usually in places where you have a where clause against some field you want an index but be careful not to just throw indexes out all over the place because they can and do have an effect on insert/update performance.

I would start by making sure that every PK and FK has an index.
Further to that, I have found that using the Index Tuning Wizard in SSMS provides excellent recommendations when you feed it the right information.

Database Considerations
When you design an index, consider the following database guidelines:
Large numbers of indexes on a table affect the performance of INSERT, UPDATE, DELETE, and MERGE statements because all indexes must be adjusted appropriately as data in the table changes.
Avoid over-indexing heavily updated tables and keep indexes narrow,
that is, with as few columns as possible.
Use many indexes to improve query performance on tables with low
update requirements, but large volumes of data. Large numbers of
indexes can help the performance of queries that do not modify data,
such as SELECT statements, because the query optimizer has more
indexes to choose from to determine the fastest access method.
Indexing small tables may not be optimal because it can take the
query optimizer longer to traverse the index searching for data than
to perform a simple table scan. Therefore, indexes on small tables
might never be used, but must still be maintained as data in the
table changes.
Indexes on views can provide significant performance gains when the
view contains aggregations, table joins, or a combination of
aggregations and joins. The view does not have to be explicitly
referenced in the query for the query optimizer to use it.
--Stay_Safe--

Indexes aren't very expensive, and speed up queries more than you realize. I would recommend adding indexes to all key and non-key fields that are often used in queries. You can even use the execution plan to recommend additional indexes that would speed up your queries.
The only point where indexes aren't in your favour is when you're doing large amounts of data inserts. Each insert requires each index in a table to be updated along with the table's data.
You can opt to wait until the application is running and you have some known queries against the database that you want to improve, or you could do it now, if you have a good idea.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas