I have 4 lucene index with Hibernate-search. Each has 2 million document. Recently I need to add #Facet fields. But whole index rebuild time is too slow.
No, you have to rebuild the index.
The process of rebuilding the index does indeed take some time, but with some tuning you can speed it up significantly.
Since there are many situations in which you'll need to rebuild the index, it's worth spending a bit of time to investigate on how to make it fast enough to be acceptable: you will need this also in case of disaster recovery.
Related
I have an Azure SQL Database that has proved pretty successful so far. It's about 20 months old, no maintenance done... but it has handled a lot. Some tables have millions of rows, and when querying on columns that are indexed, query response times are acceptable when using the web application that talks to it.
However, I read conflicting advice on rebuilding indexes.
This guy says there is no point in doing it: http://beyondrelational.com/modules/2/blogs/76/posts/15290/index-fragmentation-in-sql-azure.aspx
This guy says go ahead rebuild:
https://alexandrebrisebois.wordpress.com/2013/02/06/dont-forget-about-index-maintenance-on-windows-azure-sql-database/
I have run some rebuild index statements on some of the smaller tables storing a few thousand rows. Some of the fragmentation would drop by about 1/2... then if I run it a second time, it might go down by about
these rebuilds ran in about 2-10 seconds depending on size of table...
Then I ran an index that had the following fragmentation on a table that has about 2 million rows:
PK__tmp_ms_x__CDEC17C03A4CDB46 55.2335782060565
PK__tmp_ms_x__CDEC17C03A4CDB46 0
IX_this_is_my_fk_index 15.7538877620014
It took 33 minutes.
The result was
PK__tmp_ms_x__CDEC17C03A4CDB46 0.01
PK__tmp_ms_x__CDEC17C03A4CDB46 0
IX_this_is_my_fk_index 0
Questions:
Query speeds have not really changed since doing the above. Is this normal?
Given that there are many things I have no control over in SQL Azure, does it even make sense to Rebuild indexes?
BTW: I am not and never have been a DBA... just a developer
Rebuilding indexes will matter if the indexes are actually being used. If the index isnt being used for the query youre running, then you wont see a difference. If its only lightly being used, you'll see a minor difference if you run stats. If its being heavily used, you should see a good performance increase, most of the time. The other thing to note with Microsoft SQL is that index fragmentation is sometimes irrelevant. Usually when I'm choosing whether or not to rebuild an index, im looking at the page count combined with fragmentation. If im running a query and i'm having performance issues, and im using the index, and the index has more than 16000 pages, and the index is more than 50% fragmented, ill rebuild it. If the table is small or if I can use the online option, i will just go ahead and rebuild all of them at the same time..
Specifically for Azure, my opinion is that if you are trying to improve performance, its still a good step to take because its so easy, even if you cant be sure of the results. Whether or not its a shared service and whether or not you can control the hardware layer, reviewing the index fragmentation and rebuilding the indexes are something you have access to, so why not make use of it?
So I guess the short answer is yes, in certain situations.
What I would suggest, rather than manually reviewing indexes and rebuilding them, is set up a nightly or weekly job that runs when your db is least active. Have it go through all the tables and rebuild the indexes. You can also give it a set running time if you have lots of tables, and then make it "stateful" (you can use a table to retain progress info) so it remembers where it left off and resumes at the next scheduled run.
One index of one my databases has the strange behaviour of getting slower after time goes by.
Even though my maintenance plan includes a 'Rebuild index' step on all user databases. After a while it gets so slow that my entire application/server grids to a halt.
But when I do a manual rebuild on the particular index, the query time is brought down from minutes to half a second.
Why does the 'rebuild index' step of the maintenance plan seem to skip this index, and why does it work manually? (the maintenance plan runs correctly without errors, every night)
Fragmentation can make a big difference in indexing. You may need to drop the indexes altogether and re-create them again.
I have a database that was recreated each night with about 10 columns and 1,000,000 rows. The data is completely deleted and re-inserted.
I have full text search on this table turned on and I rebuild it after each night.
Recently I noticed that my indexes get extremely fragmented by doing this. Is rebuilding my indexes after each insert the solution to this?
I'm trying to speed up my search and filtering of the data, should I just rebuild the indexes each night as well as the full text index?
Is the fragmentation really slowing me down that much?
I'm also open to other ways to improve performance on this table on a nightly basis if you have any suggestions.
We currently have a SQL Agent Job that runs once a week to identify highly fragment indexes and rebuild them. For certain large indexes on large tables, this ends up causing the system to timeout, as the index is unavailable during the rebuild.
We have identified a strategy that should significantly reduce the fragmentation that occurs, but that won't be implemented for some time, and it doesn't cover everything.
We checked in to upgrading to the Enterprise edition, which allows for online index rebuilding. However, the cost is prohibitive for us at this point.
The indexes don't really change that much, so we can assume that they are static, at least for the most part.
I did envision a way that we could perhaps simulate the online index rebuilding. It could work as follows
For each of the large indexes identified, run a script to:
Check the fragmentation and proceed if it exceeds a certain threshold.
Create a new index, entitled CurrentIndex_TEMP.
Initiate a rebuild on the index.
Remove the temporary index.
It seems that once the temporary index has been built, it would be possible to rebuild the other index without causing any downtime, since SQL Server would have another index that would then be available to use on queries that would have otherwise used the other query.
Iterating through this for each index would hopefully minimize the increase in overall index size, as each temporary index would be removed before any other temporary indexes were created.
This strategy would also retain the historical data on the indexes. I had originally considered a strategy of first renaming the current index, then creating it again with the original name, and then removing the index that had been renamed. This, however, would result in a loss of history.
So, my question...
Is this a feasible strategy? Are there any significant problems I may run into? I understand that this will take some manual oversight from time to time, but I'm willing to accept that at this point.
Thanks for the help.
Any offline index rebuild with lock the table so you don't gain anything by creating a duplicated index.
With great effort your can simulate online index rebuilds. You have to rebuild all indexes on the table at once.
Create a copy of the table T with identical schema ("T_new")
Rename T to T_old
Create a view T defined as select * from T_old and set up INSTEAD OF DML triggers which perform all DML on both T_old and T_new
In a background job copy over batches from T_old to T_new using the MERGE statement
Finally, after the copy is completed, perform some renaming and dropping to make T_new the new T
This requires insanely high effort and good testing. But you can realize pretty much arbitrary schema changes with this online.
There have been several questions recently about database indexing and clustered indexing and it has been kind of new to me until the last couple weeks. I was wondering how important it is and what kind of performance gains can be expected from creating them.
Edit: What is usually the best type of fields to look at when putting in a clustered index when you are first starting out?
Very veryA(G,G) important. In my opinion, wise indexing is the absolute most important thing in DB performance optimization.
This is not an easy topic to cover in a single answer. Good indexing requires knowledge of queries going to happen on the database, making a large number of trade-offs and understanding the implication of a specific index in the specific DB engine. But it's very important nevertheless.
EDIT: Basically, clustered indexes usually should have short lengths. They should be created on queries which reflect a range. They should not have duplicate entries. But these guidelines are very general and by no means the right thing. The right thing is to analyze the queries that are gonna be executed. Carefully benchmarking and analyzing execution plans and understanding what is the best way to do it. This requires years of experience and knowledge and by no means it's something to explain in a single paragraph. It's the primary thing that makes DB experts expert (It's not the only thing, but it's primitive to other important things, such as concurrency issues, availability, ...)!
Indexing: extremely important. Having the wrong indexes makes queries harder, sometimes to the point they can't be completed in a sensible time.
Indexes also impact insert performance and disc usage (negatively), so keeping lots of superfluous indexes around on large tables is a bad idea too.
Clustering is something worth thinking about, I think it's really dependent on the behaviour of the specific database. If you can cluster your data correctly, you can dramatically reduce the amount of IOPs required to satisfy requests for rows not in memory.
Without proper indexes, you force the RDBMS to do table scans to query for anything. Terribly inefficient.
I'd also infer that you don't have primary keys, which is a cardinal sin in relational design.
Indexing is very important when the table contains many rows.
With a few rws, performance is better without indexes.
With larger tables indexes are very important to get good performance.
It is not easy to defined them. Clustered means that the data are stored in the clustered index order.
To get good hints of indexes you could use Toad
Indexing is vitally important.
The right index for a query can improve performance so dramatically it can seem like witchcraft.
As the other answers have said, indexing is crucial.
As you might infer from other answers, clustered indexing is much less crucial.
Decent indexing gives you first order performance gains - orders of magnitude are common.
Clustered indexing is a second order or incremental performance gain - usually giving small (<100%) percentages of performance increase.
(We also get into questions of 'what is a 100% performance gain'; I'm interpreting the percentage as ((oldtime - newtime)/newtime) * 100, so if the old time is 10 seconds and the new time is 5 seconds, the performance increase is 100%.)
Different DBMS have different interpretations of what a clustered index means. Beware.
In particular, some DBMS cluster the data once and thereafter, the clustering decays over time until the data is reclustered. Others take a more active view of clustering, I believe.
The clustered index is ususally but not always your primary key. One way of looking at a clustered index is to think of the data being physically ordered based on the values of the clustered index.
This may very well not be the case in reality however refrencing clustered indexes ususally gets you the following performance bonuses anyway:
All columns of the table are accessable for free when resolved from a clustered index hit as if they were contained within a covering index. (A query resolvable using just the index data without having to refrence the data pages of the table itself)
Update operations can be made directly against a clustered index without intermediate processing. If you are doing a lot of updates against a table you ususally want to be refrencing the clustered columns.
Depending on implementation there may be a sequential access benefit where data stored on disk is retreived quicker with fewer expensive disk seek operations.
Depending on implementation there may be free index benefit where a physical index is not necessary as data access can be resolved via simple guessing game algorithms.
Don't count on #3 and especially #4. #1 and #2 are ususally safe bets on most RDBMS platforms.