I have a table with over 1 million entries.
The problem is with the speed of the SELECT queries. This one is very fast:
SELECT *
FROM tmp_pages_data
WHERE site_id = 14294
Showing rows 0 - 29 (1,273,042 total, Query took 0.0009 sec)
And this one is very slow:
SELECT *
FROM tmp_pages_data
WHERE page_status = 0
Showing rows 0 - 29 (15,394 total, Query took 0.3018 sec)
There is an index on the id column only, not needed in any of the selects. So there is no index on site_id or page status.
The 0.30 seconds query is very disturbing, especially when there are thousands requests.
So how can this be possible? What can I do to see what's slowing it down?
What can I do to see what's slowing it down?
It's quite obvious what is slowing it down - as you've already pointed out you don't have an index on the page_status column, and you should have one.
The only surprise is that your first query is so fast without the index. Looking at it more closely it seems that whatever client you are running these queries on is adding an implicit LIMIT 30 that you aren't showing in your question. Because there are so many rows that match it doesn't take long to find the first 30 of them, at which point it can stop searching and return the result. However your second query returns fewer matching rows so it takes longer to find them. Adding the index would solve this problem and make your query almost instant.
Short answer: add an index on the column page_status.
Ok, from our discussion in the comments we now know that the db somehow knows that the first query will returns all rows. That's why it's so fast.
The second query is slow because it doesn't have an index. OMG Ponies already stated that a normal index won't work because the value set is too small. I'd just like to point you to 'bitmap indexes'. I've not used them myself yet but they are known to be designed for exactly this case.
Related
I am an operations guy tasked with pulling data from a very large table. I'm not a DBA and cannot partition it or change the indexing. Table has nearly a billion entries, is not partitioned, and could probably be indexed "better". I need two fields, which we'll call mod_date and obj_id (mod_date is indexed). EDIT: I also add a filter for 'client' which I've blurred out in my screenshot of the explain plan.
My data:
Within the group of almost a billion rows, we have fewer than 10,000 obj_id values to query across several years (a few might even be NULL). Some of the <10k obj_ids -- probably between 1,000-2,500 -- have more than 10 million mod_date values each. When the obj_ids have over a few million mod_dates, each obj_id takes several minutes to scan and sort using MAX(mod_date). The full result set takes over 12 hours to query and no one has made it to completion without some "issue" (locked out, unplugged laptop, etc.). Even if we got the first 50 rows returned we'd still need to export to Excel ... it's only going to be about 8,000 rows with 2 columns but we can never make it to the end.
So here is a simplified query I'd use if it were a small table:
select MAX(trunc(mod_date,'dd')) as last_modified_date, obj_id
from my_table
where client = 'client_name'
and obj_type_id = 12
group by obj_id;
Cardinality is 317917582, "Cost" is 12783449
The issue:
The issue is the speed of the query with such a large unpartitioned table, given the current indexes. All the other answers I've seen about "most recent date" tend to use MAX, possibly in combination with FIRST_VALUE, which seem to require a full scan of all rows in order to sort them and then determine which is the most recent.
I am hoping there is a way to avoid that, to speed up the results. It seems that Oracle (I am using Oracle SQL developer) should be able to take an obj_id, look for the most recent mod_date row starting from "now" and working backwards, and move on as soon as it finds any mod_date value … because it's a date. Is there a way to do that?
Even with such a large table, obj_ids having fewer than 10,000 mod_dates can return the MAX(mod_date) very quickly (seconds or less). The issue we are having is the obj_ids having the most mod_dates (over 10 million) take the longest to scan and sort, when they "should" be the quickest if I could get Oracle to start looking at the most recent first … because it would find a recent date quickly and move on!
First, I'd say its a common misconception that in order to make a query run faster, you need an index (or better indexes). Full table scan makes sense when you're pulling more than 10% of the data (rough estimate, depends on multiblock read count, block size, etc).
My advice is to setup a materialized view (MY_MV or whatever) that simply does the group by query (across all ids). If you need to limit the ids to a 10k subset, just make sure you full scan the table (check explain plan). You can add a full hint if needed (select /*+ full(t) */ .. from big_table t ...)
Then do:
dbms_mview.refresh('MY_MV','C',atomic_refresh=>false);
Thats it. No issues with a client only returning the first x rows and when you go to pull everything it re-runs the entire query (ugh). Full scans are also easier to track in long opts (harder to tell what progress you've made if you are doing nested loops on an index for example).
Once its done, dump entire MV table to a file or whatever you need.
tbone has it right I think. Or, if you do not have authority to create a materialized view, as he suggests, you might create a shell script on the database server to run your query via SQL*Plus and spool the output to a file. Then, run that script using nohup and you shouldn't need to worry about laptops getting turned off, etc.
But I wanted to explain something about your comment:
Oracle should be able to take an obj_id, look for the most recent mod_date row starting from "now" and working backwards, and move on as soon as it finds any mod_date value … because it's a date. Is there a way to do that?
That would be a horrible way for Oracle to run your query, given the indexes you have listed. Let's step through it...
There is no index on obj_id, so Oracle needs to do a full table scan to make sure it gets all the distinct obj_id values.
So, it starts the FTS and finds obj_id 101. It then says "I need max(mod_date) for 101... ah ha! I have an index!" So, it does a reverse index scan. For each entry in the index, it looks up the row from table and checks it to see if it is obj_id 101. If the obj_id was recently updated, we're good because we find it and stop early. But if the obj_id has not been updated in a long time, we have to read many index entries and, for each, access the table row(s) to perform the check.
In the worst case -- if the obj_id is one of those few you mentioned where max(mod_date) will be NULL, we would use the index to look up EVERY SINGLE ROW in your table that has a non-null mod_date.
Doing so many index lookups would be an awful plan if it did that just once, but you're talking about doing it for several old or never-updated obj_id values.
Anyway, it's all academic. There is no Oracle query plan that will run the query that way. It's for good reason.
Without better indexing, you're just not going to improve upon a single full table scan.
I have 2 queries taking too long time, timing out when running them inside an azure website.
1st.
SELECT Value FROM SEN.ValueTable WHERE OptId = #optId
2d
INSERT INTO SEN.ValueTable (Value, OptId)
SELECT Value, OptId FROM REF.ValueTable WHERE OptId = #optId
The both SELECTS will always return 7860 Values. The problem is that I do around 10 of these queries with different #optId. First I ran without any Indexes, then the 1st Query would timeout every now and then. I then added a non-clustered index to SEN.ValueTable and then the 2d Query began to timeout.
1st Query from an Azure VM
2d Query from an Azure-WebApp
I've tried to increase the timeout-time through the .config-files, but they still timeout within 30seconds (There is no time limit from the customer, the retrieving of data from the sql-database will not be the slow thing of the application anyway).
Is there anyway to speed it up/get rid of the timeouts? Will indexing the REF.ValueTable speed the insert up anything?
First, the obvious solution is to add an index to SEN.ValueTable(OptId, Value) and to have no index on REF.ValueTable(OptId, Value). I think this gets around your performance problem.
More importantly, it should be not be taking 30 seconds to fetch or insert 7,860 rows -- nothing like that. So, what else is going on? Is there a trigger on REF.ValueTable() that might be slowing things down? Are there other constraints? Are the columns particularly wide? I mean, if Value is VARCHAR(MAX) and normally 100 Mbytes, then inserting values might be an issue.
If you really run such a query:
SELECT Value, OptId
FROM REF.ValueTable
WHERE OptId = #optId;
The best index for it would be the following:
CREATE INDEX idx_ValueTable_OptId_Value
ON REF.ValueTable (OptId)
INCLUDE (Value);
Any index will slow inserts down, but will benefit read queries. If you want more elaborate answer, post more details - table DDLs and execution plans.
Try resumable online index rebuild -
https://azure.microsoft.com/en-us/blog/resumable-online-index-rebuild-is-in-public-preview-for-azure-sql-db/
I have a query that returns me 1400 rows.
It's basic.
SELECT * FROM dbo.entity_event ee
That is taking between 250 and 380ms, averaging though, on the 360ms. I'd expect this to be much quicker. I am running it on a laptop though, but an i7, 8gb, SSD. Does that seem normal, or should it be quicker? The table only has the total result set. No Where clause.
Running:
SELECT * FROM dbo.entity_event ee WHERE entity_event_type_id = 1
Takes the same time.
There is a clustered index on the primary key (id) in the table.
The table has around 15 columns. Mainly dates, ints and decimal(16,2).
If it seems slow, what can I look at to improve this? I do expect the table to become rather large as the system gets used.
I'm unable to track this on my live site, as the host doesn't allow SQL Profile to connect (Permissions). I can only check on my dev machine. It doesn't seem quick when live though.
The issue originates from a VIEW that I have, which is taking 643ms average. It has 8 joined tables, 4 of which are OUTER joins, ruling out the option of an indexed view.
The view, however, does use the column names, including other logic (CASE... ISNULL etc).
EDIT: I notice that SELECT TOP 10 ... Takes 1ms. SELECT TOP 500 .. Takes 161ms... So, it does seem liniar, and related to the volume of data. I'm wondering if this can be improved?
Execution plan:
This looks normal to me as it appears it's doing a full scan regardless of the WHERE clause--because the entity_event_type_id is not your PK nor indexed. So putting that WHERE clause will not help since it's not indexed.
Doing the Top N will help because it knows it doesn't need to scan the whole clustered pk index.
So...Yes it can be improved--if you index entity_event_type_id you would expect a faster time with that WHERE clause. But as always...index it only if you really will be using that field in WHERE clauses often.
--Jim
i found a very strange behaviour for which i have to explanation. We have a simple table with around 450.000 entries (MSSQL 2008 R2).
The indexes for this table are very simple:
Index #1 contains:
[OwnerUserID] -> int, 4 byte
[TargetQuestionID] -> int, 4 byte
[LastChange] -> date, 8 byte
Index #2 contains:
[LastChange] is a date, 8 byte
[OwnerUserID] is an int, 4 byte
[TargetQuestionID] is an int, 4 byte
As you can see, the difference is only in the slightly different order of the columns; in both indexes, the leafs have the same size, 16 bytes (far away from what i've seen doing some DBAs on really big databases)
The queries are simple:
Query #1:
- Asks just for the last entried element ( top(1) ) ordered by LastChange, so it takes only LastChange into account
Query #2:
- Asks just for the last entried element ( top(1) ) entried for a distinct OwnerUserID, so it takes OwnerUserID and LastChange into account
Results are:
Index #1 is super slow for query #1, albeit i thought it should be OK since the data leafs are really not big (16 bytes)
Index #2 is super slow for query #2 (but since it takes two values into account, OwnerUserID + LastChange = 8 bytes, i do not see any reason why it should be much slower/faster)
Our idea was to have only one index, but since the performance for each query scenario differs by 10 - 11 times, we ended up creating just BOTH of these indexes in parallel, where we thought we could go with one - since the index is not that big/complex that you would actually think this slight difference in column-order would hurt.
So, now we are wasting doubled space and since the table grows around by 10k rows per day, we will have diskspace issues somewhere in the future...
First, i thought this is because of some internal NHibernate issues, but we checked in Performance Monitor and the results are absolutely reproduceable.
It seems like MSSQL performance with indexes depends highly on the usage of datetime-columns, since this simple example shows that this could crash the whole performance :-/
Commonly indices are used to make a fast binary search possible, instead of slow sequential search. To achieve this they store the index keys in sorted order or in a tree. But a binary search is only possible, if the start of the key is known, and thus the order of the elements is important. In your case this means:
Query#1 needs the record with the lowest LastChange. This query can be optimized with an index, which starts with LastChange, e.g. Index#2. With Index#1 it needs to fall back to a sequential search.
Query#2 needs first to find all unique OwnerIds and an index which starts with the OwnerId can help here. Then it needs to find the lowest LastChange for a specific OwnerId. Index#1 does not help here anymore, because the next field in the index is not LastChange. Index#2 might help here if there are lots of records for the same OwnerId. Otherwise it will probably do an sequential search.
So for an index the order of fields should match the queries. Also you might need to update your statistics so that the query planner has an idea if it is better to do a sequential search (few entries per OwnerId) or use Index#2 too (lots of entries per OwnerId). I don't know if and how this can be done with mysql, only know it from postgresql.
Index is always a trade-off: it slows down inserts, but speeds up queries. So it highly depends on your application how many indices you have and how they will be constructed.
Let's suppose I have a table in my database with 1.000.000 records.
If I execute:
SELECT * FROM [Table] LIMIT 1000
Will this query take the same time as if I have that table with 1000 records and just do:
SELECT * FROM [Table]
?
I'm not looking for if it will take exactly the same time. I just want to know if the first one will take much more time to execute than the second one.
I said 1.000.000 records, but it could be 20.000.000. That was just an example.
Edit:
Of course that when using LIMIT and without using it in the same table, the query built using LIMIT should be executed faster, but I'm not asking that...
To make it generic:
Table1: X records
Table2: Y records
(X << Y)
What I want to compare is:
SELECT * FROM Table1
and
SELECT * FROM Table2 LIMIT X
Edit 2:
Here is why I'm asking this:
I have a database, with 5 tables and relationships between some of them. One of those tables will (I'm 100% sure) contain about 5.000.000 records. I'm using SQL Server CE 3.5, Entity Framework as the ORM and LINQ to SQL to make the queries.
I need to perform basically three kind of non-simple queries, and I was thinking about showing to the user a limit of records (just like lot of websites do). If the user wants to see more records, the option he/she has is to restrict more the search.
So, the question came up because I was thinking about doing this (limiting to X records per query) or if storing in the database only X results (the recent ones), which will require to do some deletions in the database, but I was just thinking...
So, that table could contain 5.000.000 records or more, and what I don't want is to show the user 1000 or so, and even like this, the query still be as slow as if it would be returning the 5.000.000 rows.
TAKE 1000 from a table of 1000000 records - will be 1000000/1000 (= 1000) times faster because it only needs to look at (and return) 1000/1000000 records. Since it does less, it is naturally faster.
The result will be pretty (pseudo-)random, since you haven't specified any order in which to TAKE. However, if you do introduce an order, then one of two below becomes true:
The ORDER BY clause follows an index - the above statement is still true.
The ORDER BY clause cannot use any index - it will be only marginally faster than without the TAKE, because
it has to inspect ALL records, and sort by ORDER BY
deliver only a subset (TAKE count)
so it is not faster in the first step, but the 2nd step involves less IO/network than ALL records
If you TAKE 1000 records from a table of 1000 records, it will be equivalent (with little significant differences) to TAKE 1000 records from 1 billion, as long as you are following the case of (1) no order by, or (2) order by against an index
Assuming both tables are equivalent in terms of index, row-sizing and other structures. Also assuming that you are running that simple SELECT statement. If you have an ORDER BY clause in your SQL statements, then obviously the larger table will be slower. I suppose you're not asking that.
If X = Y, then obviously they should run in similar speed, since the query engine will be going through the records in exactly the same order -- basically a table scan -- for this simple SELECT statement. There will be no difference in query plan.
If Y > X only by a little bit, then also similar speed.
However, if Y >> X (meaning Y has many many more rows than X), then the LIMIT version MAY be slower. Not because of query plan -- again should be the same -- but simply because that the internal structure of data layout may have several more levels. For example, if data is stored as leafs on a tree, there may be more tree levels, so it may take slightly more time to access the same number of pages.
In other words, 1000 rows may be stored in 1 tree level in 10 pages, say. 1000000 rows may be stored in 3-4 tree levels in 10000 pages. Even when taking only 10 pages from those 10000 pages, the storage engine still has to go through 3-4 tree levels, which may take slightly longer.
Now, if the storage engine stores data pages sequentially or as a linked list, say, then there will be no difference in execution speed.
It would be approximately linear, as long as you specify no fields, no ordering, and all the records. But that doesn't buy you much. It falls apart as soon as your query wants to do something useful.
This would be quite a bit more interesting if you intended to draw some useful conclusion and tell us about the way it would be used to make a design choice in some context.
Thanks for the clarification.
In my experience, real applications with real users seldom have interesting or useful queries that return entire million-row tables. Users want to know about their own activity, or a specific forum thread, etc. So unless yours is an unusual case, by the time you've really got their selection criteria in hand, you'll be talking about reasonable result sizes.
In any case, users wouldn't be able to do anything useful with many rows over several hundred, transporting them would take a long time, and they couldn't scroll through it in any reasonable way.
MySQL has the LIMIT and OFFSET (starting record #) modifiers primarlly for the exact purpose of creating chunks of a list for paging as you describe.
It's way counterproductive to start thinking about schema design and record purging until you've used up this and a bunch of other strategies. In this case don't solve problems you don't have yet. Several-million-row tables are not big, practically speaking, as long as they are correctly indexed.