Here is my query:
select word_id, count(sentence_id)
from sentence_word
group by word_id
having count(sentence_id) > 100;
The table sentenceword contains 3 fields, wordid, sentenceid and a primary key id.
It has 350k+ rows.
This query takes a whopping 85 seconds and I'm wondering (hoping, praying?) there is a faster way to find all the wordids that have more than 100 sentenceids.
I've tried taking out the select count part, and just doing 'having count(1)' but neither speeds it up.
I'd appreciate any help you can lend. Thanks!
If you don't already have one, create a composite index on sentence_id, word_id.
having count(sentence_id) > 100;
There's a problem with this... Either the table has duplicate word/sentence pairs, or it doesn't.
If it does have duplicate word/sentence pairs, you should be using this code to get the correct answer:
HAVING COUNT(DISTINCT Sentence_ID) > 100
If the table does not have duplicate word/sentence pairs... then you shouldn't count sentence_ids, you should just count rows.
HAVING COUNT(*) > 100
In which case, you can create an index on word_id only, for optimum performance.
If that query is often performed, and the table rarely updated, you could keep an auxiliary table with word ids and corresponding sentence counts -- hard to think of any further optimization beyond that!
Your query is fine, but it needs a bit of help (indexes) to get faster results.
I don't have my resources at hand (or access to SQL), but I'll try to help you from memory.
Conceptually, the only way to answer that query is to count all the records that share the same word_id. That means that the query engine needs a fast way to find those records. Without an index on word_id, the only thing the database can do is go through the table one record at a time and keep running totals of every single distinct word_id it finds. That would usually require a temporary table and no results can be dispatched until the whole table is scanned. Not good.
With an index on word_id, it still has to go through the table, so you would think it wouldn't help much. However, the SQL engine can now compute the count for each word_id without waiting until the end of the table: it can dispatch the row and the count for that value of word_id (if it passes your where clause), or discard the row (if it doesn't); that will result in lower memory load on the server, possibly partial responses, and the temporary table is no longer needed. A second aspect is parallelism; with an index on word_id, SQL can split the job in chunks and use separate processor cores to run the query in parallel (depending on hardware capabilities and existing workload).
That might be enough to help your query; but you will have to try to see:
CREATE INDEX someindexname ON sentence_word (word_id)
(T-SQL syntax; you didn't specify which SQL product you are using)
If that's not enough (or doesn't help at all), there are two other solutions.
First, SQL allows you to precompute the COUNT(*) by using indexed views and other mechanisms. I don't have the details at hand (and I don't do this often). If your data doesn't change often, that would give you faster results but with a cost in complexity and a bit of storage.
Also, you might want to consider storing the results of the query in a separate table. That is practical only if the data never changes, or changes on a precise schedule (say, during a data refresh at 2 in the morning), or if it changes very little and you can live with non perfect results for a few hours (you would have to schedule a periodic data refresh); that's the moral equivalent of a poor-man's data warehouse.
The best way to find out for sure what works for you is to run the query and look at the query plan with and without some candidate indexes like the one above.
There is, surprisingly, an even faster way to accomplish that on large data sets:
SELECT totals.word_id, totals.num
FROM (SELECT word_id, COUNT(*) AS num FROM sentence_word GROUP BY word_id) AS totals
WHERE num > 1000;
Related
I would like to know which statement (see below) will be more efficient for determining the size of a Cluster Table. Or at least determine, whether the table size reaches a certain threshhold {n}.
Efficiency meaning using less PSAPTEMP tablespace.
The problem with Cluster Tables is, that in order to get an entry for a table the fields of one entry need to be looked up in several tables of the Cluster where they are dispersed. Thus, more than just the counted table need to be looked at. So for every entry several entries need to be looked up. This makes it inefficient for reads and this can make it dump because the COUNT uses an INT datatype that can overflow.
SELECT COUNT(*)
...
UP TO {n} rows.
SELECT *
...
UP TO {n} ROWS.
ENDSELECT. `and then determine the size of the result. `
To me they seem equivalent, but maybe they are not when using a threshold. Maybe the limitation makes a difference depending how the data is read. EDIT: Of course, SELECT .. ENDSELECT is a loop and thus less efficient principally.
But I would like to know how it actually works under the hood and understand the difference better. So far it seems like I will have to try it out.
I assume the database will differ but will most often be Oracle.
We could not really create the test environment we needed. So no final answer. But some learnings:
Reading the data from cluster tables should be done based on a full primary key sequence (Should be accessed via primary key - very fast retrieval otherwise very slow)
There are no secondary indexes
Select * is Ok because all columns are retrieved anyways. Performing an operation on multiple rows is more efficient than single row operations. -> Therefore you still want to select into an internal table.
If many rows are being selected into the internal table, you might still like to retrieve specific columns to cut down on the memory required.
There is a way to convert cluster to transparent but with downtime and this no way for us
Aggreate SQL functions (SUM, AVG, MIN, MAX, etc) are not supported
Basically Select Endselect will run a loop and there will be multiple trips to DB Server.
Technically select SELECT COUNT(*) will perform all the data on the DB server itself and in one shot.
After which you can simply put the data in an internal table and work on the same.
As per the standards, this is not at all recommended even for normal transparent tables leave aside Cluster tables.
Access to Cluster tables is very expensive. Also, to make the matter worse you cannot use any indexes on Cluster tables. Its always better to provide as much data in the where clause as possible.
The priority is always given to fetch the data in one shot from the Database Server using
select * from table into table where ....
and then loop on it on the local server.
Specifically in your use case It will be fastest if you will be using count(*) and not select endselect.
Certified SAP ABAP Consultant
Using native SQL with COUNT BIG instead of COUNT can make it not memory efficient but prevent it from dumping due to a counter overflow.
i found a in a table there are 50 thousands records and it takes one minute when we fetch data from sql server table just by issuing a sql. there are one primary key that means a already a cluster index is there. i just do not understand why it takes one minute. beside index what are the ways out there to optimize a table to get the data faster. in this situation what i need to do for faster response. also tell me how we can write always a optimize sql. please tell me all the steps in detail for optimization.
thanks.
The fastest way to optimize indexes in table is to use SQL Server Tuning Advisor. Take a look http://www.youtube.com/watch?v=gjT8wL92mqE <-- here
Select only the columns you need, rather than select *. If your table has some large columns e.g. OLE types or other binary data (maybe used for storing images etc) then you may be transferring vastly more data off disk and over the network than you need.
As others have said, an index is no help to you when you are selecting all rows (no where clause). Using an index would be slower in such cases because of the index read and table lookup for each row, vs full table scan.
If you are running select * from employee (as per question comment) then no amount of indexing will help you. It's an "Every column for every row" query: there is no magic for this.
Adding a WHERE won't help usually for select * query too.
What you can check is index and statistics maintenance. Do you do any? Here's a Google search
Or change how you use the data...
Edit:
Why a WHERE clause usually won't help...
If you add a WHERE that is not the PK..
you'll still need to scan the table unless you add an index on the searched column
then you'll need a key/bookmark lookup unless you make it covering
with SELECT * you need to add all columns to the index to make it covering
for a many hits, the index will probably be ignored to avoid key/bookmark lookups.
Unless there is a network issue or such, the issue is reading all columns not lack of WHERE
If you did SELECT col13 FROM MyTable and had an index on col13, the index will probably be used.
A SELECT * FROM MyTable WHERE DateCol < '20090101' with an index on DateCol but matched 40% of the table, it will probably be ignored or you'd have expensive key/bookmark lookups
Irrespective of the merits of returning the whole table to your application that does sound an unexpectedly long time to retrieve just 50000 rows of employee data.
Does your query have an ORDER BY or is it literally just select * from employee?
What is the definition of the employee table? Does it contain any particularly wide columns? Are you storing binary data such as their CVs or employee photo in it?
How are you issuing the SQL and retrieving the results?
What isolation level are your select statements running at (You can use SQL Profiler to check this)
Are you encountering blocking? Does adding NOLOCK to the query speed things up dramatically?
I need help in indexing in MySQL.
I have a table in MySQL with following rows:
ID Store_ID Feature_ID Order_ID Viewed_Date Deal_ID IsTrial
The ID is auto generated. Store_ID goes from 1 - 8. Feature_ID from 1 - let's say 100. Viewed Date is Date and time on which the data is inserted. IsTrial is either 0 or 1. You can ignore Order_ID and Deal_ID from this discussion.
There are millions of data in the table and we have a reporting backend that needs to view the number of views in a certain period or overall where trial is 0 for a particular store id and for a particular feature.
The query takes the form of:
select count(viewed_date)
from theTable
where viewed_date between '2009-12-01' and '2010-12-31'
and store_id = '2'
and feature_id = '12'
and Istrial = 0
In SQL Server you can have a filtered index to use for Istrial. Is there anything similar to this in MySQL? Also, Store_ID and Feature_ID have a lot of duplicate data. I created an index using Store_ID and Feature_ID. Although this seems to have decreased the search period, I need better improvement than this. Right now I have more than 4 million rows. To search for a particular query like the one above, it looks at 3.5 million rows in order to give me the count of 500k rows.
PS. I forgot to add view_date filter in the query. Now I have done this.
Well you could expand your index to consist of Store_ID, Feature_ID and IsTrial. You won't get any better than this, performancewise.
My first idea would be an index on (feature_id, store_id, istrial), since feature_id seems to be the column with the highest Shannon entropy. But without knowing the statistics on feature_id i'm not sure. Maybe you should better create two indexes, (store_id, feature_id, istrial) being the other and let the optimizer sort it out. Using all three columns also has the advantage of the database being able to answer your query from the index alone, which should improve performance, too.
But if neither of your columns is selective enough to sufficiently improve index performance, you might have to resort to denormalization by using INSERT/UPDATE triggers to fill a second table (feature_id, store_id, istrial, view_count). This would slow down inserts and updates, of course...
You might want to think about splitting that table horizontally. You could run a nightly job that puts each store_id in a separate table. Or take a look at feature_id, yeah, it's a lot of tables but if you don't need real-time data. It's the route I would take.
If you need to optimize this query specifically in MySQL, why not add istrial to the end of the existing index on Store_ID and Feature_ID. This will completely index away the WHERE clause and will be able to grab the COUNT from the cardinality summary of the index if the table is MyISAM. All of your existing queries that leverage the current index will be unchanged as well.
edit: also, I'm unsure of why you're doing COUNT(viewed_date) instead of COUNT(*)? Is viewed_date ever NULL? If not, you can just use the COUNT(*) which will eliminate the need to go to the .MYD file if you take it in conjunction with my other suggestion.
The best way I found in tackling this problem is to skip DTA's recommendation and do it on my own in the following way:
Use Profiler to find the costliest queries in terms of CPU usage (probably blocking queries) and apply indexes to tables based on those queries. If the query execution plan can be changed to decrease the Read, Writes and overall execution time, then first do that. If not, in which case the query is what it is, then apply clustered/non-clustered index combination to best suit. This depends on the nature of the existing table indexes, the bytes total of columns participating in index, etc.
Run queries in the SSMS to find the most frequently executing queries and do the same as above.
Create a defragmentation schedule in order to either Reorganize or Rebuild indexes depending on how much fragmented they are.
I am pretty sure others can suggest good ideas. Doing these gave me good results. I hope someone can use this help. I think DTA does not really make things faster in terms of indexing because you really need to go through what all indexes it is going to create. This is more true for a database that gets hit a lot.
I have a SQL statement to select results from a table. I need to know the total number of records found, and then list a sub-set of them (pagination).
Normally, I would make 2 SQL calls:
one for counting the total number of records (using COUNT),
the other for returning the sub-set (using LIMIT).
But, this way, you are really duplicating the same operation on MySQL: the WHERE statements are the same in both calls.
Isn't there a way to gain speed NOT duplicating the select on MySQL ?
That first query is going to result in data being pulled into the cache, so presumable the second query should be fast. I wouldn't be too worried about this.
You have to make both SQL queries, and the COUNT is very fast with no WHERE clause. Cache the data where possible.
You should just run the COUNT a single time and then cache it somewhere. Then you can just run the pagination query as needed.
If you really don't want to run the COUNT() query- and as others have stated, it's not something that slows things down appreciably- then you have to decide on your chunk size (ie the LIMIT number) up front. This will save you the COUNT() query, but you may end up with unfortunate pagination results (like 2 pages where the 2nd page has only 1 result).
So, a quick COUNT() and then a sensible LIMIT set-up, or no COUNT() and an arbitrary LIMIT that may increase the number of more expensive queries you have to do.
You could try selecting just one field (say, the IDs) and see if that helps, but I don't think it will - I imagine the biggest overhead is MySQL finding the correct rows in the first place.
If you simply want to count the total number of rows in the entire table (i.e. without a WHERE clause) then I believe SELECT COUNT(*) FROM table is fairly efficient.
Otherwise, the only solution if you need to have the total number visible is to select all the rows. However, you can cache this in another table. If you are selecting something from a category, say, store the category UID and the total rows selected. Then whenever you add/delete rows, count the totals again.
Another option - though it may sacrifice usability a little - is to only select the rows needed for the current page and next page. If there are some rows available for the next page, add a "Next" link. Do the same for the previous page. If you have 20 rows per page, you're selecting at most 60 rows on each page load, and you don't need to count all the rows available.
If you write your query to include one column that contains the count (in every row), and then the rest of the columns from your second query, you can:
avoid the second database round-trip (which is probably more expensive than your query anyways)
Increase the likelihood that MySQL's parser will generate an optimized execution plan that reuses the base query.
Make the operation atomic.
Unfortunately, it also creates a little repetition, returning more data than you really need. But I would expect it to be much more efficient anyway. This is the sort of strategy used by a lot of ORM products when they eagerly load objects from connected tables with many-to-one or many-to-many relationships.
As others have already pointed out, it's probably not worth much concern in this case -- as long as 'field' is indexed, both select's will be extremely fast.
If you have (for whatever reason) a situation where that's not enough, you could create a memory-based temporary table (i.e. a temporary table backed by the memory storage engine), and select your records into that temporary table. Then you could do selects from the temporary table and be quite well assured they'll be fast. This can use a lot of memory though (i.e. it forces that data to all stay in memory for the duration), so it's pretty unfriendly unless you're sure that:
The amount of data is really small;
You have so much memory it doesn't matter; or
The machine will be nearly idle otherwise anyway.
The main time this comes in handy is if you have a really complex select that can't avoid scanning all of a large table (or more than one) but yields only a tiny amount of data.
I have a MySQL 5.1 InnoDB table (customers) with the following structure:
int record_id (PRIMARY KEY)
int user_id (ALLOW NULL)
varchar[11] postcode (ALLOW NULL)
varchar[30] region (ALLOW NULL)
..
..
..
There are roughly 7 million rows in the table. Currently, the table is being queried like this:
SELECT * FROM customers WHERE user_id IN (32343, 45676, 12345, 98765, 66010, ...
in the actual query, currently over 560 user_ids are in the IN clause. With several million records in the table, this query is slow!
There are secondary indexes on table, the first of which being on user_id itself, which I thought would help.
I know that SELECT(*) is A Bad Thing and this will be expanded to the full list of fields required. However, the fields not listed above are more ints and doubles. There are another 50 of those being returned, but they are needed for the report.
I imagine there's a much better way to access the data for the user_ids, but I can't think how to do it. My initial reaction is to remove the ALLOW NULL on the user_id field, as I understand NULL handling slows down queries?
I'd be very grateful if you could point me in a more efficient direction than using the IN ( ) method.
EDIT
Ran EXPLAIN, which said:
select_type = SIMPLE
table = customers
type = range
possible_keys = userid_idx
key = userid_idx
key_len = 5
ref = (NULL)
rows = 637640
Extra = Using where
does that help?
First, check if there is an index on USER_ID and make sure it's used.
You can do it with running EXPLAIN.
Second, create a temporary table and use it in a JOIN:
CREATE TABLE temptable (user_id INT NOT NULL)
SELECT *
FROM temptable t
JOIN customers c
ON c.user_id = t.user_id
Third, how may rows does your query return?
If it returns almost all rows, then it just will be slow, since it will have to pump all these millions over the connection channel, to begin with.
NULL will not slow your query down, since the IN condition only satisfies non-NULL values which are indexed.
Update:
The index is used, the plan is fine except that it returns more than half a million rows.
Do you really need to put all these 638,000 rows into the report?
Hope its not printed: bad for rainforests, global warming and stuff.
Speaking seriously, you seem to need either aggregation or pagination on your query.
"Select *" is not as bad as some people think; row-based databases will fetch the entire row if they fetch any of it, so in situations where you're not using a covering index, "SELECT *" is essentially no slower than "SELECT a,b,c" (NB: There is sometimes an exception when you have large BLOBs, but that is an edge-case).
First things first - does your database fit in RAM? If not, get more RAM. No, seriously. Now, suppose your database is too huge to reasonably fit into ram (Say, > 32Gb) , you should try to reduce the number of random I/Os as they are probably what's holding things up.
I'll assuming from here on that you're running proper server grade hardware with a RAID controller in RAID1 (or RAID10 etc) and at least two spindles. If you're not, go away and get that.
You could definitely consider using a clustered index. In MySQL InnoDB you can only cluster the primary key, which means that if something else is currently the primary key, you'll have to change it. Composite primary keys are ok, and if you're doing a lot of queries on one criterion (say user_id) it is a definite benefit to make it the first part of the primary key (you'll need to add something else to make it unique).
Alternatively, you might be able to make your query use a covering index, in which case you don't need user_id to be the primary key (in fact, it must not be). This will only happen if all of the columns you need are in an index which begins with user_id.
As far as query efficiency is concerned, WHERE user_id IN (big list of IDs) is almost certainly the most efficient way of doing it from SQL.
BUT my biggest tips are:
Have a goal in mind, work out what it is, and when you reach it, stop.
Don't take anybody's word for it - try it and see
Ensure that your performance test system is the same hardware spec as production
Ensure that your performance test system has the same data size and kind as production (same schema is not good enough!).
Use synthetic data if it is not possible to use production data (Copying production data may be logistically difficult (Remember your database is >32Gb) ; it may also violate security policies).
If your query is optimal (as it probably already is), try tuning the schema, then the database itself.
Is this your most important query? Is this a transactional table?
If so, try creating a clustered index on user_id. Your query might be slow because it still must make random disk reads to retrieve the columns (key lookups), even after finding the records that match (index seek on the user_Id index).
If you cannot change the clustered index, then you might want to consider an ETL process (simplest is a trigger that inserts into another table with the best indexing). This should yield faster results.
Also note that such large queries may take some time to parse, so help it out by putting the queried ids into a temp table if possibl
Are they the same ~560 id's every time? Or is it a different ~500 ids on different runs of the queries?
You could just insert your 560 UserIDs into a separate table (or even a temp table), stick an index on the that table and inner join it to you original table.
You can try to insert the ids you need to query on in a temp table and inner join both tables. I don't know if that would help.