SQL: using count() vs. keeping a separate field - sql

I have a question regarding SQL performance and was hoping someone would have the answer.
I have the database table tbl_users and I want to get the total number of users I have. I could write it as SELECT COUNT(*) FROM tbl_users. I presume such query would have performance implications were I to have a handful of users vs. several millions of them. (So, assumption #1 is that the more rows I have, the more resources this query will consume).
In this particular case I need to run this query at a relatively high frequency and each time I need to get up-to-date data (so, caching is not an option).
Assuming my assmption #1 is correct, I then thought of structuring it the following way:
create tbl_stats with a field userCounter
each time there is an insert in tbl_users, userCounter is updated +1
each time I need to get my user count, I can pull that one field from tbl_stats
Now, I realize that by doing it this way, the data in userCounter is technically a duplicate, which is bad form.
So, will my first query (assuming millions of rows of data) consume that many resources to warrant me to implement my alternative design? If so (or if possibly yes), then is my alternative design consistent with best practices?

If your table is indexed, which it almost certainly will be, then the performance of select count(*) probably will not be as bad as you might anticipate - even if you have millions of rows.
But, if it does become a concern, then rather than roll your own solution, look into using an indexed view.

I have a database table with almost 5 million records, the following query returns in less than a second
select count(userID) from tblUsers
This query returns in 2 seconds
select count(*) from tblUsers
I'd personally just go with select count() rather than creating a duplicate field

I think this is one of those scenarios where you really need to measure the performance to make a good decision. I would wager that a simple COUNT() isn't going to create enough latency that you would need to implement your proposed work-around.
If you are worried I would encapsulate your COUNT() in a function or stored procedure so you can quickly swap it out later if performance does become a problem.

On some systems you can ask the system to maintain the counts for you. For example, in SQL Server you can have an indexed view on the count:
create view vwCountUsers
with schema binding
as
select count_big(*) as count
from dbo.tbl_users;
create clustered index cdxCountUsers on vwCountUsers (count);
The system will maintain the count for you and will always be available at nearly no cost.

If you have a desperate need and a real business case for up to the minute accurate counts, then the trigger would be the way to go. Just make sure it caters for all multi-user issues such as concurrency and transactions.
It could become a bottleneck because instead of 5 transactions being able to insert into a new table, they will queue up waiting to update the userCounter table, and you may even get deadlocks.
There are other options for less accurate counts, but if you want accurate then there are very few other choices, but I'll try to think of some:
You could partition the data and in userCounter store a count by day. If the data only gets added for the current day, do a select sum(dailycount) from counter + select count(*) from table where {date=today}
You could at least use the nolock or readpast options to lessen resource usage:
select * from tbl with (readpast)
select * from tbl with (nolock)

There are somethings it makes sense to precalculate for performace reasons (comlex calculations over years of data). That's why data warehouses exists much of the time to speed reporting. Select count(*) is generally not one of them if you have any indexing on the table at all. There are far worse performance problems to solve than that. I get 1 second to return the count on a table with 13 million rows.
I'm all about writing code that will is more likely to perform well than the alternative (avoiding correlated subqueries, using set-based operatiosn instead of cursors, having sargable where clauses), but this is a mirco optimization that should not be addressed until there is a real performance problem.

Related

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.
If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.
So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

Pre-fetching row counts before query - performance

I recently answered this question based on my experience:
Counting rows before proceeding to actual searching
but I'm not 100% satisfied with the answer I gave.
The question is basically this: Can I get a performance improvement by running a COUNT over a particular query before deciding to run the query that brings back the actual rows?
My intuition is this: you will only save the I/O and wire time associated with retrieving the data instead of the count because to count the data, you need to actually find the rows. The possible exception to this is when the query is a simple function of the indexes.
My question then is this: Is this always true? What other exception cases are there? From a pure performance perspective, in what cases would one want to do a COUNT before running the full query?
First, the answer to your question is highly dependent on the database.
I cannot think of a situation when doing a COUNT() before a query will shorten the overall time for both the query and the count().
In general, doing a count will pre-load tables and indexes into the page cache. Assuming the data fits in memory, this will make the subsequent query run faster (although not much faster if you have fast I/O and the database does read-ahead page reading). However, you have just shifted the time frame to the COUNT(), rather than reducing overall time.
To shorten the overall time (including the run time of the COUNT()) would require changing the execution plan. Here are two ways this could theoretically happen:
A database could update statistics as a table is read in, and these statistics, in turn, change the query plan for the main query.
A database could change the execution plan based on whether tables/indexes are already in the page cache.
Although theoretically possible, I am not aware of any database that does either of these.
You could imagine that intermediate results could be stored, but this would violate the dynamic nature of SQL databases. That is, updates/inserts could occur on the tables between the COUNT() and the query. A database engine could not maintain integrity and maintain such intermediate results.
Doing a COUNT() has disadvantages, relative to speeding up the subsequent query. The query plan for the COUNT() might be quite different from the query plan for the main query. Your example with indexes is one case. Another case would be in a columnar database, where different vertical partitions of the data do not need to be read.
Yet another case would be a query such as:
select t.*, r.val
from table t left outer join
ref r
on t.refID = r.refID
and refID is a unique index on the ref table. This join can be eliminated for a count, since there are not duplicates and all records in t are used. However, the join is clearly needed for this query. Once again, whether a SQL optimizer recognizes and acts on this situation is entirely the decision of the writers of the database. However, the join could theoretically be optimized away for the COUNT().

Speed of paged queries in Oracle

This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application:
Regular queries with a "fallback" limit
Sorted and paged queries
Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries.
Case 1 is really simple. We just add an additional ROWNUM filter:
WHERE ...
AND ROWNUM < ?
That's quite fast, as Oracle's CBO will take this filter into consideration for its execution plan and probably apply a FIRST_ROWS operation (similar to the one enforced by the /*+FIRST_ROWS*/ hint.
Case 2, however is a bit more tricky with Oracle, as there is no LIMIT ... OFFSET clause as in other RDBMS. So we nest our "business" query in a technical wrapper as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*, ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... USER SORTED business query ...]
) inner
)
WHERE ROWNUM < ?
) outer
WHERE outer.RNUM > ?
Note that the TOTAL_ROWS field is calculated to know how many pages we will have even without fetching all data. Now this paging query is usually quite satisfying. But every now and then (as I said, when querying 5M+ records, possibly including non-indexed searches), this runs for 2-3minutes.
EDIT: Please note, that a potential bottleneck is not so easy to circumvent, because of sorting that has to be applied before paging!
I'm wondering, is that state-of-the-art simulation of LIMIT ... OFFSET, including TOTAL_ROWS in Oracle, or is there a better solution that will be faster by design, e.g. by using the ROW_NUMBER() window function instead of the ROWNUM pseudo-column?
The main problem with Case 2 is that in many cases the whole query result set has to be obtained and then sorted before the first N rows can be returned - unless the ORDER BY columns are indexed and Oracle can use the index to avoid a sort. For a complex query and a large set of data this can take some time. However there may be some things you can do to improve the speed:
Try to ensure that no functions are called in the inner SQL - these may get called 5 million times just to return the first 20 rows. If you can move these function calls to the outer query they will be called less.
Use a FIRST_ROWS_n hint to nudge Oracle into optimising for the fact that you will never return all the data.
EDIT:
Another thought: you are currently presenting the user with a report that could return thousands or millions of rows, but the user is never realistically going to page through them all. Can you not force them to select a smaller amount of data e.g. by limiting the date range selected to 3 months (or whatever)?
You might want to trace the query that takes a lot of time and look at its explain plan. Most likely the performance bottleneck comes from the TOTAL_ROWS calculation. Oracle has to read all the data, even if you only fetch one row, this is a common problem that all RDBMS face with this type of query. No implementation of TOTAL_ROWS will get around that.
The radical way to speed up this type of query is to forego the TOTAL_ROWS calculation. Just display that there are additional pages. Do your users really need to know that they can page through 52486 pages? An estimation may be sufficient. That's another solution, implemented by google search for example: estimate the number of pages instead of actually counting them.
Designing an accurate and efficient estimation algorithm might not be trivial.
A "LIMIT ... OFFSET" is pretty much syntactic sugar. It might make the query look prettier, but if you still need to read the whole of a data set and sort it and get rows "50-60", then that's the work that has to be done.
If you have an index in the right order, then that can help.
It may perform better to run two queries instead of trying to count() and return the results in the same query. Oracle may be able to answer the count() without any sorting or joining to all the tables (join table elimination based on declared foreign key constraints). This is what we generally do in our application. For performance important statements, we write a separate query that we know will return the correct count as we can sometimes do better than Oracle.
Alternatively, you can make a tradeoff between performance and recency of the data. Bringing back the first 5 pages is going to be nearly as quick as bringing back the first page. So you could consider storing the results from 5 pages in a temporary table along with an expiry date for the information. Take the result from the temporary table if valid. Put a background task in to delete the expired data periodically.

Fastest way to count total number and then list a set of records in MySQL

I have a SQL statement to select results from a table. I need to know the total number of records found, and then list a sub-set of them (pagination).
Normally, I would make 2 SQL calls:
one for counting the total number of records (using COUNT),
the other for returning the sub-set (using LIMIT).
But, this way, you are really duplicating the same operation on MySQL: the WHERE statements are the same in both calls.
Isn't there a way to gain speed NOT duplicating the select on MySQL ?
That first query is going to result in data being pulled into the cache, so presumable the second query should be fast. I wouldn't be too worried about this.
You have to make both SQL queries, and the COUNT is very fast with no WHERE clause. Cache the data where possible.
You should just run the COUNT a single time and then cache it somewhere. Then you can just run the pagination query as needed.
If you really don't want to run the COUNT() query- and as others have stated, it's not something that slows things down appreciably- then you have to decide on your chunk size (ie the LIMIT number) up front. This will save you the COUNT() query, but you may end up with unfortunate pagination results (like 2 pages where the 2nd page has only 1 result).
So, a quick COUNT() and then a sensible LIMIT set-up, or no COUNT() and an arbitrary LIMIT that may increase the number of more expensive queries you have to do.
You could try selecting just one field (say, the IDs) and see if that helps, but I don't think it will - I imagine the biggest overhead is MySQL finding the correct rows in the first place.
If you simply want to count the total number of rows in the entire table (i.e. without a WHERE clause) then I believe SELECT COUNT(*) FROM table is fairly efficient.
Otherwise, the only solution if you need to have the total number visible is to select all the rows. However, you can cache this in another table. If you are selecting something from a category, say, store the category UID and the total rows selected. Then whenever you add/delete rows, count the totals again.
Another option - though it may sacrifice usability a little - is to only select the rows needed for the current page and next page. If there are some rows available for the next page, add a "Next" link. Do the same for the previous page. If you have 20 rows per page, you're selecting at most 60 rows on each page load, and you don't need to count all the rows available.
If you write your query to include one column that contains the count (in every row), and then the rest of the columns from your second query, you can:
avoid the second database round-trip (which is probably more expensive than your query anyways)
Increase the likelihood that MySQL's parser will generate an optimized execution plan that reuses the base query.
Make the operation atomic.
Unfortunately, it also creates a little repetition, returning more data than you really need. But I would expect it to be much more efficient anyway. This is the sort of strategy used by a lot of ORM products when they eagerly load objects from connected tables with many-to-one or many-to-many relationships.
As others have already pointed out, it's probably not worth much concern in this case -- as long as 'field' is indexed, both select's will be extremely fast.
If you have (for whatever reason) a situation where that's not enough, you could create a memory-based temporary table (i.e. a temporary table backed by the memory storage engine), and select your records into that temporary table. Then you could do selects from the temporary table and be quite well assured they'll be fast. This can use a lot of memory though (i.e. it forces that data to all stay in memory for the duration), so it's pretty unfriendly unless you're sure that:
The amount of data is really small;
You have so much memory it doesn't matter; or
The machine will be nearly idle otherwise anyway.
The main time this comes in handy is if you have a really complex select that can't avoid scanning all of a large table (or more than one) but yields only a tiny amount of data.

is there something faster than "having count" for large tables?

Here is my query:
select word_id, count(sentence_id)
from sentence_word
group by word_id
having count(sentence_id) > 100;
The table sentenceword contains 3 fields, wordid, sentenceid and a primary key id.
It has 350k+ rows.
This query takes a whopping 85 seconds and I'm wondering (hoping, praying?) there is a faster way to find all the wordids that have more than 100 sentenceids.
I've tried taking out the select count part, and just doing 'having count(1)' but neither speeds it up.
I'd appreciate any help you can lend. Thanks!
If you don't already have one, create a composite index on sentence_id, word_id.
having count(sentence_id) > 100;
There's a problem with this... Either the table has duplicate word/sentence pairs, or it doesn't.
If it does have duplicate word/sentence pairs, you should be using this code to get the correct answer:
HAVING COUNT(DISTINCT Sentence_ID) > 100
If the table does not have duplicate word/sentence pairs... then you shouldn't count sentence_ids, you should just count rows.
HAVING COUNT(*) > 100
In which case, you can create an index on word_id only, for optimum performance.
If that query is often performed, and the table rarely updated, you could keep an auxiliary table with word ids and corresponding sentence counts -- hard to think of any further optimization beyond that!
Your query is fine, but it needs a bit of help (indexes) to get faster results.
I don't have my resources at hand (or access to SQL), but I'll try to help you from memory.
Conceptually, the only way to answer that query is to count all the records that share the same word_id. That means that the query engine needs a fast way to find those records. Without an index on word_id, the only thing the database can do is go through the table one record at a time and keep running totals of every single distinct word_id it finds. That would usually require a temporary table and no results can be dispatched until the whole table is scanned. Not good.
With an index on word_id, it still has to go through the table, so you would think it wouldn't help much. However, the SQL engine can now compute the count for each word_id without waiting until the end of the table: it can dispatch the row and the count for that value of word_id (if it passes your where clause), or discard the row (if it doesn't); that will result in lower memory load on the server, possibly partial responses, and the temporary table is no longer needed. A second aspect is parallelism; with an index on word_id, SQL can split the job in chunks and use separate processor cores to run the query in parallel (depending on hardware capabilities and existing workload).
That might be enough to help your query; but you will have to try to see:
CREATE INDEX someindexname ON sentence_word (word_id)
(T-SQL syntax; you didn't specify which SQL product you are using)
If that's not enough (or doesn't help at all), there are two other solutions.
First, SQL allows you to precompute the COUNT(*) by using indexed views and other mechanisms. I don't have the details at hand (and I don't do this often). If your data doesn't change often, that would give you faster results but with a cost in complexity and a bit of storage.
Also, you might want to consider storing the results of the query in a separate table. That is practical only if the data never changes, or changes on a precise schedule (say, during a data refresh at 2 in the morning), or if it changes very little and you can live with non perfect results for a few hours (you would have to schedule a periodic data refresh); that's the moral equivalent of a poor-man's data warehouse.
The best way to find out for sure what works for you is to run the query and look at the query plan with and without some candidate indexes like the one above.
There is, surprisingly, an even faster way to accomplish that on large data sets:
SELECT totals.word_id, totals.num
FROM (SELECT word_id, COUNT(*) AS num FROM sentence_word GROUP BY word_id) AS totals
WHERE num > 1000;