Mysql performance and Count(*)

Mysql performance and Count(*) - sql

I want to know that my sql execute count queries in linear time or in log(n) time i think that if query parameter was indexed it can do it by cubing

MyISAM will return immediatelly.
InnoDB will do PK scan, so time will lineary increase with number of records.
If you need to see approximately how many records InnoDB table holds, the fastest way is using
EXPLAIN select * from student;
(but innodb statistics may be wrong, so 40% error is quite possible also)

It all depends on the query, or more precisely, on the query plan MySql eventually select to process the query.
Also it all depend what we mean by 'n', in these big O expression. For example if 'n' is the count value eventually returned, and if that counts is that produced by a query which requires iteratively scanning multiple tables, the complexity could be worse than linear.

The answer to this is complicated. Not only does it depend on the number of tables involved, but it can also depend on what storage engine you're using.
Having said that, this is what the manual says:
COUNT(*) is optimized to return very
quickly if the SELECT retrieves from
one table, no other columns are
retrieved, and there is no WHERE
clause. For example:
mysql> SELECT COUNT(*) FROM student;
This optimization applies only to
MyISAM tables only, because an exact
row count is stored for this storage
engine and can be accessed very
quickly. For transactional storage
engines such as InnoDB, storing an
exact row count is more problematic
because multiple transactions may be
occurring, each of which may affect
the count.
-- MySQL Manual

Related

Computational Efficiency - I/O

If I have two queries, why does it seem the second query is more computationally efficient (just in terms of I/O) as the first:
The first query only returns eight fields, runs in 1.1sec and processes 115.6mb. The second, however, returns over a million records, but runs in just 3.4sec and only accesses 8.2mb.
I am really trying to understand writing queries more efficiently as I am beginning to use substantially larger pools of data. Thanks!
SELECT
*
FROM
`table1`
LIMIT
10;
SELECT
id
FROM
`table1`

BigQuery is basically a columnar database (this is not exactly true, but it is a useful approximation). That is, it stores each column separately. So accessing one column only requires finding and reading that one column. Accessing multiple columns requires finding all those columns and reading them -- even if you only want one value.
This is not only a function of performance. The number of columns also determines billing. For users of other databases, it can be really surprising when:
select t.*
from t
limit 10;
ends up costing $10 or $100 because t is really big and wide. But:
select count(id)
from t;
costs almost nothing at all.
As another note: when you refer to a table multiple times in a query, you only pay for access once. So self-joins are not more expensive than selecting directly from the table.

Pre-fetching row counts before query - performance

I recently answered this question based on my experience:
Counting rows before proceeding to actual searching
but I'm not 100% satisfied with the answer I gave.
The question is basically this: Can I get a performance improvement by running a COUNT over a particular query before deciding to run the query that brings back the actual rows?
My intuition is this: you will only save the I/O and wire time associated with retrieving the data instead of the count because to count the data, you need to actually find the rows. The possible exception to this is when the query is a simple function of the indexes.
My question then is this: Is this always true? What other exception cases are there? From a pure performance perspective, in what cases would one want to do a COUNT before running the full query?

First, the answer to your question is highly dependent on the database.
I cannot think of a situation when doing a COUNT() before a query will shorten the overall time for both the query and the count().
In general, doing a count will pre-load tables and indexes into the page cache. Assuming the data fits in memory, this will make the subsequent query run faster (although not much faster if you have fast I/O and the database does read-ahead page reading). However, you have just shifted the time frame to the COUNT(), rather than reducing overall time.
To shorten the overall time (including the run time of the COUNT()) would require changing the execution plan. Here are two ways this could theoretically happen:
A database could update statistics as a table is read in, and these statistics, in turn, change the query plan for the main query.
A database could change the execution plan based on whether tables/indexes are already in the page cache.
Although theoretically possible, I am not aware of any database that does either of these.
You could imagine that intermediate results could be stored, but this would violate the dynamic nature of SQL databases. That is, updates/inserts could occur on the tables between the COUNT() and the query. A database engine could not maintain integrity and maintain such intermediate results.
Doing a COUNT() has disadvantages, relative to speeding up the subsequent query. The query plan for the COUNT() might be quite different from the query plan for the main query. Your example with indexes is one case. Another case would be in a columnar database, where different vertical partitions of the data do not need to be read.
Yet another case would be a query such as:
select t.*, r.val
from table t left outer join
ref r
on t.refID = r.refID
and refID is a unique index on the ref table. This join can be eliminated for a count, since there are not duplicates and all records in t are used. However, the join is clearly needed for this query. Once again, whether a SQL optimizer recognizes and acts on this situation is entirely the decision of the writers of the database. However, the join could theoretically be optimized away for the COUNT().

Why an union is faster than a group by

Well, maybe I am too old school and I would like to understand the following.
query 1.
select count(*), gender from customer
group by gender
query 2.
select count(*), 'M' from customer
where gender ='M'
union
select count(*), 'F' from customer
where gender ='F'
the 1st query is simpler, but for some reason in the profiler,when I execute both at the same time, it says that query 2 uses 39% of the time, and query 1, 61%.
I would like to understand the reason, maybe I have to rewrite all my queries.

Your query 2 is actually a nice trick. It works like this: You have an index on gender. The DBMS can seek into that index two times to get two ranges of rows (one for M and one for F). It doesn't need to read anything from these rows, just that they exist. It can count the number of rows that exist in the two ranges.
In the first query the DBMS needs to decode the rows to read the gender, then it needs to either sort the rows or build a hashtable to aggregate them. That is more expensive than just counting rows.

Are you sure?
Maybe the second query is just using cached resources from the first on.
run them in two separately batches and before each one run DBCC FREEPROCCACHE to clean the cache. Then compare the values of each execution plan.

The optimization of a query depends on the database. What you are seeing is database specific.
The union, as written, would naively require two passes through the data, doing a filter and a count. Basically no other storage is necessary.
The aggregation might sort the data and then do a count. Or, it might generate a hash table. Given the performance difference, I would guess a sort is being used. Clearly, this is overkill for this type of query.
If you have an index on gender, both methods would essentially scan the index so the performance should be similar (the union version might scan it twice=.
Does the database that you are using offer a way to calculate statistics on tables? If so, you should update the statistics and see if you still get the same results.
Also, can you post the results of "explain" or the execution plan? That would precisely explain why one is faster than the other.

I tried an equivalent query, but found the opposite result; the union took 65% and the 'group by' took 35%. (Using SQL Server 2008). I do not have an index on gender so my execution plan shows a clustered index scan. Unless you examine the execution plan in detail, it really isn't possible to explain this result.
Adding an index for this query is probably not a good idea, since you are probably not going to be running this query nearly as often as you are going to insert records in the customer table. In some other database engines with bitmap indexes (Oracle, PostgreSQL), the database engine can combine multiple indexes, so that can alter the utility of single column indexes. But in SQL Server, you need to design the indexes to 'cover' the commonly used queries.

effect of number of projections on query performance

I am looking to improve the performance of a query which selects several columns from a table. was wondering if limiting the number of columns would have any effect on performance of the query.

Reducing the number of columns would, I think, have only very limited effect on the speed of the query but would have a potentially larger effect on the transfer speed of the data. The less data you select, the less data that would need to be transferred over the wire to your application.

I might be misunderstanding the question, but here goes anyway:
The absolute number of columns you select doesn't make a huge difference. However, which columns you select can make a significant difference depending on how the table is indexed.
If you are selecting only columns that are covered by the index, then the DB engine can use just the index for the query without ever fetching table data. If you use even one column that's not covered, though, it has to fetch the entire row (key lookup) and this will degrade performance significantly. Sometimes it will kill performance so much that the DB engine opts to do a full scan instead of even bothering with the index; it depends on the number of rows being selected.
So, if by removing columns you are able to turn this into a covering query, then yes, it can improve performance. Otherwise, probably not. Not noticeably anyway.
Quick example for SQL Server 2005+ - let's say this is your table:
ID int NOT NULL IDENTITY PRIMARY KEY CLUSTERED,
Name varchar(50) NOT NULL,
Status tinyint NOT NULL
If we create this index:
CREATE INDEX IX_MyTable
ON MyTable (Name)
Then this query will be fast:
SELECT ID
FROM MyTable
WHERE Name = 'Aaron'
But this query will be slow(er):
SELECT ID, Name, Status
FROM MyTable
WHERE Name = 'Aaron'
If we change the index to a covering index, i.e.
CREATE INDEX IX_MyTable
ON MyTable (Name)
INCLUDE (Status)
Then the second query becomes fast again because the DB engine never needs to read the row.

Limiting the number of columns has no measurable effect on the query. Almost universally, an entire row is fetched to cache. The projection happens last in the SQL pipeline.
The projection part of the processing must happen last (after GROUP BY, for instance) because it may involve creating aggregates. Also, many columns may be required for JOIN, WHERE and ORDER BY processing. More columns than are finally returned in the result set. It's hardly worth adding a step to the query plan to do projections to somehow save a little I/O.
Check your query plan documentation. There's no "project" node in the query plan. It's a small part of formulating the result set.
To get away from "whole row fetch", you have to go for a columnar ("Inverted") database.

It can depend on the server you're dealing with (and, in the case of MySQL, the storage engine). Just for example, there's at least one MySQL storage engine that does column-wise storage instead of row-wise storage, and in this case more columns really can take more time.
The other major possibility would be if you had segmented your table so some columns were stored on one server, and other columns on another (aka vertical partitioning). In this case, retrieving more columns might involve retrieving data from different servers, and it's always possible that the load is imbalanced so different servers have different response times. Of course, you usually try to keep the load reasonably balanced so that should be fairly unusual, but it's still possible (especially if, for example, if one of the servers handles some other data whose usage might vary independently from the rest).

yes, if your query can be covered by a non clustered index it will be faster since all the data is already in the index and the base table (if you have a heap) or clustered index does not need to be touched by the optimizer

To demonstrate what tvanfosson has already written, that there is a "transfer" cost I ran the following two statements on a MSSQL 2000 DB from query analyzer.
SELECT datalength(text) FROM syscomments
SELECT text FROM syscomments
Both results returned 947 rows but the first one took 5 ms and the second 973 ms.
Also because the fields are the same I would not expect indexing to factor here.

is there something faster than "having count" for large tables?

Here is my query:
select word_id, count(sentence_id)
from sentence_word
group by word_id
having count(sentence_id) > 100;
The table sentenceword contains 3 fields, wordid, sentenceid and a primary key id.
It has 350k+ rows.
This query takes a whopping 85 seconds and I'm wondering (hoping, praying?) there is a faster way to find all the wordids that have more than 100 sentenceids.
I've tried taking out the select count part, and just doing 'having count(1)' but neither speeds it up.
I'd appreciate any help you can lend. Thanks!

If you don't already have one, create a composite index on sentence_id, word_id.

having count(sentence_id) > 100;
There's a problem with this... Either the table has duplicate word/sentence pairs, or it doesn't.
If it does have duplicate word/sentence pairs, you should be using this code to get the correct answer:
HAVING COUNT(DISTINCT Sentence_ID) > 100
If the table does not have duplicate word/sentence pairs... then you shouldn't count sentence_ids, you should just count rows.
HAVING COUNT(*) > 100
In which case, you can create an index on word_id only, for optimum performance.

If that query is often performed, and the table rarely updated, you could keep an auxiliary table with word ids and corresponding sentence counts -- hard to think of any further optimization beyond that!

Your query is fine, but it needs a bit of help (indexes) to get faster results.
I don't have my resources at hand (or access to SQL), but I'll try to help you from memory.
Conceptually, the only way to answer that query is to count all the records that share the same word_id. That means that the query engine needs a fast way to find those records. Without an index on word_id, the only thing the database can do is go through the table one record at a time and keep running totals of every single distinct word_id it finds. That would usually require a temporary table and no results can be dispatched until the whole table is scanned. Not good.
With an index on word_id, it still has to go through the table, so you would think it wouldn't help much. However, the SQL engine can now compute the count for each word_id without waiting until the end of the table: it can dispatch the row and the count for that value of word_id (if it passes your where clause), or discard the row (if it doesn't); that will result in lower memory load on the server, possibly partial responses, and the temporary table is no longer needed. A second aspect is parallelism; with an index on word_id, SQL can split the job in chunks and use separate processor cores to run the query in parallel (depending on hardware capabilities and existing workload).
That might be enough to help your query; but you will have to try to see:
CREATE INDEX someindexname ON sentence_word (word_id)
(T-SQL syntax; you didn't specify which SQL product you are using)
If that's not enough (or doesn't help at all), there are two other solutions.
First, SQL allows you to precompute the COUNT(*) by using indexed views and other mechanisms. I don't have the details at hand (and I don't do this often). If your data doesn't change often, that would give you faster results but with a cost in complexity and a bit of storage.
Also, you might want to consider storing the results of the query in a separate table. That is practical only if the data never changes, or changes on a precise schedule (say, during a data refresh at 2 in the morning), or if it changes very little and you can live with non perfect results for a few hours (you would have to schedule a periodic data refresh); that's the moral equivalent of a poor-man's data warehouse.
The best way to find out for sure what works for you is to run the query and look at the query plan with and without some candidate indexes like the one above.

There is, surprisingly, an even faster way to accomplish that on large data sets:
SELECT totals.word_id, totals.num
FROM (SELECT word_id, COUNT(*) AS num FROM sentence_word GROUP BY word_id) AS totals
WHERE num > 1000;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas