Why an union is faster than a group by - sql

Well, maybe I am too old school and I would like to understand the following.
query 1.
select count(*), gender from customer
group by gender
query 2.
select count(*), 'M' from customer
where gender ='M'
union
select count(*), 'F' from customer
where gender ='F'
the 1st query is simpler, but for some reason in the profiler,when I execute both at the same time, it says that query 2 uses 39% of the time, and query 1, 61%.
I would like to understand the reason, maybe I have to rewrite all my queries.

Your query 2 is actually a nice trick. It works like this: You have an index on gender. The DBMS can seek into that index two times to get two ranges of rows (one for M and one for F). It doesn't need to read anything from these rows, just that they exist. It can count the number of rows that exist in the two ranges.
In the first query the DBMS needs to decode the rows to read the gender, then it needs to either sort the rows or build a hashtable to aggregate them. That is more expensive than just counting rows.

Are you sure?
Maybe the second query is just using cached resources from the first on.
run them in two separately batches and before each one run DBCC FREEPROCCACHE to clean the cache. Then compare the values of each execution plan.

The optimization of a query depends on the database. What you are seeing is database specific.
The union, as written, would naively require two passes through the data, doing a filter and a count. Basically no other storage is necessary.
The aggregation might sort the data and then do a count. Or, it might generate a hash table. Given the performance difference, I would guess a sort is being used. Clearly, this is overkill for this type of query.
If you have an index on gender, both methods would essentially scan the index so the performance should be similar (the union version might scan it twice=.
Does the database that you are using offer a way to calculate statistics on tables? If so, you should update the statistics and see if you still get the same results.
Also, can you post the results of "explain" or the execution plan? That would precisely explain why one is faster than the other.

I tried an equivalent query, but found the opposite result; the union took 65% and the 'group by' took 35%. (Using SQL Server 2008). I do not have an index on gender so my execution plan shows a clustered index scan. Unless you examine the execution plan in detail, it really isn't possible to explain this result.
Adding an index for this query is probably not a good idea, since you are probably not going to be running this query nearly as often as you are going to insert records in the customer table. In some other database engines with bitmap indexes (Oracle, PostgreSQL), the database engine can combine multiple indexes, so that can alter the utility of single column indexes. But in SQL Server, you need to design the indexes to 'cover' the commonly used queries.

Related

Please explain COUNT and GROUP BY/ORDER BY in SQL for a beginner [duplicate]

I would like to understand how exactly does sql count work. Is it a whole table scan that happens or is it some property of the table that is read. However I feel a table scan would be an overhead in case of huge tables with lots of records.
In general either a table or index scan is performed. This is chiefly because in a MVCC-supporting engine, different transactions could see different rows, so there is no single "row count" which is simultaneously correct for everyone.
Likewise, if you have a WHERE clause, then the where condition could be different for different clients, so they see different numbers.
If you need to do a lot of counts of large tables, consider storing your own counters in a different table. Exactly how you do this is entirely application specific.
This will depend very much on which SQL implementation you are using (MS SQL Server, MySQL, Oracle, PostgreSQL etc), and how clever its optimiser is.
It may also depend on the query. For example, with something like
SELECT COUNT(primary_key) FROM table;
the optimiser may realise that there is no need to scan the table (since there is no filtering with WHERE and no possibility that any values are NULL) and just return the size of the table. With a more complicated query (where there is filtering, or the possibility of NULLs), the database may have to scan the table, or it may be able to do some optimisation with the use of an index.
This is obviously implementation dependant (i.e. different RDBMS may employ different strategies) and usage dependant (i.e. select count(*) from mytable and select count(*) from mytable where myfield < somevalue) may use different methods even in the same DB.
If you are trying to get the count based on some partitioning that is already expressed by an Index, smart DBs will try to use the index alone. Or something like the old "rushmore" used in Foxbase.
So, "it depends", but at the end of the day, if no better methods are available, yes, the DB will perform a table scan.
It is usually some sort of index scan, unless there is no unique index on the table.
Strangely enough, most database engines can only count by scanning. They even provide alternate solutions to count using table metadata. For instance SQL Server supports SELECT rowcnt FROM sysindexes .... However, these are usually not 100% accurate.
YSE COUNT FUNCTION DOSE TABLE SCAN, rather than using count on table to get total number of row you can use :
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'TABLENAME'
or
SELECT sysobjects.[name], max(sysindexes.[rows]) AS TableRows
FROM sysindexes INNER JOIN sysobjects ON sysindexes.[id] = sysobjects.[id]
WHERE sysobjects.xtype = 'U' and sysobjects.[name]='tablename'
GROUP BY sysobjects.[name]
ORDER BY max(rows) DESC
OTHER WAY TO GET TOTAL COUNT : http://www.codeproject.com/Tips/58796/Number-of-different-way-to-get-total-no-of-row-fro.aspx
It depends on the DBMS used.
If there is an index, there should be one index row for each table row. A smart DBMS will likely choose the smallest index and count the index rows.
Finally, if the table is small enough, it may count the table rows and bypass the index.
In postgreSQL a table scan is performed.
I think it's implementation dependant.
Edit:
See this link
It really doesn't matter!
I assume you want the row count for some sort of paging... so just make sure your paging algorithm is into the best practices and forget about how the engine works.
Let people in database business care about this, just follow the recommendation of those who are experts in the database your are using.
SQL Server - https://web.archive.org/web/20211020131201/https://www.4guysfromrolla.com/webtech/042606-1.shtml
Oracle - Paging with Oracle
MySQL - http://php.about.com/od/phpwithmysql/ss/php_pagination.htm

SQL : Can WHERE clause increase a SELECT DISTINCT query's speed?

So here's the specific situation: I have primary unique indexed keys set to each entry in the database, but each row has a secondID referring to an attribute of the entry, and as such, the secondIDs are not unique. There is also another attribute of these rows, let's call it isTitle, which is NULL by default, but each group of entries with the same secondID have at least one entry with 1 isTitle value.
Considering the conditions above, would a WHERE clause increase the processing speed of the query or not? See the following:
SELECT DISTINCT secondID FROM table;
vs.
SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
EDIT:
The first query without the WHERE clause is faster, but could someone explain me why? Algorithmically the process should be faster with having only one 'if' in the cycle, no?
In general, to benchmark performances of queries, you usually use queries that gives you the execution plan the query they receive in input (Every small step that the engine is performing to solve your request).
You are not mentioning your database engine (e.g. PostgreSQL, SQL Server, MySQL), but for example in PostgreSQL the query is the following:
EXPLAIN SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
Going back to your question, since the isTitle is not indexed, I think the first action the engine will do is a full scan of the table to check that attribute and then perform the SELECT hence, in my opinion, the first query:
SELECT DISTINCT secondID FROM table;
will be faster.
If you want to optimize it, you can create an index on isTitle column. In such scenario, the query with the WHERE clause will become faster.
This is a very hard question to answer, particularly without specifying the database. Here are three important considerations:
Will the database engine use the index on secondID for select distinct? Any decent database optimizer should, but that doesn't mean that all do.
How wide is the table relative to the index? That is, is scanning the index really that much faster than scanning the table?
What is the ratio of isTitle = 1 to all rows with the same value of secondId?
For the first query, there are essentially two ways to process the query:
Scan the index, taking each unique value as it comes.
Scan the table, sort or hash the table, and choose the unique values.
If it is not obvious, (1) is much faster than (2), except perhaps in trivial cases where there are a small number of rows.
For the second query, the only real option is:
Scan the table, filter out the non-matching values, sort or hash the table, and choose the unique values.
The key issues here are how much data needs to be scanned and how much is filtered out. It is even possible -- if you had, say, zillions of rows per secondaryId, no additional columns, and small number of values -- that this might be comparable or slightly faster than (1) above. There is a little overhead for scanning an index and sorting a small amount of data is often quite fast.
And, this method is almost certainly faster than (2).
As mentioned in the comments, you should test the queries on your system with your data (use a reasonable amount of data!). Or, update the table statistics and learn to read execution plans.

Use index for GROUP BY

I have the following query:
SELECT * FROM messages GROUP BY peer
(really it's more complicated with joins, but I omitted them here for simplicity)
The problem is that SQLite doesn't use any indexes and always performs a full scan of the table. Expectedly, it works fast on small data sets but it's noticeably slow with a big table containing thousands of rows. Here's the output of the EXPLAIN QUERY PLAN command:
0|0|0|SCAN TABLE messages USING INDEX messages_peer_mid (~1000000 rows)
Despite it says "USING INDEX" it still performs a full scan. Is there any way to make SQLite use index for this query or it's better to give up with GROUP BY and look for some other approach?
The plan takes into account the amount of data and performs a scan because it's algorithm probably concludes it's faster to do so.
Other comments, your query has no WHERE condition and you are returning ALL columns so why wouldn't you expect a table scan?
Indexes assist in selecting records from a table (using a WHERE clause or as a result of a JOIN operation). GROUP BY is performed on a set of records after they've been selected and retrieved from the table. It cannot be assisted by indexes.
If you want to know more about what options are available for index use in your query, please post the entire query.
Also, you note that the SQL you gave is a symbolic representation of the code you're running, but if you're really using *, or any non-aggregated field names other than peer in your statement you may not be getting the results you want.
Finally, you ask "it's better to give up with GROUP BY and look for some other approach?" GROUP BY is used for a specific function in SQL (producing new aggregated result sets from non-aggregated data). If that's your goal, GROUP BY is likely to be the best solution (because it defers to the database engine, which is highly optimized and cognizant of database statistics the decision about how to retrieve and process the data). If that's not your goal and you're trying to do something else using GROUP BY as an "approach" to that other functionality, let us know what it is you're actually trying to achieve.

how does sql count work?

I would like to understand how exactly does sql count work. Is it a whole table scan that happens or is it some property of the table that is read. However I feel a table scan would be an overhead in case of huge tables with lots of records.
In general either a table or index scan is performed. This is chiefly because in a MVCC-supporting engine, different transactions could see different rows, so there is no single "row count" which is simultaneously correct for everyone.
Likewise, if you have a WHERE clause, then the where condition could be different for different clients, so they see different numbers.
If you need to do a lot of counts of large tables, consider storing your own counters in a different table. Exactly how you do this is entirely application specific.
This will depend very much on which SQL implementation you are using (MS SQL Server, MySQL, Oracle, PostgreSQL etc), and how clever its optimiser is.
It may also depend on the query. For example, with something like
SELECT COUNT(primary_key) FROM table;
the optimiser may realise that there is no need to scan the table (since there is no filtering with WHERE and no possibility that any values are NULL) and just return the size of the table. With a more complicated query (where there is filtering, or the possibility of NULLs), the database may have to scan the table, or it may be able to do some optimisation with the use of an index.
This is obviously implementation dependant (i.e. different RDBMS may employ different strategies) and usage dependant (i.e. select count(*) from mytable and select count(*) from mytable where myfield < somevalue) may use different methods even in the same DB.
If you are trying to get the count based on some partitioning that is already expressed by an Index, smart DBs will try to use the index alone. Or something like the old "rushmore" used in Foxbase.
So, "it depends", but at the end of the day, if no better methods are available, yes, the DB will perform a table scan.
It is usually some sort of index scan, unless there is no unique index on the table.
Strangely enough, most database engines can only count by scanning. They even provide alternate solutions to count using table metadata. For instance SQL Server supports SELECT rowcnt FROM sysindexes .... However, these are usually not 100% accurate.
YSE COUNT FUNCTION DOSE TABLE SCAN, rather than using count on table to get total number of row you can use :
SELECT
Total_Rows= SUM(st.row_count)
FROM
sys.dm_db_partition_stats st
WHERE
object_name(object_id) = 'TABLENAME'
or
SELECT sysobjects.[name], max(sysindexes.[rows]) AS TableRows
FROM sysindexes INNER JOIN sysobjects ON sysindexes.[id] = sysobjects.[id]
WHERE sysobjects.xtype = 'U' and sysobjects.[name]='tablename'
GROUP BY sysobjects.[name]
ORDER BY max(rows) DESC
OTHER WAY TO GET TOTAL COUNT : http://www.codeproject.com/Tips/58796/Number-of-different-way-to-get-total-no-of-row-fro.aspx
It depends on the DBMS used.
If there is an index, there should be one index row for each table row. A smart DBMS will likely choose the smallest index and count the index rows.
Finally, if the table is small enough, it may count the table rows and bypass the index.
In postgreSQL a table scan is performed.
I think it's implementation dependant.
Edit:
See this link
It really doesn't matter!
I assume you want the row count for some sort of paging... so just make sure your paging algorithm is into the best practices and forget about how the engine works.
Let people in database business care about this, just follow the recommendation of those who are experts in the database your are using.
SQL Server - https://web.archive.org/web/20211020131201/https://www.4guysfromrolla.com/webtech/042606-1.shtml
Oracle - Paging with Oracle
MySQL - http://php.about.com/od/phpwithmysql/ss/php_pagination.htm

Do indexes work with "IN" clause

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..