Use index for GROUP BY - sql

I have the following query:
SELECT * FROM messages GROUP BY peer
(really it's more complicated with joins, but I omitted them here for simplicity)
The problem is that SQLite doesn't use any indexes and always performs a full scan of the table. Expectedly, it works fast on small data sets but it's noticeably slow with a big table containing thousands of rows. Here's the output of the EXPLAIN QUERY PLAN command:
0|0|0|SCAN TABLE messages USING INDEX messages_peer_mid (~1000000 rows)
Despite it says "USING INDEX" it still performs a full scan. Is there any way to make SQLite use index for this query or it's better to give up with GROUP BY and look for some other approach?

The plan takes into account the amount of data and performs a scan because it's algorithm probably concludes it's faster to do so.
Other comments, your query has no WHERE condition and you are returning ALL columns so why wouldn't you expect a table scan?

Indexes assist in selecting records from a table (using a WHERE clause or as a result of a JOIN operation). GROUP BY is performed on a set of records after they've been selected and retrieved from the table. It cannot be assisted by indexes.
If you want to know more about what options are available for index use in your query, please post the entire query.
Also, you note that the SQL you gave is a symbolic representation of the code you're running, but if you're really using *, or any non-aggregated field names other than peer in your statement you may not be getting the results you want.
Finally, you ask "it's better to give up with GROUP BY and look for some other approach?" GROUP BY is used for a specific function in SQL (producing new aggregated result sets from non-aggregated data). If that's your goal, GROUP BY is likely to be the best solution (because it defers to the database engine, which is highly optimized and cognizant of database statistics the decision about how to retrieve and process the data). If that's not your goal and you're trying to do something else using GROUP BY as an "approach" to that other functionality, let us know what it is you're actually trying to achieve.

Related

Using temp table for sorting data in SQL Server

Recently, I came across a pattern (not sure, could be an anti-pattern) of sorting data in a SELECT query. The pattern is more of a verbose and non-declarative way for ordering data. The pattern is to dump relevant data from actual table into temporary table and then apply orderby on a field on the temporary table. I guess, the only reason why someone would do that is to improve the performance (which I doubt) and no other benefit.
For e.g. Let's say, there is a user table. The table might contain rows in millions. We want to retrieve all the users whose first name starts with 'G' and sorted by first name. The natural and more declarative way to implement a SQL query for this scenario is:
More natural and declarative way
SELECT * FROM Users
WHERE NAME LIKE 'G%'
ORDER BY Name
Verbose way
SELECT * INTO TempTable
FROM Users
WHERE NAME LIKE 'G%'
SELECT * FROM TempTable
ORDER BY Name
With that context, I have few questions:
Will there be any performance difference between two ways if there is no index on the first name field. If yes, which one would be better.
Will there be any performance difference between two ways if there is an index on the first name field. If yes, which one would be better.
Should not the SQL Server optimizer generate same execution plan for both the ways?
Is there any benefit in writing a verbose way from any other persective like locking/blocking?
Thanks in advance.
Reguzlarly: Anti pattern by people without an idea what they do.
SOMETIMES: ok, because SQL Server has a problem that is not resolvable otherwise - not seen that one in yeas, though.
It makes things slower because it forces the tmpddb table to be fully populated FIRST, while otherwise the query could POSSIBLY be resoled more efficiently.
last time I saw that was like 3 years ago. We got it 3 times as fast by not being smart and using a tempdb table ;)
Answers:
1: No, it still needs a table scan, obviously.
2: Possibly - depends on data amount, but an index seek by index would contain the data in order already (as the index is ordered by content).
3: no. Obviously. Query plan optimization is statement by statement. By cutting the execution in 2, the query optimizer CAN NOT merge the join into the first statement.
4: Only if you run into a query optimizer issue or a limitation of how many tables you can join - not in that degenerate case (degenerate in a technical meaning - i.e. very simplistic). BUt if you need to join MANY MANY tables it may be better to go with an interim step.
If the field you want to do an order by on is not indexed, you could put everything into a temp table and index it and then do the ordering and it might be faster. You would have to test to make sure.
There is never any benefit of the second approach that I can think of.
It means if the data is available pre-ordered SQL Server can't take advantage of this and adds an unnecessary blocking operator and additional sort to the plan.
In the case that the data is not available pre-ordered SQL Server will sort it in a work table either in memory or tempdb anyway and adding an explicit #temp table just adds an unnecessary additional step.
Edit
I suppose one case where the second approach could give an apparent benefit might be if the presence of the ORDER BY caused SQL Server to choose a different plan that turned out to be sub optimal. In which case I would resolve that in a different way by either improving statistics or by using hints/query rewrite to avoid the undesired plan.

Why an union is faster than a group by

Well, maybe I am too old school and I would like to understand the following.
query 1.
select count(*), gender from customer
group by gender
query 2.
select count(*), 'M' from customer
where gender ='M'
union
select count(*), 'F' from customer
where gender ='F'
the 1st query is simpler, but for some reason in the profiler,when I execute both at the same time, it says that query 2 uses 39% of the time, and query 1, 61%.
I would like to understand the reason, maybe I have to rewrite all my queries.
Your query 2 is actually a nice trick. It works like this: You have an index on gender. The DBMS can seek into that index two times to get two ranges of rows (one for M and one for F). It doesn't need to read anything from these rows, just that they exist. It can count the number of rows that exist in the two ranges.
In the first query the DBMS needs to decode the rows to read the gender, then it needs to either sort the rows or build a hashtable to aggregate them. That is more expensive than just counting rows.
Are you sure?
Maybe the second query is just using cached resources from the first on.
run them in two separately batches and before each one run DBCC FREEPROCCACHE to clean the cache. Then compare the values of each execution plan.
The optimization of a query depends on the database. What you are seeing is database specific.
The union, as written, would naively require two passes through the data, doing a filter and a count. Basically no other storage is necessary.
The aggregation might sort the data and then do a count. Or, it might generate a hash table. Given the performance difference, I would guess a sort is being used. Clearly, this is overkill for this type of query.
If you have an index on gender, both methods would essentially scan the index so the performance should be similar (the union version might scan it twice=.
Does the database that you are using offer a way to calculate statistics on tables? If so, you should update the statistics and see if you still get the same results.
Also, can you post the results of "explain" or the execution plan? That would precisely explain why one is faster than the other.
I tried an equivalent query, but found the opposite result; the union took 65% and the 'group by' took 35%. (Using SQL Server 2008). I do not have an index on gender so my execution plan shows a clustered index scan. Unless you examine the execution plan in detail, it really isn't possible to explain this result.
Adding an index for this query is probably not a good idea, since you are probably not going to be running this query nearly as often as you are going to insert records in the customer table. In some other database engines with bitmap indexes (Oracle, PostgreSQL), the database engine can combine multiple indexes, so that can alter the utility of single column indexes. But in SQL Server, you need to design the indexes to 'cover' the commonly used queries.

Why didn't SAS use my index?

I have a large SAS dataset sorted by field 'A'. I'd like to do a query that references fields 'A' and 'B'. To speed up performance I created an index on 'B'. This results in an unhelpful message:
INFO: Index B not used. Sorting into index order may help.
Of course sorting on B would help. But that's not the point. Indexes are for the case when you are already sorted on some other field.
In a similar query, SAS gives this message:
INFO: Use of index C for WHERE clause optimization canceled.
Any tips on getting SAS to use my indexes? In one case the query is taking 2 hours to run because SAS doesn't use the index.
In case the query is not selective enough - taking most of source records to the result, the index use may not help performance, eventually can make things worse. That's probably why the optimizer desided not to use the index.
To force the use of index try using IDXNAME data set option (on both tables, probably).
Refer to http://support.sas.com/documentation/cdl/en/lrdict/64316/HTML/default/viewer.htm#a000414058.htm.
Without seeing the query and knowing some characteristics of data (at least record counts of input tables and expected size of the query result) it's hard to tell the optimal approach.
Anyway, for optimal performance, when joining tables, both tables need to be index similarly and all the join keys need to be part of the index.
Can't answer a question like this without seeing the query you are trying to run. An index will only be useful if the SAS optimizer determines it will improve performance. Can you show a simple example of the code you want to run?

Tuning table select SQL having a RAW column in Oracle 10g

I have a table with several columns and a unique RAW column. I created an unique index on the RAW column.
My query selects all columns from the table (6 million rows).
when i see the cost of the query its too high (51K). and its still using INDEX FULL scan. The query do not have any filter conditions, its a plain select * from.
Please suggest how can i tune the query operation.
Thanks in advance.
Why are you hinting it to use the index if you're retrieving all columns from all rows? The index would only help if you were filtering on the indexed column. If you were only retrieving the indexed column then an INDEX_FFS hint might help. But if you have to go back to the data for any non-indexed columns then using the index at all becomes counterproductive beyond a certain proportion of returned data as you're having to access both the index data blocks and the table data blocks repeatedly.
So, your query is:
select /*+ index (rawdata idx_test) */
rawdata.*
from v_wis_cds_cp_rawdata_test rawdata
and you want to know why Oracle is choosing an INDEX FULL scan?
Well, as Alex said, the reason is the "index (raw data idx_text)" hint. This is a directive that tells the Oracle optimizer, "when you access rawdata, use an index access on the idx_text index", which means that's what Oracle will do if at all possible - even if that's not the best plan.
Hints don't make queries faster automatically. They are a way of telling the optimizer what not to do.
I've seen queries like this before - sometimes a hint like this is added in order to return the rows in sorted order, without actually doing a sort. However, if this was the requirement, I'd strongly recommend adding an ORDER BY clause in anyway, because if the hint becomes invalid for some reason (e.g. the index gets dropped or renamed), the sorting would no longer happen and no error would be reported.
If you don't need the rows returned in any particular order, I suggest you remove the hint and see if the performance improves.

Do indexes work with "IN" clause

If I have a query like:
Select EmployeeId
From Employee
Where EmployeeTypeId IN (1,2,3)
and I have an index on the EmployeeTypeId field, does SQL server still use that index?
Yeah, that's right. If your Employee table has 10,000 records, and only 5 records have EmployeeTypeId in (1,2,3), then it will most likely use the index to fetch the records. However, if it finds that 9,000 records have the EmployeeTypeId in (1,2,3), then it would most likely just do a table scan to get the corresponding EmployeeIds, as it's faster just to run through the whole table than to go to each branch of the index tree and look at the records individually.
SQL Server does a lot of stuff to try and optimize how the queries run. However, sometimes it doesn't get the right answer. If you know that SQL Server isn't using the index, by looking at the execution plan in query analyzer, you can tell the query engine to use a specific index with the following change to your query.
SELECT EmployeeId FROM Employee WITH (Index(Index_EmployeeTypeId )) WHERE EmployeeTypeId IN (1,2,3)
Assuming the index you have on the EmployeeTypeId field is named Index_EmployeeTypeId.
Usually it would, unless the IN clause covers too much of the table, and then it will do a table scan. Best way to find out in your specific case would be to run it in the query analyzer, and check out the execution plan.
Unless technology has improved in ways I can't imagine of late, the "IN" query shown will produce a result that's effectively the OR-ing of three result sets, one for each of the values in the "IN" list. The IN clause becomes an equality condition for each of the list and will use an index if appropriate. In the case of unique IDs and a large enough table then I'd expect the optimiser to use an index.
If the items in the list were to be non-unique however, and I guess in the example that a "TypeId" is a foreign key, then I'm more interested in the distribution. I'm wondering if the optimiser will check the stats for each value in the list? Say it checks the first value and finds it's in 20% of the rows (of a large enough table to matter). It'll probably table scan. But will the same query plan be used for the other two, even if they're unique?
It's probably moot - something like an Employee table is likely to be small enough that it will stay cached in memory and you probably wouldn't notice a difference between that and indexed retrieval anyway.
And lastly, while I'm preaching, beware the query in the IN clause: it's often a quick way to get something working and (for me at least) can be a good way to express the requirement, but it's almost always better restated as a join. Your optimiser may be smart enough to spot this, but then again it may not. If you don't currently performance-check against production data volumes, do so - in these days of cost-based optimisation you can't be certain of the query plan until you have a full load and representative statistics. If you can't, then be prepared for surprises in production...
So there's the potential for an "IN" clause to run a table scan, but the optimizer will
try and work out the best way to deal with it?
Whether an index is used doesn't so much vary on the type of query as much of the type and distribution of data in the table(s), how up-to-date your table statistics are, and the actual datatype of the column.
The other posters are correct that an index will be used over a table scan if:
The query won't access more than a certain percent of the rows indexed (say ~10% but should vary between DBMS's).
Alternatively, if there are a lot of rows, but relatively few unique values in the column, it also may be faster to do a table scan.
The other variable that might not be that obvious is making sure that the datatypes of the values being compared are the same. In PostgreSQL, I don't think that indexes will be used if you're filtering on a float but your column is made up of ints. There are also some operators that don't support index use (again, in PostgreSQL, the ILIKE operator is like this).
As noted though, always check the query analyser when in doubt and your DBMS's documentation is your friend.
#Mike: Thanks for the detailed analysis. There are definately some interesting points you make there. The example I posted is somewhat trivial but the basis of the question came from using NHibernate.
With NHibernate, you can write a clause like this:
int[] employeeIds = new int[]{1, 5, 23463, 32523};
NHibernateSession.CreateCriteria(typeof(Employee))
.Add(Restrictions.InG("EmployeeId",employeeIds))
NHibernate then generates a query which looks like
select * from employee where employeeid in (1, 5, 23463, 32523)
So as you and others have pointed out, it looks like there are going to be times where an index will be used or a table scan will happen, but you can't really determine that until runtime.
Select EmployeeId From Employee USE(INDEX(EmployeeTypeId))
This query will search using the index you have created. It works for me. Please do a try..