Indexed View looking for null references without INNER JOIN or Subquery - sql

So I have a legacy database with table structure like this (simplified)
Create Table Transaction
{
TransactionId INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
ReplacesTransactionId INT
..
..
}
So I want to create an indexed view such that the following example would return only the second role (because it replaces the first one)
Insert Into Transaction (TransactionId, ReplacesTransactionId, ..) Values (1,0 ..)
Insert Into Transaction (TransactionId, ReplacesTransactionId, ..) Values (2,1 ..)
There are a number of ways of creating this query but I would like to create an indexed view which means I cannot use Subqueries, Left joins or Excepts. An example query (using LEFT JOIN) could be.
SELECT trans1.* FROM Transaction trans1
LEFT JOIN Transaction trans2 on trans1.TransactionId = trans2.ReplacesTransactionId
Where trans2.TransacationId IS NULL
Clearly I'm stuck with the structure of the database and am looking to improve performance of the application using the data.
Any suggestions?

What you have here is essentially a hierarchical dataset in which you want to pre-traverse the hierarchy and store the result in an indexed view, but AFAIK, indexed views do not support that.
On the other hand, this may not be the only angle of attack to your larger goal of improving performance. First, the most obvious question: can we assume that TransactionId is clustered and ReplacesTransactionId is indexed? If not, those would be my first two changes. If the indexing is already good, then the next step would be to look at the query plan of your left join and see if anything leaps out.
In general terms (not having seen the query plan): one possible approach could be to try and convert your SELECT statement to a "covered query" (see https://www.simple-talk.com/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/). This would most likely entail some combination of:
Reducing the number of columns in the SELECT statement (replacing SELECT *)
Adding a few "included" columns to the index on ReplacesTransactionId (either in SSMS or using the INCLUDES clause of CREATE INDEX).
Good luck!

Related

Will a SQL DELETE with a sub query execute inefficiently if there are many rows in the source table?

I am looking at an application and I found this SQL:
DELETE FROM Phrase
WHERE PhraseId NOT IN(SELECT Id FROM PhraseSource)
The intention of the SQL is to delete rows from Phrase that are not in the PhraseSource table.
The two tables are identical and have the following structure
Id - GUID primary key
...
...
...
Modified int
the ... columns are about ten columns containing text and numeric data. The PhraseSource table may or may not contain more recent rows with a higher number in the Modified column and different text and numeric data.
Can someone tell me will this query execute the SELECT Id from PhraseSource for every row in the Phrase table? If so is there a more efficient way that this could be coded.
1. Will this query execute the SELECT Id from PhraseSource for every row?
No.
In SQL you express what you want to do, not how you want it to be done1. The engine will create an execution plan to do what you want in the most performant way it can.
For your query, executing the query for each row is not necessary. Instead the engine will create an execution plan that executes the subquery once, then does a left anti-semi join to determine what IDs are not present in the PhraseSource table.
You can verify this when you include the Actual Execution Plan in SQL Server Management Studio.
2. Is there a more efficient way that this could be coded?
A little bit more efficient, as follows:
DELETE
p
FROM
Phrase AS p
WHERE
NOT EXISTS (
SELECT
1
FROM
PhraseSource AS ps
WHERE
ps.Id=p.PhraseId
);
This has been shown in tests done by user Aaron Bertrand on sqlperformance.com: Should I use NOT IN, OUTER APPLY, LEFT OUTER JOIN, EXCEPT, or NOT EXISTS?:
Conclusion
[...] for the pattern of finding all rows in table A where some condition does not exist in table B, NOT EXISTS is typically going to be your best choice.
Another benefit of using NOT EXISTS with a correlated subquery is that it does not have problems when PhraseSource.Id can be NULL. I suggest you read up on IN/NOT IN vs NULL values in the subquery. E.g. you can read more about that on sqlbadpractices.com: Using NOT IN operator with null values.
The PhraseSource.Id column is probably not nullable in your schema, but I prefer using a method that is resilient in all possible schemas.
1. Exceptions exist when forcing the engine to use a specific path, e.g. with Table Hints or Query Hints. The engine doesn't always get things right.
In this case the sub-query could be evaluated for each row if the database system is not smart enough (but in case of MS SQL Server, I suppose it should be able to recognize the fact that you don't need to evaluate the subquery more than once).
Still there is a better solution:
DELETE p
FROM Phrase p
LEFT JOIN PhraseSource ps ON ps.Id = p.PhraseId
WHERE ps.Id IS NULL
This uses the LEFT JOIN which matches the rows of both tables, but in case there is no match it leaves the ps entry NULL. Now you just check for NULLs on the left side to see which Phrases do not have a match and will delete those.
All types of JOIN statements are very nicely described in this answer.
Here you can see three different approaches for a similar issue compared on MySQL. As #Drammy mentions, to actually see the performance of a given approach, you could see the execution plan on your target database and do performance testing on different approaches of the same problem.
That query should optimise into a join. Have you looked at the execution plan?
If you're experiencing poor performance it is likely because of the guid primary keys.
A primary key is clustered by default. If the guid primary key is clustered on your table that means the data in the tables is ordered by the primary key. The problem with guids as clustered keys is that when you delete one record the table has to be reordered and shuffled around on disk.
This article is a good read on the topic..
https://blog.codinghorror.com/primary-keys-ids-versus-guids/

Join to SELECT vs. Join to Tableset

For the DB gurus out there, I was wondering if there is any functional/performance difference between Joining to the results a SELECT statement and Joining to a previously filled table variable. I'm working in SQL Server 2008 R2.
Example (TSQL):
-- Create a test table
DROP TABLE [dbo].[TestTable]
CREATE TABLE [dbo].[TestTable](
[id] [int] NOT NULL,
[value] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the test table with a few rows
INSERT INTO [dbo].[TestTable]
SELECT 1123, 'test1'
INSERT INTO [dbo].[TestTable]
SELECT 2234, 'test2'
INSERT INTO [dbo].[TestTable]
SELECT 3345, 'test3'
-- Create a reference table
DROP TABLE [dbo].[TestRefTable]
CREATE TABLE [dbo].[TestRefTable](
[id] [int] NOT NULL,
[refvalue] [varchar](max) NULL
) ON [PRIMARY]
-- Populate the reference table with a few rows
INSERT INTO [dbo].[TestRefTable]
SELECT 1123, 'ref1'
INSERT INTO [dbo].[TestRefTable]
SELECT 2234, 'ref2'
-- Scenario 1: Insert matching results into it's own table variable, then Join
-- Create a table variable
DECLARE #subset TABLE ([id] INT NOT NULL, [refvalue] VARCHAR(MAX))
INSERT INTO #subset
SELECT * FROM [dbo].[TestRefTable]
WHERE [dbo].[TestRefTable].[id] = 1123
SELECT t.*, s.*
FROM [dbo].[TestTable] t
JOIN #subset s
ON t.id = s.id
-- Scenario 2: Join directly to SELECT results
SELECT t.*, s.*
FROM [dbo].TestTable t
JOIN (SELECT * FROM [dbo].[TestRefTable] WHERE id = 1123) s
ON t.id = s.id
In the "real" world, the tables and table variable are pre-defined. What I'm looking at is being able to have the matched reference rows available for further operations, but I'm concerned that the extra steps will slow the query down. Are there technical reasons as to why one would be faster than the other? What sort of performance difference may be seen between the two approaches? I realize it is difficult (if not impossible) to give a definitive answer, just looking for some advice for this scenario.
The database engine has an optimizer to figure out the best way to execute a query. There is more under the hood than you probably imagine. For instance, when SQL Server is doing a join, it has a choice of at least four join algorithms:
Nested Loop
Index Lookup
Merge Join
Hash Join
(not to mention the multi-threaded versions of these.)
It is not important that you understand how each of these works. You just need to understand two things: different algorithms are best under different circumstances and SQL Server does its best to choose the best algorithm.
The choice of join algorithm is only one thing the optimizer does. It also has to figure out the ordering of the joins, the best way to aggregate results, whether a sort is needed for an order by, how to access the data (via indexes or directly), and much more.
When you break the query apart, you are making an assumption about optimization. In your case, you are making the assumption that the first best thing is to do a select on a particular table. You might be right. If so, your result with multiple queries should be about as fast as using a single query. Well, maybe not. When in a single query, SQL Server does not have to buffer all the results at once; it can stream results from one place to another. It may also be able to take advantage of parallelism in a way that splitting the query prevents.
In general, the SQL Server optimizer is pretty good, so you are best letting the optimizer do the query all in one go. There are definitely exceptions, where the optimizer may not choose the best execution path. Sometimes fixing this is as easy as being sure that statistics are up-to-date on tables. Other times, you can add optimizer hints. And other times you can restructure the query, as you have done.
For instance, one place where loading data into a local table is useful is when the table comes from a different server. The optimizer may not have full information about the size of the table to make the best decisions.
In other words, keep the query as one statement. If you need to improve it, then focus on optimization after it works. You generally won't have to spend much time on optimization, because the engine is pretty good at it.
This would give the same result?
SELECT t.*, s.*
FROM dbo.TestTable AS t
JOIN dbo.TestRefTable AS s ON t.id = s.id AND s.id = 1123
Basically, this is a cross join of all records from TestTable and TestRefTable with id = 1123.
Joining to table variables will also result in bad cardinality estimates by the optimizer. Table variables are always assumed by the optimizer to contain only a single row. The more rows it actually has the worse that estimate becomes. This causes the optimizer to assume the wrong number of rows for the table itself, but in other places, for operators that might then join to that result, it can result in wrong estimations of the number executions for that operation.
Personally I think Table parameters should be used for getting data into and out of the server conveniently using client apps (C# .Net apps make good use of them), or for passing data between Stored Procs, but should not be used too much within the proc itself. The importance of getting rid of them within the Proc code itself increases with the expected number of rows to be carried by the parameter.
Sub Selects will perform better, or immediately copying into a temp table will work well. There is overhead for copying into the temp table, but again, the more rows you have the more worth it that overhead becomes because the estimates by the optimizer get worse and worse.
In general a derived table in the query is probably going to be faster than joining to a table variable because it can make use of indexes and they are not available in table variables. However, temp tables can also have indexes creted and that might solve the potential performance difference.
Also if the number of table variable records is expected to be small, then indexes won't make a great deal of difference anyway and so there would be little or no differnce.
As alawys you need to test on your own system as number of records and table design and index design havea great deal to do with what works best.
I'd expect the direct Table join to be faster than the Table to TableVariable, and use less resources.

TSQL join performance

My problem is that this query takes forever to run:
Select
tableA.CUSTOMER_NAME,
tableB.CUSTOMER_NUMBER,
TableB.RuleID
FROM tableA
INNER JOIN tableB on tableA.CUST_PO_NUMBER like tableB.CustomerMask
Here is the structure of the tables:
CREATE TABLE [dbo].[TableA](
[CUSTOMER_NAME] [varchar](100) NULL,
[CUSTOMER_NUMBER] [varchar](50) NULL,
[CUST_PO_NUMBER] [varchar](50) NOT NULL,
[ORDER_NUMBER] [varchar](30) NOT NULL,
[ORDER_TYPE] [varchar](30) NULL)
CREATE TABLE [dbo].[TableB](
[RuleID] [varchar](50) NULL,
[CustomerMask] [varchar](500) NULL)
TableA has 14 million rows and TableB has 1000 rows. Data in column customermask can be anything like ‘%’,’ttt%’,’%ttt%’..etc
How can I tune it to make it faster?
Thanks!
The short answer is don't use the LIKE operator to join two tables containing millions of rows. It's not going to be fast, no matter how you tune it. You might be able to improve it incrementally, but it will just be putting lipstick on a pig.
You need to have a distinct value on which to join the tables. Right now it has to do a complete scan of tableA, and do an item-by-item wildcard comparison between Customer_Name and CustomerMask. You're looking at 14 billion comparisons, all using the slow LIKE operator.
The only suggestion I can give is to re-think the architecture of associating rules with Customers.
While you can't change what's already there, you can create a new table like this:
CREATE TABLE [dbo].[TableC](
[CustomerMask] [varchar](500) NULL)
[CUST_PO_NUMBER] [varchar](50) NOT NULL)
Then have a trigger on both TableA and TableB that inserts / updates / deletes records in TableC if they no longer match the condition CUST_PO_NUMBER LIKE CustomerMask (for the trigger on TableB you need to only update TableC if the CustomerMask field has been changed.
Then in your query will just become:
SELECT
tableA.CUSTOMER_NAME,
tableB.CUSTOMER_NUMBER,
TableB.RuleID
FROM tableA
INNER JOIN tableC on tableA.CUST_PO_NUMBER = tableC.CUST_PO_NUMBER
INNER JOIN tableB on tableC.CustomerMask = tableB.CustomerMask
This will greatly improve your query performance and it shouldn't greatly affect your write performance. You will basically only be performing the like query once for each record (unless they change).
Only change order join then faster and enjoy! use this query:
Select tableA.CUSTOMER_NAME, tableB.CUSTOMER_NUMBER, TableB.RuleID
FROM tableB
INNER JOIN tableA
on tableB.CustomerMask like tableA.CUST_PO_NUMBER
Am I missing something? What about the following:
Select
tableA.CUSTOMER_NAME,
tableA.CUSTOMER_NUMBER,
tableB.RuleID
FROM tableA, tableB
WHERE tableA.CUST_PO_NUMBER = tableB.CustomerMask
EDIT2: Thinking about it, how many of those masks start and end with wildcards? You might gain some performance by first:
Indexing CUST_PO_NUMBER
Creating a persisted computed column CUST_PO_NUMBER_REV that's the reverse of CUST_PO_NUMBER
Indexing the persisted column
Putting statistics on these columns
Then you might build three queries, and UNION ALL the results together:
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER LIKE CustomerMask
WHERE /* First character of CustomerMask is not a wildcard but last one is */
UNION ALL
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER_REV LIKE REVERSE(CustomerMask)
WHERE /* Last character of CustomerMask is not a wildcard but first one is */
UNION ALL
SELECT ...
FROM ...
ON CUSTOM_PO_NUMBER LIKE CustomerMask
WHERE /* Everything else */
That's just a quick example, you'll need to take some care that the WHERE clauses give you mutually exclusive results (or use UNION, but aim for mutually exclusive results first).
If you can do that, you should have two queries using index seeks and one query using index scans.
EDIT: You can implement a sharding system to spread out the customers and customer masks tables across multiple servers and then have each server evaluate 1/n% of the results. You don't need to partition the data -- simple replication of the entire contents of each table will do. Link the servers to your main server and you can do something to the effect of:
SELECT ... FROM OPENQUERY(LinkedServer1, 'SELECT ... LIKE ... WHERE ID BETWEEN 0 AND 99')
UNION ALL
SELECT ... FROM OPENQUERY(LinkedServer2, 'SELECT ... LIKE ... WHERE ID BETWEEN 100 AND 199')
Note: the OPENQUERY may be extraneous, SQL Server might be smart enough to evaluate queries on remote servers and stream the results back. I know it doesn't do that for JET linked servers, but it might handle its own kind better.
That or through more hardware at the problem.
You can create an Indexed View of your query to improve performance.
From Designing Indexed Views:
For a standard view, the overhead of dynamically building the result set for each query that references a view can be significant for views that involve complex processing of large numbers of rows, such as aggregating lots of data, or joining many rows. If such views are frequently referenced in queries, you can improve performance by creating a unique clustered index on the view. When a unique clustered index is created on a view, the result set is stored in the database just like a table with a clustered index is stored.
Another benefit of creating an index on a view is that the optimizer starts using the view index in queries that do not directly name the view in the FROM clause. Existing queries can benefit from the improved efficiency of retrieving data from the indexed view without having to be recoded.
This should improve the performance of this particular query, but note that inserts, updates and deleted into the tables it uses may be slowed.
You can't use LIKE if you care about performance.
If you are trying to do approximate string matching (e.g. Test and est and best, etc.) and you don't want to use Sql full-text search take a look at this article.
At least you can shortlist approximate matches then run your wildcard test on them.
--EDIT 2--
Your problem is interesting in the context of your limitation. Thinking about it again, I am pretty sure that using 3 gram would boost the performance (going back to my initial suggestion).
Let's say if you setup your 3gram data, you'll be having the following tables:
Customer : 14M
Customer3Grams : Maximum 700M //Considering the field is varchar(50)
3Grams : 78
Pattern : 1000
Pattern3Grams : 50K
To join pattern to customer then you need the following join:
Pattern x Pattern3Grams x Customer3Grams x Customer
With appropriate indexing (which is easy) each look-up can happen in O(LOG(50K)+LOG(700M)+LOG(14M)) which is equal to 47.6.
Considering appropriate indexes are present the whole join can be calculated with less than 50,000 look-ups and of course the cost of scanning after look ups. I expect it to be very efficient (matter of seconds).
The cost of creating 3grams for each new customer is also minimal because it would be maximum of 50x75 possible three grams that should be appended to the customer3Grams table.
--EDIT--
Depending to your data I can also suggest hash based clustering. I assume customer numbers are numbers with some character patterns in them (e.g. 123231ttt3x4). If this is the case you can create a simple hash function that calculates the result of bit-wise OR for every letter (not numbers) and add it as an indexed column to your table. You can filter on the result of the hash before applying LIKE.
Depending to your data this may cluster your data effectively and improve your search by factor of the number of clusters (number of hash). You can test it by applying the hash and counting the number of distinct generated hash.

IN vs OR of Oracle, which faster?

I'm developing an application which processes many data in Oracle database.
In some case, I have to get many object based on a given list of conditions, and I use SELECT ...FROM.. WHERE... IN..., but the IN expression just accepts a list whose size is maximum 1,000 items.
So I use OR expression instead, but as I observe -- perhaps this query (using OR) is slower than IN (with the same list of condition). Is it right? And if so, how to improve the speed of query?
IN is preferable to OR -- OR is a notoriously bad performer, and can cause other issues that would require using parenthesis in complex queries.
Better option than either IN or OR, is to join to a table containing the values you want (or don't want). This table for comparison can be derived, temporary, or already existing in your schema.
In this scenario I would do this:
Create a one column global temporary table
Populate this table with your list from the external source (and quickly - another whole discussion)
Do your query by joining the temporary table to the other table (consider dynamic sampling as the temporary table will not have good statistics)
This means you can leave the sort to the database and write a simple query.
Oracle internally converts IN lists to lists of ORs anyway so there should really be no performance differences. The only difference is that Oracle has to transform INs but has longer strings to parse if you supply ORs yourself.
Here is how you test that.
CREATE TABLE my_test (id NUMBER);
SELECT 1
FROM my_test
WHERE id IN (1,2,3,4,5,6,7,8,9,10,
21,22,23,24,25,26,27,28,29,30,
31,32,33,34,35,36,37,38,39,40,
41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,
61,62,63,64,65,66,67,68,69,70,
71,72,73,74,75,76,77,78,79,80,
81,82,83,84,85,86,87,88,89,90,
91,92,93,94,95,96,97,98,99,100
);
SELECT sql_text, hash_value
FROM v$sql
WHERE sql_text LIKE '%my_test%';
SELECT operation, options, filter_predicates
FROM v$sql_plan
WHERE hash_value = '1181594990'; -- hash_value from previous query
SELECT STATEMENT
TABLE ACCESS FULL ("ID"=1 OR "ID"=2 OR "ID"=3 OR "ID"=4 OR "ID"=5
OR "ID"=6 OR "ID"=7 OR "ID"=8 OR "ID"=9 OR "ID"=10 OR "ID"=21 OR
"ID"=22 OR "ID"=23 OR "ID"=24 OR "ID"=25 OR "ID"=26 OR "ID"=27 OR
"ID"=28 OR "ID"=29 OR "ID"=30 OR "ID"=31 OR "ID"=32 OR "ID"=33 OR
"ID"=34 OR "ID"=35 OR "ID"=36 OR "ID"=37 OR "ID"=38 OR "ID"=39 OR
"ID"=40 OR "ID"=41 OR "ID"=42 OR "ID"=43 OR "ID"=44 OR "ID"=45 OR
"ID"=46 OR "ID"=47 OR "ID"=48 OR "ID"=49 OR "ID"=50 OR "ID"=51 OR
"ID"=52 OR "ID"=53 OR "ID"=54 OR "ID"=55 OR "ID"=56 OR "ID"=57 OR
"ID"=58 OR "ID"=59 OR "ID"=60 OR "ID"=61 OR "ID"=62 OR "ID"=63 OR
"ID"=64 OR "ID"=65 OR "ID"=66 OR "ID"=67 OR "ID"=68 OR "ID"=69 OR
"ID"=70 OR "ID"=71 OR "ID"=72 OR "ID"=73 OR "ID"=74 OR "ID"=75 OR
"ID"=76 OR "ID"=77 OR "ID"=78 OR "ID"=79 OR "ID"=80 OR "ID"=81 OR
"ID"=82 OR "ID"=83 OR "ID"=84 OR "ID"=85 OR "ID"=86 OR "ID"=87 OR
"ID"=88 OR "ID"=89 OR "ID"=90 OR "ID"=91 OR "ID"=92 OR "ID"=93 OR
"ID"=94 OR "ID"=95 OR "ID"=96 OR "ID"=97 OR "ID"=98 OR "ID"=99 OR
"ID"=100)
I would question the whole approach. The client of the SP has to send 100000 IDs. Where does the client get those IDs from? Sending such a large number of ID as the parameter of the proc is going to cost significantly anyway.
If you create the table with a primary key:
CREATE TABLE my_test (id NUMBER,
CONSTRAINT PK PRIMARY KEY (id));
and go through the same SELECTs to run the query with the multiple IN values, followed by retrieving the execution plan via hash value, what you get is:
SELECT STATEMENT
INLIST ITERATOR
INDEX RANGE SCAN
This seems to imply that when you have an IN list and are using this with a PK column, Oracle keeps the list internally as an "INLIST" because it is more efficient to process this, rather than converting it to ORs as in the case of an un-indexed table.
I was using Oracle 10gR2 above.

Slow distinct query in SQL Server over large dataset

We're using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that one of the tables has started to take an obscene amount of time to query.
The table has 3 columns:
id -- autonumber (clustered)
typeUUID -- GUID generated before the insert happens; used to group the types together
typeName -- The type name (duh...)
One of the queries we run is a distinct on the typeName field:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
The typeName field has a non-clusted, non-unique ascending index on it. The table contains approximately 200M records at the moment. When we run this query, the query took 5m 58s to return! Perhaps we're not understanding how the indexes work... But I didn't think we mis-understood them that much.
To test this a little further, we ran the following query:
SELECT DISTINCT [typeName] FROM (SELECT TOP 1000000 [typeName] FROM [types] WITH (nolock)) AS [subtbl]
This query returns in about 10 seconds, as I would expect, it's scanning the table.
Is there something we're missing here? Why does the first query take so long?
Edit: Ah, my apologies, the first query returns 76 records, thank you ninesided.
Follow up: Thank you all for your answers, it makes more sense to me now (I don't know why it didn't before...). Without an index, it's doing a table scan across 200M rows, with an index, it's doing an index scan across 200M rows...
SQL Server does prefer the index, and it does give a little bit of a performance boost, but nothing to be excited about. Rebuilding the index did take the query time down to just over 3m instead of 6m, an improvement, but not enough. I'm just going to recommend to my boss that we normalize the table structure.
Once again, thank you all for your help!!
You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?
There is an issue with the SQL Server optimizer when using the DISTINCT keyword. The solution was to force it to keep the same query plan by breaking out the distinct query separately.
So we took queries such as:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
and break it up into the following:
SELECT typeName INTO #tempTable1 FROM types WITH (NOLOCK)
SELECT DISTINCT typeName FROM #tempTable1
Another way to get around it is to use a GROUP BY, which gets a different optimization plan.
I doubt SQL Server will even try to use the index, it'd have to do practically the same amount of work (given the narrow table), reading all 200M rows regardless of whether it looks at the table or the index. If the index on typeName was clustered it may reduce the time taken as it shouldn't need to sort before grouping.
If the cardinality of your types is low, how about maintaining a summary table which holds the list of distinct type values? A trigger on insert/update of the main table would do a check on the summary table and insert a new record when a new type is found.
As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.
So it's really a matter of limiting the number of rows that need to be scanned.
The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??
If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.
That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.
Marc
A looping approach should use multiple seeks (but loses some parallelism). It might be worth a try for cases with relatively few distinct values compared to the total number of rows (low cardinality).
Idea was from this question:
select typeName into #Result from Types where 1=0;
declare #t varchar(100) = (select min(typeName) from Types);
while #t is not null
begin
set #t = (select top 1 typeName from Types where typeName > #t order by typeName);
if (#t is not null)
insert into #Result values (#t);
end
select * from #Result;
And looks like there are also some other methods (notably the recursive CTE #Paul White):
different-ways-to-find-distinct-values-faster-methods
sqlservercentral Topic873124-338-5
My first thought is statistics. To find last updated:
SELECT
name AS index_name,
STATS_DATE(object_id, index_id) AS statistics_update_date
FROM
sys.indexes
WHERE
object_id = OBJECT_ID('MyTable');
Edit: Stats are updated when indexes are rebuilt, which I see are not maintained
My second thought is that is the index still there? The TOP query should still use an index.
I've just tested on one of my tables with 57 million rows and both use the index.
An indexed view can make this faster.
create view alltypes
with schemabinding as
select typename, count_big(*) as kount
from dbo.types
group by typename
create unique clustered index idx
on alltypes (typename)
The work to keep the view up to date on each change to the base table should be moderate (depending on your application, of course -- my point is that it doesn't have to scan the whole table each time or do anything insanely expensive like that.)
Alternatively you could make a small table holding all values:
select distinct typename
into alltypes
from types
alter table alltypes
add primary key (typename)
alter table types add foreign key (typename) references alltypes
The foreign key will make sure that all values used appear in the parent alltypes table. The trouble is in ensuring that alltypes does not contain values not used in the child types table.
I should try something like this:
SELECT typeName FROM [types] WITH (nolock)
group by typeName;
And like other i would say you need to normalize that column.
An index helps you quickly find a row. But you're asking the database to list all unique types for the entire table. An index can't help with that.
You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID included in the nightly scan, and combine the results:
select type
from nightlyscan
union
select distinct type
from verybigtable
where rowid > lastscannedid
Another option is to normalize the big table into two tables:
talbe1: id, guid, typeid
type table: typeid, typename
This would be very beneficial if the number of types was relatively small.
I could be missing something but would it be more efficient if an overhead on load to create a view with distinct values and query that instead?
This would give almost instant responses to the select if the result set is significantly smaller with the overhead over populating it on each write though given the nature of the view that might be trivial in itself.
It does ask the question how many writes compared to how often you want the distinct and the importance of the speed when you do.