How to speed up a SQL Server query involving count(distinct()) - sql

I have a deceptively simple SQL Server query that's taking a lot longer than I would expect.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
SELECT COUNT(DISTINCT(guid)) FROM listens WHERE url='http://www.sample.com/'
'guid' is varchar(64) NULL
'url' is varchar(900) NULL
There is an index on guid and url.
There are over 7 million rows in the 'listens' table, of which 17,000 match the url in question, and the result of the query is 5,500.
It is taking over 1 minute to run this query on SQL Server 2008 on a fairly idle Dual-Core AMD Opteron 2GHz with 1GB RAM.
Any ideas how to get the execution time down? Ideally it should be under 1 second!

Create an index on url which would cover the GUID:
CREATE INDEX ix_listens_url__guid ON listens (url) INCLUDE (guid)
When dealing with urls as identifiers, it is much better to store and index the URL hash rather than the whole URL.

scaning indexes that large will take long no matter what.
what you need to do is to shorten the indexes.
what you can do is have an integer column where the checksum of the url is calculated and stored.
this way your index will be narrow and count will be fast.
note that checksum is not unique but it's unique enough.
here's a complete code example of how to do it. I've included checksums for both columns but it probably needs only one. you could also calculate the checksum on the insert or update by yourself and remove the trigger.
CREATE TABLE MyTable
(
ID INT IDENTITY(1,1) PRIMARY KEY,
[Guid] varchar(64),
Url varchar(900),
GuidChecksum int,
UrlChecksum int
)
GO
CREATE TRIGGER trgMyTableCheckSumCalculation ON MyTable
FOR UPDATE, INSERT
as
UPDATE t1
SET GuidChecksum = checksum(I.[Guid]),
UrlChecksum = checksum(I.Url)
FROM MyTable t1
join inserted I on t1.ID = I.ID
GO
CREATE NONCLUSTERED INDEX NCI_MyTable_GuidChecksum ON MyTable(GuidChecksum)
CREATE NONCLUSTERED INDEX NCI_MyTable_UrlChecksum ON MyTable(UrlChecksum)
INSERT INTO MyTable([Guid], Url)
select NEWID(), 'my url 1' union all
select NEWID(), 'my url 2' union all
select null, 'my url 3' union all
select null, 'my url 4'
SELECT *
FROM MyTable
SELECT COUNT(GuidChecksum)
FROM MyTable
WHERE Url = 'my url 3'
GO
DROP TABLE MyTable

I know this post is a bit late. I was searching on another optimization issue.
Noting that:
guid is VARCHAR(64) **and not really a 16byte uniqueidentifier
url is varchar(900) and you have 7 million rows of it.
My suggestion:
Create a new field for the table.
Column = URLHash AS UNIQUEIDENTIFIER
on creation of a new record. URLHash = CONVERT( UNIQUEIDENTIFIER, HASHBYTES('MD5', url) )
Build an index on URLHash
then in your query:
SELECT COUNT(DISTINCT(guid)) FROM listens WHERE URLHash = CONVERT( UNIQUEIDENTIFIER, HASHBYTES( 'MD5', 'http://www.sample.com/' ) )
This will give you a very fast method of uniquely seeking a specific url, while maintaining a very small index size.
If you need FURTHER optimization, you may want to do the same hash on guid. Performing a distinct on a 16byte uniqueidentifier is faster than a varchar(64).
The above assumption is that you are not adding ALOT of new rows into listen table; ie, new record rates are not that heavy. The reason is that MD5 algorithm, although providing perfect dispersion; is notoriously slow. If you are adding new records in the thousands per-second; then calculating the MD5 hash on record creation can slow down your server (unless you have a very fast server). The alternative approach is to implement your own version of FNV1a hashing algorithm which is not built-in. FNV1a is a lot faster compared to MD5 and yet provide a very good dispersion/low collission rate.
Hope the above helps whoever run into these kind of problems in the future.

Your GUID column will, by nature, be a lot more labour-intensive than, say, a bigint as it takes up more space (16 bytes). Can you replace the GUID column with an auto-incremented numerical column, or failing that, introduce a new column of type bigint/int that is incremented for each new value of the GUID column (you could then use your GUID to ensure global uniqueness, and the bigint/int for indexing purposes)?
From the link above:
At 16 bytes, the uniqueidentifier data
type is relatively large compared to
other data types such as 4-byte
integers. This means indexes built
using uniqueidentifier keys may be
relatively slower than implementing
the indexes using an int key.
Is there any particular reason why you're using a varchar for your guid column rather than uniqueidentifier?

I bet if you have more than 1GB of memory in the machine it would perform better (all DBA's I've met expect at least 4GB in a production SQL server.)
I've no idea if this matters but if you do a
SELECT DISTINCT(guid) FROM listens WHERE url='http://www.sample.com/'
won't #rowcount contain the result you want?

Your best possible plan is a range seek to obtain the 17k candidate URLs and the count distinct to rely on a guaranteed order of input so it does not have to sort. The proper data structure that can satisfy both of these requirements is an index on (url, guid):
CREATE INDEX idxListensURLGuid on listens(url, guid);
You already got plenty of feedback on the wideness of the key used and you can definetely seek to improve them, and also increase that puny 1Gb of RAM if you can.
If is possible to deploy on SQL 2008 EE, then make sure you turn on page compression for such a highly repetitive and wide index. It will do miracles on performance due to reduced IO.

Some hints ...
1) Refactor your query, e.g. use with clause ...
with url_entries as (
select guid
from listens
where url='http://www.sample.com/'
)
select count(distinct(enries.guid)) as distinct_guid_count
from url_entries entries
2) Tell exact SQL Serever which index must be scanned while performing query (of course, index by url field). Another way - simple drop index by guid and leave index by url alone. Look here for more information about hints. Especially for constructions like select ... from listens with (index(index_name_for_url_field) )
3) Verify state of indexes on listens table and update index statistics.

Related

How can I improve the speed of a SQL query searching for a collection of strings

I have a table called T_TICKET with a column CallId varchar(30).
Here is an example of my data:
CallId | RelatedData
===========================================
MXZ_SQzfGMCPzUA | 0000
MXyQq6wQ7gVhzUA | 0001
MXwZN_d5krgjzUA | 0002
MXw1YXo7JOeRzUA | 0000
...
I am attempting to find records that match a collection of CallId's. Something like this:
SELECT * FROM T_TICKET WHERE CALLID IN(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA', N'MXZ_SQzfGMCPzUA', ... ,N'MXyQq6wQ7gVhzUA')
And I have anywhere from 200 - 300 CallId's that I am looking up at a time using this query. The query takes around 35 seconds to run. Is there anything I can do to either the table structure, the column type, the index, or the query itself to improve the performance of this query?
There are around 300,000 rows in T_INDEX currently. CallId is not unique. And RelatedData is not unique. I also have an index (non-clustered) on CallId.
I know the basics of SQL, but I'm not a pro. Some things I've thought of doing are:
Change the type of CallId from varchar to char.
Shorten the length of CallId (it's length is 30, but in reality, right now, I am using only 15 bytes).
I have not tried any of these yet because it requires changes to live production data. And, I am not sure they would make a significant improvement.
Would either of these options make a significant improvement? Or, are there other things I could do to make this perform faster?
First, be sure that the types are the same -- either VARCHAR() or NVARCHAR(). Then, add an index:
create index idx_t_ticket_callid on t_ticket(callid);
If the types are compatible, SQL Server should make use of the index.
Your table is what we called heap (a table without clustered index). This kind of tables only good for data loading and/or as staging table. I would recommend you to convert your table to have a clustered key. A good clustering key should be unique, static, narrow, non-nullable, and ever-increasing (eg. int/bigint identity datatype).
Another downside of heap is when you have lots of UPDATE/DELETE on your table, it will slow down your SELECT because of forwarded records. Quoting from Paul Randal about forwarded records:
If a forwarding record occurs in a heap, when the record locator points to that location, the Storage Engine gets there and says Oh, the record isn't really here – it's over there! And then it has to do another (potentially physical) I/O to get to the page with the forwarded record on. This can result in a heap being less efficient that an equivalent clustered index.
Lastly, make sure you define all your columns on your SELECT. Avoid the SELECT *. I'm guessing you are experiencing a table scan when you execute the query. What you can do is INCLUDE all columns list on your SELECT on your index like this:
CREATE INDEX [IX_T_TICKET_CallId_INCLUDE] ON [T_TICKET] ([CallId]) INCLUDE ([RelatedData]) WITH (DROP_EXISTING=ON)
It turns out there is in fact a way to significantly optimize my query without changing any data types.
This query:
SELECT * FROM T_TICKET
WHERE CALLID IN(N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA', N'MXZ_SQzfGMCPzUA', ... ,N'MXyQq6wQ7gVhzUA')
is using NVARCHAR types as the input params (N'MXZInrBl1DCnzUA', N'MXZ0TWkUhHprzUA'...). As I specified in my question, CallId is VARCHAR. Sql Server was converting CallId in every row of the table to an NVARCHAR type to do the comparison, which was taking a long time (even though I have an index on CallId).
I was able to optimize it by simply NOT changing the parameter types to NVARCHAR:
SELECT * FROM T_TICKET
WHERE CALLID IN('MXZInrBl1DCnzUA', 'MXZ0TWkUhHprzUA', 'MXZ_SQzfGMCPzUA', ... ,'MXyQq6wQ7gVhzUA')
Now, instead of taking over 30 seconds to run, it only takes around .03 seconds. Thanks for all the input.

SQL index for date range query

For a few days, I've been struggling with improving the performance of my database and there are some issues that I'm still kind a confused about regarding indexing in a SQL Server database.
I'll try to be as informative as I can.
My database currently contains about 100k rows and will keep growing, therfore I'm trying to find a way to make it work faster.
I'm also writing to this table, so if you suggestion will drastically reduce the writing time please let me know.
Overall goal is to select all rows with a specific names that are in a date range.
That will usually be to select over 3,000 rows out of a lot lol ...
Table schema:
CREATE TABLE [dbo].[reports]
(
[id] [int] IDENTITY(1,1) NOT NULL,
[IsDuplicate] [bit] NOT NULL,
[IsNotValid] [bit] NOT NULL,
[Time] [datetime] NOT NULL,
[ShortDate] [date] NOT NULL,
[Source] [nvarchar](350) NULL,
[Email] [nvarchar](350) NULL,
CONSTRAINT [PK_dbo.reports]
PRIMARY KEY CLUSTERED ([id] ASC)
) ON [PRIMARY]
This is the SQL query I'm using:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-13' AND '2017-10-15'
As I understood, my best approach to improve efficency without hurting the writing time as much would be to create a nonclustered index on the Source and ShortDate.
Which I did like such, index schema:
CREATE NONCLUSTERED INDEX [Source&Time]
ON [dbo].[reports]([Source] ASC, [ShortDate] ASC)
Now we are getting to the tricky part which got me completely lost, the index above sometimes works, sometime half works and sometime doesn't work at all....
(not sure if it matters but currently 90% of the database rows has the same Source, although this won't stay like that for long)
With the query below, the index isn't used at all, I'm using SQL Server 2014 and in the Execution Plan it says it only uses the clustered index scan:
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate BETWEEN '2017-10-10' AND '2017-10-15'
With this query, the index isn't used at all, although I'm getting a suggestion from SQL Server to create an index with the date first and source second... I read that the index should be made by the order the query is? Also it says to include all the columns Im selecting, is that a must?... again I read that I should include in the index only the columns I'm searching.
SELECT *
FROM [db].[dbo].[reports]
WHERE Source = 'name1'
AND ShortDate = '2017-10-13'
SQL Server index suggestion -
/* The Query Processor estimates that implementing the following
index could improve the query cost by 86.2728%. */
/*
USE [db]
GO
CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]
ON [dbo].[reports] ([ShortDate], [Source])
INCLUDE ([id], [IsDuplicate], [IsNotValid], [Time], [Email])
GO
*/
Now I tried using the index SQL Server suggested me to make and it works, seems like it uses 100% of the nonclustered index using both the queries above.
I tried to use this index but deleting the included columns and it doesn't work... seems like I must include in the index all the columns I'm selecting?
BTW it also work when using the index I made if I include all the columns.
To summarize: seems like the order of the index didn't matter, as it worked both when creating Source + ShortDate and ShortDate + Source
But for some reason its a must to include all the columns... (which will drastically affect the writing to this table?)
Thanks a lot for reading, My goal is to understand why this stuff happens and what I should do otherwise (not just the solution as I'll need to apply it on other projects as well ).
Cheers :)
Indexing in SQL Server is part know-how from long experience (and many hours of frustration), and part black magic. Don't beat yourself up over that too much - that's what a place like SO is ideal for - lots of brains, lots of experience from many hours of optimizing, that you can tap into.
I read that the index should be made by the order the query is?
If you read this - it is absolutely NOT TRUE - the order of the columns is relevant - but in a different way: a compound index (made up from multiple columns) will only ever be considered if you specify the n left-most columns in the index definition in your query.
Classic example: a phone book with an index on (city, lastname, firstname). Such an index might be used:
in a query that specifies all three columns in its WHERE clause
in a query that uses city and lastname (find all "Miller" in "Detroit")
or in a query that only filters by city
but it can NEVER EVER be used if you want to search only for firstname ..... that's the trick about compound indexes you need to be aware of. But if you always use all columns from an index, their ordering is typically not really relevant - the query optimizer will handle this for you.
As for the included columns - those are stored only in the leaf level of the nonclustered index - they are NOT part of the search structure of the index, and you cannot specify filter values for those included columns in your WHERE clause.
The main benefit of these included columns is this: if you search in a nonclustered index, and in the end, you actually find the value you're looking for - what do you have available at that point? The nonclustered index will store the columns in the non-clustered index definition (ShortDate and Source), and it will store the clustering key (if you have one - and you should!) - but nothing else.
So in this case, once a match is found, and your query wants everything from that table, SQL Server has to do what is called a Key lookup (often also referred to as a bookmark lookup) in which it takes the clustered key and then does a Seek operation against the clustered index, to get to the actual data page that contains all the values you're looking for.
If you have included columns in your index, then the leaf level page of your non-clustered index contains
the columns as defined in the nonclustered index
the clustering key column(s)
all those additional columns as defined in your INCLUDE statement
If those columns "cover" your query, e.g. provide all the values that your query needs, then SQL Server is done once it finds the value you searched for in the nonclustered index - it can take all the values it needs from that leaf-level page of the nonclustered index, and it does NOT need to do another (expensive) key lookup into the clustering index to get the actual values.
Because of this, trying to always explicitly specify only those columns you really need in your SELECT can be beneficial - in this case, you might be able to create an efficient covering index that provides all the values for your SELECT - always using SELECT * makes that really hard or next to impossible.....
In general, you want the index to be from most selective (i.e. filtering out the most possible records) to least selective; if a column has low cardinality, the query optimizer may ignore it.
That makes intuitive sense - if you have a phone book, and you're looking for people called "smith", with the initial "A", you want to start with searching for "smith" first, and then the "A"s, rather than all people whose initial is "A" and then filter out those called "Smith". After all, the odds are that one in 26 people have the initial "A".
So, in your example, I guess you have a wide range of values in short date - so that's the first column the query optimizer is trying to filter out. You say you have few different values in "source", so the query optimizer may decide to ignore it; in that case, the second column in that index is no use either.
The order of where clauses in the index is irrelevant - you can swap them round and achieve the exact same results, so the query optimizer ignores them.
EDIT:
So, yes, make the index. Imagine you have a pile of cards to sort - in your first run, you want to remove as many cards as possible. Assuming it's all evenly spread - if you have 1000 separate short_dates over a million rows, that means you end up with 1000 items if your first run starts on short_date; if you sort by source, you have 100000 rows.
The included columns of an index is for the columns you are selecting.
Due to the fact that you do select * (which isn't good practice), the index won't be used, because it would have to lookup the whole table to get the values for the columns.
For your scenario, I would drop the default clustered index (if there is one) and create a new clustered index with the following statement:
USE [db]
GO
CREATE CLUSTERED INDEX CIX_reports
ON [dbo].[reports] ([ShortDate],[Source])
GO

Joining 100 tables

Assume that I have a main table which has 100 columns referencing (as foreign keys) to some 100 tables (containing primary keys).
The whole pack of information requires joining those 100 tables. And it is definitely a performance issue to join such a number of tables. Hopefully, we can expect that any user would like to request a bunch of data containing values from not more than some 5-7 tables (out of those 100) in queries that put conditions (in WHERE part of the query) on the fields from some 3-4 tables (out of those 100). Different queries have different combinations of the tables used to produce "SELECT" part of query and to put conditions in "WHERE". But, again, every SELECT would require some 5-7 tables and every WHERE would requre some 3-4 tables (definitely, the list of tables used to produce SELECT may overlap with the list of tables used to put conditions in WHERE).
I can write a VIEW with the underlying code joining all those 100 tables. Then I can write the mentioned above SQL-queries to this VIEW. But in this case it is a big issue for me how to instruct SQL Server that (despite the explicit instructions in the code to join all those 100 tables) only some 11 tables should be joined (11 tables are enough to be joined to produce SELECT outcome and take into account WHERE conditions).
Another approach may be to create a "feature" that converts the following "fake" code
SELECT field1, field2, field3 FROM TheFakeTable WHERE field1=12 and field4=5
into the following "real" code:
SELECT T1.field1, T2.field2, T3.field3 FROM TheRealMainTable
join T1 on ....
join T2 on ....
join T3 on ....
join T4 on ....
WHERE T1.field1=12 and T4.field4=5
From grammatical point of view, it is not a problem even to allow any mixed combinations of this "TheFakeTable-mechanism" with real tables and constructions. The real problem here is how to realize this "feature" technically. I can create a function which takes the "fake" code as an input and produces the "real" code. But it is not convenient because it requires using dynamic SQL tools evrywhere where this "TheFakeTable-mechanism" appears. A fantasy-land solution is to extend the gramma of the SQL-language in my Management Studio to allow writing such a fake code and then automatically converting this code into the real one before sending to the server.
My questions are:
whether SQl Server can be instructed shomehow (or to be genius enouh) to join only 11 tables instead of 100 in the VIEW described above?
If I decide to create this "TheFakeTable-mechanism" feature, what would be the best form for the technical realization of this feature?
Thanks to everyone for every comment!
PS
The structure with 100 tables arises from the following question that I asked here:
Normalizing an extremely big table
The SQL Server optimizer does contain logic to remove redundant joins, but there are restrictions, and the joins have to be provably redundant. To summarize, a join can have four effects:
It can add extra columns (from the joined table)
It can add extra rows (the joined table may match a source row more than once)
It can remove rows (the joined table may not have a match)
It can introduce NULLs (for a RIGHT or FULL JOIN)
To successfully remove a redundant join, the query (or view) must account for all four possibilities. When this is done, correctly, the effect can be astonishing. For example:
USE AdventureWorks2012;
GO
CREATE VIEW dbo.ComplexView
AS
SELECT
pc.ProductCategoryID, pc.Name AS CatName,
ps.ProductSubcategoryID, ps.Name AS SubCatName,
p.ProductID, p.Name AS ProductName,
p.Color, p.ListPrice, p.ReorderPoint,
pm.Name AS ModelName, pm.ModifiedDate
FROM Production.ProductCategory AS pc
FULL JOIN Production.ProductSubcategory AS ps ON
ps.ProductCategoryID = pc.ProductCategoryID
FULL JOIN Production.Product AS p ON
p.ProductSubcategoryID = ps.ProductSubcategoryID
FULL JOIN Production.ProductModel AS pm ON
pm.ProductModelID = p.ProductModelID
The optimizer can successfully simplify the following query:
SELECT
c.ProductID,
c.ProductName
FROM dbo.ComplexView AS c
WHERE
c.ProductName LIKE N'G%';
To:
Rob Farley wrote about these ideas in depth in the original MVP Deep Dives book, and there is a recording of him presenting on the topic at SQLBits.
The main restrictions are that foreign key relationships must be based on a single key to contribute to the simplification process, and compilation time for the queries against such a view may become quite long particularly as the number of joins increases. It could be quite a challenge to write a 100-table view that gets all the semantics exactly correct. I would be inclined to find an alternative solution, perhaps using dynamic SQL.
That said, the particular qualities of your denormalized table may mean the view is quite simple to assemble, requiring only enforced FOREIGN KEYs non-NULLable referenced columns, and appropriate UNIQUE constraints to make this solution work as you would hope, without the overhead of 100 physical join operators in the plan.
Example
Using ten tables rather than a hundred:
-- Referenced tables
CREATE TABLE dbo.Ref01 (col01 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref02 (col02 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref03 (col03 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref04 (col04 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref05 (col05 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref06 (col06 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref07 (col07 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref08 (col08 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref09 (col09 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
CREATE TABLE dbo.Ref10 (col10 tinyint PRIMARY KEY, item varchar(50) NOT NULL UNIQUE);
The parent table definition (with page-compression):
CREATE TABLE dbo.Normalized
(
pk integer IDENTITY NOT NULL,
col01 tinyint NOT NULL REFERENCES dbo.Ref01,
col02 tinyint NOT NULL REFERENCES dbo.Ref02,
col03 tinyint NOT NULL REFERENCES dbo.Ref03,
col04 tinyint NOT NULL REFERENCES dbo.Ref04,
col05 tinyint NOT NULL REFERENCES dbo.Ref05,
col06 tinyint NOT NULL REFERENCES dbo.Ref06,
col07 tinyint NOT NULL REFERENCES dbo.Ref07,
col08 tinyint NOT NULL REFERENCES dbo.Ref08,
col09 tinyint NOT NULL REFERENCES dbo.Ref09,
col10 tinyint NOT NULL REFERENCES dbo.Ref10,
CONSTRAINT PK_Normalized
PRIMARY KEY CLUSTERED (pk)
WITH (DATA_COMPRESSION = PAGE)
);
The view:
CREATE VIEW dbo.Denormalized
WITH SCHEMABINDING AS
SELECT
item01 = r01.item,
item02 = r02.item,
item03 = r03.item,
item04 = r04.item,
item05 = r05.item,
item06 = r06.item,
item07 = r07.item,
item08 = r08.item,
item09 = r09.item,
item10 = r10.item
FROM dbo.Normalized AS n
JOIN dbo.Ref01 AS r01 ON r01.col01 = n.col01
JOIN dbo.Ref02 AS r02 ON r02.col02 = n.col02
JOIN dbo.Ref03 AS r03 ON r03.col03 = n.col03
JOIN dbo.Ref04 AS r04 ON r04.col04 = n.col04
JOIN dbo.Ref05 AS r05 ON r05.col05 = n.col05
JOIN dbo.Ref06 AS r06 ON r06.col06 = n.col06
JOIN dbo.Ref07 AS r07 ON r07.col07 = n.col07
JOIN dbo.Ref08 AS r08 ON r08.col08 = n.col08
JOIN dbo.Ref09 AS r09 ON r09.col09 = n.col09
JOIN dbo.Ref10 AS r10 ON r10.col10 = n.col10;
Hack the statistics to make the optimizer think the table is very large:
UPDATE STATISTICS dbo.Normalized WITH ROWCOUNT = 100000000, PAGECOUNT = 5000000;
Example user query:
SELECT
d.item06,
d.item07
FROM dbo.Denormalized AS d
WHERE
d.item08 = 'Banana'
AND d.item01 = 'Green';
Gives us this execution plan:
The scan of the Normalized table looks bad, but both Bloom-filter bitmaps are applied during the scan by the storage engine (so rows that cannot match do not even surface as far as the query processor). This may be enough to give acceptable performance in your case, and certainly better than scanning the original table with its overflowing columns.
If you are able to upgrade to SQL Server 2012 Enterprise at some stage, you have another option: creating a column-store index on the Normalized table:
CREATE NONCLUSTERED COLUMNSTORE INDEX cs
ON dbo.Normalized (col01,col02,col03,col04,col05,col06,col07,col08,col09,col10);
The execution plan is:
That probably looks worse to you, but column storage provides exceptional compression, and the whole execution plan runs in Batch Mode with filters for all the contributing columns. If the server has adequate threads and memory available, this alternative could really fly.
Ultimately, I'm not sure this normalization is the correct approach considering the number of tables and the chances of getting a poor execution plan or requiring excessive compilation time. I would probably correct the schema of the denormalized table first (proper data types and so on), possibly apply data compression...the usual things.
If the data truly belongs in a star-schema, it probably needs more design work than just splitting off repeating data elements into separate tables.
Why do you think joining 100 tables would be a performance issue?
If all the keys are primary keys, then all the joins will use indexes. The only question, then, is whether the indexes fit into memory. If they fit in memory, performance is probably not an issue at all.
You should try the query with the 100 joins before making such a statement.
Furthermore, based on the original question, the reference tables have just a few values in them. The tables themselves fit on a single page, plus another page for the index. This is 200 pages, which would occupy at most a few megabytes of your page cache. Don't worry about the optimizations, create the view, and if you have performance problems then think about the next steps. Don't presuppose performance problems.
ELABORATION:
This has received a lot of comments. Let me explain why this idea may not be as crazy as it sounds.
First, I am assuming that all the joins are done through primary key indexes, and that the indexes fit into memory.
The 100 keys on the page occupy 400 bytes. Let's say that the original strings are, on average 40 bytes each. These would have occupied 4,000 bytes on the page, so we have a savings. In fact, about 2 records would fit on a page in the previous scheme. About 20 fit on a page with the keys.
So, to read the records with the keys is about 10 times faster in terms of I/O than reading the original records. With the assumptions about the small number of values, the indexes and original data fit into memory.
How long does it take to read 20 records? The old way required reading 10 pages. With the keys, there is one page read and 100*20 index lookups (with perhaps an additional lookup to get the value). Depending on the system, the 2,000 index lookups may be faster -- even much faster -- than the additional 9 page I/Os. The point I want to make is that this is a reasonable situation. It may or may not happen on a particular system, but it is not way crazy.
This is a bit oversimplified. SQL Server doesn't actually read pages one-at-a-time. I think they are read in groups of 4 (and there might be look-ahead reads when doing a full-table scan). On the flip side, though, in most cases, a table-scan query is going to be more I/O bound than processor bound, so there are spare processor cycles for looking up values in reference tables.
In fact, using the keys could result in faster reading of the table than not using them, because spare processing cycles would be used for the lookups ("spare" in the sense that processing power is available when reading). In fact, the table with the keys might be small enough to fit into available cache, greatly improving performance of more complex queries.
The actual performance depends on lots of factors, such as the length of the strings, the original table (is it larger than available cache?), the ability of the underlying hardware to do I/O reads and processing at the same time, and the dependence on the query optimizer to do the joins correctly.
My original point was that assuming a priori that the 100 joins are a bad thing is not correct. The assumption needs to be tested, and using the keys might even give a boost to performance.
If your data doesn't change much, you may benefit from creating an Indexed View, which basically materializes the view.
If the data changes often, it may not be a good option, as the server has to maintain the indexed view for each change in the underlying tables of the view.
Here's a good blog post that describes it a bit better.
From the blog:
CREATE VIEW dbo.vw_SalesByProduct_Indexed
WITH SCHEMABINDING
AS
SELECT
Product,
COUNT_BIG(*) AS ProductCount,
SUM(ISNULL(SalePrice,0)) AS TotalSales
FROM dbo.SalesHistory
GROUP BY Product
GO
The script below creates the index on our view:
CREATE UNIQUE CLUSTERED INDEX idx_SalesView ON vw_SalesByProduct_Indexed(Product)
To show that an index has been created on the view and that it does
take up space in the database, run the following script to find out
how many rows are in the clustered index and how much space the view
takes up.
EXECUTE sp_spaceused 'vw_SalesByProduct_Indexed'
The SELECT statement below is the same statement as before, except
this time it performs a clustered index seek, which is typically very
fast.
SELECT
Product, TotalSales, ProductCount
FROM vw_SalesByProduct_Indexed
WHERE Product = 'Computer'

Is There ANY Sense in SQL Data Type VARCHAR(1)?

I've bumped into a lot of VARCHAR(1) fields in a database I've recently had to work with. I rolled my eyes: obviously the designer didn't have a clue. But maybe I'm the one who needs to learn something. Is there any conceivable reason to use a VARCHAR(1) data type rather than CHAR(1)? I would think that the RDMS would convert the one to the other automatically.
The database is MS SQL 2K5, but evolved from Access back in the day.
Yes there is sense to it.
Easier for it to be definable in the language. It is consistent and easier to define varchar to allow 1-8000 than to say it needs to be 2+ or 3+ to 8000.
The VARying CHARacter aspect of VARCHAR(1) is exactly that. It may not be optimal for storage but conveys a specific meaning, that the data is either 1 char (classroom code) or blank (outside activity) instead of NULL (unknown/not-yet-classified).
Storage plays very little part in this - looking at a database schema for CHAR(1), you would almost expect that it must always have a 1 char value, such as credit cards must have 16 digits. That is simply not the case with some data where it can be one or optionally none.
There are also differences to using VARCHAR(1) vs CHAR(1)+NULL combination for those who say tri-state [ 1-char | 0-char | NULL ] is completely useless. It allows for SQL statements like:
select activity + '-' + classroom
from ...
which would otherwise be more difficult if you use char(1)+NULL, which can convey the same information but has subtle differences.
AFAIK, No.
a VARCHAR(1) requires 3 bytes storage (The storage size is the actual length of data entered + 2 bytes. Ref.
a CHAR(1) requires 1 byte.
From a storage perspective: A rule of thumb is, if it's less than or equal to 5 chars, consider using a fixed length char column.
A reason to avoid varchar(1) (aside from the fact they they convey poor design reasoning, IMO) is when using Linq2SQL: LINQ to SQL and varchar(1) fields
A varchar(1) can store a zero length ("empty") string. A char(1) can't as it will get padded out to a single space. If this distinction is important to you you may favour the varchar.
Apart from that, one use case for this may be if the designer wants to allow for the possibility that a greater number of characters may be required in the future.
Altering a fixed length datatype from char(1) to char(2) means that all the table rows need to be updated and any indexes or constraints that access this column dropped first.
Making these changes to a large table in production can be an extremely time consuming operation that requires down time.
Altering a column from varchar(1) to varchar(2) is much easier as it is a metadata only change (FK constraints that reference the column will need to be dropped and recreated but no need to rebuild the indexes or update the data pages).
Moreover the 2 bytes per row saving might not always materialize anyway. If the row definition is already quite long this won't always affect the number of rows that can fit on a data page. Another case would be if using the compression feature in Enterprise Edition the way the data is stored is entirely different than that mentioned in Mitch's answer in any event. Both varchar(1) and char(1) would end up stored the same way in the short data region.
#Thomas - e.g. try this table definition.
CREATE TABLE T2
(
Code VARCHAR(1),
Foo datetime2,
Bar int,
Filler CHAR(4000),
PRIMARY KEY CLUSTERED (Code, Foo, Bar)
)
INSERT INTO T2
SELECT TOP 100000 'A',
GETDATE(),
ROW_NUMBER() OVER (ORDER BY (SELECT 0)),
NULL
FROM master..spt_values v1, master..spt_values v2
CREATE NONCLUSTERED INDEX IX_T2_Foo ON T2(Foo) INCLUDE (Filler);
CREATE NONCLUSTERED INDEX IX_T2_Bar ON T2(Bar) INCLUDE (Filler);
For a varchar it is trivial to change the column definition from varchar(1) to varchar(2). This is a metadata only change.
ALTER TABLE T2 ALTER COLUMN Code VARCHAR(2) NOT NULL
If the change is from char(1) to char(2) the following steps must happen.
Drop the PK from the table. This converts the table into a heap and means all non clustered indexes need to be updated with the RID rather than the clustered index key.
Alter the column definition. This means all rows are updated in the table so that Code now is stored as char(2).
Add back the clustered PK constraint. As well as rebuilding the CI itself this means all non clustered indexes need to be updated again with the CI key as a row pointer rather than the RID.

Slow distinct query in SQL Server over large dataset

We're using SQL Server 2005 to track a fair amount of constantly incoming data (5-15 updates per second). We noticed after it has been in production for a couple months that one of the tables has started to take an obscene amount of time to query.
The table has 3 columns:
id -- autonumber (clustered)
typeUUID -- GUID generated before the insert happens; used to group the types together
typeName -- The type name (duh...)
One of the queries we run is a distinct on the typeName field:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
The typeName field has a non-clusted, non-unique ascending index on it. The table contains approximately 200M records at the moment. When we run this query, the query took 5m 58s to return! Perhaps we're not understanding how the indexes work... But I didn't think we mis-understood them that much.
To test this a little further, we ran the following query:
SELECT DISTINCT [typeName] FROM (SELECT TOP 1000000 [typeName] FROM [types] WITH (nolock)) AS [subtbl]
This query returns in about 10 seconds, as I would expect, it's scanning the table.
Is there something we're missing here? Why does the first query take so long?
Edit: Ah, my apologies, the first query returns 76 records, thank you ninesided.
Follow up: Thank you all for your answers, it makes more sense to me now (I don't know why it didn't before...). Without an index, it's doing a table scan across 200M rows, with an index, it's doing an index scan across 200M rows...
SQL Server does prefer the index, and it does give a little bit of a performance boost, but nothing to be excited about. Rebuilding the index did take the query time down to just over 3m instead of 6m, an improvement, but not enough. I'm just going to recommend to my boss that we normalize the table structure.
Once again, thank you all for your help!!
You do misunderstand the index. Even if it did use the index it would still do an index scan across 200M entries. This is going to take a long time, plus the time it takes to do the DISTINCT (causes a sort) and it's a bad thing to run. Seeing a DISTINCT in a query always raises a red flag and causes me to double check the query. In this case, perhaps you have a normalization issue?
There is an issue with the SQL Server optimizer when using the DISTINCT keyword. The solution was to force it to keep the same query plan by breaking out the distinct query separately.
So we took queries such as:
SELECT DISTINCT [typeName] FROM [types] WITH (nolock);
and break it up into the following:
SELECT typeName INTO #tempTable1 FROM types WITH (NOLOCK)
SELECT DISTINCT typeName FROM #tempTable1
Another way to get around it is to use a GROUP BY, which gets a different optimization plan.
I doubt SQL Server will even try to use the index, it'd have to do practically the same amount of work (given the narrow table), reading all 200M rows regardless of whether it looks at the table or the index. If the index on typeName was clustered it may reduce the time taken as it shouldn't need to sort before grouping.
If the cardinality of your types is low, how about maintaining a summary table which holds the list of distinct type values? A trigger on insert/update of the main table would do a check on the summary table and insert a new record when a new type is found.
As others have already pointed out - when you do a SELECT DISTINCT (typename) over your table, you'll end up with a full table scan no matter what.
So it's really a matter of limiting the number of rows that need to be scanned.
The question is: what do you need your DISTINCT typenames for? And how many of your 200M rows are distinct? Do you have only a handful (a few hundred at most) distinct typenames??
If so - you could have a separate table DISTINCT_TYPENAMES or something and fill those initially by doing a full table scan, and then on inserting new rows to the main table, just always check whether their typename is already in DISTINCT_TYPENAMES, and if not, add it.
That way, you'd have a separate, small table with just the distinct TypeName entries, which would be lightning fast to query and/or to display.
Marc
A looping approach should use multiple seeks (but loses some parallelism). It might be worth a try for cases with relatively few distinct values compared to the total number of rows (low cardinality).
Idea was from this question:
select typeName into #Result from Types where 1=0;
declare #t varchar(100) = (select min(typeName) from Types);
while #t is not null
begin
set #t = (select top 1 typeName from Types where typeName > #t order by typeName);
if (#t is not null)
insert into #Result values (#t);
end
select * from #Result;
And looks like there are also some other methods (notably the recursive CTE #Paul White):
different-ways-to-find-distinct-values-faster-methods
sqlservercentral Topic873124-338-5
My first thought is statistics. To find last updated:
SELECT
name AS index_name,
STATS_DATE(object_id, index_id) AS statistics_update_date
FROM
sys.indexes
WHERE
object_id = OBJECT_ID('MyTable');
Edit: Stats are updated when indexes are rebuilt, which I see are not maintained
My second thought is that is the index still there? The TOP query should still use an index.
I've just tested on one of my tables with 57 million rows and both use the index.
An indexed view can make this faster.
create view alltypes
with schemabinding as
select typename, count_big(*) as kount
from dbo.types
group by typename
create unique clustered index idx
on alltypes (typename)
The work to keep the view up to date on each change to the base table should be moderate (depending on your application, of course -- my point is that it doesn't have to scan the whole table each time or do anything insanely expensive like that.)
Alternatively you could make a small table holding all values:
select distinct typename
into alltypes
from types
alter table alltypes
add primary key (typename)
alter table types add foreign key (typename) references alltypes
The foreign key will make sure that all values used appear in the parent alltypes table. The trouble is in ensuring that alltypes does not contain values not used in the child types table.
I should try something like this:
SELECT typeName FROM [types] WITH (nolock)
group by typeName;
And like other i would say you need to normalize that column.
An index helps you quickly find a row. But you're asking the database to list all unique types for the entire table. An index can't help with that.
You could run a nightly job which runs the query and stores it in a different table. If you require up-to-date data, you could store the last ID included in the nightly scan, and combine the results:
select type
from nightlyscan
union
select distinct type
from verybigtable
where rowid > lastscannedid
Another option is to normalize the big table into two tables:
talbe1: id, guid, typeid
type table: typeid, typename
This would be very beneficial if the number of types was relatively small.
I could be missing something but would it be more efficient if an overhead on load to create a view with distinct values and query that instead?
This would give almost instant responses to the select if the result set is significantly smaller with the overhead over populating it on each write though given the nature of the view that might be trivial in itself.
It does ask the question how many writes compared to how often you want the distinct and the importance of the speed when you do.