Hashset equivalent in SQL Server - sql

I want to create a large table (about 45 billion rows) that is always accessed by a unique key.
Outside of the DB, the best structure to hold this is a Dictionary or a HashSet, but of course due to the size of data, it's not possible to do this outside of the database.
Does SQL Server provide a structure that's optimized for key-value access? I understand that a clustered key is very fast, but still it's an index and therefore there will be some additional disk reads associated with traversing index pages. What I would like to get from SQL Server is a "native" structure that stores data as key-value pairs and then makes it possible to access values based on keys.
In other words, my question is how to store in SQL Server 45 billion rows and efficiently access them WITHOUT having an index, clustered or non-clustered, because reading the index non-leaf pages may result in substantial IO, and since each value can be accessed by a unique key, it should be possible to have a structure where the hash of a key resolves into a physical location of the value. To get 1 value, we would need to do 1 read (unless there are hash collisions).
(an equivalent in Oracle is Hash Cluster)
Thanks for your help.

No such thing in SQL server. Your only option is an index. If you're going to be requesting all columns for a given key, you should use a clustered index. If you're only going to be requesting a subset, you should use a non-clustered index including only the columns you want like this:
create index IX_MyBigTable on MyBigTable(keyColumn) include (col1, col2, col3youneed);
This will be pretty efficient.

According to my benchmarks, the best approach is to create a hash column for the key. Details.

Related

Are indexes on columns with always a different value worth it?

Does creating an index on a column that will always have a different value in each record (like a unique column) improves performances on SELECTs?
I understand that having an index on a column named ie. status which can have 3 values (such as PENDING, DONE, FAILED) and searching only FAILED in 1kk records will be faster.
But what happens if I have a unique id (not the primary key) in 1kk records, and I'm doing a SELECT on that column?
An index on a unique column is actually better than an index on a column with a few values.
To understand why, you need a basic understanding of how databases manage storage. This is a high-level view.
The primary purpose of an index is to reduce the number of pages that need to be read for a query. The rows themselves are stored on data pages. If you don't have an index, then all the data needs to be read.
The index is a data structure that makes it efficient to find a particular value. You can think of it as a sorted list, where a binary search is used to identify the right location. In actual fact, these are usually stored in a structure called b-trees (where the "b" stands for "balanced", not "binary") but that is an implementation detail. And there are types of indexes that don't use b-trees.
So, if the values are unique, then an index is extremely helpful. Instead of doing a full table scan, the "row id" can efficiently be looked up in the index and then only one data page needs to be read.
Note that unique constraints are implemented using indexes. If you have declared a column to be unique, there is no need for an additional index because it is already there.

Searching for record(s) in a table that has over 200 Million Rows

Which type of index should be used on the table? It is initially inserted (one a month) into a empty table. I then place a non clustered composite index on two of the columns. Wondering if merging the two fields into one would increase performance when searching. Or does it not matter? Should I be working with an identity column that has a primary key clustered index?
You should index the field(s) most likely to be used in the where clause as people query the table. Don't worry about the primary key - it already has an index.
If you can define a unique primary key that can be used when querying the table, this will be used as the clustered index and will be the fastest for selects.
If your select query has to use the two fields you mentioned, keep them separate. Performance will not be impacted and the schema is not spoiled.
"A clustered index is particularly efficient on columns that are often searched for ranges of values. After the row with the first value is found using the clustered index, rows with subsequent indexed values are guaranteed to be physically adjacent."
With this in mind you probably won't see much benefit from haveing a clustered index on your primary key (ID) unless it have business meaning for your aplication. If you have a Date value that you are commonly querying, then it may make more sense to add a clustered index to that
select * from table where created > '2013-01-01' and created < '2013-02-01'
I have seen datawarehouses use a concatenated key approach. Whether this works for you depends on your queries. Obviously querying a single field value will be faster than multiple fields, particularly when there is one less lookup in the B-tree index.
Alternatively, if you have 200 million rows in a table you could look at breaking the data out into multiple tables if it makes sense to do so.
You're saying that you're loading all this data every month so I have to assume that all the data is relevant. If there was data in your table that is considered "old" and not relevant to searches, then you could move data out into a archive table (using the same schema) so your queries only run against "current" data.
Otherwise, you can look at a sharding approach as used by NoSQL like MongoDB. If MongoDB is not an option, you could achieve the same shard key like logic in your application. I doubt that your database SQL drivers will support sharding natively.

Do I need to use this many indexes in my SQL Server 2008 database?

I'd appreciate some advice from SQL Server gurus here. Let me explain...
I have an SQL Server 2008 database table that has 21 columns. Here's a quick type of those:
INT Primary Key
Several other INT's that are indexes already (used to reference this and other tables)
Several NVARCHAR(64) to hold user-provided text
Several NVARCHAR(256) to hold longer user-provided text
Several DATETIME2
One BIGINT
Several UNIQUEIDENTIFIER, one is already an index
The way this table is used is that it is presented to a user as a sortable table and a user can choose which column to sort it by. This table may contain many thousands of records (like currently it does 21,000 and it will be growing.)
So my question is, do I need to set each column as an INDEX to enable faster sorting?
PS. Forgot to say. The output obviously supports pagination, so the user sees no more than 100 rows at once.
Contrary to popular belief, just having an index on a column does not guarantee that any queries will be any faster!
If you constantly use SELECT *.. from that table, these non-clustered indices on a single column will most likely not be used at all.
A good nonclustered index is a covering index, which means, it contains all the necessary columns to satisfy one or multiple given queries. If you have this situation, then a nonclustered index can make sense - otherwise, in more cases than not, the nonclustered index is likely to be ignored by the query optimizer. The reason for this being: if you need all the columns anyway, the query would have to do key lookups from the nonclustered index into the actual data (the clustered index) for each row found - and the key lookup is a very expensive operation, so doing this for a lots of hits becomes overly costly, and the query optimizer will rather quickly switch to a index scan (possibly the clustered index scan) to fetch the data.
Don't over-index - use a well-designed clustered index, put indices on the foreign key columns to speed up joins - and then let it be for the time being. Observe your system, measure performance, maybe add an index here or there - but don't just overload the system with tons of indices!
Having too many indices can be worse than having none - every index must be maintained, e.g. updated for each INSERT, UPDATE and DELETE statement - does that take time!
this table is ... presented to a user as a sortable table ... [that] may contain many thousands of records
If you're ordering many thousands of records for display, you're doing it wrong. Typical users can reasonably process at most around 500 typical records. Exceptional users can handle a couple thousand. Any more than that, and you're misleading your users into a false sense that they've seen a representative sample. This results in poor decision making and inefficient user workflow. Instead, you need to focus on a good search algorithm.
Another to keep in mind here is that more indexes means slower inserts and updates. It's a balancing act. Sql Server keeps statistics on what queries and sorts it actually performs, and makes those statistics available to you. There are queries you can run that tell you exactly what indexes Sql Server thinks it could use. I would deploy without any sorting index and let it run for a week or two that way. Then look at data and see what users actually sort on and index just those columns.
Take a look at this link for an example and introduction on finding missing indexes:
http://sqlserverpedia.com/wiki/Find_Missing_Indexes
Generally indexes use to accelerate WHERE conditions (in some cases JOINS). so I don't thinks create index on column except PRIMARY KEY accelerate sorting. you can do your sorting in clients(if you use win forms or wpf) or in database for web scenarios
Good Luck

Do clustered indexes have to be unique?

What happens if a clustered index is not unique? Can it lead to bad performance because inserted rows flow to an "overflow" page of some sorts?
Is it "made" unique and if so how? What is the best way to make it unique?
I am asking because I am currently using a clustered index to divide my table in logical parts, but the performance is so-so, and recently I got the advice to make my clustered indexes unique. I'd like a second opinion on that.
They don't have to be unique but it certainly is encouraged.
I haven't encountered a scenario yet where I wanted to create a CI on a non-unique column.
What happens if you create a CI on a non-unique column
If the clustered index is not a unique
index, SQL Server makes any duplicate
keys unique by adding an internally
generated value called a uniqueifier
Does this lead to bad performance?
Adding a uniqueifier certainly adds some overhead in calculating and in storing it.
If this overhead will be noticable depends on several factors.
How much data the table contains.
What is the rate of inserts.
How often is the CI used in a select (when no covering indexes exist, pretty much always).
Edit
as been pointed out by Remus in comments, there do exist use cases where creating a non-unique CI would be a reasonable choice. Me not having encountered one off those scenarios merely shows my own lack of exposure or competence (pick your choice).
I like to check out what The Queen of Indexing, Kimberly Tripp, has to say on the topic:
I'm going to start with my recommendation for the Clustering Key - for a couple of reasons. First, it's an easy decision to make and second, making this decision early helps to proactively prevent some types of fragmentation. If you can prevent certain types of base-table fragmentation then you can minimize some maintenance activities (some of which, in SQL Server 2000 AND less of which, in SQL Server 2005) require that your table be offline. OK, I'll get to the rebuild stuff later.....
Let's start with the key things that I look for in a clustering key:
* Unique
* Narrow
* Static
Why Unique?
A clustering key should be unique because a clustering key (when one exists) is used as the lookup key from all non-clustered indexes. Take for example an index in the back of a book - if you need to find the data that an index entry points to - that entry (the index entry) must be unique otherwise, which index entry would be the one you're looking for? So, when you create the clustered index - it must be unique. But, SQL Server doesn't require that your clustering key is created on a unique column. You can create it on any column(s) you'd like. Internally, if the clustering key is not unique then SQL Server will “uniquify” it by adding a 4-byte integer to the data. So if the clustered index is created on something which is not unique then not only is there additional overhead at index creation, there's wasted disk space, additional costs on INSERTs and UPDATEs, and in SQL Server 2000, there's an added cost on a clustereD index rebuild (which because of the poor choice for the clustering key is now more likely).
Source: Ever-increasing clustering key debate - again!
Do clustered indexes have to be unique?
They don't, and there are times where it's better if they're not.
Consider a table with a semi-random, unique EmployeeId, and a DepartmentId for each employee: if your select statement is
SELECT * FROM EmployeeTable WHERE DepartmentId=%DepartmentValue%
then it's best for performance if the DepartmentId is the clustered index even though (or even especially because) it's not the unique index (best for performance because it ensures that all the records within a given DepartmentId are clustered).
Do you have any references?
There's Clustered Index Design Guidelines for example, which says,
With few exceptions, every table
should have a clustered index defined
on the column, or columns, that offer
the following:
Can be used for frequently used queries.
Provide a high degree of uniqueness.
Can be used in range queries.
My understanding of "high degree of uniqueness" for example is that it isn't good to choose "Country" as the clusted index if most of your queries want to select the records within a given town.
If you are tuning an old DB this is a Godsend. I am working on Perf issues on a 20-year-old DB. It has nonclustered PKs with 3 - 8 columns. Instead of using all 8 columns to be unique I can pick one column with broad distribution, and it applies a Uniqueifier. It is an Int but by using a column like Project ID it can handle 2147483647 unique projectIDs which is enough for most use-cases. If it is not enough add a second or third column to the cluster.
This works without any coding modification in the App layer. 20 years in production and management doesn't have to order a major rewrite.

What is an index in SQL?

Also, when is it appropriate to use one?
An index is used to speed up searching in the database. MySQL has some good documentation on the subject (which is relevant for other SQL servers as well):
http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html
An index can be used to efficiently find all rows matching some column in your query and then walk through only that subset of the table to find exact matches. If you don't have indexes on any column in the WHERE clause, the SQL server has to walk through the whole table and check every row to see if it matches, which may be a slow operation on big tables.
The index can also be a UNIQUE index, which means that you cannot have duplicate values in that column, or a PRIMARY KEY which in some storage engines defines where in the database file the value is stored.
In MySQL you can use EXPLAIN in front of your SELECT statement to see if your query will make use of any index. This is a good start for troubleshooting performance problems. Read more here:
http://dev.mysql.com/doc/refman/5.0/en/explain.html
A clustered index is like the contents of a phone book. You can open the book at 'Hilditch, David' and find all the information for all of the 'Hilditch's right next to each other. Here the keys for the clustered index are (lastname, firstname).
This makes clustered indexes great for retrieving lots of data based on range based queries since all the data is located next to each other.
Since the clustered index is actually related to how the data is stored, there is only one of them possible per table (although you can cheat to simulate multiple clustered indexes).
A non-clustered index is different in that you can have many of them and they then point at the data in the clustered index. You could have e.g. a non-clustered index at the back of a phone book which is keyed on (town, address)
Imagine if you had to search through the phone book for all the people who live in 'London' - with only the clustered index you would have to search every single item in the phone book since the key on the clustered index is on (lastname, firstname) and as a result the people living in London are scattered randomly throughout the index.
If you have a non-clustered index on (town) then these queries can be performed much more quickly.
An index is used to speed up the performance of queries. It does this by reducing the number of database data pages that have to be visited/scanned.
In SQL Server, a clustered index determines the physical order of data in a table. There can be only one clustered index per table (the clustered index IS the table). All other indexes on a table are termed non-clustered.
SQL Server Index Basics
SQL Server Indexes: The Basics
SQL Server Indexes
Index Basics
Index (wiki)
Indexes are all about finding data quickly.
Indexes in a database are analogous to indexes that you find in a book. If a book has an index, and I ask you to find a chapter in that book, you can quickly find that with the help of the index. On the other hand, if the book does not have an index, you will have to spend more time looking for the chapter by looking at every page from the start to the end of the book.
In a similar fashion, indexes in a database can help queries find data quickly. If you are new to indexes, the following videos, can be very useful. In fact, I have learned a lot from them.
Index Basics
Clustered and Non-Clustered Indexes
Unique and Non-Unique Indexes
Advantages and disadvantages of indexes
Well in general index is a B-tree. There are two types of indexes: clustered and nonclustered.
Clustered index creates a physical order of rows (it can be only one and in most cases it is also a primary key - if you create primary key on table you create clustered index on this table also).
Nonclustered index is also a binary tree but it doesn't create a physical order of rows. So the leaf nodes of nonclustered index contain PK (if it exists) or row index.
Indexes are used to increase the speed of search. Because the complexity is of O(log N). Indexes is very large and interesting topic. I can say that creating indexes on large database is some kind of art sometimes.
INDEXES - to find data easily
UNIQUE INDEX - duplicate values are not allowed
Syntax for INDEX
CREATE INDEX INDEX_NAME ON TABLE_NAME(COLUMN);
Syntax for UNIQUE INDEX
CREATE UNIQUE INDEX INDEX_NAME ON TABLE_NAME(COLUMN);
First we need to understand how normal (without indexing) query runs. It basically traverse each rows one by one and when it finds the data it returns. Refer the following image. (This image has been taken from this video.)
So suppose query is to find 50 , it will have to read 49 records as a linear search.
Refer the following image. (This image has been taken from this video)
When we apply indexing, the query will quickly find out the data without reading each one of them just by eliminating half of the data in each traversal like a binary search. The mysql indexes are stored as B-tree where all the data are in leaf node.
INDEX is a performance optimization technique that speeds up the data retrieval process. It is a persistent data structure that is associated with a Table (or View) in order to increase performance during retrieving the data from that table (or View).
Index based search is applied more particularly when your queries include WHERE filter. Otherwise, i.e, a query without WHERE-filter selects whole data and process. Searching whole table without INDEX is called Table-scan.
You will find exact information for Sql-Indexes in clear and reliable way:
follow these links:
For cocnept-wise understanding:
http://dotnetauthorities.blogspot.in/2013/12/Microsoft-SQL-Server-Training-Online-Learning-Classes-INDEX-Overview-and-Optimizations.html
For implementation-wise understanding:
http://dotnetauthorities.blogspot.in/2013/12/Microsoft-SQL-Server-Training-Online-Learning-Classes-INDEX-Creation-Deletetion-Optimizations.html
If you're using SQL Server, one of the best resources is its own Books Online that comes with the install! It's the 1st place I would refer to for ANY SQL Server related topics.
If it's practical "how should I do this?" kind of questions, then StackOverflow would be a better place to ask.
Also, I haven't been back for a while but sqlservercentral.com used to be one of the top SQL Server related sites out there.
An index is used for several different reasons. The main reason is to speed up querying so that you can get rows or sort rows faster. Another reason is to define a primary-key or unique index which will guarantee that no other columns have the same values.
So, How indexing actually works?
Well, first off, the database table does not reorder itself when we put index on a column to optimize the query performance.
An index is a data structure, (most commonly its B-tree {Its balanced tree, not binary tree}) that stores the value for a specific column in a table.
The major advantage of B-tree is that the data in it is sortable. Along with it, B-Tree data structure is time efficient and operations such as searching, insertion, deletion can be done in logarithmic time.
So the index would look like this -
Here for each column, it would be mapped with a database internal identifier (pointer) which points to the exact location of the row. And, now if we run the same query.
Visual Representation of the Query execution
So, indexing just cuts down the time complexity from o(n) to o(log n).
A detailed info - https://pankajtanwar.in/blog/what-is-the-sorting-algorithm-behind-order-by-query-in-mysql
INDEX is not part of SQL. INDEX creates a Balanced Tree on physical level to accelerate CRUD.
SQL is a language which describe the Conceptual Level Schema and External Level Schema. SQL doesn't describe Physical Level Schema.
The statement which creates an INDEX is defined by DBMS, not by SQL standard.
An index is an on-disk structure associated with a table or view that speeds retrieval of rows from the table or view. An index contains keys built from one or more columns in the table or view. These keys are stored in a structure (B-tree) that enables SQL Server to find the row or rows associated with the key values quickly and efficiently.
Indexes are automatically created when PRIMARY KEY and UNIQUE constraints are defined on table columns. For example, when you create a table with a UNIQUE constraint, Database Engine automatically creates a nonclustered index.
If you configure a PRIMARY KEY, Database Engine automatically creates a clustered index, unless a clustered index already exists. When you try to enforce a PRIMARY KEY constraint on an existing table and a clustered index already exists on that table, SQL Server enforces the primary key using a nonclustered index.
Please refer to this for more information about indexes (clustered and non clustered):
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/clustered-and-nonclustered-indexes-described?view=sql-server-ver15
Hope this helps!