Search performance of non-clustered index on binary and integer columns - sql

I store fixed-size binary hashes less than or equal 64 bits in Microsoft SQL Server 2012 database. The size of binary hash may be 48 or 32 bits also. Each hash has an identificator Id. The table structure is like this:
Id int NOT NULL PRIMARY KEY,
Hash binary(8) NOT NULL
I created non-clustered index on Hash column for performance purposes and for fast way to look up of hash. Also I tried to create integer columns instead of binary(n) depending on bytes n. For example I changed the column type from binary(4) to int.
Are there differences between indices on column types binary(8) and bigint or between binary(4) and int and so on?
Is it reasonable to store hashes as integers to improve search performance?

Under the covers the index is limited to a certain byte length. The smaller the better for IO. It's easy enough to convert between the datatypes using convert(varbinary(25),Hash) syntax, once you have the value of interest. You don't want to invoke a ton of converts while you're looking up the records.
If there is a difference it might be due to the collations or statistics that are being used, which just says between two values if one is more, less or equal. Statics enable the query to look past lots of values because it "knows" the data distribution.
When you have a large strings and you attempt to do like '%value' lookups, there's not much benefit much from the index. Hashes should be random. Which means the focus is on the number of bytes that are compared to make a query decision. The fewer the better.
The unhelpful but accurate cya that every database engineer will tell you, it depends, you should test it.

Related

Understand sql joins and data type impact

If a sql server column is a string versus a guid, how would joins be impacted (assuming no indexes). Would it matter?
Also, when you put an index on a string column, does it become as efficient as an integer column with an index?
i.e. when you put an index on either a string or integer column, is the resulting index built the same way and therefore performs equally?
All other things being equal, less data is better. And by data I mean bytes.
For almost all SQL Server applications, the tightest bottleneck is disk I/O, and pulling less data from disk (or cache) makes everything faster.
This is variable depending on your declared string length. Bear in mind that GUIDs are 16 bytes, and varchar are 1 byte per character. nvarchar are 2 bytes per character. (n)varchar also have a 2 byte overhead per row to define the string length.
Space/bytes wise, a String is bigger than a GUID is bigger than an int.
The smaller/tighter your field definition the better, so int is faster than a guid, which is faster than a string.
Without indices, the size of the column really doesn't make a huge difference, since SQL Server will have to basically do a table scan anyway, to link up the two values. Whether that's for a 4-byte INT or a 60-byte VARCHAR really doesn't make a big difference - that data is there in the data pages anyway.
But if you start using indices, smaller and fixed-length data (4-byte fixed-length INT) is significantly better than larger fixed-width structures (like 16-byte GUID / UNIQUEIDENTIFIER), and much better than variable-width columns like VARCHAR - but again: only with indices...
A string column will always have higher overhead than an int column. Indexes are generally some kind of hash, and a string (especially long ones) will always take longer to hash than a simple 16/32/64bit integer.
Scanning either index will most likely take the same amount of time, but the overhead of producing/maintaining the indexes will always make int columns win.

What's the database performance improvement from storing as numbers rather than text?

Suppose I have text such as "Win", "Lose", "Incomplete", "Forfeit" etc. I can directly store the text in the database. Instead if use numbers such as 0 = Win, 1 = Lose etc would I get a material improvement in database performance? Specifically on queries where the field is part of my WHERE clause
At the CPU level, comparing two fixed-size integers takes just one instruction, whereas comparing variable-length strings usually involves looping through each character. So for a very large dataset there should be a significant performance gain with using integers.
Moreover, a fixed-size integer will generally take less space and can allow the database engine to perform faster algorithms based on random seeking.
Most database systems however have an enum type which is meant for cases like yours - in the query you can compare the field value against a fixed set of literals while it is internally stored as an integer.
There might be significant performance gains if the column is used in an index.
It could range anywhere from negligible to extremely beneficial depending on the table size, the number of possible values being enumerated and the database engine / configuration.
That said, it almost certainly will never perform worse to use a number to represent an enumerated type.
Don't guess. Measure.
Performance depends on how selective the index is (how many distinct values are in it), whether critical information is available in the natural key, how long the natural key is, and so on. You really need to test with representative data.
When I was designing the database for my employer's operational data store, I built a testbed with tables designed around natural keys and with tables designed around id numbers. Both those schemas have more than 13 million rows of computer-generated sample data. In a few cases, queries on the id number schema outperformed the natural key schema by 50%. (So a complex query that took 20 seconds with id numbers took 30 seconds with natural keys.) But 80% of the test queries had faster SELECT performance against the natural key schema. And sometimes it was staggeringly faster--a difference of 30 to 1.
The reason, of course, is that lots of the queries on the natural key schema need no joins at all--the most commonly needed information is naturally carried in the natural key. (I know that sounds odd, but it happens surprisingly often. How often is probably application-dependent.) But zero joins is often going to be faster than three joins, even if you join on integers.
Clearly if your data structures are shorter, they are faster to compare AND faster to store and retrieve.
How much faster 1, 2, 1000. It all depends on the size of the table and so on.
For example: say you have a table with a productId and a varchar text column.
Each row will roughly take 4 bytes for the int and then another 3-> 24 bytes for the text in your example (depending on if the column is nullable or is unicode)
Compare that to 5 bytes per row for the same data with a byte status column.
This huge space saving means more rows fit in a page, more data fits in the cache, less writes happen when you load store data, and so on.
Also, comparing strings at the best case is as fast as comparing bytes and worst case much slower.
There is a second huge issue with storing text where you intended to have a enum. What happens when people start storing Incompete as opposed to Incomplete?
having a skinner column means that you can fit more rows per page.
it is a HUGE difference between a varchar(20) and an integer.

Is there an advantage on setting tinyint fields when I know that the value will not exceed 255?

Should I choose the smallest datatype possible, or if I am storing the value 1 for example, it doesn't matter what is the col datatype and the value will occupy the same memory size?
The question is also, cuz I will always have to convert it and play around in the application.
UPDATE
I think that varchar(1) and varchar(50) is the same memory size if value is "a", I thought it's the same with int and tinyint, according to the answers I understand it's not, is it?
Always choose the smallest data type possible. SQL can't guess what you want the maximum value to be, but it can optimize storage and performance once you tell it the data type.
To answer your update:
varchar does take up only as much space as you use and so you're right when you say that the character "a" will take up 1 byte (in latin encoding) no matter how large a varchar field you choose. That is not the case with any other type of field in SQL.
However, you will likely be sacrificing efficiency for space if you make everything a varchar field. If everything is a fixed-size field then SQL can do a simple constant-time multiplication to find your value (like an array). If you have varchar fields in there, then the only way to find out where you data is stored it to go through all the previous fields (like a linked list).
If you're beginning SQL then I advise just to stay away from varchar fields unless you expect to have fields that sometimes have very small amounts of text and sometimes very large amounts of text (like blog posts). It takes experience to know when to use variable length fields to the best effect and even I don't know most of the time.
It's a performance consideration particular to the design of your system. In general, the more data you can fit into a page of Sql Server data, the better the performance.
One page in Sql Server is 8k. Using tiny ints instead of ints will enable you to put more data into a single page but you have to consider whether or not it's worth it. If you're going to be serving up thousands of hits a minute, then yes. If this is a hobby project or something that just a few dozen users will ever see, then it doesn't matter.
The advantage is there but might not be significant unless you have lots of rows and performs los of operation. There'll be performance improvement and smaller storage.
Traditionally every bit saved on the page size would mean a little bit of speed improvement: narrower rows means more rows per page, which means less memory consumed and fewer IO requests, resulting in better speed. However, with SQL Server 2008 Page compression things start to get fuzzy. The compression algorithm may compress 4 byte ints with values under 255 on even less than a byte.
Row compression algorithms will store a 4 byte int on a single byte for values under 127 (int is signed), 2 bytes for values under 32768 and so on and so forth.
However, given that the nice compression features are only available on Enterprise Edition servers, it makes sense to keep the habit of using the smallest possible data type.

Is there any harm choosing a large value for varchar in MySQL?

I'm about to add a new column to my table with 500,000 existing rows. Is there any harm in choosing a large value for the varchar? How exactly are varchars allocated for existing rows? Does it take up a lot of disk space? How about memory effects during run time?
I'm looking for MySQL specific behavior details not general software design advises.
There's no harm in choosing a large value for a varchar field. Only the actual data will be stored, and MySQL doesn't allocate the full specified length for each record. It stores the actual length of the data along with the field, so it doesn't need to store any padding or allocate unused memory.
Depends on what you're doing. See the relevant documentation page for some of the details:
http://dev.mysql.com/doc/refman/5.0/en/char.html
The penalty in disk space isn't really any different than what you have for e.g. TEXT types, and from a performance perspective it MAY actually be faster.
The primary problem is the maximum row size. Note that the exact implications of this differ between storage engines. Consult the MySQL docs for your storage engine of choice for maximum row size information.
I should also add that there can be performance benefits to minimizing row size, but it really depends on your workload, indexing, and just how big the rows are, whether or not it will be meaningful for you.
MySQL VARCHAR fields store the contents, plus 2 bytes for length. So empty VARCHAR fields will use up space to mark their lengths.
Also, if this is the only VARCHAR field in your table, and your storage engine is MyISAM, it would force dynamic row format which may yield a performance hit (testing will confirm).
http://dev.mysql.com/doc/refman/5.0/en/column-count-limit.html
http://dev.mysql.com/doc/refman/5.0/en/dynamic-format.html

SQL performance & MD5 strings

I've got a DB table where we store a lot of MD5 hashes (and yes I know that they aren't 100% unique...) where we have a lot of comparison queries against those strings.
This table can become quite large with over 5M rows.
My question is this: Is it wise to keep the data as hexadecimal strings or should I convert the hex to binary or decimals for better querying?
Binary is likely to be faster, since with text you're using 8 bits (a full character) to encode 4 bits of data. But I doubt you'll really notice much if any difference.
Where I'm at we have a very similar table. It holds dictation texts from doctors for billing purposes in a text column (still on sql server 2000). We're approaching four million records, and we need to be able to check for duplicates, where the doctor dictated the exact same thing twice for validation and compliance purposes. A dictation can run several pages, so we also have a hash column that's populated on insert via a trigger. The column is a char(32) type.
Binary data is a bummer to work with manually or if you have to dump your data to a text file or whatnot.
Just put an index on the hash column and you should be fine.