SQL performance & MD5 strings - sql

I've got a DB table where we store a lot of MD5 hashes (and yes I know that they aren't 100% unique...) where we have a lot of comparison queries against those strings.
This table can become quite large with over 5M rows.
My question is this: Is it wise to keep the data as hexadecimal strings or should I convert the hex to binary or decimals for better querying?

Binary is likely to be faster, since with text you're using 8 bits (a full character) to encode 4 bits of data. But I doubt you'll really notice much if any difference.
Where I'm at we have a very similar table. It holds dictation texts from doctors for billing purposes in a text column (still on sql server 2000). We're approaching four million records, and we need to be able to check for duplicates, where the doctor dictated the exact same thing twice for validation and compliance purposes. A dictation can run several pages, so we also have a hash column that's populated on insert via a trigger. The column is a char(32) type.

Binary data is a bummer to work with manually or if you have to dump your data to a text file or whatnot.
Just put an index on the hash column and you should be fine.

Related

What's the database performance improvement from storing as numbers rather than text?

Suppose I have text such as "Win", "Lose", "Incomplete", "Forfeit" etc. I can directly store the text in the database. Instead if use numbers such as 0 = Win, 1 = Lose etc would I get a material improvement in database performance? Specifically on queries where the field is part of my WHERE clause
At the CPU level, comparing two fixed-size integers takes just one instruction, whereas comparing variable-length strings usually involves looping through each character. So for a very large dataset there should be a significant performance gain with using integers.
Moreover, a fixed-size integer will generally take less space and can allow the database engine to perform faster algorithms based on random seeking.
Most database systems however have an enum type which is meant for cases like yours - in the query you can compare the field value against a fixed set of literals while it is internally stored as an integer.
There might be significant performance gains if the column is used in an index.
It could range anywhere from negligible to extremely beneficial depending on the table size, the number of possible values being enumerated and the database engine / configuration.
That said, it almost certainly will never perform worse to use a number to represent an enumerated type.
Don't guess. Measure.
Performance depends on how selective the index is (how many distinct values are in it), whether critical information is available in the natural key, how long the natural key is, and so on. You really need to test with representative data.
When I was designing the database for my employer's operational data store, I built a testbed with tables designed around natural keys and with tables designed around id numbers. Both those schemas have more than 13 million rows of computer-generated sample data. In a few cases, queries on the id number schema outperformed the natural key schema by 50%. (So a complex query that took 20 seconds with id numbers took 30 seconds with natural keys.) But 80% of the test queries had faster SELECT performance against the natural key schema. And sometimes it was staggeringly faster--a difference of 30 to 1.
The reason, of course, is that lots of the queries on the natural key schema need no joins at all--the most commonly needed information is naturally carried in the natural key. (I know that sounds odd, but it happens surprisingly often. How often is probably application-dependent.) But zero joins is often going to be faster than three joins, even if you join on integers.
Clearly if your data structures are shorter, they are faster to compare AND faster to store and retrieve.
How much faster 1, 2, 1000. It all depends on the size of the table and so on.
For example: say you have a table with a productId and a varchar text column.
Each row will roughly take 4 bytes for the int and then another 3-> 24 bytes for the text in your example (depending on if the column is nullable or is unicode)
Compare that to 5 bytes per row for the same data with a byte status column.
This huge space saving means more rows fit in a page, more data fits in the cache, less writes happen when you load store data, and so on.
Also, comparing strings at the best case is as fast as comparing bytes and worst case much slower.
There is a second huge issue with storing text where you intended to have a enum. What happens when people start storing Incompete as opposed to Incomplete?
having a skinner column means that you can fit more rows per page.
it is a HUGE difference between a varchar(20) and an integer.

Why use shorter VARCHAR(n) fields?

It is frequently advised to choose database field sizes to be as narrow as possible. I am wondering to what degree this applies to SQL Server 2005 VARCHAR columns: Storing 10-letter English words in a VARCHAR(255) field will not take up more storage than in a VARCHAR(10) field.
Are there other reasons to restrict the size of VARCHAR fields to stick as closely as possible to the size of the data? I'm thinking of
Performance: Is there an advantage to using a smaller n when selecting, filtering and sorting on the data?
Memory, including on the application side (C++)?
Style/validation: How important do you consider restricting colunm size to force non-sensical data imports to fail (such as 200-character surnames)?
Anything else?
Background: I help data integrators with the design of data flows into a database-backed system. They have to use an API that restricts their choice of data types. For character data, only VARCHAR(n) with n <= 255 is available; CHAR, NCHAR, NVARCHAR and TEXT are not. We're trying to lay down some "good practices" rules, and the question has come up if there is a real detriment to using VARCHAR(255) even for data where real maximum sizes will never exceed 30 bytes or so.
Typical data volumes for one table are 1-10 Mio records with up to 150 attributes. Query performance (SELECT, with frequently extensive WHERE clauses) and application-side retrieval performance are paramount.
Data Integrity - By far the most important reason. If you create a column called Surname that is 255 characters, you will likely get more than surnames. You'll get first name, last name, middle name. You'll get their favorite pet. You'll get "Alice in the Accounting Department with the Triangle hair". In short, you will make it easy for users to use the column as a notes/surname column. You want the cap to imped the users that try to put something other than a surname into that column. If you have a column that calls for a specific length (e.g. a US tax identifier is nine characters) but the column is varchar(255), other developers will wonder what is going on and you likely get crap data as well.
Indexing and row limits. In SQL Server you have a limit of 8060 bytes IIRC. Lots of fat non-varchar(max) columns with lots of data can quickly exceed that limit. In addition, indexes have a 900 bytes cap in width IIRC. So, if you wanted to index on your surname column and some others that contain lots of data, you could exceed this limit.
Reporting and external systems. As a report designer you must assume that if a column is declared with a max length of 255, it could have 255 characters. If the user can do it, they will do it. Thus, to say, "It probably won't have more than 30 characters." is not even remotely the same as "It cannot have more than 30 characters." Never rely on the former. As a report designer, you have to work around the possibilities that users will enter a bunch of data into a column. That either means truncating the values (and if that is the case why have the additional space available?) or using CanGrow to make a lovely mess of a report. Either way, you make it harder on other developers to understand the intent of the column if the column size is so far out of whack with the actual data being stored.
I think that the biggest issue is data validation. If you allow 255 characters for a surname, you WILL get a surname that's 200+ characters in your database.
Another reason is that if you allow the database to hold 255 characters you now have to account for that possibility in every system that touches your database. For example, if you exported to a fixed-width column file all of your columns would have to be 255 characters wide, which could be pretty annoying or even problematic. That's just one example where it could cause a problem.
One good reason is validation.
(for example) In Holland a social security number is always 9 chars long, when you won't allow more it will never occur.
If you would allow more and for some unknown reason there are 10 chars, you will need to put in checks (which you otherwise wouldn't) to check if it is 9 long.
1) Readability & Support
A database developer could look at a field called StateCode with a length of varchar(2) and get a good idea of what kind of data that field holds, without even looking at the contents.
2) Reporting
When you data is without a length constraint, you are expecting the developer to enforce that the column data is all similar in length. When reporting on that data, if the developer has failed to make the column data consistent, that will make the reporting that data inconsistent & look funny.
3) SQL Server Data Storage
SQL Server stores data on 8k "pages" and from a performance standpoint it is ideal to be as efficient as possible and store as much data as possible on a page.
If your database is designed to store every string column as varchar(255), "bad" data could slip into one of those fields (for example a state name might slip into a StateCode field that is meant to be 2 characters long), and cause unecessary & inefficient page and index splits.
The other thing is that a single row of data is limited to 8060 bytes, and SQL Server uses the max length of varchar fields to determine this.
Reference: http://msdn.microsoft.com/en-us/library/ms143432.aspx

Is there an advantage on setting tinyint fields when I know that the value will not exceed 255?

Should I choose the smallest datatype possible, or if I am storing the value 1 for example, it doesn't matter what is the col datatype and the value will occupy the same memory size?
The question is also, cuz I will always have to convert it and play around in the application.
UPDATE
I think that varchar(1) and varchar(50) is the same memory size if value is "a", I thought it's the same with int and tinyint, according to the answers I understand it's not, is it?
Always choose the smallest data type possible. SQL can't guess what you want the maximum value to be, but it can optimize storage and performance once you tell it the data type.
To answer your update:
varchar does take up only as much space as you use and so you're right when you say that the character "a" will take up 1 byte (in latin encoding) no matter how large a varchar field you choose. That is not the case with any other type of field in SQL.
However, you will likely be sacrificing efficiency for space if you make everything a varchar field. If everything is a fixed-size field then SQL can do a simple constant-time multiplication to find your value (like an array). If you have varchar fields in there, then the only way to find out where you data is stored it to go through all the previous fields (like a linked list).
If you're beginning SQL then I advise just to stay away from varchar fields unless you expect to have fields that sometimes have very small amounts of text and sometimes very large amounts of text (like blog posts). It takes experience to know when to use variable length fields to the best effect and even I don't know most of the time.
It's a performance consideration particular to the design of your system. In general, the more data you can fit into a page of Sql Server data, the better the performance.
One page in Sql Server is 8k. Using tiny ints instead of ints will enable you to put more data into a single page but you have to consider whether or not it's worth it. If you're going to be serving up thousands of hits a minute, then yes. If this is a hobby project or something that just a few dozen users will ever see, then it doesn't matter.
The advantage is there but might not be significant unless you have lots of rows and performs los of operation. There'll be performance improvement and smaller storage.
Traditionally every bit saved on the page size would mean a little bit of speed improvement: narrower rows means more rows per page, which means less memory consumed and fewer IO requests, resulting in better speed. However, with SQL Server 2008 Page compression things start to get fuzzy. The compression algorithm may compress 4 byte ints with values under 255 on even less than a byte.
Row compression algorithms will store a 4 byte int on a single byte for values under 127 (int is signed), 2 bytes for values under 32768 and so on and so forth.
However, given that the nice compression features are only available on Enterprise Edition servers, it makes sense to keep the habit of using the smallest possible data type.

Storing time-temperature data in DB

I'm storing time-temperature data in a database, which is really just CSV data. The first column is time in seconds, starting at zero, with the following (one or more) column(s) being temperature:
0,197.5,202.4
1,196.0,201.5
2,194.0,206.5
3,192.0,208.1 ....etc
Each plot represents about 2000 seconds. Currently I'm compressing the data before storing it in a output_profile longtext field.
CREATE TABLE `outputprofiles` (
`id` int(11) NOT NULL auto_increment,
`output_profile` longtext NOT NULL,
PRIMARY KEY (`id`)
This helps quite a bit...I can compress a plot which is 10K of plain text down to about 2.5K. There is no searching or indexing needed on this data since it's just referenced in another table.
My question: Is there any other way to store this data I'm not thinking about which is more efficient in terms of storage space?
Is there any reason to think that storage space will be a limiting constraint on your application? I'd try to be pretty sure that's the case before putting a higher priority on that, compared to ease of access and usage; for which purpose it sounds like what you have is satisfactory.
I actually do not understand quite well what you mean with "compressing the plot". Means that, that you are compressing 2000 measurements or you are compressing each line?
Anyway, space is cheap. I would make it in the traditional way, i.e. two columns, one entry for each measurements.
If for some reason this doesn't work and if you want to save 2000 measurements as one record then you can do it pretty much better.
. Create a csv file with your measurements.
. zip it (gzip -9 gives you the maximal compression)
. save it as a blob (or longblob depending the DB you are using) NOT as a longtext
Then just save it at the DB.
This will give you maximal compression.
PostgreSQL has a big storage space overhead since every tuple (a prepresentation of a row in a table) is 28 byte excluding the data (PostgreSQL 8.3). There are 2, 4 and 8 bytes integers and a timestamp is 8 byte. Floats are 8 bytes I think. So, storing 1,000,000,000 rows in PostgreSQL will require several GiB more storage than MySQL (depending on which storage enginge you use in MySQL). But PostgreSQL is also great at handling huge data compared to MySQL. Try do run some DDL queries to a huge MySQL table and you'll see what I mean. But this simple data you are storing should probably be easy to partition heavily, so maby a simple MySQL can handle the job nicely. But, as I always says, if you're not really really sure you need a specific MySQL feature you should go for PostgreSQL.
I'm limiting this post to only MySQL and PostgreSQL since this question is tagged with only those two databases.
Edit: Sorry, I didn't see that you actually stores the CSV in the DB.

How much more inefficient are text (blobs) than varchar/nvarchar's?

We're doing a lot of large, but straightforward forms for a fairly big project (about 600 users using it throughout the day - that's big for me at least ;-) ).
The forms have a lot of question/answer type sections, so it's natural for some people to type a sentence, while others type a novel. How beneficial would it be to put a character limit on some of these fields really?
(Please include references or citations, if necessary/possible - Thanks!)
If you have no limitations on the data size, then why worry. This doesn't sound like a mission critical project, even with 600 users and several thousand records. Use CLOB/BLOB and be done with it. I have doubts as to whether you would see any major gains in limiting sizes and risking data loss. That said, you should layout such boundaries before implementation.
Usually varchar is best for storing values that you wish to use logically and perform "whole value" comparisons against. Text is for unstructured data. If your project is a survey result with unstructured text, use CLOB/BLOB
Semi-Reference: I work with hundreds of thousands of call center records sometimes where we use a CLOB to store the dialog between employees and customers.
I say, focus on the needs of the users and only worry about database performance issues when/if those issues arise. Ask yourself "will my users benefit if I limit the amount of data they can enter".
I keep a great gapingvoid cartoon on my wall that says "it's not what the software does. it's what the user does".
You don't mention which sql server you are using
If you are using MySql there are definite advantages in speed to using fixed length fields to keep the table in static mode, however if you have any variable width fields the table will switch to dynamic and you lose the benefit of specifying the length of the field.
http://dev.mysql.com/doc/refman/5.0/en/static-format.html
http://dev.mysql.com/doc/refman/5.0/en/dynamic-format.html
Microsoft SQL Server has similar performance gains when you use fixed length columns. With fixed length columns the server knows exactly what the offset and length of the data in the row is. With variable length columns the server knows the offset but has to store the actual length of the data as a preceding 2byte counter. This has a couple of implications that are discussed in this interesting article that discusses performance as a function of disk space and the advantages of variable length columns.
If you are using SQL Server 2005 or newer you can take advantage of varchar(max). This column type has the same 2GB storage capacity of BLOBs but the data is stored in 8K chunks with the table data pages instead of in a separate store. So you get the large size advantage, only use 8K in your pages at a time, quick access for the DB engine, and the same query semantics that work with other column types work with varchar(max).
In the end specifying a max length on a variable column mainly lets you constrain the growth size of your database. Once you use variable length columns you lose the advantage of fixed size rows and varchar(max) will perform the same as varchar(10) when holding the same amount of data.
blob and text / ntext are stored outside of the row context, and only a reference stored to the object, resulting in a smaller row size, which will improve performance on clustered indexes.
However because text / ntext are not stored with the row data retrival takes longer, and these fields cannot be used in any comparison statements.
from: http://www.making-the-web.com/2008/03/24/saving-bytes-efficient-data-storage-mysql-part-1/
There are a few variations of the TEXT and BLOB types which affect size; they are:
Type - Maximum Length -Storage
TINYBLOB, TINYTEXT 255 Length+1 bytes
BLOB, TEXT 65535 Length+2 bytes
MEDIUMBLOB, MEDIUMTEXT 16777215 Length+3 bytes
LONGBLOB, LONGTEXT 4294967295 Length+4 bytes