List of Best Practice MySQL Data Types - sql

Is there a list of best practice MySQL data types for common applications. For example, the list would contain the best data type and size for id, ip address, email, subject, summary, description content, url, date (timestamp and human readable), geo points, media height, media width, media duration, etc
Thank you!!!

i don't know of any, so let's start one!
numeric ID/auto_increment primary keys: use an unsigned integer. do not use 0 as a value. and keep in mind the maximum value of of the various sizes, i.e. don't use int if you don't need 4 billion values when the 16 million offered by mediumint will suffice.
dates: unless you specifically need dates/times that are outside the supported range of mysql's DATE and TIME types, use them! if you instead use unix timestamps, you have to convert them to use the built-in date and time functions. if your app needs unix timestamps, you can always convert the standard date and time data types on the way out using unix_timestamp().
ip addresses: use inet_aton() and inet_ntoa() since it easily compacts an ip address in to 4 bytes and gives you the ability to do range searches that utilize indexes.

Integer Display Width You likely define your integers something like this "INT(4)" but have been baffled by the fact that (4) has no real effect on the stored numbers. In other words, you can store numbers like 999999 just fine. The reason is that for integers, (4) is the display width, and only has an effect if used with the ZEROFILL modifier. Further, this is for display purposes only, so you could define a column as "INT(4) ZEROFILL" and store 99999. If you stored 999, the mysql REPL (console) would output 0999 when you've selected this column.
In other words, if you don't need the ZEROFILL stuff, you can leave off the display width.

Money: Use the Decimal data type. Based on real-world production scenarios I recommend (19,8).
EDIT: My original recommendation was (19,4); however, I've recently run into a production issue where the client reported that they absolutely needed decimal with a "scale" of "8"; thus "4" wasn't enough and was causing improper tax calculations. I now recommend (19,8) based on a real-world scenario. I would love to hear stories needing a more granular scale.

Related

What is the benefit of the one "DATE" datatype over another in Laravel/SQLite?

In my app, I'm just using a SQLlite database for development. Now in the migration, I declare a DATE datatype which laravel seems to handle without any problem, and in the database itself creates it as a varchar.
According to this nice article (http://www.sqlitetutorial.net/sqlite-date/) SQLite has basically got three options for handling dates:
Using the TEXT storage class for storing SQLite date and time Using
REAL storage class to store SQLite date and time values
Using INTEGER to store SQLite date and time values
So as I'm trying to formulate my approach, I'm thinking ahead that I will likely end up, at some point, need to step up and move to a higher performance SQL database (mySQL / Postgres / etc. ) And then may have datatype translation challenges.
But then also, at the application layer, Laravel itself has some manipulations.
Now, the question I'm asking is this, What is the benefit of one type over another? Is there some kind of reason to choose one type over another? My thinking is that TEXT is nice and human-readable for backend support, but it may require addiotnal coding to manipulate strings.
INTEGERS are probably more efficient, and would be translatable to a bigger SQL server easier than text.
Does anyone know of a comparison of the pro's and con's of various choices?
Any advice? Thanks in advance.
The size of integer is 4 bytes. The size of a letter in text is 1 byte.
To represent date and time you need 1 UTC number when you use integer. So its much better to user 4 bytes of integer than using 8 bytes of text. I dont see how real can be better than integer for the exact same reason. I would say you should use integer.

precision gains where data move from one table to another in sql server

There are three tables in our sql server 2008
transact_orders
transact_shipments
transact_child_orders.
Three of them have a common column carrying_cost. Data type is same in all the three tables.It is float with NUMERIC_PRECISION 53 and NUMERIC_PRECISION_RADIX 2.
In table 1 - transact_orders this column has value 5.1 for three rows. convert(decimal(20,15), carrying_cost) returns 5.100000..... here.
Table 2 - transact_shipments three rows are fetching carrying_cost from those three rows in transact_orders.
convert(decimal(20,15), carrying_cost) returns 5.100000..... here also.
Table 3 - transact_child_orders is summing up those three carrying costs from transact_shipments. And the value shown there is 15.3 when I run a normal select.
But convert(decimal(20,15), carrying_cost) returns 15.299999999999999 in this stable. And its showing that precision gained value in ui also. Though ui is only fetching the value, not doing any conversion. In the java code the variable which is fetching the value from the db is defined as double.
The code in step 3, to sum up the three carrying_costs is simple ::
...sum(isnull(transact_shipments.carrying_costs,0)) sum_carrying_costs,...
Any idea why this change occurs in the third step ? Any help will be appreciated. Please let me know if any more information is needed.
Rather than post a bunch of comments, I'll write an answer.
Floats are not suitable for precise values where you can't accept rounding errors - For example, finance.
Floats can scale from very small numbers, to very high numbers. But they don't do that without losing a degree of accuracy. You can look the details up on line, there is a host of good work out there for you to read.
But, simplistically, it's because they're true binary numbers - some decimal numbers just can't be represented as a binary value with 100% accuracy. (Just like 1/3 can't be represented with 100% accuracy in decimal.)
I'm not sure what is causing your performance issue with the DECIMAL data type, often it's because there is some implicit conversion going on. (You've got a float somewhere, or decimals with different definitions, etc.)
But regardless of the cause; nothing is faster than integer arithmetic. So, store your values are integers? £1.10 could be stored as 110p. Or, if you know you'll get some fractions of a pence for some reason, 11000dp (deci-pennies).
You do then need to consider the biggest value you will ever reach, and whether INT or BIGINT is more appropriate.
Also, when working with integers, be careful of divisions. If you divide £10 between 3 people, where does the last 1p need to go? £3.33 for two people and £3.34 for one person? £0.01 eaten by the bank? But, invariably, it should not get lost to the digital elves.
And, obviously, when presenting the number to a user, you then need to manipulate it back to £ rather than dp; but you need to do that often anyway, to get £10k or £10M, etc.
Whatever you do, and if you don't want rounding errors due to floating point values, don't use FLOAT.
(There is ALOT written on line about how to use floats, and more importantly, how not to. It's a big topic; just don't fall into the trap of "it's so accurate, it's amazing, it can do anything" - I can't count the number of time people have screwed up data using that unfortunately common but naive assumption.)

How best to represent rational numbers in SQL Server?

I'm working with data that is natively supplied as rational numbers. I have a slick generic C# class which beautifully represents this data in C# and allows conversion to many other forms. Unfortunately, when I turn around and want to store this in SQL, I've got a couple solutions in mind but none of them are very satisfying.
Here is an example. I have the raw value 2/3 which my new Rational<int>(2, 3) easily handles in C#. The options I've thought of for storing this in the database are as follows:
Just as a decimal/floating point, i.e. value = 0.66666667 of various precisions and exactness.
Pros: this allows me to query the data, e.g. find values < 1.
Cons: it has a loss of exactness and it is ugly when I go to display this simple value back in the UI.
Store as two exact integer fields, e.g. numerator = 2, denominator = 3 of various precisions and exactness.
Pros: This allows me to precisely represent the original value and display it in its simplest form later.
Cons: I now have two fields to represent this value and querying is now complicated/less efficient as every query must perform the arithmetic, e.g. find numerator / denominator < 1.
Serialize as string data, i.e. "2/3". I would be able to know the max string length and have a varchar that could hold this.
Pros: I'm back to one field but with an exact representation.
Cons: querying is pretty much busted and pay a serialization cost.
A combination of #1 & #2.
Pros: easily/efficiently query for ranges of values, and have precise values in the UI.
Cons: three fields (!?!) to hold one piece of data, must keep multiple representations in sync which breaks D.R.Y.
A combination of #1 & #3.
Pros: easily/efficiently query for ranges of values, and have precise values in the UI.
Cons: back down to two fields to hold one piece data, must keep multiple representations in sync which breaks D.R.Y., and must pay extra serialization costs.
Does anyone have another out-of-the-box solution which is better than these? Are there other things I'm not considering? Is there a relatively easy way to do this in SQL that I'm just unaware of?
If you're using SQL Server 2005 or 2008, you have the option to define your own CLR data types:
Beginning with SQL Server 2005, you
can use user-defined types (UDTs) to
extend the scalar type system of the
server, enabling storage of CLR
objects in a SQL Server database. UDTs
can contain multiple elements and can
have behaviors, differentiating them
from the traditional alias data types
which consist of a single SQL Server
system data type.
Because UDTs are accessed by the
system as a whole, their use for
complex data types may negatively
impact performance. Complex data is
generally best modeled using
traditional rows and tables. UDTs in
SQL Server are well suited to the
following:
Date, time, currency, and extended numeric types
Geospatial applications
Encoded or encrypted data
If you can live with the limitations, I can't imagine a better way to map data you're already capturing in a custom class.
I would probably go with Option #4, but use a calculated column for the 3rd column to avoid the sync/DRY issue (and also means you actually only store 2 columns, avoiding the "three fields" issue).
In SQL server, calculated column is defined like so:
CREATE TABLE dbo.Whatever(
Numerator INT NOT NULL,
Denominator INT NOT NULL,
Value AS (Numerator / Denominator) PERSISTED
)
(note you may have to do some type conversion and verification that Denominator is not zero, etc).
Also, SQL 2005 added a PERSISTED calculated column that would get rid of the calculation at query time.
How much precision do you need?
The language, C# or otherwise, will round 2/3rds at a given position in the precision. If it's acceptable for whatever you are working on to use decimal values of say scientific notation of 10, then set the precision accordingly in the db.
If the precision is really a concern, then separate the numerator & denominator. This would ensure you always have access to whatever precision you want, and you can use a computed column to represent the value for quick filtering:
numerator INT,
denominator INT,
result AS CASE WHEN denominator > 0 THEN numerator / denominator ELSE NULL END
I have experimented a little bit with using the geometry data type in SQL Server 2008 to store and manipulate rational numbers. Basically, I assume that the numerator goes in the X slot and the denominator goes in the Y slot of a fictitious geometry point.
This was good for my needs, but it might be useless for yours. That will depend on what your priorities are (performance, code readability, etc.). I personally found that T-SQL for geometry data manipulation is hard to write and read.
how much precision are you looking at ? double/float provide decent precision(in my opinion). Am pretty sure scientific/astronomical data need a lot more precision that that. I do know that libraries like matlab and mathematica are good at these. I found that you can use mathematica with your .net program. Here is the link
Edit: adding more links and quotes
"When Mathematica operates on rational numbers, it gives an exact result no matter how many digits are required" from here
Another good read, but you would have to implement it I guess

how to store an approximate number? (number is too small to be measured)

I have a table representing standards of alloys. The standard is partly based on the chemical composition of the alloys. The composition is presented in percentages. The percentage is determined by a chemical composition test. Sample data.
But sometimes, the lab cannot measure below a certain percentage. So they indicate that the element is present, but the percentage is less than they can measure.
I was confused how to accurately store such a number in an SQL database. I thought to store the number with a negative sign. No element can have a negative composition of course, but i can interpret this as less than the specified value. Or option is to add another column for each element!! The latter option i really don't like.
Any other ideas? It's a small issue if you think about it, but i think a crowd is always wiser. Somebody might have a neater solution.
Question updated:
Thanks for all the replies.
The test results come from different labs, so there is no common lower bound.
The when the percentage of Titanium is less than <0.0004 for example, the number is still important, only the formula will differ slightly in this case.
Hence the value cannot be stored as NULL, and i don't know the lower bound for all values.
Tricky one.
Another possibility i thought of is to store it as a string. Any other ideas?
What you're talking about is a sentinel value. It's a common technique. Strings in most languages after all use 0 as a sentinel end-of-string value. You can do that. You just need to find a number that makes sense and isn't used for anything else. Many string functions will return -1 to indicate what you're looking for isn't there.
0 might work because if the element isn't there there shouldn't even be a record. You also face the problem that it might be mistaken for actually meaning 0. -1 is another option. It doesn't have that same problem obviously.
Another column to indicate if the amount is measurable or not is also a viable option. The case for this one becomes stronger if you need to store different categories of trace elements (eg <1%, <0.1%, <0.01%, etc). Storing the negative of those numbers seems a bit hacky to me.
You could just store it as NULL, meaning that the value exists but is undefined.
Any arithmetic operation with a NULL yields a NULL.
Division by NULL is safe.
NULL's are ignored by the aggregation functions, so queries like these:
SELECT SUM(metal_percent), COUNT(metal_percent)
FROM alloys
GROUP BY
metal
will give you the sum and the count of the actual, defined values, not taking the unfilled values into account.
I would use a threshold value which is at least one significant digit smaller than your smallest expected value. This way you can logically say that any value less than say 0.01, can be presented to you application as a "trace" amount. This remains easy to understand and gives you flexibility in determining where your threshold should lie.
Since the constraints of the values are well defined (cannot have negative composition), I would go for the "negative value to indicate less-than" approach. As long as this use of such sentinel values are sufficiently documented, it should be reasonably easy to implement and maintain.
An alternative but similar method would be to add 100 to the values, assuming that you can't get more than 100%. So <0.001 becomes 100.001.
I would have a table modeling the certificate, in a one to many relation with another table, storing the values for elements. Then, I would still have the elements table containing the value in one column and a flag (less than) as a separate column.
Draft:
create table CERTIFICATES
(
PK_ID integer,
NAME varchar(128)
)
create table ELEMENTS
(
ELEMENT_ID varchar(2),
CERTIFICATE_ID integer,
CONCENTRATION number,
MEASURABLE integer
)
Depending on the database engine you're using, the types of the columns may vary.
Why not add another column to store whether or not its a trace amount
This will allow you to to save the amount that the trace is less than too
Since there is no common lowest threshold value and NULL is not acceptable, the cleanest solution now is to have a marker column which indicates whether there is a quantifiable amount or a trace amount present. A value of "Trace" would indicate to anybody reading the raw data that only a trace amount was present. A value of "Quantity" would indicate that you should check an amount column to find the actual quantity present.
I would have to warn against storing numerical values as strings. It will inevitably add additional pain, since you now lose the assertions a strong type definition gives you. When your application consumes the values in that column, it has to read the string to determine whether it's a sentinel value, a numeric value or simply some other string it can't interpret. Trying to handle data conversion errors at this point in your application is something I'm sure you don't want to be doing.
Another field seems like the way to go; call it 'MinMeasurablePercent'.

Is there any reason for numeric rather than int in T-SQL?

Why would someone use numeric(12, 0) datatype for a simple integer ID column? If you have a reason why this is better than int or bigint I would like to hear it.
We are not doing any math on this column, it is simply an ID used for foreign key linking.
I am compiling a list of programming errors and performance issues about a product, and I want to be sure they didn't do this for some logical reason. If you follow this link:
http://msdn.microsoft.com/en-us/library/ms187746.aspx
... you can see that the numeric(12, 0) uses 9 bytes of storage and being limited to 12 digits, theres a total of 2 trillion numbers if you include negatives. WHY would a person use this when they could use a bigint and get 10 million times as many numbers with one byte less storage. Furthermore, since this is being used as a product ID, the 4 billion numbers of a standard int would have been more than enough.
So before I grab the torches and pitch forks - tell me what they are going to say in their defense?
And no, I'm not making a huge deal out of nothing, there are hundreds of issues like this in the software, and it's all causing a huge performance problem and using too much space in the database. And we paid over a million bucks for this crap... so I take it kinda seriously.
Perhaps they're used to working with Oracle?
All numeric types including ints are normalized to a standard single representation among all platforms.
There are many reasons to use numeric - for example - financial data and other stuffs which need to be accurate to certain decimal places. However for the example you cited above, a simple int would have done.
Perhaps sloppy programmers working who didn't know how to to design a database ?
Before you take things too seriously, what is the data storage requirement for each row or set of rows for this item?
Your observation is correct, but you probably don't want to present it too strongly if you're reducing storage from 5000 bytes to 4090 bytes, for example.
You don't want to blow your credibility by bringing this up and having them point out that any measurable savings are negligible. ("Of course, many of our lesser-experienced staff also make the same mistake.")
Can you fill in these blanks?
with the data type change, we use
____ bytes of disk space instead of ____
____ ms per query instead of ____
____ network bandwidth instead of ____
____ network latency instead of ____
That's the kind of thing which will give you credibility.
How old is this application that you are looking into?
Previous to SQL Server 2000 there was no bigint. Maybe its just something that has made it from release to release for many years without being changed or the database schema was copied from an application that was this old?!?
In your example I can't think of any logical reason why you wouldn't use INT. I know there are probably reasons for other uses of numeric, but not in this instance.
According to: http://doc.ddart.net/mssql/sql70/da-db_1.htm
decimal
Fixed precision and scale numeric data from -10^38 -1 through 10^38 -1.
numeric
A synonym for decimal.
int
Integer (whole number) data from -2^31 (-2,147,483,648) through 2^31 - 1 (2,147,483,647).
It is impossible to know if there is a reason for them using decimal, since we have no code to look at though.
In some databases, using a decimal(10,0) creates a packed field which takes up less space. I know there are many tables around my work that use that. They probably had the same kind of thought here, but you have gone to the documentation and proven that to be incorrect. More than likely, I would say it will boil down to a case of "that's the way we have always done it, because someone one time said it was better".
It is possible they spend a LOT of time in MS Access and see 'Number' often and just figured, its a number, why not use numeric?
Based on your findings, it doesn't sound like they are the optimization experts, and just didn't know. I'm wondering if they used schema generation tools and just relied on them too much.
I wonder how efficient an index on a decimal value (even if 0 scale is set) for a primary key compares to a pure integer value.
Like Mark H. said, other than the indexing factor, this particular scenario likely isn't growing the database THAT much, but if you're looking for ammo, I think you did find some to belittle them with.
In your citation, the decimal shows precision of 1-9 as using 5 bytes. Your column apparently has 12,0 - using 4 bytes of storage - same as integer.
Moreover, INT, datatype can go to a power of 31:
-2^31 (-2,147,483,648) to 2^31-1 (2,147,483,647)
While decimal is much larger to 38:
- 10^38 +1 through 10^38 - 1
So the software creator was actually providing more while using the same amount of storage space.
Now, with the basics out of the way, the software creator actually limited themselves to just 12 numbers or 123,456,789,012 (just an example for place holders not a maximum number). If they used INT they could not scale this column - it would go up to the full 31 digits. Perhaps there is a business reason to limit this column and associated columns to 12 digits.
An INT is an INT, while a DECIMAL is scalar.
Hope this helps.
PS:
The whole number argument is:
A) Whole numbers are 0..infinity
B) Counting (Natural) numbers are 1..infinity
C) Integers are infinity (negative) .. infinity (positive)
D) I would not cite WikiANYTHING for anything. Come on, use a real source! May as well be http://MyPersonalMathCite.com