Sql Order By - Varchar vs Integer - sql

Speaking with a friend of mine about DB structure, he says that for telephone number he use to create integer attributes and cast them directly into code that will extract data from DB (he add zero ahead number). Apart that method that could be questionable, I suggest him to use a varchar field.
He says that use a varchar is less efficient because:
It take more memory for information storage (and this is true)
It take more "time" for ordering the field
I'm pretty confused as I guess that rdbms, with all optimization, will do this sort in a O(n log(n)) or something like regardless of data type. Mining the internet for information, unfortunately, turned out to be useless.
Could someone helps me understand if what I'm saying here has sense or not?
Thank you.

RDBMS use indexes to optimize ordering(sorting). Indexes stored as B-tree in the storages and is more large as the indexed field(s) is. Thus the disk IO operations increases because the data is large. In another hands, the O(n log(n)) is different for different type of data. The semantic of comparing string and numeric(integer) are different and comparison of strings are more complicated than integers.

Related

How to generate a numeric identifier for entries based on a string

I'm working in Redshift SQL syntax, and want to know a way to convert a string id for each entry in a table to a numeric id (since numeric joins between tables are supposedly much quicker and more efficient than string joins).
Currently the ids look like this - a bunch of strings with both numbers and letters
01r00001ABCDeAAF
01r00001IJKLmAAN
...
01r00001OPQRtAAN
What I would like is to turn this into a purely numeric identifier, using the string id as an input and ensuring that each output is unique and corresponds only to a single input with no collisions (which can be replicated across tables so that accurate joins are possible).
I've tried using some hash functions within SQL like CHECKSUM() and BINARY_CHECKSUM() over the columns, but I'm a little unclear which would be the most applicable here - I understand some are case-sensitive and others aren't, while some generate collisions and others don't.
First, your reference for strings versus integers is based on an entirely different database. I would not generalize from SQL Server performance to other databases, particularly a massively parallel columnar database. There is also a lot of information that is taken out of context and generalized to wrong situations.
Second, you can test on tables in Amazon Redshift. Generating the data and doing the tests should be faster than modifying existing data. You will probably find no need to change anything.
You need to understand what is happening "under the hood" before making a change like this, particularly if you think it is for performance reasons.
Strings can be troublesome for a variety of reasons. First, they can have different collations or character sets -- information that is hidden. Such differences would preclude the use of indexes -- a major hit in a database such as SQL Server. Not using indexes is generally not an issue in Redshift.
Strings can also have variable lengths. This makes indexes slightly less efficient. They also require a wee bit more overhead to compare than numbers, because those collations and character sets need to be taken into account. They also need to be compared character-by-character, whereas most hardware has built-in comparisons for numbers. The extra cycles here is usually minimal compared to the cost of moving data.
When you do a join in Amazon Redshift, the first thing it is going to do is collocate the data, probably by hashing the values and sending the data to the same nodes in the parallel environment. Moving the data is expensive. Hashing the values, much less so.
In Redshift, you should be more concerned about how your data is distributed. Although I haven't tested it, adding a new column that is a number might make the query more expensive, because in a columnar database, the number of columns referenced has an impact on performance.

Why can't database engines use an auto column datatype?

When defining table in database, we set column types as int / varchar etc etc.
Why can't it set to auto?
The database would recognize from the input and set the type itself, much like php handles variables. Also, youtube wouldn't have to crash their stats counter.
In some instances, you can (e.g. SQL Server's sql_variant or Oracle's ANYDATA).
Whether that's a good idea is another matter...
First of all, not having explicit schema (in the database), doesn't mean you can avoid implicit schema (implied by what your client application expects). For example, just because you have a string in the database, doesn't mean your application will be able to work with it, if it was originally implemented to expect a number. This is really an argument of dynamic vs. static typing, and static typing is generally acknowledged as superior for non-trivial programs.
Furthermore, these "variant" types often have significant overhead and are problematic from the indexing and sargability perspective. For example, is 2 smaller than 10? Yes if these are numbers, no if they are strings!
So if you know what you expect from the database (which you do in 99% cases), tell that to the database, so it can type-check and optimize it for you!

Removing repeated string values

Given a non-key varchar column where the string values may be repeated in many other rows, is a separate table mapping unique strings from the column to an integer a beneficial practice? It would clearly eliminate storage space, but is the performance lost from joining the first table to this mapping table worth it?
Generally speaking integer comparisons are going to be faster because at the lowest levels the machine does these singly as opposed to each character in the string.
However whether conversion is a good idea or not is a difficult question without knowing how often the comparison takes place.
Personally speaking I the conversion was likely to take place often (such as a key look up in a join) then I'd make make them integers.
The same things holds for indexing, and also because the indexes are smaller (space efficiency) you remove some backing store latency - again that's the theory - but in reality there may be many many other factors to consider.
Commonly referred to as a Lookup table, it's definitely worth adding if it the values repeat frequently, and if the strings are substantial enough, ie: a 2-character State code is not worth the trouble.
Integer comparisons are faster than string comparisons, but this is typically more about space savings than performance, because you've already got the string in-row, so separating the repeating values into a lookup table adds an extra JOIN. It's a step towards normalization, but there is such a thing as over-normalization, in my opinion, whether or not you should depends on what your data is like, and how it's used.

For databases, does choosing the correct data type affect performance?

And if so, why? I mean, is a tinyint faster to search than int?
If so, what are the practical differences in performance?
Depending on the data types, yes, it does make a difference.
int vs. tinyint wouldn't make a noticeable difference in speed, but it would make a difference in data sizes. Assuming tinyint is 1 byte, versus int being 4, that's 3 bytes saved every row. it adds up after awhile.
Now, if it was int against varchar, then there would be a bit of a drop, as things like sorts would be much faster on integer values than string values.
If it's a comparable type, and you're not very pressed for space, go with the one that's easier and more robust.
Theoretically, yes, a tinyint is faster than an int. But good database design and proper indexing has a far more substantial effect on performance, so I always use int for design simplicity.
I would venture that there are no practical performance differences in that case. Storage space is the more substantial factor, but even then, it's not much difference. The difference is perhaps 2 bytes? After 500,000 rows you've almost used an extra megabyte. Hopefully you aren't pinching megabytes if you are working with that much data.
Choosing the right data type can improve performance. In a lot of cases the practical difference might not be a lot but a bad choice can definitely have an impact. Imagine using a 1000 character char field instead of a varchar field when you are only going to be storing a string of a few characters. It's a bit of an extreme example but you would definitely be a lot better using a varchar. You would probably never notice a difference in performance between an int and a tinyint. Your overall database design (normalized tables, good indices, etc.) will have a far larger impact.
of course choosing right datatypes always helps in faster execution
take a look in at this article this will surely help you out:
http://www.peachpit.com/articles/article.aspx?p=30885&seqNum=7
The performance consideration all depends on the scale of your model and usage. While the consideration for storage space in these modern times is almost a non-issue, you might need to think about performance:
Database engines tend to store data in chunks called pages. Sql Server has 8k pages, Oracle 2k and MySql 16k page size by default? Not that big for any of these systems. Whenever you perform an operation on a bit of data (the field and row) its entire page is fetched from the db and put into memory. When your data is smaller (tiny int vs. int) then you can fit more individual rows and data items into a page and therefore your likelihood of having to fetch more pages goes down and the overall performance speeds up. So yes, using the smallest possible representation of your data will definitely have an impact on performance because it allows the db engine to be more efficient.
One way it can affect performance is by not requiring you to convert it to the correct type to manipulate the data. This is true when someone uses varchar for instance instead of a datetime datatype and then they have to be converted to do date math. It can also affect performance by giving a smaller record (this why you shouldn't define everything at the max size) which affects how pages are stored and retrieved in the database.
Of course using the correct type of data can also help data integrity; you can't store a date that doesn't exist in a datetime field but you can in varchar field. If you use float instead of int then your values aren't restricted to integer values etc. ANd speaking of float, it is generally bad to use if you intend to do math calulations as you get rounding errors since it is not an exact type.

Why do we care about data types?

Specifically, in relational database management systems, why do we need to know the data type of a column (more likely, the attribute of an object) at creation time?
To me, data types feel like an optimization, because one data point can be implemented in any number of ways. Wouldn't it be better to assign semantic roles and constraints to a data point and then have the engine internally examine and optimize which data type best serves the user?
I suspect this is where the heavy lifting is and why it's easier to just ask the user rather than to do the work.
What do you think? Where are we headed? Is this a realistic expectation? Or do I have a misguided assumption?
The type expresses a desired constraint on the values of the column.
The answer is storage space and fixed size rows.
Fixed-size rows are much, MUCH faster to search than variable length rows, because you can seek directly to the correct byte if you know which record number and field you want.
Edit: Having said that, if you use proper indexing in your database tables, the fixed-size rows thing isn't as important as it used to be.
SQLite does not care.
Other RDBMS's use principles that were designed in early 80's, when it was vital for performance.
Oracle, for instance, does not distinguish between a NULL and an empty string, and keeps its NUMBER's as sets of centesimal digits.
That hardly makes sense today, but these were very clever solutions when Oracle was being developed.
In one of the databases I developed, though, non-indexed values were used that were stored as VARCHAR2's, casted dynamically into appropriate datatypes depending on several conditions.
That was quite a special thing, though: it was used for bulk loading key-value pairs in one call to the database using collections.
Dynamic SQL statements were used for parsing data and putting them into appropriate tables based on key name.
All values were loaded to the temporary VARCHAR2 column as is and then converted into NUMBER's and DATETIME's to be put into their columns.
Explicit data types are huge for efficiency, and storage. If they are implicit they have to be 'figured' out and therefore incur speed costs. Indexes would be hard to implement as well.
I would suspect, although not positive, that having explicit types also on average incur less storage space. For numbers especially, there is no comparison between a binary int and a string of digit characters.
Hm... Your question is sort of confusing.
If I understand it correctly, you're asking why it is that we specify data types for table columns, and why it is that the "engine" automatically determines what is needed for the user.
Data types act as a constraint - they secure the data's integrity. An int column will never have letters in it, which is a good thing. The data type isn't automatically decided for you, you specify it when you create the database - almost always using SQL.
You're right: assigning a data type to a column is an implementation detail and has nothing to do with the set theory or calculus behind a database engine. As a theoretical model, a database ought to be "typeless" and able to store whatever we throw at it.
But we have to implement the database on a real computer with real constraints. It's not practical, from a performance standpoint, to have the computer dynamically try to figure out how to best store the data.
For example, let's say you have a table in which you store a few million integers. The computer could -- correctly -- figure out that it should store each datum as an integral value. But if you were to one day suddenly try to store a string in that table, should the database engine stop everything until it converts all the data to a more general string format?
Unfortunately, specifying a data type is a necessary evil.
If you know that some data item is supposed to be numeric integer, and you deliberately choose NOT to let the DBMS take care of enforcing this, then it becomes YOUR responsibility to ensure all sorts of things such as data integrity (ensuring that no value 'A' can be entered in the column, ensuring that no value 1.5 can be entered in the column), such as consistency of system behaviour (ensuring that the value '01' is considered equal to the value '1', which is not the behaviour you get from type String), ...
Types take care of all those sorts of things for you.
I'm not sure of the history of datatypes in databases, but to me it makes sense to know the datatype of a field.
When would you want to do a sum of some fields which are entirely varchar?
If I know that a field is an integer, it makes perfect sense to do a sum, avg, max, etc.
Not all databases work this way. SQLite was mentioned earlier, but a much older set of databases also does this, multivalued databases.
Consider UniVerse (now an IBM property). It does not do any data validation, nor does it require that you specify what type it is. Searches are still (relatively) fast, it takes up less space (due to the way it stores data dynamically).
You can describe what the data may look like using meta-data (dictionary items), but that is the limit of how you restrict the data.
See the wikipedia article on UniVerse
When you're pushing half a billion rows in 5 months after go live, every byte counts (in our system)
There is no such anti-pattern as "premature optimisation" in database design.
Disk space is cheap, of course, but you use the data in memory.
You should care about datatypes when it comes to filtering (WHERE clause) or sorting (ORDER BY). For example "200" is LOWER than "3" if those values are strings, and the opposite when they are integers.
I believe sooner or later you wil have to sort or filter your data ("200" > "3" ?) or use some aggregate functions in reports (like sum() or (avg()). Until then you are good with text datatype :)
A book I've been reading on database theory tells me that the SQL standard defines a concept of a domain. For instance, height and width could be two different domains. Although both might be stored as numeric(10,2), a height and a width column could not be compared without casting. This allows for a "type" constraint that is not related to implementation.
I like this idea in general, though, since I've never seen it implemented, I don't know what it would be like to use it. I can see that it would reduce the chance of errors in using values whose implementation happen to be the same, when their conceptual domain is quite different. It might also help keep people from comparing cm and inches, for instance.
Constraint is perhaps the most important thing mentioned here. Data types exist for ensuring the correctness of your data so you are sure you can manipulate it correctly. There are 2 ways we can store a date. In a type of date or as a string "4th of January 1893". But the string could also have been "4/1 1893", "1/4 1893" or similar. Datatypes constrain that and defines a canonical form for a date.
Furthermore, a datatype has the advantage that it can undergo checks. The string "0th of February 1975" is accepted as a string, but should not be as a date. How about "30th of February 1983"? Poor databases, like MySQL, does not make these checks by default (although you can configure MySQL to do it -- and you should!).
data types will ensure the consistency of your data. This is one of the most important concepts as keeping your data sane will spare your head from insanity.
RDBMs generally require definition of column types so it can perform lookups fast. If you want to get the 5th column of every row in a huge dataset, having the columns defined is a huge optimisation.
Instead of scanning each row for some form of delimiter to retrieve the 5th column (if column widths were not fixed width), the RDBMs can just take the item at sizeOf(column1 - 4(bytes)) + sizeOf(column5(bytes)). Imagine how much quicker this would be on a table of say 10,000,000 rows.
Alternatively, if you don't want to specify the types of each column, you have two options that I'm aware of. Specify each column as a varchar(255) and decide what you want to do with it within the calling program. Or you can use a different database system that uses key-value pairs such as Redis.
database is all about physical storage, data type define this!!!