Removing repeated string values

Removing repeated string values - sql

Given a non-key varchar column where the string values may be repeated in many other rows, is a separate table mapping unique strings from the column to an integer a beneficial practice? It would clearly eliminate storage space, but is the performance lost from joining the first table to this mapping table worth it?

Generally speaking integer comparisons are going to be faster because at the lowest levels the machine does these singly as opposed to each character in the string.
However whether conversion is a good idea or not is a difficult question without knowing how often the comparison takes place.
Personally speaking I the conversion was likely to take place often (such as a key look up in a join) then I'd make make them integers.
The same things holds for indexing, and also because the indexes are smaller (space efficiency) you remove some backing store latency - again that's the theory - but in reality there may be many many other factors to consider.

Commonly referred to as a Lookup table, it's definitely worth adding if it the values repeat frequently, and if the strings are substantial enough, ie: a 2-character State code is not worth the trouble.
Integer comparisons are faster than string comparisons, but this is typically more about space savings than performance, because you've already got the string in-row, so separating the repeating values into a lookup table adds an extra JOIN. It's a step towards normalization, but there is such a thing as over-normalization, in my opinion, whether or not you should depends on what your data is like, and how it's used.

Related

How to generate a numeric identifier for entries based on a string

I'm working in Redshift SQL syntax, and want to know a way to convert a string id for each entry in a table to a numeric id (since numeric joins between tables are supposedly much quicker and more efficient than string joins).
Currently the ids look like this - a bunch of strings with both numbers and letters
01r00001ABCDeAAF
01r00001IJKLmAAN
...
01r00001OPQRtAAN
What I would like is to turn this into a purely numeric identifier, using the string id as an input and ensuring that each output is unique and corresponds only to a single input with no collisions (which can be replicated across tables so that accurate joins are possible).
I've tried using some hash functions within SQL like CHECKSUM() and BINARY_CHECKSUM() over the columns, but I'm a little unclear which would be the most applicable here - I understand some are case-sensitive and others aren't, while some generate collisions and others don't.

First, your reference for strings versus integers is based on an entirely different database. I would not generalize from SQL Server performance to other databases, particularly a massively parallel columnar database. There is also a lot of information that is taken out of context and generalized to wrong situations.
Second, you can test on tables in Amazon Redshift. Generating the data and doing the tests should be faster than modifying existing data. You will probably find no need to change anything.
You need to understand what is happening "under the hood" before making a change like this, particularly if you think it is for performance reasons.
Strings can be troublesome for a variety of reasons. First, they can have different collations or character sets -- information that is hidden. Such differences would preclude the use of indexes -- a major hit in a database such as SQL Server. Not using indexes is generally not an issue in Redshift.
Strings can also have variable lengths. This makes indexes slightly less efficient. They also require a wee bit more overhead to compare than numbers, because those collations and character sets need to be taken into account. They also need to be compared character-by-character, whereas most hardware has built-in comparisons for numbers. The extra cycles here is usually minimal compared to the cost of moving data.
When you do a join in Amazon Redshift, the first thing it is going to do is collocate the data, probably by hashing the values and sending the data to the same nodes in the parallel environment. Moving the data is expensive. Hashing the values, much less so.
In Redshift, you should be more concerned about how your data is distributed. Although I haven't tested it, adding a new column that is a number might make the query more expensive, because in a columnar database, the number of columns referenced has an impact on performance.

Replacing a SQL index text column with numeric values would make it faster?

I have and old and very bad database.
I have a child table with a text column for the users, all my users have numeric values, but there is an exception for the admin user, the code for the admin user is 'ADMIN'.
So I created numeric code for the ADMIN user and I will update all the records with that numeric value, but I wont change the column type to integer.
So I want to know if making this change, and having all the values of the user column with numeric value, the index for user column will be better, faster and stronger?

Indexing performance aside, it is always better to use the database type that matches the actual type in your model. Since the actual type of the ID is integer, changing database type to int would make it more natural to work with your database.
For example, ordering on ID would behave in a natural way, because it would no longer alphabetize your numbers (i.e. ordering 199 ahead of 2, because 199 comes first lexicographically). Searches using BETWEEN operator would produce correct results for the numbers as well.
Another important improvement is that the application relying on your database would no longer be able to insert non-numeric data into the ID column by mistake. This additional validation alone is worth making the change.
As far as the size and performance of an index goes, the size is very likely to shrink, which has a potential of improving performance by reducing the amount of reads.

It sounds like you really want a reference table.
Integers have advantages over strings for indexes:
They are fixed length.
They are usually shorter (although at 32 bits each, your codes might be shorter).
I think they are easier to gather statistics on.
The first two as optimizations for the index, but they are pretty minor, and the third might affect the optimizer. These are the sort of thing that is helpful, but you wouldn't change your data structure for it.
These also affect joins and foreign keys. The second is particularly important for foreign key references. If your values are wide, you end up repeating them in multiple tables -- eating up even more space.

Postgresql -- All else equal, is querying for (small) integer or float values faster than querying for (small) string values?

I'm about to mark maybe 100,000 records retroactively/posthoc-wise with category-indicating string or integer values. There are more to come. The categories to be marked by this column reflect a scalar continuum of different category types, going anywhere from "looser" to "tighter" essentially. I was thinking about using string values though, instead of integers, in case one day I come back to it and not know what means what.
So that's the reasoning for using strings, readability.
But I'll be relying on these columns pretty significantly, selecting swaths of records based off this criteria.
Obviously whatever it is I'm going to put an index on it, but with an index, I'm not sure how much faster querying on integers is than using strings. I've noticed the speediness of using booleans, and can reasonably assume small integers can be queried on more quickly than strings based off this.
I've been pondering this trade off for some time now so thought I'd fire off a question. Thanks

If it's really a string representing some ordered level between "looser" and "tighter", consider using an enum:
http://www.postgresql.org/docs/current/static/datatype-enum.html
That way, you'll get the best of both worlds.
One tiny note, though: ideally, make sure you nail all possible values in advance. Changing an enum is of course possible, but doing so adds an extra lookup and sort step internally (on a 32-bit float field) when the order of its numeric representation (its oid, which is a 32-bit integer) no longer matches its final order. (The performance difference is minor, but one to keep in mind should your data ever grow to billions of rows. And, again: it only applies when you alter the order of an existing enum.)
Regarding the second part of your question, sorting small integers (16-bit) is, in my own admittedly limited testing from a few years back, a bit slower than normal integers (32-bit). I imagine it's because they're manipulated as 32-bit integers anyway. And sorting or querying integers, as in the case of enums, is faster than sorting arbitrary strings. Ergo, use enums if you don't need the flexibility of adding arbitrary values down thhe road: they'll give you the best of each world.

Modeling database : many small tables or not?

I have a database with some information which are repeated in some tables.
I want to know if it's interesting to create a table with this information and in the other table, I put only the id.
It's interesting because with this method I haven't got redundance. But I will have to do many joints between my tables in my request, and I'm afraid my request will be more slow.
(I work with symfony if it changes something)

It sounds like the 'information' in question is data that makes up key values. If so, it sounds like the database designer likes to use natural keys and that you prefer to use surrogate keys.
First, these are both merely a question of style. If the natural key values are composite (i.e. involve more than one column) and are included in other columns for data integrity purposes then they are not redundant.
Second, as you have observed, when it comes to performance of surrogate keys you have to weigh the advantage of the more efficient data type (e.g. a single integer column) against the degrading performance of needing to write more JOINs. Note that using surrogates tends to make constraints more troublesome to write e.g. when the required values for a rule is in another table and you SQL product doesn't support subqueries in CHECK constraints then you will need to use a trigger which degrades performance in a high activity environment.
Further consider that performance is not the only consideration e.g. using natural key values will tend to make the data more readable and therefore make the schema easier to maintain because the physical model will reflect the logical model more closely (surrogate keys do not appear in the logical model at all).

You're talking about Normalisation. As with so many design aspects it's a trade-off.
Having duplication within the database leads to many problems - for example how to keep those duplicates in step when updating data. So Inserts and Updates may well go more slowly because of the duplication. Hence we tend to normalise the database to avoid such duplication. That does lead to more complex queries and possibly some retrieval overhead.
Modern database products tend to do such queries really well if you take a bit of care to have the right indexes in place.
Hence my starting position would be to normalise your data, avoid duplication. Then in a special case perhaps denormalise just pieces where it really becomes essential. For example suppose some part of you database is large, mostly queried rather than updated (eg. historic order information) then perhaps denormalise that bit.

It is not a question of style.
The answer is, as the seeker has already identified, removal of duplication; Normalisation. Pull them all into one table, and place a Foreign Key wherever they are used.
Now an Integer FK may be "tidy", but any good, short, fixed length key will do. Variable length keys are very bad for performance, as the key needs to be unpacked every time the index is searched.
The nature of a Normalised database is more, smaller tables, which is much faster than an Unnormalised data heap, with fewer, larger tables. Get used to it.
As long as you are Joining on keys, Joins do not cost anything in themselves; ten joins to construct a row do not cost more than five. The cost is in the table sizes; the indices used; the distribution; the datatypes of the index columns; etc. Relational dbms are heavily engineered for Normalised databases.
If you need to do lookups of lookups, then that is the way it is. Just ensure that the tables are Normalised.

If you don't normalise
How are you going to store values that could potentially be used?
How are you going to separate "Lookup value" from "Look up value from "LookUpValue" etc
You'll be slows because you are storing variable length string "Lookup value" across many rows, rather than a nice tidy integer key
This is the more practical points to the other 2 answers...

Why do we care about data types?

Specifically, in relational database management systems, why do we need to know the data type of a column (more likely, the attribute of an object) at creation time?
To me, data types feel like an optimization, because one data point can be implemented in any number of ways. Wouldn't it be better to assign semantic roles and constraints to a data point and then have the engine internally examine and optimize which data type best serves the user?
I suspect this is where the heavy lifting is and why it's easier to just ask the user rather than to do the work.
What do you think? Where are we headed? Is this a realistic expectation? Or do I have a misguided assumption?

The type expresses a desired constraint on the values of the column.

The answer is storage space and fixed size rows.
Fixed-size rows are much, MUCH faster to search than variable length rows, because you can seek directly to the correct byte if you know which record number and field you want.
Edit: Having said that, if you use proper indexing in your database tables, the fixed-size rows thing isn't as important as it used to be.

SQLite does not care.
Other RDBMS's use principles that were designed in early 80's, when it was vital for performance.
Oracle, for instance, does not distinguish between a NULL and an empty string, and keeps its NUMBER's as sets of centesimal digits.
That hardly makes sense today, but these were very clever solutions when Oracle was being developed.
In one of the databases I developed, though, non-indexed values were used that were stored as VARCHAR2's, casted dynamically into appropriate datatypes depending on several conditions.
That was quite a special thing, though: it was used for bulk loading key-value pairs in one call to the database using collections.
Dynamic SQL statements were used for parsing data and putting them into appropriate tables based on key name.
All values were loaded to the temporary VARCHAR2 column as is and then converted into NUMBER's and DATETIME's to be put into their columns.

Explicit data types are huge for efficiency, and storage. If they are implicit they have to be 'figured' out and therefore incur speed costs. Indexes would be hard to implement as well.
I would suspect, although not positive, that having explicit types also on average incur less storage space. For numbers especially, there is no comparison between a binary int and a string of digit characters.

Hm... Your question is sort of confusing.
If I understand it correctly, you're asking why it is that we specify data types for table columns, and why it is that the "engine" automatically determines what is needed for the user.
Data types act as a constraint - they secure the data's integrity. An int column will never have letters in it, which is a good thing. The data type isn't automatically decided for you, you specify it when you create the database - almost always using SQL.

You're right: assigning a data type to a column is an implementation detail and has nothing to do with the set theory or calculus behind a database engine. As a theoretical model, a database ought to be "typeless" and able to store whatever we throw at it.
But we have to implement the database on a real computer with real constraints. It's not practical, from a performance standpoint, to have the computer dynamically try to figure out how to best store the data.
For example, let's say you have a table in which you store a few million integers. The computer could -- correctly -- figure out that it should store each datum as an integral value. But if you were to one day suddenly try to store a string in that table, should the database engine stop everything until it converts all the data to a more general string format?
Unfortunately, specifying a data type is a necessary evil.

If you know that some data item is supposed to be numeric integer, and you deliberately choose NOT to let the DBMS take care of enforcing this, then it becomes YOUR responsibility to ensure all sorts of things such as data integrity (ensuring that no value 'A' can be entered in the column, ensuring that no value 1.5 can be entered in the column), such as consistency of system behaviour (ensuring that the value '01' is considered equal to the value '1', which is not the behaviour you get from type String), ...
Types take care of all those sorts of things for you.

I'm not sure of the history of datatypes in databases, but to me it makes sense to know the datatype of a field.
When would you want to do a sum of some fields which are entirely varchar?
If I know that a field is an integer, it makes perfect sense to do a sum, avg, max, etc.

Not all databases work this way. SQLite was mentioned earlier, but a much older set of databases also does this, multivalued databases.
Consider UniVerse (now an IBM property). It does not do any data validation, nor does it require that you specify what type it is. Searches are still (relatively) fast, it takes up less space (due to the way it stores data dynamically).
You can describe what the data may look like using meta-data (dictionary items), but that is the limit of how you restrict the data.
See the wikipedia article on UniVerse

When you're pushing half a billion rows in 5 months after go live, every byte counts (in our system)
There is no such anti-pattern as "premature optimisation" in database design.
Disk space is cheap, of course, but you use the data in memory.

You should care about datatypes when it comes to filtering (WHERE clause) or sorting (ORDER BY). For example "200" is LOWER than "3" if those values are strings, and the opposite when they are integers.
I believe sooner or later you wil have to sort or filter your data ("200" > "3" ?) or use some aggregate functions in reports (like sum() or (avg()). Until then you are good with text datatype :)

A book I've been reading on database theory tells me that the SQL standard defines a concept of a domain. For instance, height and width could be two different domains. Although both might be stored as numeric(10,2), a height and a width column could not be compared without casting. This allows for a "type" constraint that is not related to implementation.
I like this idea in general, though, since I've never seen it implemented, I don't know what it would be like to use it. I can see that it would reduce the chance of errors in using values whose implementation happen to be the same, when their conceptual domain is quite different. It might also help keep people from comparing cm and inches, for instance.

Constraint is perhaps the most important thing mentioned here. Data types exist for ensuring the correctness of your data so you are sure you can manipulate it correctly. There are 2 ways we can store a date. In a type of date or as a string "4th of January 1893". But the string could also have been "4/1 1893", "1/4 1893" or similar. Datatypes constrain that and defines a canonical form for a date.
Furthermore, a datatype has the advantage that it can undergo checks. The string "0th of February 1975" is accepted as a string, but should not be as a date. How about "30th of February 1983"? Poor databases, like MySQL, does not make these checks by default (although you can configure MySQL to do it -- and you should!).
data types will ensure the consistency of your data. This is one of the most important concepts as keeping your data sane will spare your head from insanity.

RDBMs generally require definition of column types so it can perform lookups fast. If you want to get the 5th column of every row in a huge dataset, having the columns defined is a huge optimisation.
Instead of scanning each row for some form of delimiter to retrieve the 5th column (if column widths were not fixed width), the RDBMs can just take the item at sizeOf(column1 - 4(bytes)) + sizeOf(column5(bytes)). Imagine how much quicker this would be on a table of say 10,000,000 rows.
Alternatively, if you don't want to specify the types of each column, you have two options that I'm aware of. Specify each column as a varchar(255) and decide what you want to do with it within the calling program. Or you can use a different database system that uses key-value pairs such as Redis.

database is all about physical storage, data type define this!!!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas