Generally, are string (or varchar) fields used as join fields? - sql

We have two tables. The first contains a name (varchar) field. The second contains a field that references the name field from the first table. This foreign key in the second table will be repeated for every row associated with that name. Is it generally discouraged to use a varchar/string field as a join between two tables? When is the best case where a string field can be used as a join field?

It's certainly possible to use a varchar as a key field (or simply something to join on). The main problems with it are based on what you normally store in a varchar field; mutable data. Strictly speaking, it's not advisable to have key fields change. A person's name, telephone number, even their SSN can all change. However, the employee with internal ID 3 will always be ID 3, even if there are two John Smiths.
Second, string comparison is dependent on a number of nit-picky details, such as culture, collation, whitespace translation, etc. that can break a join for no immediately-apparent reason. Say you use a tabspace character \t for a certain string you're joining on. Later, you change your software to replace \t with 3 spaces to reduce character escapes in your raw strings. You have now broken any functionality requiring a string with escaped tabs to be matched to an identical-looking, but differently-composed, string.
Lastly, even given two perfectly identical strings, there is a slight performance benefit to comparing two integer numbers than comparing two strings. Integer comparison is effectively constant-time. String comparison is linear at best, based on the length of the string.

Is it generally discouraged to use a varchar/string field as a join between two tables?
If there's a natural key to be used (extremely rare in real life, but state/province abbreviations are a good example), then VARCHAR fields are fine.
When is the best case where a string field can be used as a join field?
Depends on the database because of the bits allocated to the data type, but generally VARCHAR(4) or less takes around the same amount of space (less the less number of characters) as INT would.

Generally speaking you shouldn't use anything that is editable by the end users as a FK as an edit would require not one update, but one update per table which references that key.
Everyone else has already mentioned the potenetial performance implications of a query, but the update cost is also worth noting. I strongly suggest the use of a generated key instead.

If you're concerned about performance, the best way to know is to create tables that implement your potential design choices, then load them up with massive amounts of data to see what happens.
In theory, very small strings should perform as well as a number in joins. In practice, it would definitely depend upon the database, indexing, and other implementation choices.

In a relational database, you shouldn't use a string in one table that references the same string in another table. If the second table is a look-up, create an identity column for the table, and then reference the integer value in the first. When displaying the data, use a join to the second table. Just make sure in the second table you never actually delete records.
The only exception would be if you are creating an archive table where you want to store exactly what was chosen at a given time.

Sometimes a join will happen on fields that are not "join fields", because that's just the nature of the query (e.g. most ways of identifying records that are duplicates in a particular column). If the query you want relates to those values, then that's what the join will be on, end of story.
If a field genuinely identifies a row, then it is possible to use it as a key. It's even possible to do so if it could change (it brings issues, but not insurmountable issues) as long as it remains a genuine identifier (it'll never change to a value that exists for another row).
The performance impact varies by common query and by database. By database the type of indexing strategies of some makes them better at using varchar and other textual keys than other databases (in particular, hash-indices are nice).
Common queries can be such that it becomes more performant to use varchar even without hash indices. A classic example is storing pieces of text for a multi-lingual website. Each such piece of text will have a particular languageID relating to the language it is in. However, obtaining other information about that language (it's name etc.) is rarely needed; what's much more often needed is to either filter by the RFC 5646 code, or to find out what that RFC 6546 code is. If we use a numeric id, then we will have to join for both types of query to obtain that code. If we use the code as the ID, then the most common queries concerned with the language won't need to look in the language table at all. Most queries that do care about the details of the language also won't need to do any join; pretty much the only time the key will be used as a foreign key is in maintaining referential integrity on update and insert of text or on deletion of languages. Hence while the join is less efficient when it is used the system as a whole will be more efficient by using fewer joins.

It depends on the nature of your data.
If the string is some user-entered and updated value then I would probably shy away from joining on it. You may run into consistency difficulties for storing the name in both the parent and the detail table.
Nothing has duplicate names?
I have used a string field as a join when using GUIDs or single char identifiers or when I know the string to be a natural key (though I almost always prefer a surrogate)

Natural primary keys like a zip code, phone number, email address or user name are by definition strings. There are unique and relatively short.
If you put an index on such a column there is no problem with using them a join. Impact on performance will usually be minimal.

Related

Replacing a SQL index text column with numeric values would make it faster?

I have and old and very bad database.
I have a child table with a text column for the users, all my users have numeric values, but there is an exception for the admin user, the code for the admin user is 'ADMIN'.
So I created numeric code for the ADMIN user and I will update all the records with that numeric value, but I wont change the column type to integer.
So I want to know if making this change, and having all the values of the user column with numeric value, the index for user column will be better, faster and stronger?
Indexing performance aside, it is always better to use the database type that matches the actual type in your model. Since the actual type of the ID is integer, changing database type to int would make it more natural to work with your database.
For example, ordering on ID would behave in a natural way, because it would no longer alphabetize your numbers (i.e. ordering 199 ahead of 2, because 199 comes first lexicographically). Searches using BETWEEN operator would produce correct results for the numbers as well.
Another important improvement is that the application relying on your database would no longer be able to insert non-numeric data into the ID column by mistake. This additional validation alone is worth making the change.
As far as the size and performance of an index goes, the size is very likely to shrink, which has a potential of improving performance by reducing the amount of reads.
It sounds like you really want a reference table.
Integers have advantages over strings for indexes:
They are fixed length.
They are usually shorter (although at 32 bits each, your codes might be shorter).
I think they are easier to gather statistics on.
The first two as optimizations for the index, but they are pretty minor, and the third might affect the optimizer. These are the sort of thing that is helpful, but you wouldn't change your data structure for it.
These also affect joins and foreign keys. The second is particularly important for foreign key references. If your values are wide, you end up repeating them in multiple tables -- eating up even more space.

Query on varchar vs foreign key performance

This is for SQL Server.
I have a table that will contain a lot of rows and that table will be queried multiple times so I need to make sure my design is optimized.
Just for the question let say that table contains 2 columns. Name and Type.
Name is a varchar and it will be unique.
Type can be 5 different value (type1... type5). (It possible can contains more values in the future)
Should I make type a varchar (and create an index) or would be it better to create a table of types that will contains 5 rows with only a column for the name and make type a foreign key?
Is there a performance difference between both approach? The queries will not always have the same condition. Sometime it will query the name, type, or both with different values.
EDIT: Consider that in my application if type would be a table, the IDs would be cached so I wouldn't have to query the Type table everytime.
Strictly speaking, you'll probably get better query performance if you keep all the data in one table. However doing this is known as "denormalization" and comes with a number of pretty significant drawbacks.
If your table has "a lot of rows", storing an extra varchar field for every row as opposed to say, a small, or even tinyint, can add a non-trivial amount of size to your table
If any of that data needs to change, you'll have to perform lots of updates against that table. This means transaction log growth and potential blocking on the table during modification locks. If you store it as a separate table with 5-ish rows, if you need to update the data associated with that data, you just update one of the 5 rows you need.
Denormalizing data means that the definition of that data is no longer stored in one place, but in multiple places (actually its stored across every single row that contains those values).
For all the reasons listed above, managing that data (inserts, updates, deletes, and simply defining the data) can quickly become far more overhead than simply normalizing the data correctly in the first place, and for little to no benefit beyond what can be done with proper indexing.
If you find the need to return both the "big" table and some other information from the type table and you're worried about join performance, truthfully, wouldn't be. That's a generalization, but If your big table has, say, 500M rows in it, I can't see many use cases where you'd want all those rows returned; you're probably going to get a subset. In which case, that join might be more manageable. Provided you index type, the join should be pretty snappy.
If you do go the route of denormalizing your data, I'd recommend still having the lookup table as the "master definition" of what a "type" is, so it's not a conglomeration of millions of rows of data.
If you STILL want to denormalize the data WITHOUT a lookup table, at least put a CHECK constraint on the column to limit which values are allowable or not.
How much is "a lot of rows"?.
If it is hundreds of thousands or more, then a Columnstore Index may be a good fit.
It depends on your needs, but usually you would want the type column to be of a numerical value (in your case tinyint).

When would combining columns into a single, delimited column be better in a RDB schema?

Consider for example the case where you have two peaces of data, where one value is rarely used without the other. As one example, here is a table holding user authentication data :
CREATE TABLE users
(
id INT PRIMARY KEY,
auth_name STRING,
auth_password STRING,
auth_password_salt STRING
)
I think that password is meaningless without salt, and the other way around. I also have the option on representing the data this way:
CREATE TABLE users
(
id INT PRIMARY KEY,
auth_name STRING,
auth_secret STRING,
)
And in auth_secret, store strings such as D5SDfsuuAedW:unguessable42
In general, are there any situations where combining columns into one, delimited column would be a better choice?
Even if it is never a "better choice" overall, are there any costs (performance, space, anything) to having more columns vs fewer columns (for the same data)? My motivation is better understanding and to be able to more competently argue against it when someone suggests this sort of thing.
--edited I changed the example... original example as follows:
CREATE TABLE points
(
id INT PRIMARY KEY,
x_coordinate INT,
y_coordinate INT,
z_coordinate INT
)
vs
CREATE TABLE points
(
id INT PRIMARY KEY,
position STRING
)
In position, storing strings such as 7:3:15
You do that when there is no chance of needing to join, query, report or aggregate the data.
In other words - never. It is bad database design.
First Normal form (NF1) states that attributes should be distinct - it is the basic requirement.
The only possible answer to this question is never. Never, ever, store delimited data in a column. It defeats the entire point of columns, which are there to delimit your data, and makes it inordinately difficult to do anything that a database has been designed to do. It's a violation of normalisation so huge that you'll spend hours on Stack Overflow trying to correct it in a months time.
Never do this.
However, "never say never".
In certain, extremely limited, circumstances it's okay. Never assume it's okay but it can be.
A good example is Stack Overflow's own Posts table, which stores the tags in a delimited format for quick reading. The tags a question has are read from the database far more often than they are edited. The tags are stored in a separate table, PostTags, and then denormalised to Posts when they are updated.
In short, even though you can denormalise your data in this way, don't. Try everything possible to avoid it. If you come across a situation where you've been optimizing for days and the only way to get something quicker is to denormalize, then it's okay. Just ensure that you are only ever going to read data from that column and you have a secondary process in place to ensure that it is kept up-to-date. If the update of the denormalised data fails, roll everything back to ensure that your data is consistent.
You left out a significant option: create an appropriate user-defined data type. (PostgreSQL has long had an intrinsic data type for 2-space.)
PostgreSQL
Oracle
SQL Server
DB2
These implementations differ quite a lot.
But you might not have the luxury of using one of those platforms. You might have to use MySQL, for example, which doesn't support user-defined data types.
Relational theory says that data types can be arbitrarily complex; they can have internal structure. The most common data type that has internal structure is the type "date". Relational theory specifies what the dbms is supposed to do with data types like that. The dbms must either
ignore the internal structure entirely, or
provide functions to manipulate the parts.
In the case of dates, every SQL dbms provides functions to manipulate the parts.
You can make a good argument for a single column that stores 3-space coordinates like "7:3:15" in MySQL. To keep in line with relational theory, you'd want the dbms to ignore the structure, and return only the single value "7:3:15"; manipulation of parts is left to application code.
One problem with implementing something like that in MySQL is that MySQL doesn't enforce CHECK constraints. So it's a lot harder to prevent values like "wibble:frog:foo" from finding their way into the database.

What's more efficient: string searches, or joins through SQL?

I have two lists of words and I need to find matches (intersection of the two sets.)
Should I store each list as a string and find matches through string functions (like a regular expression) or store the words in a table, and have SQL find matches by joining?
It is almost impossible to say without more information about the problem. Here are some things to consider:
How many different distinct items do you have?
How many different combinations would be on a typical row?
Do your searches require looking for wildcards?
How long are the individual items?
Specifics on the database engine and hardware you are running on.
I want to emphasize that in almost all situations, you want to store the values in another table. Performance is not necessarily the primary reason. More important are ease of updating and deleting individual values, and the ability to support many more types of queries (such as a list of all available values).
But, we can still think about the performance issues. Storing values in a single string simply requires fetching the page with the record on it, and then applying a function that goes through the string. For simple patterns (such as identifying the presence of a fixed substring), this should go quite fast. There are few things that computers do faster than looping through strings and comparing values (assuming a reasonable implementation).
In the fastest possible join, both tables need to be read in, and the keys need to be matched. This requires additional effort. The situation is even worse, because you really want two additional tables, one for the individual string items and the other for the relationship between the original records and the items.
At this point, you may think "gosh, strings seem like a better idea". This is wrong. One of the big differences is in average size. If you items are, on average, longer than, say, 4 characters, then you save space by using a reference table. This saved space immediately translates into improved performance, because there is less I/O. With indexes, the additional tables would be in memory anyway, so the matching would be quite fast.
And, there is the issue of querying. You can use standard SQL functions for queries such as records that have A and B (many string functions are database specific). You can easily find out exactly which items are in the database, and relatively easily find what pairs exist on records. You can keep track of when an item is added to a record, and the first time it appears in the database. Generally, this flexible functionality -- which is just basic SQL functionality -- is what you need when managing this type of data.
Storing in a table will be much faster than a SQL string manipulation function in most circumstances especially if you can index the words.
I think you're asking if this:
SELECT word FROM table_one WHERE word in (SELECT word FROM table_two)
is faster than this:
SELECT table_one.word FROM table_one
INNER JOIN table_two ON table_one.word = table_two.word
The first answer should be faster, because the second creates a (potentially large) temporary object (the joined table).
Note that I assume you have an index on word. Also: if the strings are very long (URLs, for example), this will be very slow, and you should match on a hash instead.

Sphinx question: Structuring database

I'm developing a job service that has features like radial search, full-text search, the ability to do full-text search + disable certain job listings (such as un-checking a textbox and no longer returning full-time jobs).
The developer who is working on Sphinx wants the database information to all be stored as intergers with a key (so under the table "Job Type" values might be stored such as 1="part-time" and 2="full-time")... whereas the other developers want to keep the database as strings (so under the table "Job Type" it says "part-time" or "full-time".
Is there a reason to keep the database as ints? Or should strings be fine?
Thanks!
Walker
Choosing your key can have a dramatic performance impact. Whenever possible, use ints instead of strings. This is called using a "surrogate key", where the key presents a unique and quick way to find the data, rather than the data standing on it's own.
String comparisons are resource intensive, potentially orders of magnitude worse than comparing numbers.
You can drive your UI off off the surrogate key, but show another column (such as job_type). This way, when you hit the database you pass the int in, and avoid looking through to the table to find a row with a matching string.
When it comes to joining tables in the database, they will run much faster if you have int's or another number as your primary keys.
Edit: In the specific case you have mentioned, if you only have two options for what your field may be, and it's unlikely to change, you may want to look into something like a bit field, and you could name it IsFullTime. A bit or boolean field holds a 1 or a 0, and nothing else, and typically isn't related to another field.
if you are normalizing your structure (i hope you are) then numeric keys will be most efficient.
Aside from the usual reasons to use integer primary keys, the use of integers with Sphinx is essential, as the result set returned by a successful Sphinx search is a list of document IDs associated with the matched items. These IDs are then used to extract the relevant data from the database. Sphinx does not return rows from the database directly.
For more details, see the Sphinx manual, especially 3.5. Restrictions on the source data.