Arrangement of database columns in any DBMS

Arrangement of database columns in any DBMS - sql

We had a discussion in class about whether putting an identity key as the first column of the table affects the performance when writing your queries.
I think whether the identity key is the first or last column does not really matter since it will be used for linking one table with another. But then again I could be mistaken. I can't really find good articles that address this. What are your thoughts on this? And you have good references please do like wise.
Thanks!

In short...
Depending on your DBMS, field order may actually make a difference, but that difference is likely to be too small to matter.
Slightly longer answer...
Fields are stored together in rows, and rows are grouped in database pages1.
So when the DBMS loads a specific field, it also loads the row containing that field and the entire page containing that row2. I/O, not CPU, tends to be most important for DB performance, so once the row is in memory, field order typically doesn't matter (much).
Depending on the physical row layout...
The DBMS may already know the offset of each field, in which case each field is equally fast.
Or may need to "scan" through preceding fields, which is also quite fast but can make a small difference.
In fact, the DBMS may not even honor the order of fields in your CREATE TABLE statement - for example, it may move all fixed field to the front of the row and relegate the variable fields to the back.
But as always: if in doubt, measure on realistic amounts of data.
1 Which are some multiple of disk blocks. One database page may (and typically does) contain many rows.
2 Unless the row cannot fit into page, in which case some of its fields can be "chained" into another page or stored "out of line" in some way. But let's not complicate things too much here.

In a truly relational DBMS there is no logical ordering of attributes in a table. In a SQL DBMS there is a logical ordering of columns but that doesn't necessarily match the ordering of data in physical storage. It is the structures used in physical storage that might influence performance and not the logical placement of columns in a table. So as a rule you are right, column order doesn't / shouldn't make any difference. Whether physical ordering makes a difference in any particular DBMS product very much depends on the physical storage details of that product.

Related

DB Architecture: One table using WHERE vs multiple

I wonder what is the difference between having one table with 6 millions row (aka with a huge DB) and 100k active users:
CREATE TABLE shoes (
id serial primary key,
color text,
is_left_one boolean,
stock int
);
With also 6 index like:
CREATE INDEX blue_left_shoes ON shoes(color,is_left_one) WHERE color=blue AND is_left_one=true;
Versus: 6 tables with 1 million rows:
CREATE TABLE blue_left_shoes(
id serial primary key,
stock int
);
The latter one seems more efficient because users don't have to ask for the condition since the table IS the condition, but perhaps creating the indexes mitigate this?
This table is used to query either left, right, "blue", "green" or "red" shoes and to check the number of remaining items, but it is a simplified example but you can think of Amazon (or any digital selling platform) tooltip "only 3 items left in stock" for the workload and the usecase. It is the users (100k active daily) who will make the query.
NB: The question is mostly for PostgreSQL but differences with other DB is still relevant and interesting.

In the latter case, where you use a table called blue_left_shoes
Your code needs to first work out which table to look at (as opposed to parameterising a value in the where clause)
As permutations and options increase, you need to increase the number of tables, and increase the logic in your app that works out which table to use
Anything that needs to use this database (i.e. a reporting tool or an API) now needs to re implement all of these rules
You are imposing logic at a high layer to improve performance.
If you were to partition and/or index your table appropriately, you get the same effect - SQL queries only look through the records that matter. The difference is that you don't need to implement this logic in higher layers
As long as you can get the indexing right, keeping this is one table is almost always the right thing to do.
Partitioning
Database partitioning is where you select one or more columns to decide how to "split up" your table. In your case you could choose (color, is_left_one).
Now your table is logically split and ordered in this way and when you search for blue,true it automatically knows which partition to look in. It doesn't look in any other partitions (this is called partition pruning)
Note that this occurs automatically from the search criteria. You don't need to manually work out a particular table to look at.
Partitioning doesn't require any extra storage (beyond various metadata that has to be saved)
You can't apply multiple partitions to a table. Only one
Indexing
Creating an index also provides performance improvements. However indexes take up space and can impact insert and update performance (as they need to be maintained). Practically speaking, the select trade off almost always far outweighs any insert/update negatives
You should always look at indexes before partitioning
Non selective indexes
In your particular case, there's an extra thing to consider: a boolean field is not "selective". I won't go into details but suffice to say you shouldn't create an index on this field alone, as it won't be used because it only halves the number of records you have to look through. You'd need to include some other fields in any index (i.e. colour) to make it useful

In general, you want to keep all "like" data in a single table, not split among multiples. There are good reasons for this:
Adding new combinations is easier.
Maintaining the tables is easier.
You an easily do queries "across" entities.
Overall, the database is more efficient, because it is more likely that pages will be filled.
And there are other reasons as well. In your case, you might have an argument for breaking the data into 6 separate tables. The gain here comes from not having the color and is_left_one in the data. That means that this data is not repeated 6 million times. And that could save many tens of megabytes of data storage.
I say the last a bit tongue-in-cheek (meaning I'm not that serious). Computers nowadays have so much member that 100 Mbytes is just not significant in general. However, if you have a severely memory limited environment (I'm thinking "watch" here, not even "smart phone") then it might be useful.
Otherwise, partitioning is a fine solution that pretty much meets your needs.

For this:
WHERE color=blue AND is_left_one=true
The optimal index is
INDEX(color, is_left_one) -- in either order
Having id first makes it useless for that WHERE.
It is generally bad to have multiple identical tables instead of one.

Three SQL tables or one?

I have a choice of creating three tables with identical structure but different content or one table with all of the data and one additional column that distinguishes the data. Each table will have about 10,000 rows in it, and it will be used exclusively for looking up data. The key design criteria is speed of lookup, so which is faster: three tables with 10K rows each or one table with 30K rows, or is there no substantive difference? Note: all columns that will be used as query parameters will have indices.

There should be no substantial difference between 10k or 30k rows in any modern RDBMS in terms of lookup time. In any case not enough difference to warrant the de-normalization. Indexed qualifier column is a common approach for such a design.
The only time you may consider de-normalizing if your update pattern affects a limited set of data that you can put in a "short" table (say, today's messages in social network) with few(er) indexes for fast inserts/updates and there is a background process transferring the stabilized updates to a large, fully indexed table. The case were you really win during write operations will be a dramatic one though, with very particular and unfortunate requirements. RDBMS engines are sophisticated enough to handle most of the simple scenarios in very efficient way. 30k or rows does not sound like a candidate.
If still in doubt, it is very easy to write a test to check on your particular database / system setup. I think if you post your findings here with real data, it will be a useful info for everyone in your steps.

Apart from the speed issue, which the other posters have covered and I agree with, you should also take into consideration the business model that your are replicating in your database, as this may affect the maintenance cost of your solution.
If is it possible that the 3 'things' may turn into 4, and you have chosen the separate table path, then you will have to add another table. Whereas if you choose the discriminator path then it is as simple as coming up with a new discriminator.
However, if you choose the discriminator path and then new requirements dictate that one of 'things' has more data to store then you are going to have to add extra columns to your table which have no relevance to the other 'things'.
I cannot say which is the right way to go, as only you know your business model.

SQL Server column design

I always tried to make my sql database as simple and as understandable as possible.
Until now I always used a limited number of columns, I think I never had more than 20. Now, there is one thing, that would make my life easier, if I had much more columns. Let´s say 200 columns. (not rows). What do you think about it?
I just want to know, if it is a bad idea, not why i´m doing this or if there are other possibilities, just if somebody has already experienced something like that and if it is a bad idea to do such a table.

Fewer, smaller width columns is better than lots of columns and/or large width columns.
Why? Because the narrower the row size, the more rows you fit on a 8K page. That means you do less I/O and use less memory to buffer pages. That is always a good thing.
In those (hopefully) rare cases, where the domain requires many attributes on an object (with the assumption of 1-1 object-table mapping), you should consider splitting into two tables ina 1-1 relationship, one containing the frequently used columns.

I don't think it is black and white. Having a large row size (implied by the large number of columns) will hurt performance (i.e., more I/O) -- but there are cases where taking a small hit in performance in one place will be offset by increased performance in others.
I'd say it depends on how many rows you expect this table to have, how often will it will be queried, how many of those additional columns will really be accessed, and how it would compare to your alternative design in terms of efficiency and complexity.

Luke--
It really depends on the type of the system you are working with. Example in transactional systems, most tables have at most 50 columns or so with almost no redundant data attrributes ( If you have a process date, you would not need the Process Month or the process year as a seperate column). This of course is because the records are updated/inserted frequently and you'll need to update all the redundant attributes everytime you update one row.
In Data Warehouse/reporting environments, for Dimension tables (which have the attributes for an entity) it is typical to have 100+ columns as there are could be various ways you want to categorize a given entity.The Updates here are not so much a problem as data is typically loaded once during off-peak hours and then is used mostly in selects.
Take a look at these links to know more..
http://en.wikipedia.org/wiki/Database_normalization
http://en.wikipedia.org/wiki/Star_schema
So the answer is it depends... If you want a perfectly relational system, then may be 200+ columns is kind of a red flag indicating you should look at normalize your data (May be not). Updates and Indexes are two things that you should be concerned with in such a system.

You are using SQL Server, which I think defaults to row-oriented storage (all fields in a row are stored together in a page), which can be a problem with large number of columns. However, if you use column-oriented storage, the number of columns per table does not matter because each column is stored together. I don't know if this is possible with SQL Server.

Modeling database : many small tables or not?

I have a database with some information which are repeated in some tables.
I want to know if it's interesting to create a table with this information and in the other table, I put only the id.
It's interesting because with this method I haven't got redundance. But I will have to do many joints between my tables in my request, and I'm afraid my request will be more slow.
(I work with symfony if it changes something)

It sounds like the 'information' in question is data that makes up key values. If so, it sounds like the database designer likes to use natural keys and that you prefer to use surrogate keys.
First, these are both merely a question of style. If the natural key values are composite (i.e. involve more than one column) and are included in other columns for data integrity purposes then they are not redundant.
Second, as you have observed, when it comes to performance of surrogate keys you have to weigh the advantage of the more efficient data type (e.g. a single integer column) against the degrading performance of needing to write more JOINs. Note that using surrogates tends to make constraints more troublesome to write e.g. when the required values for a rule is in another table and you SQL product doesn't support subqueries in CHECK constraints then you will need to use a trigger which degrades performance in a high activity environment.
Further consider that performance is not the only consideration e.g. using natural key values will tend to make the data more readable and therefore make the schema easier to maintain because the physical model will reflect the logical model more closely (surrogate keys do not appear in the logical model at all).

You're talking about Normalisation. As with so many design aspects it's a trade-off.
Having duplication within the database leads to many problems - for example how to keep those duplicates in step when updating data. So Inserts and Updates may well go more slowly because of the duplication. Hence we tend to normalise the database to avoid such duplication. That does lead to more complex queries and possibly some retrieval overhead.
Modern database products tend to do such queries really well if you take a bit of care to have the right indexes in place.
Hence my starting position would be to normalise your data, avoid duplication. Then in a special case perhaps denormalise just pieces where it really becomes essential. For example suppose some part of you database is large, mostly queried rather than updated (eg. historic order information) then perhaps denormalise that bit.

It is not a question of style.
The answer is, as the seeker has already identified, removal of duplication; Normalisation. Pull them all into one table, and place a Foreign Key wherever they are used.
Now an Integer FK may be "tidy", but any good, short, fixed length key will do. Variable length keys are very bad for performance, as the key needs to be unpacked every time the index is searched.
The nature of a Normalised database is more, smaller tables, which is much faster than an Unnormalised data heap, with fewer, larger tables. Get used to it.
As long as you are Joining on keys, Joins do not cost anything in themselves; ten joins to construct a row do not cost more than five. The cost is in the table sizes; the indices used; the distribution; the datatypes of the index columns; etc. Relational dbms are heavily engineered for Normalised databases.
If you need to do lookups of lookups, then that is the way it is. Just ensure that the tables are Normalised.

If you don't normalise
How are you going to store values that could potentially be used?
How are you going to separate "Lookup value" from "Look up value from "LookUpValue" etc
You'll be slows because you are storing variable length string "Lookup value" across many rows, rather than a nice tidy integer key
This is the more practical points to the other 2 answers...

MySQL Table with TEXT column

I've been working on a database and I have to deal with a TEXT field.
Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
Some research revealed this, suggesting that
Separate text/blobs from metadata, don't put text/blobs in results if you don't need them.
However, I am not familiar with the definition of "metadata" being used here.
So I wonder if there are any relevant advantages in putting a TEXT column in a table of its own. What are the potential problems of having it with the rest of the fields? And potential problems of keeping it in a separated table?
This table(without the TEXT field) is supposed to be searched(SELECTed) rather frequently. Is "premature optimization considered evil" important here? (If there really is a penalty in TEXT columns, how relevant is it, considering it is fairly easy to change this later if needed).
Besides, are there any good links on this topic? (Perhaps stackoverflow questions&answers? I've tried to search this topic but I only found TEXT vs VARCHAR discussions)

Yep, it seems you've misinterpreted the meaning of the sentence. What it says is that you should only do a SELECT including a TEXT field if you really need the contents of that field. This is because TEXT/BLOB columns can contain huge amounts of data which would need to be delivered to your application - this takes time and of course resources.
Best wishes,
Fabian

This is probably premature optimisation. Performance tuning MySQL is really tricky and can only be done with real performance data for your application. I've seen plenty of attempts to second guess what makes MySQL slow without real data and the result each time has been a messy schema and complex code which will actually make performance tuning harder later on.
Start with a normalised simple schema, then when something proves too slow add a complexity only where/if needed.
As others have pointed out the quote you mentioned is more applicable to query results than the schema definition, in any case your choice of storage engine would affect the validity of the advice anyway.
If you do find yourself needing to add the complexity of moving TEXT/BLOB columns to a separate table, then it's probably worth considering the option of moving them out of the database altogether. Often file storage has advantages over database storage especially if you don't do any relational queries on the contents of the TEXT/BLOB column.
Basically, get some data before taking any MySQL tuning advice you get on the Internet, including this!

The data for a TEXT column is already stored separately. Whenever you SELECT * from a table with text column(s), each row in the result-set requires a lookup into the text storage area. This coupled with the very real possibility of huge amounts of data would be a big overhead to your system.
Moving the column to another table simply requires an additional lookup, one into the secondary table, and the normal one into the text storage area.
The only time that moving TEXT columns into another table will offer any benefit is if there it a tendency to usually select all columns from tables. This is merely introducing a second bad practice to compensate for the first. It should go without saying the two wrongs is not the same as three lefts.

The concern is that a large text field—like way over 8,192 bytes—will cause excessive paging and/or file i/o during complex queries on unindexed fields. In such cases, it's better to migrate the large field to another table and replace it with the new table's row id or index (which would then be metadata since it doesn't actually contain data).
The disadvantages are:
a) More complicated schema
b) If the large field is using inspected or retrieved, there is no advantage
c) Ensuring data consistency is more complicated and a potential source of database malaise.

There might be some good reasons to separate a text field out of your table definition. For instance, if you are using an ORM that loads the complete record no matter what, you might want to create a properties table to hold the text field so it doesn't load all the time. However if you are controlling the code 100%, for simplicity, leave the field on the table, then only select it when you need it to cut down on data trasfer and reading time.

Now, I believe I've seen some place mentioning it would be best to isolate the TEXT column from the rest of the table(putting it in a table of its own).
However, now I can't find this reference anywhere and since it was quite a while ago, I'm starting to think that maybe I misinterpreted this information.
You probably saw this, from the MySQL manual
http://dev.mysql.com/doc/refman/5.5/en/optimize-character.html
If a table contains string columns such as name and address, but many queries do not retrieve those columns, consider splitting the string columns into a separate table and using join queries with a foreign key when necessary. When MySQL retrieves any value from a row, it reads a data block containing all the columns of that row (and possibly other adjacent rows). Keeping each row small, with only the most frequently used columns, allows more rows to fit in each data block. Such compact tables reduce disk I/O and memory usage for common queries.
Which indeed is telling you that in MySQL you are discouraged from keeping TEXT data (and BLOB, as written elsewhere) in tables frequently searched

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas