SQL table with a single row? [closed] - sql

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
What is the point (if any) in having a table in a database with only one row?
Note: I'm not talking about the possibility of having only one row in a table, but when a developer deliberately makes a table that is intended to always have exactly one row.
Edit:
The sales tax example is a good one.
I've just observed in some code I'm reviewing three different tables that contain three different kinds of certificates (a la SSL), each having exactly one row. I don't understand why this isn't made into one large table; I assume I'm missing something.

I've seen something like this when a developer was asked to create a configuration table to store name-value pairs of data that needs to persist without being changed often. He ended up creating a one-row table with a column for each configuration variable. I wouldn't say it's a good idea, but I can certainly see why the developer did it given his instructions. Needless to say it didn't pass review.
I've just observed in some code I'm reviewing three different tables that contain three different kinds of certificates (a la SSL), each having exactly one row. I don't understand why this isn't made into one row; I assume I'm missing something.
This doesn't sound like good design, unless there are some important details you don't know about. If there are three pieces of information that have the same constraints, the same use and the same structure, they should be stored in the same table, 99% of the time. That's a big part of what tables are for fundamentally.

For some things you only need one row - typically system configuration data. For example, "current sales tax rate". This might change in the future and so shouldn't be hardcoded, but you'll typically only ever need one at any given time. This kind of data needs to be in the database so that queries can use it in computations.

It's not necessarily a bad idea.
What if you had some global state (say, a boolean) that you wanted to store somewhere? And you wanted your stored procedures to easily access this state?
You could create a table with a primary key whose value range was limited to exactly one value.

Single row is like a singleton class. purpose: to control or manage some other process.
Single row table could act as a critical section or as deterministic automaton (kind of dispatcher based on row values)
Single row is use full in a table COMPANY_DESCRIPTION, to obtain consistent data about that company. Use full on company letters and addressing.
Single row is use full to contain an actual value like VAT or Date or Time, and so on.

It can be useful sometime to emulate some features the Database system doesn't provide. I'm thinking of sequences in MySQL for instance.

If your database is your application, then it probably makes sense for storing configuration data that might be required by stored procedures implementing business logic.
If you have an application that could use the file system to store information, then I don't think there is an advantage to using the database over an XML or flat file, except maybe that most developers are now far more well versed in using SQL to store and retrieve data than accessing the file system.

What is the point (if any) in having a table in a database with only one row?
A relational database stores things as relations: a tuples of data satisfying some relation.
Like, this one: "a VAT of this many percent is in effect in my country now".
If only one tuple satisifies this relation, then yes, it will be the only one in the table.
SQL cannot store variables: it can store a set consisting of 1 element, this is a one-row table.
Also, SQL is a set based language, and for some operations you need a fake set of only one row, like, to select a constant expression.
You cannot just SELECT out of nothing in Oracle, you need a FROM clause.
Oracle has a pseudotable, dual, which contains only one row and only one column.
Once, long time ago, it used to have two rows (hence the name dual), but lost its second row somewhere on its way to version 7.
MySQL has this pseudotable too, but MySQL is able to do selects without FROM clause. Still, it's useful when you need an empty rowset: SELECT 1 FROM dual WHERE NULL
I've just observed in some code I'm reviewing three different tables that contain three different kinds of certificates (a la SSL), each having exactly one row. I don't understand why this isn't made into one large table; I assume I'm missing something.
It may be a kind of "have it all or lose" scenario, when all three certificates are needed at once:
SELECT *
FROM ssl1
CROSS JOIN
ssl2
CROSS JOIN
ssl3
If any if the certificates is missing, the whole query returns nothing.

A table with a single row can be used to store application level settings that are shared across all database users. 'Maximum Allowed Users' for example.

Funny... I asked myself the same question. If you just want to store some simple value and your ONLY method of storage is an SQL server, that's pretty much what you have to do. If I have to do this, I usually end up creating a table with several columns and one row. I've seen a couple commercial products do this as well.

We have used a single-row table in the past (not often). In our case, this table was used to store system-wide configuration values that were updatable via a web interface. We could have gone the route of a simple name/value table, but the end client preferred a single row. I personally would have preferred the latter, but it really is up to preference, especially if this table will never have any sort of relationship with another table.

I really cannot figure out why this would be the best solution. It seams more efficient to just have some kind of config file that will contain the data that would be in the tables one row. The cost of connecting to the database and querying the one row would be more costly. However if this is going to be some kind of config for the database logic. Then this would make a little bit more sense depending on the type of database you are using.

I use the totally awesome rails-settings plugin for this http://github.com/Squeegy/rails-settings/tree/master
It's really easy to set up and provides a nice syntax:
Settings.admin_password = 'supersecret'
Settings.date_format = '%m %d, %Y'
Settings.cocktails = ['Martini', 'Screwdriver', 'White Russian']
Settings.foo = 123
Want a list of all the settings?
Settings.all # returns {'admin_password' => 'super_secret', 'date_format' => '%m %d, %Y'}
Set defaults for certain settings of your app. This will cause the defined settings to return with the Specified value even if they are not in the database. Make a new file in config/initializers/settings.rb with the following:
Settings.defaults[:some_setting] = 'footastic'

A use for this might be to store the current version of the database.
If one were storing database versions for schema changes it would need to reside within the database itself.
I currently analyse the schema and update accordingly but am thinking of moving to versioning. Unless someone has a better idea.
I use vb.net and sql express

Unless there are insert constraints on the table a timestamp for versioning then this sounds like a bad idea.

There was a table set up like this in a project I inherited. It was for configuration data, and the reason that was given was that it made for very simple queries:
SELECT WidgetSize FROM ConfigTable
SELECT FooLength FROM ConfigTable
Okay fine. We converted to a generalized configuration table:
ID Name IntValue StringValue TextValue
This has served our purposes well.

CREATE TABLE VERSION (VERSION_STRING VARCHAR2(20 BYTE))
?

I used a single datum in a SQLite database as a counter in a dynamic web page. That's the simplest way I can think of to make it thread-safe (or process-safe to be precise). But I am not sure whether it's a good idea.

I think the best way to deal with these scenarios is to, rather than using a database at all, use the configuration file (which is usually XML) or make your own configuration file that is read during start up of the application. It only takes a few minutes to write the code to read the file in.
The advantage here is that the there is no chance accidentally adding additional values for the same XML variable, and its great for testing because you don't need to write a lot of code to test the different inputs, just a simple change to the text value and re-run the application.

Related

SQL many value in one var [duplicate]

So, per Mehrdad's answer to a related question, I get it that a "proper" database table column doesn't store a list. Rather, you should create another table that effectively holds the elements of said list and then link to it directly or through a junction table. However, the type of list I want to create will be composed of unique items (unlike the linked question's fruit example). Furthermore, the items in my list are explicitly sorted - which means that if I stored the elements in another table, I'd have to sort them every time I accessed them. Finally, the list is basically atomic in that any time I wish to access the list, I will want to access the entire list rather than just a piece of it - so it seems silly to have to issue a database query to gather together pieces of the list.
AKX's solution (linked above) is to serialize the list and store it in a binary column. But this also seems inconvenient because it means that I have to worry about serialization and deserialization.
Is there any better solution? If there is no better solution, then why? It seems that this problem should come up from time to time.
... just a little more info to let you know where I'm coming from. As soon as I had just begun understanding SQL and databases in general, I was turned on to LINQ to SQL, and so now I'm a little spoiled because I expect to deal with my programming object model without having to think about how the objects are queried or stored in the database.
Thanks All!
John
UPDATE: So in the first flurry of answers I'm getting, I see "you can go the CSV/XML route... but DON'T!". So now I'm looking for explanations of why. Point me to some good references.
Also, to give you a better idea of what I'm up to: In my database I have a Function table that will have a list of (x,y) pairs. (The table will also have other information that is of no consequence for our discussion.) I will never need to see part of the list of (x,y) pairs. Rather, I will take all of them and plot them on the screen. I will allow the user to drag the nodes around to change the values occasionally or add more values to the plot.
No, there is no "better" way to store a sequence of items in a single column. Relational databases are designed specifically to store one value per row/column combination. In order to store more than one value, you must serialize your list into a single value for storage, then deserialize it upon retrieval. There is no other way to do what you're talking about (because what you're talking about is a bad idea that should, in general, never be done).
I understand that you think it's silly to create another table to store that list, but this is exactly what relational databases do. You're fighting an uphill battle and violating one of the most basic principles of relational database design for no good reason. Since you state that you're just learning SQL, I would strongly advise you to avoid this idea and stick with the practices recommended to you by more seasoned SQL developers.
The principle you're violating is called first normal form, which is the first step in database normalization.
At the risk of oversimplifying things, database normalization is the process of defining your database based upon what the data is, so that you can write sensible, consistent queries against it and be able to maintain it easily. Normalization is designed to limit logical inconsistencies and corruption in your data, and there are a lot of levels to it. The Wikipedia article on database normalization is actually pretty good.
Basically, the first rule (or form) of normalization states that your table must represent a relation. This means that:
You must be able to differentiate one row from any other row (in other words, you table must have something that can serve as a primary key. This also means that no row should be duplicated.
Any ordering of the data must be defined by the data, not by the physical ordering of the rows (SQL is based upon the idea of a set, meaning that the only ordering you should rely on is that which you explicitly define in your query)
Every row/column intersection must contain one and only one value
The last point is obviously the salient point here. SQL is designed to store your sets for you, not to provide you with a "bucket" for you to store a set yourself. Yes, it's possible to do. No, the world won't end. You have, however, already crippled yourself in understanding SQL and the best practices that go along with it by immediately jumping into using an ORM. LINQ to SQL is fantastic, just like graphing calculators are. In the same vein, however, they should not be used as a substitute for knowing how the processes they employ actually work.
Your list may be entirely "atomic" now, and that may not change for this project. But you will, however, get into the habit of doing similar things in other projects, and you'll eventually (likely quickly) run into a scenario where you're now fitting your quick-n-easy list-in-a-column approach where it is wholly inappropriate. There is not much additional work in creating the correct table for what you're trying to store, and you won't be derided by other SQL developers when they see your database design. Besides, LINQ to SQL is going to see your relation and give you the proper object-oriented interface to your list automatically. Why would you give up the convenience offered to you by the ORM so that you can perform nonstandard and ill-advised database hackery?
You can just forget SQL all together and go with a "NoSQL" approach. RavenDB, MongoDB and CouchDB jump to mind as possible solutions. With a NoSQL approach, you are not using the relational model..you aren't even constrained to schemas.
What I have seen many people do is this (it may not be the best approach, correct me if I am wrong):
The table which I am using in the example is given below(the table includes nicknames that you have given to your specific girlfriends. Each girlfriend has a unique id):
nicknames(id,seq_no,names)
Suppose, you want to store many nicknames under an id. This is why we have included a seq_no field.
Now, fill these values to your table:
(1,1,'sweetheart'), (1,2,'pumpkin'), (2,1,'cutie'), (2,2,'cherry pie')
If you want to find all the names that you have given to your girl friend id 1 then you can use:
select names from nicknames where id = 1;
Simple answer: If, and only if, you're certain that the list will always be used as a list, then join the list together on your end with a character (such as '\0') that will not be used in the text ever, and store that. Then when you retrieve it, you can split by '\0'. There are of course other ways of going about this stuff, but those are dependent on your specific database vendor.
As an example, you can store JSON in a Postgres database. If your list is text, and you just want the list without further hassle, that's a reasonable compromise.
Others have ventured suggestions of serializing, but I don't really think that serializing is a good idea: Part of the neat thing about databases is that several programs written in different languages can talk to one another. And programs serialized using Java's format would not do all that well if a Lisp program wanted to load it.
If you want a good way to do this sort of thing there are usually array-or-similar types available. Postgres for instance, offers array as a type, and lets you store an array of text, if that's what you want, and there are similar tricks for MySql and MS SQL using JSON, and IBM's DB2 offer an array type as well (in their own helpful documentation). This would not be so common if there wasn't a need for this.
What you do lose by going that road is the notion of the list as a bunch of things in sequence. At least nominally, databases treat fields as single values. But if that's all you want, then you should go for it. It's a value judgement you have to make for yourself.
In addition to what everyone else has said, I would suggest you analyze your approach in longer terms than just now. It is currently the case that items are unique. It is currently the case that resorting the items would require a new list. It is almost required that the list are currently short. Even though I don't have the domain specifics, it is not much of a stretch to think those requirements could change. If you serialize your list, you are baking in an inflexibility that is not necessary in a more-normalized design. Btw, that does not necessarily mean a full Many:Many relationship. You could just have a single child table with a foreign key to the parent and a character column for the item.
If you still want to go down this road of serializing the list, you might consider storing the list in XML. Some databases such as SQL Server even have an XML data type. The only reason I'd suggest XML is that almost by definition, this list needs to be short. If the list is long, then serializing it in general is an awful approach. If you go the CSV route, you need to account for the values containing the delimiter which means you are compelled to use quoted identifiers. Persuming that the lists are short, it probably will not make much difference whether you use CSV or XML.
If you need to query on the list, then store it in a table.
If you always want the list, you could store it as a delimited list in a column. Even in this case, unless you have VERY specific reasons not to, store it in a lookup table.
Many SQL databases allow a table to contain a subtable as a component. The usual method is to allow the domain of one of the columns to be a table. This is in addition to using some convention like CSV to encode the substructure in ways unknown to the DBMS.
When Ed Codd was developing the relational model in 1969-1970, he specifically defined a normal form that would disallow this kind of nesting of tables. Normal form was later called First Normal Form. He then went on to show that for every database, there is a database in first normal form that expresses the same information.
Why bother with this? Well, databases in first normal form permit keyed access to all data. If you provide a table name, a key value into that table, and a column name, the database will contain at most one cell containing one item of data.
If you allow a cell to contain a list or a table or any other collection, now you can't provide keyed access to the sub items, without completely reworking the idea of a key.
Keyed access to all data is fundamental to the relational model. Without this concept, the model isn't relational. As to why the relational model is a good idea, and what might be the limitations of that good idea, you have to look at the 50 years worth of accumulated experience with the relational model.
I'd just store it as CSV, if it's simple values then it should be all you need (XML is very verbose and serializing to/from it would probably be overkill but that would be an option as well).
Here's a good answer for how to pull out CSVs with LINQ.
Only one option doesn't mentioned in the answers. You can de-normalize your DB design. So you need two tables. One table contains proper list, one item per row, another table contains whole list in one column (coma-separated, for example).
Here it is 'traditional' DB design:
List(ListID, ListName)
Item(ItemID,ItemName)
List_Item(ListID, ItemID, SortOrder)
Here it is de-normalized table:
Lists(ListID, ListContent)
The idea here - you maintain Lists table using triggers or application code. Every time you modify List_Item content, appropriate rows in Lists get updated automatically. If you mostly read lists it could work quite fine. Pros - you can read lists in one statement. Cons - updates take more time and efforts.
I was very reluctant to choose the path I finally decide to take because of many answers. While they add more understanding to what is SQL and its principles, I decided to become an outlaw. I was also hesitant to post my findings as for some it's more important to vent frustration to someone breaking the rules rather than understanding that there are very few universal truthes.
I have tested it extensively and, in my specific case, it was way more efficient than both using array type (generously offered by PostgreSQL) or querying another table.
Here is my answer:
I have successfully implemented a list into a single field in PostgreSQL, by making use of the fixed length of each item of the list. Let say each item is a color as an ARGB hex value, it means 8 char. So you can create your array of max 10 items by multiplying by the length of each item:
ALTER product ADD color varchar(80)
In case your list items length differ you can always fill the padding with \0
NB: Obviously this is not necessarily the best approach for hex number since a list of integers would consume less storage but this is just for the purpose of illustrating this idea of array by making use of a fixed length allocated to each item.
The reason why:
1/ Very convenient: retrieve item i at substring i*n, (i +1)*n.
2/ No overhead of cross tables queries.
3/ More efficient and cost-saving on the server side. The list is like a mini blob that the client will have to split.
While I respect people following rules, many explanations are very theoretical and often fail to acknowledge that, in some specific cases, especially when aiming for cost optimal with low-latency solutions, some minor tweaks are more than welcome.
"God forbid that it is violating some holy sacred principle of SQL": Adopting a more open-minded and pragmatic approach before reciting the rules is always the way to go. Else you might end up like a candid fanatic reciting the Three Laws of Robotics before being obliterated by Skynet
I don't pretend that this solution is a breakthrough, nor that it is ideal in term of readability and database flexibility, but it can certainly give you an edge when it comes to latency.
What I do is that if the List required to be stored is small then I would just convert it to a string then split it later when required.
example in python -
for y in b:
if text1 == "":
text1 = y
else:
text1 = text1 + f"~{y}"
then when I required it I just call it from the db and -
out = query.split('~')
print(out)
this will return a list, and a string will be stored in the db. But if you are storing a lot of data in the list then creating a table is the best option.
If you really wanted to store it in a column and have it queryable a lot of databases support XML now. If not querying you can store them as comma separated values and parse them out with a function when you need them separated. I agree with everyone else though if you are looking to use a relational database a big part of normalization is the separating of data like that. I am not saying that all data fits a relational database though. You could always look into other types of databases if a lot of your data doesn't fit the model.
I think in certain cases, you can create a FAKE "list" of items in the database, for example, the merchandise has a few pictures to show its details, you can concatenate all the IDs of pictures split by comma and store the string into the DB, then you just need to parse the string when you need it. I am working on a website now and I am planning to use this way.
you can store it as text that looks like a list and create a function that can return its data as an actual list. example:
database:
_____________________
| word | letters |
| me | '[m, e]' |
| you |'[y, o, u]' | note that the letters column is of type 'TEXT'
| for |'[f, o, r]' |
|___in___|_'[i, n]'___|
And the list compiler function (written in python, but it should be easily translatable to most other programming languages). TEXT represents the text loaded from the sql table. returns list of strings from string containing list. if you want it to return ints instead of strings, make mode equal to 'int'. Likewise with 'string', 'bool', or 'float'.
def string_to_list(string, mode):
items = []
item = ""
itemExpected = True
for char in string[1:]:
if itemExpected and char not in [']', ',', '[']:
item += char
elif char in [',', '[', ']']:
itemExpected = True
items.append(item)
item = ""
newItems = []
if mode == "int":
for i in items:
newItems.append(int(i))
elif mode == "float":
for i in items:
newItems.append(float(i))
elif mode == "boolean":
for i in items:
if i in ["true", "True"]:
newItems.append(True)
elif i in ["false", "False"]:
newItems.append(False)
else:
newItems.append(None)
elif mode == "string":
return items
else:
raise Exception("the 'mode'/second parameter of string_to_list() must be one of: 'int', 'string', 'bool', or 'float'")
return newItems
Also here is a list-to-string function in case you need it.
def list_to_string(lst):
string = "["
for i in lst:
string += str(i) + ","
if string[-1] == ',':
string = string[:-1] + "]"
else:
string += "]"
return string
Imagine your grandmother's box of recipes, all written on index cards. Each of those recipes is a list of ingredients, which are themselves ordered pairs of items and quantities. If you create a recipe database, you wouldn't need to create one table for the recipe names and a second table where each ingredient was a separate record. That sounds like what we're saying here. My apologies if I've misread anything.
From Microsoft's T-SQL Fundamentals:
Atomicity of attributes is subjective in the same way that the
definition of a set is subjective. As an example, should an employee
name in an Employees relation be expressed with one attribute
(fullname), two (firstname and lastname), or three (firstname,
middlename, and lastname)? The answer depends on the application. If
the application needs to manipulate the parts of the employee’s name
separately (such as for search purposes), it makes sense to break them
apart; otherwise, it doesn’t.
So, if you needed to manipulate your list of coordinates via SQL, you would need to split the elements of the list into separate records. But is you just wanted to store a list and retrieve it for use by some other software, then storing the list as a single value makes more sense.

Having all contact information in one table vs. using key-value tables

(NB. The question is not a duplicate for this, since I am dealing with an ORM system)
I have a table in my database to store all Contacts information. Some of the columns for each contact is fixed (e.g. Id, InsertDate and UpdateDate). In my program I would like to give user the option to add or remove properties for each contact.
Now there are of course two alternatives here:
First is to save it all in one table and add and remove entire columns when user needs to;
Create a key-value table to save each property alongside its type and connect the record to user's id.
These alternatives are both doable. But I am wondering which one is better in terms of speed? In the program it will be a very common thing for the user to view the entire Contact list to check for updates. Plus, I am using an ORM framework (Microsoft's Entity Framework) to deal with database queries. So if the user is to add and remove columns from a table all the time, it will be a difficult task to map them to my program. But again, if alternative (1) is a significantly better option than (2), then I can reconsider the key-value option.
I have actually done both of these.
Example #1
Large, wide table with columns of data holding names, phone, address and lots of small integer values of information that tracked details of the clients.
Example #2
Many different tables separating out all of the Character Varying data fields, the small integer values etc.
Example #1 was a lot faster to code for but in terms of performance, it got pretty slow once the table filled with records. 5000 wasn't a problem. When it reached 50,000 there was a noticeable performance degradation.
Example #2 was built later in my coding experience and was built to resolve the issues found in Example #1. While it took more to get the records I was after (LEFT JOIN this and UNION that) it was MUCH faster as you could ultimately pick and choose EXACTLY what the client was after without having to search a massive wide table full of data that was not all being requested.
I would recommend Example #2 to fit your #2 in the question.
And your USER specified columns for their data set could be stored in a table just to their own (depending on how many you have I suppose) which would allow you to draw on the table specific to that USER, which would also give you unlimited ability to remove and add columns to suit that particular setup.
You could then also have another table which kept track of the custom columns in the custom column table, which would give you the ability to "recover" columns later, as in "Do you want to add this to your current column choices or to one of these columns you have deleted in the past".

Schema less SQL database table - practical compromise

This question is an attempt to find a practical solution for this question.
I need a semi-schema less design for my SQL database. However, I can limit the flexibility to shoehorn it into the entire SQL paradigm. Moving to schema less databases might be an option in the future but right now, I' stuck with SQL.
I have a table in a SQL database (let's call it Foo). When an row is added to this, it needs to be able to store an arbitrary number of "meta" fields with this. An example would be the ability to attach arbitrary metadata like tags, collaborators etc. All the fields are optional but the problem is that they're of different types. Some might be numeric, some might be textual etc.
A simple design linking Foo to a table of OptionalValues with fields like name, value_type, value_string, value_int, value_date etc. seems direct although it descends into the whole EAV model which Alex mentions on that last answer and it looks quite wasteful. Also, I imagine queries out of this when it grows will be quite slow. I don't expect to search or sort by anything in this table though. All I need is that when I get a row out of Foo, these extra attributes should be obtainable as well.
Are there any best practices for implementing this kind of a setup in a SQL database or am I simply looking at the whole thing wrongly?
Add a string column "Metafields" to your table "Foo" and store your metadata there as an XML or JSON string.

What is wrong with using SELECT * FROM sometable [duplicate]

I've heard that SELECT * is generally bad practice to use when writing SQL commands because it is more efficient to SELECT columns you specifically need.
If I need to SELECT every column in a table, should I use
SELECT * FROM TABLE
or
SELECT column1, colum2, column3, etc. FROM TABLE
Does the efficiency really matter in this case? I'd think SELECT * would be more optimal internally if you really need all of the data, but I'm saying this with no real understanding of database.
I'm curious to know what the best practice is in this case.
UPDATE: I probably should specify that the only situation where I would really want to do a SELECT * is when I'm selecting data from one table where I know all columns will always need to be retrieved, even when new columns are added.
Given the responses I've seen however, this still seems like a bad idea and SELECT * should never be used for a lot more technical reasons that I ever though about.
One reason that selecting specific columns is better is that it raises the probability that SQL Server can access the data from indexes rather than querying the table data.
Here's a post I wrote about it: The real reason select queries are bad index coverage
It's also less fragile to change, since any code that consumes the data will be getting the same data structure regardless of changes you make to the table schema in the future.
Given your specification that you are selecting all columns, there is little difference at this time. Realize, however, that database schemas do change. If you use SELECT * you are going to get any new columns added to the table, even though in all likelihood, your code is not prepared to use or present that new data. This means that you are exposing your system to unexpected performance and functionality changes.
You may be willing to dismiss this as a minor cost, but realize that columns that you don't need still must be:
Read from database
Sent across the network
Marshalled into your process
(for ADO-type technologies) Saved in a data-table in-memory
Ignored and discarded / garbage-collected
Item #1 has many hidden costs including eliminating some potential covering index, causing data-page loads (and server cache thrashing), incurring row / page / table locks that might be otherwise avoided.
Balance this against the potential savings of specifying the columns versus an * and the only potential savings are:
Programmer doesn't need to revisit the SQL to add columns
The network-transport of the SQL is smaller / faster
SQL Server query parse / validation time
SQL Server query plan cache
For item 1, the reality is that you're going to add / change code to use any new column you might add anyway, so it is a wash.
For item 2, the difference is rarely enough to push you into a different packet-size or number of network packets. If you get to the point where SQL statement transmission time is the predominant issue, you probably need to reduce the rate of statements first.
For item 3, there is NO savings as the expansion of the * has to happen anyway, which means consulting the table(s) schema anyway. Realistically, listing the columns will incur the same cost because they have to be validated against the schema. In other words this is a complete wash.
For item 4, when you specify specific columns, your query plan cache could get larger but only if you are dealing with different sets of columns (which is not what you've specified). In this case, you do want different cache entries because you want different plans as needed.
So, this all comes down, because of the way you specified the question, to the issue resiliency in the face of eventual schema modifications. If you're burning this schema into ROM (it happens), then an * is perfectly acceptable.
However, my general guideline is that you should only select the columns you need, which means that sometimes it will look like you are asking for all of them, but DBAs and schema evolution mean that some new columns might appear that could greatly affect the query.
My advice is that you should ALWAYS SELECT specific columns. Remember that you get good at what you do over and over, so just get in the habit of doing it right.
If you are wondering why a schema might change without code changing, think in terms of audit logging, effective/expiration dates and other similar things that get added by DBAs for systemically for compliance issues. Another source of underhanded changes is denormalizations for performance elsewhere in the system or user-defined fields.
You should only select the columns that you need. Even if you need all columns it's still better to list column names so that the sql server does not have to query system table for columns.
Also, your application might break if someone adds columns to the table. Your program will get columns it didn't expect too and it might not know how to process them.
Apart from this if the table has a binary column then the query will be much more slower and use more network resources.
There are four big reasons that select * is a bad thing:
The most significant practical reason is that it forces the user to magically know the order in which columns will be returned. It's better to be explicit, which also protects you against the table changing, which segues nicely into...
If a column name you're using changes, it's better to catch it early (at the point of the SQL call) rather than when you're trying to use the column that no longer exists (or has had its name changed, etc.)
Listing the column names makes your code far more self-documented, and so probably more readable.
If you're transferring over a network (or even if you aren't), columns you don't need are just waste.
Specifying the column list is usually the best option because your application won't be affected if someone adds/inserts a column to the table.
Specifying column names is definitely faster - for the server. But if
performance is not a big issue (for example, this is a website content database with hundreds, maybe thousands - but not millions - of rows in each table); AND
your job is to create many small, similar applications (e.g. public-facing content-managed websites) using a common framework, rather than creating a complex one-off application; AND
flexibility is important (lots of customization of the db schema for each site);
then you're better off sticking with SELECT *. In our framework, heavy use of SELECT * allows us to introduce a new website managed content field to a table, giving it all of the benefits of the CMS (versioning, workflow/approvals, etc.), while only touching the code at a couple of points, instead of a couple dozen points.
I know the DB gurus are going to hate me for this - go ahead, vote me down - but in my world, developer time is scarce and CPU cycles are abundant, so I adjust accordingly what I conserve and what I waste.
SELECT * is a bad practice even if the query is not sent over a network.
Selecting more data than you need makes the query less efficient - the server has to read and transfer extra data, so it takes time and creates unnecessary load on the system (not only the network, as others mentioned, but also disk, CPU etc.). Additionally, the server is unable to optimize the query as well as it might (for example, use covering index for the query).
After some time your table structure might change, so SELECT * will return a different set of columns. So, your application might get a dataset of unexpected structure and break somewhere downstream. Explicitly stating the columns guarantees that you either get a dataset of known structure, or get a clear error on the database level (like 'column not found').
Of course, all this doesn't matter much for a small and simple system.
Lots of good reasons answered here so far, here's another one that hasn't been mentioned.
Explicitly naming the columns will help you with maintenance down the road. At some point you're going to be making changes or troubleshooting, and find yourself asking "where the heck is that column used".
If you've got the names listed explicitly, then finding every reference to that column -- through all your stored procedures, views, etc -- is simple. Just dump a CREATE script for your DB schema, and text search through it.
Performance wise, SELECT with specific columns can be faster (no need to read in all the data). If your query really does use ALL the columns, SELECT with explicit parameters is still preferred. Any speed difference will be basically unnoticeable and near constant-time. One day your schema will change, and this is good insurance to prevent problems due to this.
definitely defining the columns, because SQL Server will not have to do a lookup on the columns to pull them. If you define the columns, then SQL can skip that step.
It's always better to specify the columns you need, if you think about it one time, SQL doesn't have to think "wtf is *" every time you query. On top of that, someone later may add columns to the table that you actually do not need in your query and you'll be better off in that case by specifying all of your columns.
The problem with "select *" is the possibility of bringing data you don't really need. During the actual database query, the selected columns don't really add to the computation. What's really "heavy" is the data transport back to your client, and any column that you don't really need is just wasting network bandwidth and adding to the time you're waiting for you query to return.
Even if you do use all the columns brought from a "select *...", that's just for now. If in the future you change the table/view layout and add more columns, you'll start bring those in your selects even if you don't need them.
Another point in which a "select *" statement is bad is on view creation. If you create a view using "select *" and later add columns to your table, the view definition and the data returned won't match, and you'll need to recompile your views in order for them to work again.
I know that writing a "select *" is tempting, 'cause I really don't like to manually specify all the fields on my queries, but when your system start to evolve, you'll see that it's worth to spend this extra time/effort in specifying the fields rather than spending much more time and effort removing bugs on your views or optimizing your app.
While explicitly listing columns is good for performance, don't get crazy.
So if you use all the data, try SELECT * for simplicity (imagine having many columns and doing a JOIN... query may get awful). Then - measure. Compare with query with column names listed explicitly.
Don't speculate about performance, measure it!
Explicit listing helps most when you have some column containing big data (like body of a post or article), and don't need it in given query. Then by not returning it in your answer DB server can save time, bandwidth, and disk throughput. Your query result will also be smaller, which is good for any query cache.
You should really be selecting only the fields you need, and only the required number, i.e.
SELECT Field1, Field2 FROM SomeTable WHERE --(constraints)
Outside of the database, dynamic queries run the risk of injection attacks and malformed data. Typically you get round this using stored procedures or parameterised queries. Also (although not really that much of a problem) the server has to generate an execution plan each time a dynamic query is executed.
It is NOT faster to use explicit field names versus *, if and only if, you need to get the data for all fields.
Your client software shouldn't depend on the order of the fields returned, so that's a nonsense too.
And it's possible (though unlikely) that you need to get all fields using * because you don't yet know what fields exist (think very dynamic database structure).
Another disadvantage of using explicit field names is that if there are many of them and they're long then it makes reading the code and/or the query log more difficult.
So the rule should be: if you need all the fields, use *, if you need only a subset, name them explicitly.
The result is too huge. It is slow to generate and send the result from the SQL engine to the client.
The client side, being a generic programming environment, is not and should not be designed to filter and process the results (e.g. the WHERE clause, ORDER clause), as the number of rows can be huge (e.g. tens of millions of rows).
Naming each column you expect to get in your application also ensures your application won't break if someone alters the table, as long as your columns are still present (in any order).
Performance wise I have seen comments that both are equal. but usability aspect there are some +'s and -'s
When you use a (select *) in a query and if some one alter the table and add new fields which do not need for the previous query it is an unnecessary overhead. And what if the newly added field is a blob or an image field??? your query response time is going to be really slow then.
In other hand if you use a (select col1,col2,..) and if the table get altered and added new fields and if those fields are needed in the result set, you always need to edit your select query after table alteration.
But I suggest always to use select col1,col2,... in your queries and alter the query if the table get altered later...
This is an old post, but still valid. For reference, I have a very complicated query consisting of:
12 tables
6 Left joins
9 inner joins
108 total columns on all 12 tables
I only need 54 columns
A 4 column Order By clause
When I execute the query using Select *, it takes an average of 2869ms.
When I execute the query using Select , it takes an average of 1513ms.
Total rows returned is 13,949.
There is no doubt selecting column names means faster performance over Select *
Select is equally efficient (in terms of velocity) if you use * or columns.
The difference is about memory, not velocity. When you select several columns SQL Server must allocate memory space to serve you the query, including all data for all the columns that you've requested, even if you're only using one of them.
What does matter in terms of performance is the excecution plan which in turn depends heavily on your WHERE clause and the number of JOIN, OUTER JOIN, etc ...
For your question just use SELECT *. If you need all the columns there's no performance difference.
It depends on the version of your DB server, but modern versions of SQL can cache the plan either way. I'd say go with whatever is most maintainable with your data access code.
One reason it's better practice to spell out exactly which columns you want is because of possible future changes in the table structure.
If you are reading in data manually using an index based approach to populate a data structure with the results of your query, then in the future when you add/remove a column you will have headaches trying to figure out what went wrong.
As to what is faster, I'll defer to others for their expertise.
As with most problems, it depends on what you want to achieve. If you want to create a db grid that will allow all columns in any table, then "Select *" is the answer. However, if you will only need certain columns and adding or deleting columns from the query is done infrequently, then specify them individually.
It also depends on the amount of data you want to transfer from the server. If one of the columns is a defined as memo, graphic, blob, etc. and you don't need that column, you'd better not use "Select *" or you'll get a whole bunch of data you don't want and your performance could suffer.
To add on to what everyone else has said, if all of your columns that you are selecting are included in an index, your result set will be pulled from the index instead of looking up additional data from SQL.
SELECT * is necessary if one wants to obtain metadata such as the number of columns.
Gonna get slammed for this, but I do a select * because almost all my data is retrived from SQL Server Views that precombine needed values from multiple tables into a single easy to access View.
I do then want all the columns from the view which won't change when new fields are added to underlying tables. This has the added benefit of allowing me to change where data comes from. FieldA in the View may at one time be calculated and then I may change it to be static. Either way the View supplies FieldA to me.
The beauty of this is that it allows my data layer to get datasets. It then passes them to my BL which can then create objects from them. My main app only knows and interacts with the objects. I even allow my objects to self-create when passed a datarow.
Of course, I'm the only developer, so that helps too :)
What everyone above said, plus:
If you're striving for readable maintainable code, doing something like:
SELECT foo, bar FROM widgets;
is instantly readable and shows intent. If you make that call you know what you're getting back. If widgets only has foo and bar columns, then selecting * means you still have to think about what you're getting back, confirm the order is mapped correctly, etc. However, if widgets has more columns but you're only interested in foo and bar, then your code gets messy when you query for a wildcard and then only use some of what's returned.
And remember if you have an inner join by definition you do not need all the columns as the data in the join columns is repeated.
It's not like listing columns in SQl server is hard or even time-consuming. You just drag them over from the object browser (you can get all in one go by dragging from the word columns). To put a permanent performance hit on your system (becasue this can reduce the use of indexes and becasue sending unneeded data over the network is costly) and make it more likely that you will have unexpected problems as the database changes (sometimes columns get added that you do not want the user to see for instance) just to save less than a minute of development time is short-sighted and unprofessional.
Absolutely define the columns you want to SELECT every time. There is no reason not to and the performance improvement is well worth it.
They should never have given the option to "SELECT *"
If you need every column then just use SELECT * but remember that the order could potentially change so when you are consuming the results access them by name and not by index.
I would ignore comments about how * needs to go get the list - chances are parsing and validating named columns is equal to the processing time if not more. Don't prematurely optimize ;-)

Dynamic Database Schema [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
What is a recommended architecture for providing storage for a dynamic logical database schema?
To clarify: Where a system is required to provide storage for a model whose schema may be extended or altered by its users once in production, what are some good technologies, database models or storage engines that will allow this?
A few possibilities to illustrate:
Creating/altering database objects via dynamically generated DML
Creating tables with large numbers of sparse physical columns and using only those required for the 'overlaid' logical schema
Creating a 'long, narrow' table that stores dynamic column values as rows that then need to be pivoted to create a 'short, wide' rowset containing all the values for a specific entity
Using a BigTable/SimpleDB PropertyBag type system
Any answers based on real world experience would be greatly appreciated
What you are proposing is not new. Plenty of people have tried it... most have found that they chase "infinite" flexibility and instead end up with much, much less than that. It's the "roach motel" of database designs -- data goes in, but it's almost impossible to get it out. Try and conceptualize writing the code for ANY sort of constraint and you'll see what I mean.
The end result typically is a system that is MUCH more difficult to debug, maintain, and full of data consistency problems. This is not always the case, but more often than not, that is how it ends up. Mostly because the programmer(s) don't see this train wreck coming and fail to defensively code against it. Also, often ends up the case that the "infinite" flexibility really isn't that necessary; it's a very bad "smell" when the dev team gets a spec that says "Gosh I have no clue what sort of data they are going to put here, so let 'em put WHATEVER"... and the end users are just fine having pre-defined attribute types that they can use (code up a generic phone #, and let them create any # of them -- this is trivial in a nicely normalized system and maintains flexibility and integrity!)
If you have a very good development team and are intimately aware of the problems you'll have to overcome with this design, you can successfully code up a well designed, not terribly buggy system. Most of the time.
Why start out with the odds stacked so much against you, though?
Don't believe me? Google "One True Lookup Table" or "single table design". Some good results:
http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:10678084117056
http://thedailywtf.com/Comments/Tom_Kyte_on_The_Ultimate_Extensibility.aspx?pg=3
http://www.dbazine.com/ofinterest/oi-articles/celko22
http://thedailywtf.com/Comments/The_Inner-Platform_Effect.aspx?pg=2
A strongly typed xml field in MSSQL has worked for us.
Like some others have said, don't do this unless you have no other choice. One case where this is required is if you are selling an off-the-shelf product that must allow users to record custom data. My company's product falls into this category.
If you do need to allow your customers to do this, here are a few tips:
- Create a robust administrative tool to perform the schema changes, and do not allow these changes to be made any other way.
- Make it an administrative feature; don't allow normal users to access it.
- Log every detail about every schema change. This will help you debug problems, and it will also give you CYA data if a customer does something stupid.
If you can do those things successfully (especially the first one), then any of the architectures you mentioned will work. My preference is to dynamically change the database objects, because that allows you to take advantage of your DBMS's query features when you access the data stored in the custom fields. The other three options require you load large chunks of data and then do most of your data processing in code.
I have a similar requirement and decided to use the schema-less MongoDB.
MongoDB (from "humongous") is an open source, scalable, high-performance, schema-free, document-oriented database written in the C++ programming language. (Wikipedia)
Highlights:
has rich query functionality (maybe the closest to SQL DBs)
production ready (foursquare, sourceforge use it)
Lowdarks (stuff you need to understand, so you can use mongo correctly):
no transactions (actually it has transactions but only on atomic operations)
this stuff here: http://ethangunderson.com/blog/two-reasons-to-not-use-mongodb/
durability .. mostly ACID related stuff
I did it ones in a real project:
The database consisted of one table with one field which was an array of 50. It had a 'word' index set on it. All the data was typeless so the 'word index' worked as expected. Numeric fields were represented as characters and the actual sorting had been done at client side. (It still possible to have several array fields for each data type if needed).
The logical data schema for logical tables was held within the same database with different table row 'type' (the first array element). It also supported simple versioning in copy-on-write style using same 'type' field.
Advantages:
You can rearrange and add/delete your columns dynamically, no need for dump/reload of database. Any new column data may be set to initial value (virtually) in zero time.
Fragmentation is minimal, since all records and tables are same size, sometimes it gives better performance.
All table schema is virtual. Any logical schema stucture is possible (even recursive, or object-oriented).
It is good for "write-once, read-mostly, no-delete/mark-as-deleted" data (most Web apps actually are like that).
Disadvantages:
Indexing only by full words, no abbreviation,
Complex queries are possible, but with slight performance degradation.
Depends on whether your preferred database system supports arrays and word indexes (it was inplemented in PROGRESS RDBMS).
Relational model is only in programmer's mind (i.e. only at run-time).
And now I'm thinking the next step could be - to implement such a database on the file system level. That might be relatively easy.
The whole point of having a relational DB is keeping your data safe and consistent. The moment you allow users to alter the schema, there goes your data integrity...
If your need is to store heterogeneous data, for example like a CMS scenario, I would suggest storing XML validated by an XSD in a row. Of course you lose performance and easy search capabilities, but it's a good trade off IMHO.
Since it's 2016, forget XML! Use JSON to store the non-relational data bag, with an appropriately typed column as backend. You shouldn't normally need to query by value inside the bag, which will be slow even though many contemporary SQL databases understand JSON natively.
Sounds to me like what you really want is some sort of "meta-schema", a database schema which is capable of describing a flexible schema for storing the actual data. Dynamic schema changes are touchy and not something you want to mess with, especially not if users are allowed to make the change.
You're not going to find a database which is more suited to this task than any other, so your best bet is just to select one based on other criteria. For example, what platform are you using to host the DB? What language is the app written in? etc
To clarify what I mean by "meta-schema":
CREATE TABLE data (
id INTEGER NOT NULL AUTO_INCREMENT,
key VARCHAR(255),
data TEXT,
PRIMARY KEY (id)
);
This is a very simple example, you would likely have something more specific to your needs (and hopefully a little easier to work with), but it does serve to illustrate my point. You should consider the database schema itself to be immutable at the application level; any structural changes should be reflected in the data (that-is, the instantiation of that schema).
I know that models indicated in the question are used in production systems all over. A rather large one is in use at a large university/teaching institution that I work for. They specifically use the long narrow table approach to map data gathered by many varied data acquisition systems.
Also, Google recently released their internal data sharing protocol, protocol buffer, as open source via their code site. A database system modeled on this approach would be quite interesting.
Check the following:
Entity-attribute-value model
Google Protocol Buffer
Create 2 databases
DB1 contains static tables, and represents the "real" state of the data.
DB2 is free for users to do with as they wish - they (or you) will have to write code to populate their odd-shaped tables from DB1.
EAV approach i believe is the best approach, but comes with a heavy cost
I know it's an old topic, but I guess that it never loses actuality.
I'm developing something like that right now.
Here is my approach.
I use a server setting with a MySQL, Apache, PHP, and Zend Framework 2 as application framework, but it should work as well with any other settings.
Here is a simple implementation guide, you can evolve it yourself further from this.
You would need to implement your own query language interpreter, because the effective SQL would be too complicated.
Example:
select id, password from user where email_address = "xyz#xyz.com"
The physical database layout:
Table 'specs': (should be cached in your data access layer)
id: int
parent_id: int
name: varchar(255)
Table 'items':
id: int
parent_id: int
spec_id: int
data: varchar(20000)
Contents of table 'specs':
1, 0, 'user'
2, 1, 'email_address'
3, 1, 'password'
Contents of table 'items':
1, 0, 1, ''
2, 1, 2, 'xyz#xyz.com'
3, 1, 3, 'my password'
The translation of the example in our own query language:
select id, password from user where email_address = "xyz#xyz.com"
to standard SQL would look like this:
select
parent_id, -- user id
data -- password
from
items
where
spec_id = 3 -- make sure this is a 'password' item
and
parent_id in
( -- get the 'user' item to which this 'password' item belongs
select
id
from
items
where
spec_id = 1 -- make sure this is a 'user' item
and
id in
( -- fetch all item id's with the desired 'email_address' child item
select
parent_id -- id of the parent item of the 'email_address' item
from
items
where
spec_id = 2 -- make sure this is a 'email_address' item
and
data = "xyz#xyz.com" -- with the desired data value
)
)
You will need to have the specs table cached in an associative array or hashtable or something similar to get the spec_id's from the spec names. Otherwise you would need to insert some more SQL overhead to get the spec_id's from the names, like in this snippet:
Bad example, don't use this, avoid this, cache the specs table instead!
select
parent_id,
data
from
items
where
spec_id = (select id from specs where name = "password")
and
parent_id in (
select
id
from
items
where
spec_id = (select id from specs where name = "user")
and
id in (
select
parent_id
from
items
where
spec_id = (select id from specs where name = "email_address")
and
data = "xyz#xyz.com"
)
)
I hope you get the idea and can determine for yourself whether that approach is feasible for you.
Enjoy! :-)
Over at the c2.com wiki, the idea of "Dynamic Relational" was explored. You DON'T need a DBA: columns and tables are Create-On-Write, unless you start adding constraints to make it act more like a traditional RDBMS: as a project matures, you can incrementally "lock it down".
Conceptually you can think of each row as an XML statement. For example, an employee record could be represented as:
<employee lastname="Li" firstname="Joe" salary="120000" id="318"/>
This does not imply it has to be implemented as XML, it's just a handy conceptualization. If you ask for a non-existing column, such as "SELECT madeUpColumn ...", it's treated as blank or null (unless added constraints forbid such). And it's possible to use SQL, although one has to be careful about comparisons because of the implied type model. But other than type handling, users of a Dynamic Relational system would feel right at home because they can leverage most of their existing RDBMS knowledge. Now, if somebody would just build it...
In the past I've chosen option C -- Creating a 'long, narrow' table that stores dynamic column values as rows that then need to be pivoted to create a 'short, wide' rowset containing all the values for a specific entity.. However, I was using an ORM, and that REALLY made things painful. I can't think of how you'd do it in, say, LinqToSql. I guess I'd have to create a Hashtable to reference the fields.
#Skliwz: I'm guessing he's more interested in allowing users to create user-defined fields.
ElasticSearch. You should consider it especially if you're dealing with datasets that you can partition by date, you can use JSON for your data, and are not fixed on using SQL for retrieving the data.
ES infers your schema for any new JSON fields you send, either automatically, with hints, or manually which you can define/change by one HTTP command ("mappings").
Although it does not support SQL, it has some great lookup capabilities and even aggregations.
I know this is a super old post, and much has changed in the last 11 years, but thought I would added this as it might be helpful to future readers. One of the reason's why my co-founders and I created HarperDB is to natively accomplish Dynamic schema in a single, unduplicated data set while providing full index capability. You can read more about it here:
https://harperdb.io/blog/dynamic-schema-the-harperdb-way/
sql already provides a way to change your schema: the ALTER command.
simply have a table that lists the fields that users are not allowed to change, and write a nice interface for ALTER.