I want to store entries (a set of key=>value pairs) in a database, but the keys vary from entry to entry.
I thought of storing with two tables, (1) of the keys for each entry and (2) of the values of specific keys for each entry, where entries share a common id field in both tables, but I am not sure how to pull entries as a key=>value pairs in sql with this sort of configuration.
Is there a better method? If this is not possible in sqlite, is it possible in mysql? Thanks!
It sounds like you are looking for the Entity-Attribute-Value model.
Alternatives are to create different tables for different types of entities, or to have a table with a column for every possible key and set the value to NULL for entities that don't have that key.
You might want to take a look at Bill Karwin's presentation SQL Antipatterns where he covers some of the pros and cons of the EAV model and suggests possible alternatives. The relevant part starts from slide 16.
#Mark Byers is right, this is the EAV model. You should read Bad CaRMa before you go down that dark path. It's a story of how this database design practically destroyed a company.
In a relational database, every row in a relation must include the same columns. That's part of the definition for a relation. This is true in SQLite, MySQL, or any other relational database.
Also see my presentation Practical Object-Oriented Models in SQL or my book SQL Antipatterns, in which I show the problems caused by the EAV model.
If you need variable columns per entity, you need a non-relational database. There are document-oriented databases like CouchDB or MongoDB that are catching on in popularity.
Or try Berkeley DB if you want an embeddable single-user solution like SQLite.
Related
For the following hypothetical use case, I'm trying to understand why it may be desirable to have a pivot table instead of an alternative solution (outlined below).
Hypothetical Use Case
Let’s say that a movie has many actors and that an actor can belong to more than one movie.
"Standard" Pivot Table Solution
As outlined in this lesson (using Elixir's Ecto library), the "standard" solution recommends using a movies_actors pivot table, and both the movies and actors tables reference this movies_actors table.
Alternative Solution
Instead, could we achieve the same result by having the concept of a list of ids?
actor belongs to one or more movies by having the actors table include a movie_ids field (which is a list)
movie has many actors by having the movies table include a actor_ids field (which is a list)
Question
Is one solution preferable? Why?
The table you are referring to is more typically called a "junction" table or "association" table. It is the standard way to implement many-to-many relationships.
Junction tables have some key advantages. Notably, it guarantees data integrity when foreign keys are properly defined.
But that is not your question. Are other representations appropriate under some circumstances? I would say that Postgres provides powerful functionality through arrays and JSON which make them feasible for many-to-many relationships. In particular, Postgres supports indexes on arrays and JSON, overcoming one of the big hurdles of such a relationship.
When would such a list be appropriate? I don't think it is appropriate for Actors. That is an entity in its own right and there is lots of additional information you want about an actor.
But it might be appropriate for something like user-generated tags, particularly tags where you don't feel a need to maintain a master list (and don't care about misspellings). It might be appropriate for alternative names for something (assuming you don't want disjoint names across rows).
I think you should not use the "alternative solution" of storing arrays of referenced ids to model an many-to-many relationship. It seems simpler at first glance, but it will hurt you later.
You should write a simple test case for both scenarios and create test tables with a realistic number of entries and relationships (it doesn't matter if the data are artificial and repeating). Then try to write a join between the two tables. You will find that with the "alternative solution", the query looks much more complicated (at best, it will involve strange operators like #>) and doesn't perform as well (you can only get a nested loop join).
There is a good reason to keep data in the first normal form – it is better adapted to the way relational databases process data.
Of course this "normal form" stuff has to be taken with a grain of salt: it is fine to use an array to store data, as long as you don't use individual array entries in your query processing. But by joining over array elements you certainly step over that line.
So, I've read a lot about how stashing multiple values into one column is a bad idea and violates the first rule of data normalisation (which, surprisingly, is not "Do Not Talk About Data Normalisation") so I need some help.
At the moment I'm designing an ASP .NET webpage for the place I work for. I want to display data on a web page depending on what Active Directory groups the person belongs to. The first way of doing this that comes to mind is to have a table with, essentially, a column containing the AD group and the second column containing what list of computers belong to that list.
I've learnt that this is showing great disregard for relational databases, so what is a better way to do it? I want to control this access by SQL tables, so I can add/remove from these tables and change end users access accordingly.
Thanks for the help! :)
EDIT: To describe exactly what I want to do is this:
We have a certain group of computers that need to be checked up on, however these computers are in physically difficult to reach locations. The organisation I belong to has remote control enabled for these computers, however they're not in the business of giving out the remote control password (understandable).
The added layer of complexity is that, depending on who you are, our clients should only be able to see a certain group of computers (that is, the group of computers that their area owns). So, if Group A has Thomas in it, and Group B has Jones in it, if you belong to either group then you would just see one entry. However, if you belong to both groups you should see both Thomas and Jones computers in it.
The reason why I think that storing this data in a SQL cell is the way to go is because, to store them in tables would require (in my mind) a new table for each new "group" of computers. I don't want to crank out SQL tables for every new group, I'd much rather just have an added row in a SQL table somewhere.
Does this make any sense?
You basically have three options in SQL Server:
Storing the values in a single column.
Storing the values in a junction table.
Storing the values as XML (or as some other structured data format).
(Other databases have other options, such as arrays, nested tables, and JSON.)
In almost all cases, using a junction table is the correct approach. Why? Here are some reasons:
SQL Server has (relatively) lousy string manipulation, so doing something as simple as ensuring a unique list is really, really hard.
A junction table allows you to store lots of other information (When was a machine added? What is the full description of the machine? etc. etc.).
Most queries that you want are pretty easy with a junction table (with the one exception of getting a comma-delimited list, alas -- which is just counterintuitive rather than "hard").
All the types are stored natively.
A junction table allows you to enforce constraints (both check and foreign key) on the elements of the list.
Although a delimited list is almost never the right solution, it is possible to think of cases where it might be useful:
The list doesn't change and presentation of the list is very important.
Space usage is an issue (alas, denormalization often results in fewer pages).
Queries do not really access elements of the list, just the entire thing.
XML is also a reasonable choice under some circumstances. In the most recent versions of SQL Server, this can be made pretty efficient. However, it incurs the overhead of reading and parsing XML -- and things like duplicate elimination are still not obvious.
So, you do have options. In almost all cases, the junction table is the right approach.
There is an "it depends" that you should consider. If the data is never going to be queried (or queried very rarely) storing it as XML or JSON would be perfectly acceptable. Many DBAs would freak out but it is much faster to get the blob of data that you are going to send to the client than to recompose and decompose a set of columns from a secondary table. (There is a reason document and object databases are becoming so popular.)
... though I would ask why are you replicating active directory to your database and how are you planning on keeping these in sync.
I not really a bad idea to store multiple values in one column, but will depend the search you want.
If you just only want to know the persons that is part of a group then you can store persons in one column with a group id as key. For update you just update the entire list in a group.
But if you want to search a specified person that belongs to group, then its not recommended that you store this multiple persons in one column. In this case its better to store a itermedium table that store person id, and group id.
Sounds like you want a table that maps users to group IDs and a second table that maps group IDs to which computers are in that group. I'm not sure, your language describing the problem was a bit confusing to me.
a list has some columns like: name, family name, phone number etc.
and rows like name=john familyName= lee number=12321321
name=... familyname=... number=...
an sql database works same way. every row in a sql database is a record. so you jusr add records of your list into your database using insert query.
complete explanation in here:
http://www.w3schools.com/sql/sql_insert.asp
This sounds like a typical many-to-many problem. You have many groups and many computers and they are related to eachother. In this situation, it is often recommended to use a mapping table, a.k.a. "junction table" or "cross-reference" table. This table consist solely of the two foreign keys in your other tables.
If your tables look like this:
Computer
- computerId
- otherComputerColumns
Group
- groupId
- othergroupColumns
Then your mapping table would look like this:
GroupComputer
- groupId
- computerId
And you would insert a single record for every relationship between a group and computer. This is in compliance with the rules for third normal form in regards to database normalization.
You can have a table with the group and group id, another table with the computer and computer id and a third table with the relation of group id and computer id.
I've seen this design paradigm a couple of places, often when storing somewhat unpredictable data (custom user preferences, that sort of thing), where the table has only 4 columns:
row_id - unique
item_id - indexed, userid or whatever "owns" the preference
name - name of the field
value - field value
So it's basically unstructured data stored in SQL. Is there a term for this style of table? It might be the right way to solve a problem I'm having, but I don't want to use it without more research, and it's hard to research without a name
This is called an entity-attribute-value model (EAV). Wikipedia (of course) is a good place to start.
There are some limitations when using EAV in a relational database. In particular, the types of the values tend to be strings, regardless of the natural type. In addition, foreign key relationships can be difficult to express in some databases.
One-to-one relationship could usually be stored in the same table. Are there reasons not to store them in the same table?
Number and type of columns. There is a limit on the size of the columns in a table. See here. There is a maximum of 8,060 bytes per row.
Very large tables can also affect performance and can be difficult to optimize and index well.
This is apart from keeping data the is conceptually different, apart from each other. For example, a country and currency have a 1 to 1 relationship (illustrative example, I know this is not always the case). I would still not keep them together.
You'll find some information about when it's useful to create one-to-one relations under http://onlamp.com/pub/a/onlamp/2001/03/20/aboutSQL.html
The most important thing is following:
The key indicator of a possible need
for a one-to-one relationship is a
table that contains fields that are
only used for a certain subset of the
records in that table.
I've done this to prevent locking/blocking, put the read heavy columns in one table the update heavy columns in another, worked like a charm. A lot of big fat update transactions were slowing down a lot of reads.
One - to zero-or-one relationships are common and linked from the optional to the mandatory - the example given in http://onlamp.com/pub/a/onlamp/2001/03/20/aboutSQL.html is of this kind, not one-to-one. Type/subtype relations can be implemented like this.
one-to-one relations occur when each represents a clear, meaningful entity, which in a different context may be in some different relationship and where a minor change to the requirements may change the cardinality of the relation. It is arbitrary which links to which so its best to choose one to be optional and convert to one to zero-or-one.
I am designing a new laboratory database with MANY types of my main entities.
The table for each entity will hold fields common to ALL types of that entity (entity_id, created_on, created_by, etc). I will then use concrete inheritance (separate table for each unique set of attributes) to store all remaining fields.
I believe that this is the best design for the standard types of data which come through the laboratory daily. However, we often have a special samples which often are accompanied by specific values the originator wants stored.
Question: How should I model special (non-standard) types of entities?
Option 1: Use entity-value for special fields
One table (entity_id, attribute_name, numerical_value) would hold all data for any special entity.
+ Fewer tables.
- Cannot enforce requiring a particular attribute.
- Must convert (pivot) rows to columns which is inefficient.
Option 2: Strict concrete inheritance.
Create separate table for each separate special case.
+ Follows in accordance with all other rules
- Overhead of many tables with only a few rows.
Option 3: Concrete inheritance with special tables under a different user.
Put all special tables under a different user.
+ Keeps all special and standard tables separate.
+ Easier to search for common standard table in a list without searching through all special tables.
- Overhead of many tables with only a few rows.
Actually the design you described (common table plus subtype-specific tables) is called Class Table Inheritance.
Concrete Table Inheritance would have all the common attributes duplicated in the subtype tables, and you'd have no supertype table as you do now.
I'm strongly against EAV. I consider it an SQL antipattern. It may seem like an elegant solution because it requires fewer tables, but you're setting yourself up for a lot of headache later. You identified a couple of the disadvantages, but there are many others. IMHO, EAV is used appropriately only if you absolutely must not create a new table when you introduce a new subtype, or if you have an unbounded number of subtypes (e.g. users can define new attributes ad hoc).
You have many subtypes, but still a finite number of them, so if I were doing this project I'd stick with Class Table Inheritance. You may have few rows of each subtype, but at least you have some assurance that all rows in each subtype have the same columns, you can use NOT NULL if you need to, you can use SQL data types, you can use referential integrity constraints, etc. From a relational perspective, it's a better design than EAV.
One more option that you didn't mention is called Serialized LOB. That is, add a BLOB column for a semi-structured collection of custom attributes. Store XML, YAML, JSON, or your own DSL in that column. You won't be able to parse individual attributes out of that BLOB easily with SQL, you'll have to fetch the whole BLOB back into your application and extract individual attributes in code. So in some ways it's less convenient. But if that satisfies your usage of the data, then there's nothing wrong with that.
I think it depends mostly on how you want to use this data.
First of all, I don't really see the benefit of option 3 over option 2. I think separating the special tables in another schema will make your application harder to maintain, especially if later on commonalities are found between 'special values'.
As another option I would say:
- Store the special values in an XML fragment (or blob). Most databases have ability to query on XML structures these days, so without the need for many extra tables, you would keep your flexibility for a small performance hit.
If you put all the special values in one table, you get a very sparse table. Most normal DBMSes cannot handle this very well, but there are some implementations that specialize in this. You could benefit from that.
Do you often need to query the key-value pairs? if you basically access that table through it's entry_id, I think having a key-value table is not a bad design. An extra index on the kay column might even help you when you do need to query for special values. If you build an application layer on top of your database, the key-value table will map on a Map or Hash structure, which can also easily be used.
It also depends on the different types of values you want to store. If there are many different types, that need to be easily accessed (instead of being serialized/deserialized to XML/Character-String) you might want to store the type in a separate column, but that will usually lead to a very complicated design.
Hope this helps (a little bit).
-Maarten
http://en.wikipedia.org/wiki/Entity-Attribute-Value_model
Suggest you read about the problems with entity value tables before deciding to use them.
Oracle can deal with sparsely filled tables quite well. I think you can use a similar approach as company salesforce uses. They use tables with a lot of columns, they create columns when needed. You can index those columns much better than an eav model.
So it is flexible but it performs better than an eav model.
Read: Ask Tom 1, Ask Tom 2, High Scalabilty and SalesForce.
The "Option 1" patterns is also called the "Universal Relation" At first look it seems like a short cut to not doing potentially difficult data modeling. It trades effortless data modeling for not being able to do simple select, update, delete without dramatically more effort than it would take on more usual looking data model with multiple tables.