I have been wondering whether it's possible to do multidimensional tables in PostgreSQL. Here's an example table from my project:
id | created_by | content | comments |
---+---------------+----------------------------+----------+
1 | Anonymous | does this thing work? | |
2 | James | this is the body | |
3 | Chan | this must work this time~! | |
4 | Freak | just to add something new | |
5 | Anonymous | yahoo! | |
What do I mean by multidimensional table? It would look like something like this if there's such thing.
id | created_by | content | comments |
---+---------------+----------------------------+------------------------------------------+
1 | Anonymous | does this thing work? | id | created_by | comments |
2 | James | this is the body | |created_by| comments |
3 | Chan | this must work this time~! | |
4 | Freak | just to add something new | |
5 | Anonymous | yahoo! | |
This is just an example. But the key concept is that in every comment, there's another set of columns, making comments sort of like a table by itself.
So yeah, does this exist in Postgres or is there any better way to implement this feature? :)
I would like to convince you, if possible, to not encode your data this way, (independent of how terrible an idea it is)
Lets suppose you have a really hot post, goes viral, et-cetera. That means all of your users are viewing it and many are trying to comment on it. with all of your nested discussion embedded in a single row, all updates must apply to that row. This in turn means that every update on that discussion competes with every other to update that one attribute. As you might imagine, this write contention will make your database slow way down.
A second reason is that it violates the rules of first normal form; in the sense that the comment attribute on the table you're showing contains more than one value. The motivating reasoning for this widely applied rule is that it makes a larger number of queries possible. In your design, it would be very difficult to delete from COMMENTS where USER = 'spammy-user'*, or even select * from COMMENTS where text like '%Trending Topic%'. In general, if you might ever want to look at part of a value in a column, rather than the whole thing, then you're probably looking at an opportunity for normalization.
The rule I try to use is "each 'kind of thing' gets its own table". as comments are a 'kind of thing', we'll split them out:
create table COMMENTS(
COMMENT_ID serial primary key,
POST_ID integer not null foreign key references POSTS(ID),
PARENT_COMMENT_ID integer foreign key references COMMENTS(COMMENT_ID),
CREATED_BY ...
CONTENT ...
)
with the convention that comments having a null parent_comment_id are the roots of threaded discussions.
Related
I have a table that has user a user_id and a new record for each return reason for that user. As show here:
| user_id | return_reason |
|--------- |-------------- |
| 1 | broken |
| 2 | changed mind |
| 2 | overpriced |
| 3 | changed mind |
| 4 | changed mind |
What I would like to do is generate a foreign key for each combination of values that are applicable in a new table and apply that key to the user_id in a new table. Effectively creating a many to many relationship. The result would look like so:
Dimension Table ->
| reason_id | return_reason |
|----------- |--------------- |
| 1 | broken |
| 2 | changed mind |
| 2 | overpriced |
| 3 | changed mind |
Fact Table ->
| user_id | reason_id |
|--------- |----------- |
| 1 | 1 |
| 2 | 2 |
| 3 | 3 |
| 4 | 3 |
My thought process is to iterate through the table with a cursor, but this seems like a standard problem and therefore has a more efficient way of doing this. Is there a specific name for this type of problem? I also thought about pivoting and unpivoting. But that didn't seem too clean either. Any help or reference to articles in how to process this is appreciated.
The problem concerns data normalization and relational integrity. Your concept doesn't really make sense - Dimension table shows two different reasons with same ID and Fact table loses a record. Conventional schema for this many-to-many relationship would be three tables like:
Users table (info about users and UserID is unique)
Reasons table (info about reasons and ReasonID is unique)
UserReasons junction table (associates users with reasons - your
existing table). Assuming user could associate with same reason
multiple times, probably also need ReturnDate and OrderID_FK fields
in UserReasons.
So, need to replace reason description in first table (UserReasons) with a ReasonID. Add a number long integer field ReasonID_FK in that table to hold ReasonID key.
To build Reasons table based on current data, use DISTINCT:
SELECT DISTINCT return_reason INTO Reasons FROM UserReasons
In new table, rename return_reason field to ReasonDescription and add an autonumber field ReasonID.
Now run UPDATE action to populate ReasonID_FK field in UserReasons.
UPDATE UserReasons INNER JOIN UserReasons.return_reason ON Reasons.ReasonDescription SET UserReasons.ReasonID_FK = Reasons.ReasonID
When all looks good, delete return_reason field.
I am having trouble normalising data from a RSS Feed into a database.
Each post would have id and categories.
The problem I am having is that categories is a list which is not predefined in size. By 1NF I should split a list up such that each column only has atomic data:
+----+----------+
| id | name |
+----+----------+
| 1 | flying |
| 2 | swimming |
| 3 | throwing |
| 4 | sleeping |
| 5 | etc |
+----+----------+
However, blog posts can have more than one category tagged. This means that the posts table can have a list of ids of the categories tagged.
Alternatively, the categories table can have two ids:
+----+--------+----------+
| id | postId | name |
+----+--------+----------+
| 1 | 1 | flying |
| 2 | 1 | swimming |
| 3 | 1 | throwing |
| 4 | 2 | flying |
| 5 | 2 | swimming |
| 6 | 2 | etc |
+----+--------+----------+
And the posts table id will reference the postId column. However, there is repeated data, which is not good.
Lastly, another method I had thought of was to put all the categories in one table:
+----+--------+----------+----------+----------+-----+
| id | flying | swimming | throwing | sleeping | etc |
+----+--------+----------+----------+----------+-----+
| 1 | 1 | 1 | 1 | 1 | 1 |
| 2 | 0 | 1 | 0 | 0 | 0 |
| 3 | 1 | 1 | 0 | 0 | 1 |
| 4 | 0 | 0 | 1 | 1 | 1 |
+----+--------+----------+----------+----------+-----+
1s representing present and 0s representing absent, the id in the posts table references id. This method would not have any repeated data. However, categories from blogs can be created at will, making it hard to maintain such a table as I would need to update it every time there is a new category.
How do I put my database in 3NF, eliminating repetition while keeping it maintainable?
TL;DR "Repeated data" is a bugbear. Learn about design and normalization. Start with rows/tables that make clear straightforward relevant statements about an arbitrary situation. So far all you need is:
-- [id] identifies a post with ...
Post(id, ...)
-- post [id] is tagged [name]
Post_Category(id, name)
there is repeated data, which is not good
What exactly do you think "repeated data" is? And why exactly do you think it's "not good"?
There is nothing intrinsically bad about having the same value appear multiple times as a column of a row or part of a value for a column of a row. What matters is whether rows in tables say overlapping things about a situation in certain ways.
Normalization replaces a table by projections of it that join back to it. That means that it replaces tables whose rows say (ie have predicate) "some stuff AND other stuff" about column values by tables whose rows say "some stuff" and "other stuff" separately. Having "AND"s in such a row/table meaning isn't always bad. When there's only one AND, normalization says to decompose to a particular pair of tables exactly when no shared column set always holds a unique set of values in either of the two tables.
put all the categories in one table
Although there is nothing about such a design that would cause normalization to decompose it, your last table is a "bad" design. (Sometimes this kind of design with repeated similar columns is said to violate some notion of "1NF" or "normalization", but that is a misconception.) Eg its rows say "(post [id] is tagged 'flying' and [flying] = 1 OR post [id] is not tagged 'flying' AND [flying] = 0) AND (post [id] is tagged 'swimming' and [swimming] = 1 OR post [id] is not tagged 'swimming' AND [swimming] = 0) AND ..." when instead we could just have a table Post_Category with rows saying "post [id] is tagged [name]". Eg we cannot write queries that ask about all categories without mentioning all categories explicitly. Eg if we add a new category then we must add a new column to the table and then if we want our past queries re all categories to mean the same thing then they we must add the new column to still be referring to all categories.
PS It's not clear why you introduced ids. There are reasons we do so, but you should do it for a reason. (Normalization does not introduce ids.) Eg introducing post ids if posts are not uniquely identifiable by other information we want to record.
I have some tables that look like this:
+------------+------------+------------+----------------+----------+
| Locations | HotelsA | HotelsB | HotelsB-People | People |
+------------+------------+------------+----------------+----------+
| LocationID | HotelAID | HotelBID | PersonID | PersonID |
| Address | HotelAName | HotelBName | HotelBID | Name |
| | LocationID | | | |
+------------+------------+------------+----------------+----------+
Currently, if I want to know what the address is of the hotel someone is staying at there is no way to make that association without manually looking through the names of HotelsA for something that looks similar enough to the name of HotelsB.
I would like to remove HotelBName and replace it with a foreign key to HotelAID (in this example it would actually make more sense to change HotelsB-People to HotelsAPeople, but there are additional columns that I have omitted for simplicity that prevent that solution from being viable in my particular case). The end result would look like this:
+------------+------------+-------------+----------------+----------+
| Locations | HotelsA | HotelsB | HotelsB-People | People |
+------------+------------+-------------+----------------+----------+
| LocationID | HotelAID | HotelBID | PersonID | PersonID |
| Address | HotelAName | FK_HotelAID | HotelBID | Name |
| | LocationID | | | |
+------------+------------+-------------+----------------+----------+
HotelAName and HotelBName are likely very similar, but inconsistently so. You could have "Springfield Marriott" in one and "Marriott, Springfield" in the other, but there's no consistency (no guarantee anything is spelled correctly either).
Are there any strategies for how this could be done as well as considerations for how to make the applications that utilize this data continue to work during the time it takes to fix all of the data?
Thank you.
I would just add the FK_HotelAID column to the HotelsB table. Assigning the correct id to that column will largely be a manual process, although you could try joining HotelAName to HotelBName to at least cover the ids for names that have a perfect match. Your applications should continue to work while you do this. After you've assigned all the ids inHotelsB you can define the foreign key and then delete the HotelBName column. Of course, any references that the applications make to HotelBName will need to be modified.
I have been working to build a more abstract schema, where there had been several tables modeling remarkably similar relationships, I want to model just the "essence". Due to the environment I am working with (Drupal 7), I can't change the nature of the issue: that a relationship of the same essential type could reference one of two different tables for the object in one role. Let's bring in some example to clarify (this is not my actual problem domain, but a similar problem). Here are the requirements:
First, if you are unfamiliar with Drupal, here's the gist: Users in one table, every other entity in a single second table (gross generalization, but enough).
Let's say we want to model the "works for" relationship, and lets have the given be that "companies" are of type "entity" and "supervisor" is of type "user" (and by "type" I mean that's the table in the database where their tuples reside). Here are the simplified requirements:
A user can work for a company
A company can work for a company
These "works for" relationships should be in the same table.
I have two ideas, and both don't exactly sit well with my current disposition toward schema quality, and this is where I would like some insight.
One foreign-key column paired with a 'type' column
Two foreign-key columns, always at most one utilized (ick!)
In case you are a visual thinker, here are the two options representing the fact that users 123 and 632, as well as entity 123 all work for entity 435:
Option 1
+---------------+-------------+---------------+-------------+
| employment_id | employee_id | employee_type | employer_id |
+---------------+-------------+---------------+-------------+
| 1 | 123 | user | 435 |
+---------------+-------------+---------------+-------------+
| 2 | 123 | entity | 435 |
+---------------+-------------+---------------+-------------+
| 3 | 632 | user | 435 |
+---------------+-------------+---------------+-------------+
Option 2
+---------------+------------------+--------------------+-------------+
| employment_id | employee_user_id | employee_entity_id | employer_id |
+---------------+------------------+--------------------+-------------+
| 1 | 123 | <NULL> | 435 |
+---------------+------------------+--------------------+-------------+
| 2 | <NULL> | 123 | 435 |
+---------------+------------------+--------------------+-------------+
| 3 | 632 | <NULL> | 435 |
+---------------+------------------+--------------------+-------------+
Thoughts on option 1: I like that the employee_id column has concrete role, but I despise that it has ambiguous target. Option 2 has ambiguous role (which column is the employee?), but has concrete target for any given FK, so I can think of it this way:
+-----------+-----------+----------+
| | ROLE |
| | ambiguous | concrete |
+-----------+-----------+----------+
| T | | |
| A ambig. | | 1 |
| R | | |
| G -------+-----------+----------+
| E | | |
| T concr. | 2 | ? |
| | | |
+-----------+-----------+----------+
Option two has very pragmatic benefits for my project, but I do not feel comfortable with so many nulls (you might not even call it 1NF!)
So here's the crux of my question for SO: How can option 1 be improved, or else what knowledge gap might I have that leaves me unsettled? While I can't bring to mind a specific rule which it violates, the design clearly is not in keeping with the intentions of normalization (requiring two columns to uniquely identify a relationship is not doing me any favors for safeguarding against anomalies).
I do understand that the ideal solution would be to redesign the users entity to be the same as what I have been calling "entity" here, but please consider that beside the point/circumstantial (or at least let's draw the pragmatic line right exactly there for this question).
Again, the essential question: What, in terms of normalization, is wrong with schema option 1, and how might you model this relationship given the constraint of not refactoring "user" into "entity"?
note: For this, I am more interested in theoretical purity than a pragmatic solution
The solutions you present contravene 4th normal form as #podiluska says. If this is recast into the form below, then the solution removes this difficult and is in 5NF (and even 6NF?).
Adopt one of the patterns for sub/super types. This uses the relation definitions set out below, plus the super/subtype constraint. This constraint is that each tuple in the super type relation must correspond exactly to one sub type tuple. In other words, the subtypes must form a disjoint, covering set over the supertype.
I suspect the performance of this in a real situation might require some heavy tuning:
Table: Employment
+---------------+-------------+
| employee_id | employer_id |
+---------------+-------------+
| 1 | 435 |
+---------------+-------------+
| 2 | 435 |
+---------------+-------------+
| 3 | 435 |
+---------------+-------------+
Table: Employee (SuperType)
+---------------+
| employee_id |
+---------------+
| 1 |
+---------------+
| 2 |
+---------------+
| 3 |
+---------------+
Table: User employee (SubType)
+---------------+-------------+
| employee_id | user_id |
+---------------+-------------+
| 1 | 123 |
+---------------+-------------+
| 3 | 632 |
+---------------+-------------+
Table: Entity employee (SubType)
+---------------+-------------+
| employee_id | entity_id |
+---------------+-------------+
| 2 | 123 |
+---------------+-------------+
What is wrong with option 1 ( and option 2) is that it is a multivalued dependency, and as such, a breach of 4th normal form. However, within the constraints you have given, there's not a lot you can do about that.
If you could replace the worksfor table with a view, then you could keep user-company and company-company relations separate.
Of your two choices, Option 2 has the advantage that it may be easier to enforce the referential integrity, depending on your platform.
One potential, if icky, pragmatic solution within you current constraints could be to give companies positive IDs and users negative IDs which eliminates the empty column of option 2 and turns the type column of option 1 into an implication, but I feel dirty even suggesting it.
Similarly, if you don't need to know what type the entity is as long as you can determine it via joining, then using Guids as IDs would eliminate the need for the type column
Related
Using SO as an example, what is the most sensible way to manage tags if you anticipate they will change often?
Way 1: Seriously denormalized (comma delimited)
table posts
+--------+-----------------+
| postId | tags |
+--------+-----------------+
| 1 | c++,search,code |
Here tags are comma delimited.
Pros: Tags are retrieved at once with a single select query. Updating tags is simple. Easy and cheap to update.
Cons: Extra parsing on tag retrieval, difficult to count how many posts use which tags.
(alternatively, if limited to something like 5 tags)
table posts
+--------+-------+-------+-------+-------+-------+
| postId | tag_1 | tag_2 | tag_3 | tag_4 | tag_5 |
+--------+-------+-------+-------+-------+-------+
| 1 | c++ |search | code | | |
Way 2: "Slightly normalized" (separate table, no intersection)
table posts
+--------+-------------------+
| postId | title |
+--------+-------------------+
| 1 | How do u tag? |
table taggings
+--------+---------+
| postId | tagName |
+--------+---------+
| 1 | C++ |
| 1 | search |
Pros: Easy to see tag counts (count(*) from taggings where tagName='C++').
Cons: tagName will likely be repeated many, many times.
Way 3: The cool kid's (normalized with intersection table)
table posts
+--------+---------------------------------------+
| postId | title |
+--------+---------------------------------------+
| 1 | Why is a raven like a writing desk? |
table tags
+--------+---------+
| tagId | tagName |
+--------+---------+
| 1 | C++ |
| 2 | search |
| 3 | foofle |
table taggings
+--------+---------+
| postId | tagId |
+--------+---------+
| 1 | 1 |
| 1 | 2 |
| 1 | 3 |
Pros:
No repeating tag names.
More girls will like you.
Cons: More expensive to change tags than way #1.
These solutions are called mysqlicious, scuttle and toxi.
This article compares benefits and drawbacks of each.
I would argue that there is a fourth solution which is a variation on your third solution:
Create Table Posts
(
id ...
, title ...
)
Create Table Tags
(
name varchar(30) not null primary key
, ...
)
Create Table PostTags
(
PostId ...
, TagName varchar(30) not null
, Constraint FK_PostTags_Posts
Foreign Key ( PostId )
References Posts( Id )
, Constraint FK_PostTags_Tags
Foreign Key ( TagName )
References Tags( Name )
On Update Cascade
On Delete Cascade
)
Notice that I'm using the tag name as the primary key of the Tags table. In this way, you can filter on certain tags without the extra join to the Tags table itself. In addition, if you change a tag name, it will update the names in the PostTags table. If changing a tag name is a rare occurrence, then this shouldn't be a problem. If changing a tag name is a common occurrence, then I would go with your third solution where you use a surrogate key to reference the tag.
I personally favour solution #3.
I don't agree that solution #1 is easier to mantain.
Think of the situation where you have to change the name of a tag.
Solution #1:
UPDATE posts SET tag = REPLACE(tag, "oldname", "newname") WHERE tag LIKE("%oldname%")
Solution #3:
UPDATE tags SET tag = "newname" WHERE tag = "oldname"
The first one is way heavier.
Also you have to deal with the commas when deleting tags (OK, it's easily done but still, more difficult that just deleting one line in the taggings table)
As for solution #2... is neither fish nor fowl
I think that SO uses solution #1. I'd go with either #1 or #3.
One thing to consider is if you have several thing that you can tag (e.g. adding tags to both post and products, for example). This may affect database solution.
Well I have the same doubt I adopted the third solution for my website. I know there is another way for dealing with this problem of variable-length tuples which consists in using columns as rows in this way you will have some information identifying the tuple redudant and the varying ones organized one for each row.
+--------+-------+-------------------------------------+
| postId | label | value |
+--------+-------+-------------------------------------+
| 1 | tag |C++ |
+--------+-------+-------------------------------------+
| 1 | tag |search |
+--------+-------+-------------------------------------+
| 1 | tag |code |
+--------+-------+-------------------------------------+
| 1 | title | Why is a raven like a writing desk? |
+--------+-------+-------------------------------------+
This is really bad but sometimes it's the only feasible solution, and it's very far from the relational approach.