PostgreSQL + Django: SQL design pattern for multi-type data [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I need to store message data in SQL. Cannot decide which way to go here.
There is a main class Message, say (simplified):
class Message(models.Model):
user_id = models.ForeignKey(User)
text = models.TextField()
Plus, there are other Message classes that inherit this one.
class MmsMessage(Message):
imagedata = models.ForeignKey(ImageData)
And so on. These other message classes of course can have more than 1 additional field.
Now, I am evaluating the best (fastest) design pattern to make this work.
In around 25% of cases I will not be needing additional fields, simply raw Message objects (Message.objects.all). In other cases, I need all data. Additional fields may not necessarily be searchable. Nonetheless, it would be nice thing to have.
I was thinking about:
A: Inheritance (concrete, abstract)
Abstract inheritance is out. I loose the ability to do: Message.objects.all() which is unacceptable.
Concrete inheritance seems to me like a way to go. Tried two approaches. django-model-utils one (select_subclasses) which doesn't need additional queries, but due to lots of inner joins and redundant data in results it is very slow compared to other solutions.
django_polymorphic (still concrete inheritance) approach (using contenttypes to know what we are dealing with and then select related fields) is ~4 times faster than select_subclasses (at least on postgresql) - which was a small surprise for me (it requires +n queries where n is a number of child types but still faster due to simpler joins and no unnecessary data results). Tested on 10 000 objects across 20 different Message child types.
B: EAV model (many to many for additional attributes)
Haven't tested EAV model but I doubt it will be faster than inheritance solution. When I know what column names and types I want, it seems that EAV model loses all its charm.
[UPDATED - horse_with_no_name] B1: hstore - similiar to EAV with many to many but possibly much faster (no joins, backend support)
Great to add dictionary like custom fields.
Downsides: I lose compatibility with other django database backends (I would prefer not to), also it is type-agnostic, key and value is TEXT. I am also worried about making Message table raw queries slower in general due to many TEXT fields in hstore dict.
C: XML field in Message table for additional data
XML field in Message table is something that feels a little fishy to me. What if I dont need these additional fields (from message child types) to be searchable or indexable - is XML field a good solution?
What is the best option in your opinion?

The simple answer here is to just use a single table. However the real question is why you're putting stuff in a database in the first place. If your intent is to scale up to large sizes, then you probably want to look at a hybrid storage model (indexing messages in the DB and storing the raw message in something like hbase).

Related

RDBMS results return, ordering and returning sets/hashmaps instead of arrays/lists [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 4 years ago.
Improve this question
Most/all SQL based RDBMS connector libraries I've come into contact with will return results in array form. Why? If the order is arbitrary (without a sorting SQL modifier), then could the natural data return be in the form of something like a Set or Hashmap? These data structures would, in some cases, be more computationally favorable at scale than a typical array/list return in languages like C++ (with standard template library usage), JavaScript/Node, Go, and any other language that supports associative data types or pure Sets.
In particular, do libraries such as knex.js offer such a feature in the form of a connection flag (I haven't found it yet)?
Do any of the major RDBMS systems (MySQL, PostgreSQL, ...), offer the ability to return results in a set/hashmap form?
Concretely, what I think would make sense using node.js and a library like knex.js, is to specify a flag like:
knex.forceMap('keycolumnpattern') Or, knex.forceSet()...
Again, the underlying assumption here would be that you are not imposing order on the SQL (or other) query by adding a sort directive i.e. ORDER BY
The justification for this is in environments were scaling and computational complexity are important concerns.
Good question.
This is by no means a comprehensive answer, but just my opinion on this curious question.
Usually databases return a series of rows, that most documentation refers to as "result set".
The database
Now, this result set is assembled when the query is executed and may take different forms. Most likely the database sends it as an "enumeration": this is, a list-like entity that produces rows when you request them. In order to save resources database will try not to materialize the whole result set at once, but to produce rows as you read them from your client application. Well, this happens as long as the query can be "pipelined".
When the query cannot be pipelined, then the whole data set (in the database side) is materialised.
The driver
You client driver does not retrieve rows one by one but in groups of them by the use of buffering. Even when the query cannot be pipelined your client driver will also retrieve the rows in groups according to the "fetch size" and "buffer size".
The client technology
Your application can use basic driver primitive operations, or a more sophisticated ORM. It's common that ORMs will hide all inner workings of the driver and will offer you a "simple" result like an array, list, or map, i.e., hiding the "streaming" an enumeration provides.
If you don't use an ORM, then you will probably call the driver primitives yourself and therefore you can get access to all inner, ugly details. The upside is that you can assemble the result set rows in any data structure you prefer.
In any case, the repertoire of data structures will depend on the specific query since a "map" or a "set" will require some king of unique identifier, while a list won't.

Does Stack Exchange's database schema follow good practice? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
This is somewhat of a meta question, but because it relates to a database design, I thought I should post it here.
I'm building a site that includes Q+A and was wondering how I should structure my SQL database, so naturally, I looked to the best of the best. However, the Stack Exchange database schema seems to defy what I've learned about creating maintainable/extensible table hierarchies.
As you can see, Stack Exchange stores all of its "Posts" in one table, except for comments, which has its own table. Post types include questions, answers, and various wiki things. This results in a lot of NULL columns in the table. For example, questions have titles, tags, and answerCounts, while answers don't, so all answer entries have NULL for all three of those columns. If more post types are added over time, this will progressively become less maintainable. And the fact that comments is the only type of post that has its own table just seems inconsistent.
What I've read states that it's generally preferred to use an object subclass hierarchy, in which there's a generic "Posts" table along with a bunch of tables for each type of post that all have one column that maps back to the corresponding entry in the "Posts" table. This keeps the number of null columns to a minimum and makes it more extensible, but slows down queries because they'll require more joins.
So why does Stack Exchange use this giant table method? Is it just the result of ages of modifications to an old database? More specifically, should I use this model for my own Q+A system or stick with an object subclass hierarchy (my Q+A/forum system will closely resemble SO's, with several types of posts including questions, answers, polls, reviews, etc.)?
This is a classic case of so-called "Object-relational impedance mismatch". Specifically, you are taking about mapping OO's inheritance into a relational database structure. There are several common ways of doing that -
A table per subclass,
A table per leaf subclass, and
A table per class hierarchy (with a discriminator)
Each of these strategies is perfectly valid. Moreover, the structures could be mixed as needed.
It looks like Stack Exchange used a table per class hierarchy approach, with PostTypeId serving as a discriminator. This approach is as valid as any other approach that they could have taken. It is also one of the simplest ones to take from the maintenance standpoint, because it lets you construct manual queries with less work.
There is another thing in the structure of the table that you did not mention: it is not normalized. Specifically, there are AnswerCount and CommentCount fields that store information that could be obtained by aggregating the table (i.e. running a SELECT COUNT(*) FROM ... WHERE ... AND other.ParentId = p.Id ...) This is a common tradeoff between normalization and speed of execution: most likely, the profiling has indicated that the aggregation takes significant amount of time, so the counts have been moved into the "parent" record.

Performance gains vs Normalizing your tables? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
Ok ok I know you probably all going to kill me for asking this, however I got into an friendly programmer argument with a co-worker about one of our database tables and he asked a question which I know the answer to but I couldn't explain it is the better way.
I will simplify the situation for the simplicity of the question, We have a fairly large table of people / users. Now amongst other data being stored the data in question is as follows: we have a simNumber, cellNumber and the ipAddress of that sim.
Now I am saying that we should make a table lets call it SimTable and put those 3 entries in the sim table, and then put a FK in the UsersTable linking the two. Why? Because that's what I have always been taught NORMALISE your tables!!! Ok so all is good in that regard.
But now my friend says to me yes, but now when you want to query a users phone number, SQL now has to go and:
search for the user
search for the sim fk
search for the correct sim row in the sim database
get the phone number
Now when I go and request 10000 users phone numbers, the number of operations done seriously grows in size.
Vs the other approach
search for the user
find the phone number
Now the argument is purely performance based. As much as I understand why we do normalize the data (to remove redundant data, maintainability, make changes to data in one table which propagate up etc.. ) It does appear to me that the approach with the data in one table will be faster or will at least less tasks/ operations to give me the data I want?
So what is the case in this situation? I do hope that I have not asked anything insanely silly , it is early in the morning so do forgive me if im not thinking clearly
The technology involved in MS SQL server 2012
[EDIT]
This article below also touches on some pf the concepts I have mentioned above
http://databases.about.com/od/specificproducts/a/Should-I-Normalize-My-Database.htm
The goal of normalization is not performance. The goal is to model your data correctly with minimum redundancy so you avoid data anomalies.
Say for example two users share the same phone. If you store the phones in the user table, you'd have sim number, IP address, and cell number stored one each user's row.
Then you change the IP address on one row but not the other. How can one sim number have two IP addresses? Is that even valid? Which one is correct? How would you fix such discrepancies? How would you even detect them?
There are times when denormalization is worthwhile, if you really need to optimize data access for one query that you run very frequently. But denormalization comes at a cost, so be prepared to commit yourself to a lot more manual work to take responsibility for data integrity. More code, more testing, more cleanup tasks. Do those count when considering "performance" of the project overall?
Re comments:
I agree with #JoelBrown, as soon as you implement your first case of denormalization, you compromise on data integrity.
I'll expand on what Joel mentions as "well-considered." Denormalization benefits specific queries. So you need to know which queries you have in your app, and which ones you need to optimize for. Do this conservatively, because while denormalization can help a specific query, it harms performance for all other uses of the same data. So you need to know whether you need to query the data in different ways.
Example: suppose you are designing a database for StackOverflow, and you want to support tags for questions. Each question can have a number of tags, and each tag can apply to many questions. The normalized way to design this is to create a third table, pairing questions with tags. That's the physical data model for a many-to-many relationship:
Questions ----<- QuestionsTagged ->---- Tags
But you figure you don't want to do the join to get tags for a given question, so you put tags into a comma-separated string in the questions table. This makes it quicker to query a given question and its associated tags.
But what if you also want to query for one specific tag and find its related questions? If you use the normalized design, it's simply a query against the many-to-many table, but on the tag column.
But if you denormalize by storing tags as a comma-separated list in the Questions table, you'd have to search for tags as substrings within that comma-separated list. Searching for substrings can't be indexed with a standard B-tree style index, and therefore searching for related questions becomes a costly table-scan. It's also more complex and inefficient to insert and delete a tag, or to apply constraints like uniqueness or foreign keys.
That's what I mean by denormalization making an improvement for one type of query at the expense of other uses of the data. That's why it's a good idea to start out with everything in normal form, and then refactor to denormalized designs later on a case by case basis as your bottlenecks reveal themselves.
This goes back to old wisdom:
"Premature optimization is the root of all evil" -- Donald Knuth
In other words, don't denormalize until you can demonstrate during load testing that (a) it makes a real improvement to performance that justifies the loss of data integrity, and (b) it does not degrade performance of other cases unacceptably.
It sounds like you already understand the benefits of normalisation, so I won't cover these.
There are a couple of considerations here:
1. Does a user always have one and only phone number?
If so, then it is still normalised to add these to the user table. However, if the user can have either no phone number or multiple phone numbers, then the phone details should be held in a seperate table.
Assuming you have these in seperate tables, but after conducting performance tests you found that joining on these 2 tables was having a significant effect on performance, then you may choose to deliberately denormalise the tables for performance gains.
Others have already provided some good points and you may also want to take a look at this.
I'd just like to mention one more aspect that is often overlooked: I/O tends to be the greatest component of the cost of most queries, and denormalization generally increases the storage size of data, therefore making the DBMS cache "smaller".
If your normalized database fits into cache and denormalized doesn't, you may actually observe a performance decrease for the latter.
And you won't be able to spot that in development, unless you actually have the amount of data that is similar to production. This is one of many reasons why you should never, ever denormalize without solid measurements (on representative amounts of data) to justify it.

Are user-defined SQL datatypes used much? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
My DBA told me to use a user-defined SQL datatype to represent addresses, and then use a single column of that new type in our users table instead of multiple address columns. I've never done this before and am wondering if this is a common approach.
Also, what's the best place to get information about this - is it product-specific?
As far as I can tell, at least in the SQL Server world, UDT aren't used very much.
Trouble with UDT is the fact you can't easily update them. Once created and used in databases, they're almost like set in stone.
There's no "CREATE OR ALTER (UDT)" command :-( So to change something, you have to do a lot of shuffling around - possibly copying away existing data, then dropping lots of columns from other tables, then dropping your UDT, re-creating it with the new structure and reapplying the data and everything.
That's just too much hassle - and you know : there will be change!
Right now, in SQL Server land, UDT are just a nice idea - but really badly implemented. I wouldn't recommend using them extensively.
Marc
There are a number of other questions on SO about how to represent addresses in a database. AFAICR, none of them suggest a user-defined type for the purpose. I would not regard it as a common approach; that is not to say it is not a reasonable approach. The main difficulties lie in deciding what methods to provide to manipulate the address data - those used for formatting the data to appear on an envelope, or in specific places on a printed form, or to update fields, worrying about the many ramifications of international addresses, and so on.
Defining user-defined types is very product specific. The ways you do it in Informix are different from the ways it is done in DB2 and Oracle, for example.
I would also rather avoid using User defined datatypes as their defination and usability will make your code dependant on a particular database.
Instead if you are using any object oriented language, create a composition relationship to define addresses for an employee (for example) and store the addresses in a separate table.
Eg. Employees table and Employee_Addresses table. One employee can have multiple addresses.
user-defined SQL datatype to represent addresses
User-defined types can be quite useful, but a mailing address doesn't jump out as one of those cases (to me, at least). What is a mailing address to you? Is it something you print on an envelope to mail someone? If so, text is about as good as it's going to get. If you need to know what state someone is in for legal reasons, store that separately and it's not a problem.
Other posts here have criticized UDTs, but I think they do have some amazing uses. PostgreSQL has had full text search as a plugin based on UDTs for a long time before full-text search was actually integrated into the core product. Right now PostGIS is a very successful GIS product that is entirely a plugin based on UDTs (it has GPL license, so will never be integrated into core).

Where do you store long/complete queries used in code? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Here's my situation. I've noticed that code gets harder to maintain when you keep embedding queries in every function that will use them. Some queries tend to grow very fast and tend to lose readability after concatenating every line. Another issue that comes with all of the concatenation is when you must test a specific query and paste it. You are forced to remove all of the " that held your query string together.
So my question is, what methods are being used to separate queries from the code? I have tried searching but it doesn't look like it's the right thing because i'm not finding anything relevant.
I'd like to note that views and stored procedure are not possible since my queries fetch data from a production database.
Thank you.
If you follow an MVC pattern, then your queries should be all in the model - i.e. the objects representing actual data.
If not, then you could just put all your repetitive queries in script files, including only those needed in each request.
However, concatenating and that kind of stuff is hard to get rid of; that's why programmers exist :)
These two words will be your best friend: Stored Procedure
I avoid this problem by wrapping queries in classes that represent the entities stored in the table. So the accounts table has an Account object. It'll have an insert/update/delete query.
I've seen places where the query is stored in a file and templates are used to replace parts of the query.
Java had something called SQLJ - don't know if it ever took off.
LINQ might provide some way around this as an issue too.
Risking being accused of not answering the question, my suggestion to you would be simply don't. Use an O/RM of your choice and you'll see this problem disappear in no time.
I usually create a data class that represents the data requirements for objects that are represented in the database. So if I have a customer class, then I create a customerData class as well that houses all the data access logic in them. This keeps data logic out of your entity classes. You would add all you CRUD methods here as well as custom data methods.
Youc an also use Stored Proceedures or an ORM tool depending on your language.
The main key is to get your data logic away from your business and entity logic.
I use stored procedures in my production environment, and we process the rows too...