In my app, I have a unique id for each object (tables).
Now, because of that, when using seeing the id, I know what object type it is, whether it's a User or it's a Hotel.
I was wondering if I could save the lookup for the item_type in polymorphic associations, patch the lookup with an id sequence lookup in memory thus saving the space in the DB and in the index.
Can this be done?
I am working with Rails 3.0.9, Ruby 1.9.2

This may not be the answer you're looking for. I'm sure this is possible somehow, but you'd be fighting against the grain of how polymorphism was designed in ActiveRecord, and that would probably cause a load of pain.
The first question that comes to me is why? You're asking for a performance optimization. Are you seeing a performance problem? Have you verified with instrumentation, New Relic, the Ruby profiler and other tools that this second lookup is really what is killing your performance? If you haven't done any of this, then you're probably wasting your time. Predicting performance bottlenecks is an inaccurate science and subject to the 80-20 rule.
If you really, really have this problem and you've been thorough in analyzing your logs, your New Relic charts, you've isolated the problem and run performance tests against it, if you've done all that, and you're seeing a performance problem right with this issue, then I'd suggest that probably a denormalization solution of sorts will give you some improvement. Denormalization is a common tool to optimize database performance problems. You'd be storing data in more than one place, but your queries will to touch fewer tables (faster), at the extra overhead of keeping the multiple bits in sync when updating records (more complex application code).
If you could post a bit more detail about your example, it would be easier to make some more concrete suggestions or give examples.


What do I need to know about databases in order to create a quality Django app?

I'm trying to optimize my site and found this nice little Django doc:
Database Access Optimization, which suggests profiling followed by indexing and the selection of proper fields as the starting point for database optimization.
Normally, the django docs explain things pretty well, even things that more experienced programmers might consider "obvious". Not so in this case. After no explanation of indexing, the doc goes on to say:
We will assume you have done the obvious things above.
Uhhh. Wait! What the heck is indexing?
Obviously I can figure out what indexing is via google, my question is: what is it that I need to know as far as database stuff goes in order to create a scalable website? What should I be aware of about the Django framework specifically? What other "obvious" things ought I know? Where can I learn them?
I'm looking to get pointed in a direction here. I don't need to learn anything and everything about SQL, I just want to be informed enough to build my app the right way.
Thanks in advance!
I encourage you to read all that the other answers suggest and whatever else you can find on the subject, because it's all good information to know and will make you a better programmer.
That said, one of the nice things about Django and other similar frameworks is that for the most part you don't have to know what's going on behind the scenes in the DB. Django adds indexes automatically for fields that need them. The encouragement to add more is based on the use cases of your app. If you continually query based on one particular field, you should ensure that that field is indexed. It might be already (if it's a foreign key, primary key, etc.), but other random fields typically aren't.
There's also various optimizations that are database client-specific. Django can't do much here because it's goal is to remain database independent. So, if you're using PostgreSQL, MySQL, whatever, read about optimizations and best practices concerning those particular clients.
Wikipedia database design, and database normalization http://en.wikipedia.org/wiki/Database_design, and http://en.wikipedia.org/wiki/Database_normalization are two very important concepts, in addition to indexing.
In addition to these, having a basic understanding of your database of choice is necessary. Being able to add users, set permissions, and create a database are key things that you should know.
Learning how to backup your data is also a crucial thing.
The list keeps getting longer, one should also be aware of the db relationships that django handles for you, OneToOne, ManyToMany, ManyToOne. https://docs.djangoproject.com/en/dev/topics/db/models/
The performance impact of JOINs shouldn't be ignored. Access model properties in django is so easy, but understanding that some of Foreign Key relationships could have huge performance impacts is something to consider too.
Once you have a basic understanding of these things you should be at a pretty good starting point for creating a non-trivial django app!
Wikipedia has a nice article about database indexes, they are similar(ish) to an index in a book i.e. lets you (the computer) find things faster because you just look at the index (probably a very bad example :-)
As for performance there are many things you can do and presumably as it is a very detailed subject in itself, and is something that is particular to each RDBMS then it would be distracting / irrelevant for them (django) to go into great detail. Best thing is really to google performance tips for your particular RDBMS. There are some general tips such as indexing, limiting queries to only return the required data etc.
I think one of the main things is a good design, sticking as much as possible to Normal Form and in general actually taking your database into consideration before programming your models etc (which clearly you seem to be doing). Naming conventions are also a big plus, remembering explicit is better then implicit :-)
To summarise:
Learn/understand the fundamentals such as the relational model
Decide on a naming convention
Design your database perhaps using an ERM tool
Prefer surrogate ID's
Use the correct data type of minimum possible size
Use indexes appropriately and don't over index
Avoid unecessary/over querying
Prioritise security and stability over raw performance
Once you have an up and running database 'tune' the database analysing/profiling settings, queries, design etc
Backup and archive regularly - cron
Hang out here :-)
If required advance into replication (master/slave - django supports this quite well too)
Consider upgrading your hardware
Don't get too hung up about it

Raw SQL vs OOP based queries (ORM)?

I was doing a project that requires frequent database access, insertions and deletions. Should I go for Raw SQL commands or should I prefer to go with an ORM technique? The project can work fine without any objects and using only SQL commands? Does this affect scalability in general?
EDIT: The project is one of the types where the user isn't provided with my content, but the user generates content, and the project is online. So, the amount of content depends upon the number of users, and if the project has even 50000 users, and additionally every user can create content or read content, then what would be the most apt approach?
If you have no ( or limited ) experience with ORM, then it will take time to learn new API. Plus, you have to keep in mind, that the sacrifice the speed for 'magic'. For example, most ORMs will select wildcard '*' for fields, even when you just need list of titles from your Articles table.
And ORMs will aways fail in niche cases.
Most of ORMs out there ( the ones based on ActiveRecord pattern ) are extremely flawed from OOP's point of view. They create a tight coupling between your database structure and class/model.
You can think of ORMs as technical debt. It will make the start of project easier. But, as the code grows more complex, you will begin to encounter more and more problems caused by limitations in ORM's API. Eventually, you will have situations, when it is impossible to to do something with ORM and you will have to start writing SQL fragments and entires statements directly.
I would suggest to stay away from ORMs and implement a DataMapper pattern in your code. This will give you separation between your Domain Objects and the Database Access Layer.
I'd say it's better to try to achieve the objective in the most simple way possible.
If using an ORM has no real added advantage, and the application is fairly simple, I would not use an ORM.
If the application is really about processing large sets of data, and there is no business logic, I would not use an ORM.
That doesn't mean that you shouldn't design your application property though, but again: if using an ORM doesn't give you any benefit, then why should you use it ?
For speed of development, I would go with an ORM, in particular if most data access is CRUD.
This way you don't have to also develop the SQL and write data access routines.
Scalability should't suffer, though you do need to understand what you are doing (you could hurt scalability with raw SQL as well).
If the project is either oriented :
- data editing (as in viewing simple tables of data and editing them)
- performance (as in designing the fastest algorithm to do a simple task)
Then you could go with direct sql commands in your code.
The thing you don't want to do, is do this if this is a large software, where you end up with many classes, and lot's of code. If you are in this case, and you scatter sql everywhere in your code, you will clearly regret it someday. You will have a hard time making changes to your domain model. Any modification would become really hard (except for adding functionalities or entites independant with the existing ones).
More information would be good, though, as :
- What do you mean by frequent (how frequent) ?
- What performance do you need ?
It seems you're making some sort of CMS service. My bet is you don't want to start stuffing your code with SQL. #teresko's pattern suggestion seems interesting, seperating your application logic from the DB (which is always good), but giving the possiblity to customize every queries. Nonetheless, adding a layer that fills in memory objects can take more time than simply using the database result to write your page, but I don't think that small difference should matter in your case.
I'd suggest to choose a good pattern that seperates your business logique and dataAccess, like what #terekso suggested.
It depends a bit on timescale and your current knowledge of MySQL and ORM systems. If you don't have much time, just do whatever you know best, rather than wasting time learning a whole new set of code.
With more time, an ORM system like Doctrine or Propel can massively improve your development speed. When the schema is still changing a lot, you don't want to be spending a lot of time just rewriting queries. With an ORM system, it can be as simple as changing the schema file and clearing the cache.
Then when the design settles down, keep an eye on performance. If you do use ORM and your code is solid OOP, it's not too big an issue to migrate to SQL one query at a time.
That's the great thing about coding with OOP - a decision like this doesn't have to bind you forever.
I would always recommend using some form of ORM for your data access layer, as there has been a lot of time invested into the security aspect. That alone is a reason to not roll your own, unless you feel confident about your skills in protecting against SQL injection and other vulnerabilities.

SQL generalization/specialization, data redundancy

I have three tables: actions, messages, likes. It defines inheritance, messages and likes are actions' childs (specialization).
Message and Like both have column userId and createdAt. Those should be of course moved to the parrent table Action and removed from Message and Likes. But there's only one case when I need to select both messages and likes from the database, in other cases I select only one of them, either messages or likes.
Is it ok to duplicate userId and createdAt in child and parrent table? It costs disk space but saves one join - I would have to join messages, likes with actions everytime I needed userId and createdAt. Whatsmore I would need to change my current code...
What would you suggest?
In my opinion this is a case of premature optimization (or premature denormalization, if you prefer). You're guessing that the join overhead will cause significant problems, so you're guessing that duplicating the userId and createdAt columns in the dependent tables will improve performance significantly.
I suggest that you should not duplicate columns until you know there's a real problem. I keep a few observations on performance optimization tacked up on the wall to remind myself of what I should do in similar cases:
It ain’t broke ‘til it’s broke.
You can’t improve what you haven’t measured.
Programs spend surprising amounts of time in the damnedest places.
Make it run. Make it run right. Make it run right fast.
optimization is literally the last thing you should be doing.
doing things wrong faster is no great benefit.
Also a few comments on denormalization:
You can’t denormalize that which is not normalized.
Most developers wouldn’t know third-normal form if it leapt out from behind their screen, screamed like a banshee, and cracked a baseball bat over their heads.
Denormalization is suggested as a panacea for database performance issues. The problem is that too often those recommending denormalization have never normalized anything.
“Denormalization for performance reasons” is an excuse for sloppy, “do what we’ve always done” thinking, especially when such denormalization is enshrined in the design.
In my experience, I am not able to identify where performance problems will occur before writing code. Problems always seem to occur in places where I would never have thought to look. Thus, I've found that my best choice is always to write the simplest, clearest code that I can and to design the database as simply as I can, following the normalization rules to the best of my ability, and then to deal with what turns up. There may still be performance issues which need attention (but, surprisingly, not really all that often), but in the end I'll end up with simple, clear, and easily understood/maintained code, running on a simple, well-designed database.
Share and enjoy.

Storing multiple choice values in database

Say I offer user to check off languages she speaks and store it in a db. Important side note, I will not search db for any of those values, as I will have some separate search engine for search.
Now, the obvious way of storing these values is to create a table like
UserID nvarchar(50),
LookupLanguageID int
but the site will be high load and we are trying to eliminate any overhead where possible, so in order to avoid joins with main member table when showing results on UI, I was thinking of storing languages for a user in the main table, having them comma separated, like "12,34,65"
Again, I don't search for them so I don't worry about having to do fulltext index on that column.
I don't really see any problems with this solution, but am I overlooking anything?
You don't search for them now
Data is useless to anything but this one situation
No data integrity (eg no FK)
You still have to change to "English,German" etc for display
"Give me all users who speak x" = FAIL
The list is actually a presentation issue
It's your system, though, and I look forward to answering the inevitable "help" questions later...
You might not be missing anything now, but when you're requirements change you might regret that decision. You should store it normalized like your first instinct suggested. That's the correct approach.
What you're suggesting is a classic premature optimization. You don't know yet whether that join will be a bottleneck, and so you don't know whether you're actually buying any performance improvement. Wait until you can profile the thing, and then you'll know whether that piece needs to be optimized.
If it does, I would consider a materialized view, or some other approach that pre-computes the answer using the normalized data to a cache that is not considered the book of record.
More generally, there are a lot of possible optimizations that could be done, if necessary, without compromising your design in the way you suggest.
This type of storage has almost ALWAYS come back to haunt me. For one, you are not even in first normal form. For another, some manager or the other will definitely come back and say.. "hey, now that we store this, can you write me a report on... "
I would suggest going with a normalized design. Put it in a separate table.
You lose join capability (obviously).
You have to reparse the list on each page load / post back. Which results in more code client side.
You lose all pretenses of trying to keep database integrity. Just imagine if you decide to REMOVE a language later on... What's the sql going to be to fix all of your user profiles?
Assuming your various profile options are stored in a lookup table in the DB, you still have to run "30 queries" per profile page. If they aren't then you have to code deploy for each little change. bad, very bad.
Basing a design decision on something that "won't happen" is an absolute recipe for failure. Sure, the business people said they won't ever do that... Until they think of a reason they absolutely must do it. Today. Which will be promptly after you finish coding this.
As I stated in a comment, 30 queries for a low use page is nothing. Don't sweat it, and definitely don't optimize unless you know for darn sure it's necessary. Guess how many queries SO does for it's profile page?
I generally stay away at the solution you described, you asking for troubles when you store relational data in such fashion.
As alternative solution:
You could store as one bitmasked integer, for example:
0 - No selection
1 - English
2 - Spanish
4 - German
8 - French
16 - Russian
--and so on powers of 2
So if someone selected English and Russian the value would be 17, and you could easily query the values with Bitwise operators.
Premature optimization is the root of all evil.
EDIT: Apparently the context of my observation has been misconstrued by some - and hence the downvotes. So I will clarify.
Denormalizing your model to make things easier and/or 'more performant' - such as creating concatenated columns to represent business information (as in the OP case) - is what I refer to as a "premature optimization".
While there may be some extreme edge cases where there is no other way to get the necessary performance necessary for a particular problem domain - one should rarely assume this is the case. In general, such premature optimizations cause long-term grief because they are hard to undo - changing your data model once it is in production takes a lot more effort than when it initially deployed.
When designing a database, developers (and DBAs) should apply standard practices like normalization to ensure that their data model expresses the business information being collected and managed. I don't believe that proper use of data normalization is an "optimization" - it is a necessary practice. In my opinion, data modelers should always be on the lookout for models that could be restructured to (at least) third normal form (3NF).
If you're not querying against them, you don't lose anything by storing them in a form like your initial plan.
If you are, then storing them in the comma-delimited format will come back to haunt you, and I doubt that any speed savings would be significant, especially when you factor in the work required to translate them back.
You seem to be extremely worried about adding in a few extra lookup table joins. In my experience, the time it takes to actually transmit the HTML response and have the browser render it far exceed a few extra table joins. Especially if you are using indexes for your primary and foreign keys (as you should be). It's like you are planning a multi-day cross-country trip and you are worried about 1 extra 10 minute bathroom stop.
The lack of long-term flexibility and data integrity are not worth it for such a small optimization (which may not be necessary or even noticeable).
As stated very well in the above few posts.
If you want a contrary view to this debate, look at wordpress. Tables are chocked full of delimited data, and it's a great, simple platform.

Large volume database updates with an ORM

I like ORM tools, but I have often thought that for large updates (thousands of rows), it seems inefficient to load, update and save when something like
UPDATE [table] set [column] = [value] WHERE [predicate]
would give much better performance.
However, assuming one wanted to go down this route for performance reasons, how would you then make sure that any objects cached in memory were updated correctly.
Say you're using LINQ to SQL, and you've been working on a DataContext, how do you make sure that your high-performance UPDATE is reflected in the DataContext's object graph?
This might be a "you don't" or "use triggers on the DB to call .NET code that drops the cache" etc etc, but I'm interested to hear common solutions to this sort of problem.
You're right, in this instance using an ORM to load, change and then persist records is not efficient. My process goes something like this
1) Early implementation use ORM, in my case NHibernate, exclusively
2) As development matures identify performance issues, which will include large updates
3) Refactor those out to sql or SP approach
4) Use Refresh(object) command to update cached objects,
My big problem has been informing other clients that the update has occured. In most instances we have accepted that some clients will be stale, which is the case with standard ORM usage anyway, and then check a timestamp on update/insert.
Most ORMs also have facilities for performing large or "bulk" updates efficiently. The Stateless Session is one such mechanism available in Hibernate for Java which apparently will be available in NHibernate 2.x:
ORMs are great for rapid development, but you're right -- they're not efficient. They're great in that you don't need to think about the underlying mechanisms which convert your objects in memory to rows in tables and back again. However, many times the ORM doesn't pick the most efficient process to do that. If you really care about the performance of your app, it's best to work with a DBA to help you design the database and tune your queries appropriately. (or at least understand the basic concepts of SQL yourself)
Bulk updates are a questionable design. Sometimes they seems necessary; in many cases, however, a better application design can remove the need for bulk updates.
Often, some other part of the application already touched each object one at a time; the "bulk" update should have been done in the other part of the application.
In other cases, the update is a prelude to processing elsewhere. In this case, the update should be part of the later processing.
My general design strategy is to refactor applications to eliminate bulk updates.
ORMs just won't be as efficient as hand-crafted SQL. Period. Just like hand-crafted assembler will be faster than C#. Whether or not that performance difference matters depends on lots of things. In some cases the higher level of abstraction ORMs give you might be worth more than potentailly higher performance, in other cases not.
Traversing relationships with object code can be quite nice but as you rightly point out there are potential problems.
That being said, I personally view ORms to be largely a false economy. I won't repeat myself here but just point to Using an ORM or plain SQL?