Design Question - Storing Images as Objects - oop

Ok, so my site is centric to a lot of dynamic user entered images for different user defined objects throughout. So I have many objects, and those objects have images.
Is it good practice to treat images like an object since the "Image Name" and "Image Bytes" columns are repeated many times? Or is it more sensible just to include those two columns in the tables for each object.
I guess I'm answering my own question while typing... I am creating an extra join and an extra column (there would be three with Name, Id, and ImageId on each table"
however, there are several tables with multiple images per object... so I guess it would be better???? Opinions?

I generally have a Files table that stores files more generically. Then in your other tables, you could have a column for each image (file) which is just a reference into the files table.
Your files table would have all the normal stuff like ID, Filename, Size, type, etc. Then yes, you'd just join into it to get what you need for whatever query you're running.
In case there is any doubt--I'd strongly discourage you from storing files directly in the database. I don't think that's what your after but if anyone else gets that idea--just don't do it!

The question to ask is "do I have multiple objects with the same image." If so, then you might want to think about redundancy. Also, does the data base you're using handle "blobs" well? Or would it be better to keep a file path to the image and store the images separately>

Intuitively, I would find it easier to deal with the images and objects as separate concepts. It just feels more flexible.
If you need to attach things other than images to particular objects, it becomes as simple as associating the object table entry with the "other things" table entry, although you'd probably have to swap which way the foreign key pointed. It would also give you the ability to use those images in other user objects, or to allow users to borrow other users images.
The downside to this approach is you would pay a bit of a performance penalty for needing to join one or more tables together. Probably not the biggest deal, though, if your databases query analyzer is reasonably smart.

Related

Should I be concerned that ORMs, by default, return all columns?

In my limited experience in working with ORMs (so far LLBL Gen Pro and Entity Framework 4), I've noticed that inherently, queries return data for all columns. I know NHibernate is another popular ORM, and I'm not sure that this applies with it or not, but I would assume it does.
Of course, I know there are workarounds:
Create a SQL view and create models and mappings on the view
Use a stored procedure and create models and mappings on the result set returned
I know that adhering to certain practices can help mitigate this:
Ensuring your row counts are reasonably limited when selecting data
Ensuring your tables aren't excessively wide (large number of columns and/or large data types)
So here are my questions:
Are the above practices sufficient, or should I still consider finding ways to limit the number of columns returned?
Are there other ways to limit returned columns other than the ones I listed above?
How do you typically approach this in your projects?
Thanks in advance.
UPDATE: This sort of stems from the notion that SELECT * is thought of as a bad practice. See this discussion.
One of the reasons to use an ORM of nearly any kind is to delay a lot of those lower-level concerns and focus on the business logic. As long as you keep your joins reasonable and your table widths sane, ORMs are designed to make it easy to get data in and out, and that requires having the entire row available.
Personally, I consider issues like this premature optimization until encountering a specific case that bogs down because of table width.
First of : great question, and about time someone asked this! :-)
Yes, the fact an ORM typically returns all columns for a database table is something you need to take into consideration when designing your systems. But as you've mentioned - there are ways around this.
The main fact for me is to be aware that this is what happens - either a SELECT * FROM dbo.YourTable, or (better) a SELECT (list of all columns) FROM dbo.YourTable.
This is not a problem when you really want the whole object and all its properties, and as long as you load a few rows, that's fine, too - the convenience beats the raw performance.
You might need to think about changing your database structures a little bit - things like:
maybe put large columns like BLOBs into separate tables with a 1:1 link to your base table - that way, a select on the parent tables doesn't grab all those large blobs of data
maybe put groups of columns that are optional, that might only show up in certain situations, into separate tables and link them - again, just to keep the base tables lean'n'mean
Also: avoid trying to "arm-wrestle" your ORM into doing bulk operations - that's just not their strong point.
And: keep an eye on performance, and try to pick an ORM that allows you to change certain operations into e.g. stored procedures - Entity Framework 4 allows this. So if the deletes are killing you - maybe you just write a Delete stored proc for that table and handle that operation differently.
The question here covers your options fairly well. Basically you're limited to hand-crafting the HQL/SQL. It's something you want to do if you run into scalability problems, but if you do in my experience it can have a very large positive impact. In particular, it saves a lot of disk and network IO, so your scalability can take a big jump. Not something to do right away though: analyse then optimise.
Are there other ways to limit returned columns other than the ones I listed above?
NHibernate lets you add projections to your queries so you wouldn't need to use views or procs just to limit your columns.
For me this has only been an issue if the tables has LOTS of columns > 30 or if the column had alot of data for example a over 5000 character in a field.
The approach I have used is to just map another object to the existing table but with only the fields I need. So for a search that populates a table with 100 rows I would have a
MyObjectLite, but when I click to view the Details of that Row I would call a GetById and return a MyObject that has all the columns.
Another approach is to use custom SQL, Stroed procs but I only think you should go down this path if you REALLY need the performance gain and have users complaining. SO unless there is a performance problem do not waste your time trying to fix a problem that does not exist.
You can limit number of returned columns by using Projection and Transformers.AliasToBean and DTO here how it looks in Criteria API:
.SetProjection(Projections.ProjectionList()
.Add(Projections.Property("Id"), "Id")
.Add(Projections.Property("PackageName"), "Caption"))
.SetResultTransformer(Transformers.AliasToBean(typeof(PackageNameDTO)));
In LLBLGen Pro, you can return Typed Lists which not only allow you to define which fields are returned but also allow you to join data so you can pull a custom list of fields from multiple tables.
Overall, I agree that for most situations, this is premature optimization.
One of the big advantages of using LLBLGen and other ORMs as well (I just feel confident speaking about LLBLGen because I have used it since its inception) is that the performance of the data access has been optimized by folks who understand the issues better than your average bear.
Whenever they figure out a way to further speed up their code, you get those changes "for free" just by re-generating your data layer or by installing a new dll.
Unless you consider yourself an expert at writing data access code, ORMs probably improve most developers efficacy and accuracy.

Address book database design: denormalize?

I'm designing a contact manager/address book-like application but can't settle on the database design.
In my current setup I have a Contact, which has Addresses, Phonenumbers, Emails, and Organizations. All contact properties are currently separate tables with a fk to the Contact table. Needless to say a contact can have any number of these properties.
Now, I find myself joining all these tables together if I want to read contacts into the app. Since no filters, reverse lookups, sorts etc. are performed on the related tables, isn't it a better/simpler solution to just store the related fields as json-encoded lists on direct properties of the Contact table?
E.g., instead of a Contact with a fk to a phonenumber table with 3 entries, just encode all phonenumbers and store them into a field of the Contact table?
Any insights really appreciated! (fyi I'm using Django although that doesn't really matter)
Can you guarantee that your app will never grow to need these other functionalities? Do you really want to paint yourself into the corner such that you can't easily support all of this later?
Generally, denormalization happens only for preformance reasons. And then, a copy of the normalized data is still kept for live work and the denormalized data is used for offline processing where having a static snapshot is fine.
Get used to writing joins. That's the way SQL works. Having to do so doesn't meant something is wrong.
I know I'm too late, but for anyone with the same issue.
IMO, in this case metadata modeling is the way to go.
http://searchdatamanagement.techtarget.com/feature/Data-model-patterns-A-metadata-map
Sounds like you propose taking data currently modelled as five SQL tables and converting it to a common multi-valued type (does your SQL product have good support for this?) The only way I can see this would constitute 'denormalization' would be if you were proposing to violate 1NF, at which point you may as well abandon SQL as a data store because your data would no longer be relational! Otherwise, your data would still be normalized but you will have lost the ability to query its attributes using SQL (unless your SQL product has extensions for querying multi-value attributes). The deciding factor seems to be: do you need to query these attributes using SQL?

Whether to Split Data in to Separate PostgreSQL Table

I am creating an app with a WPF frontend and a PostgreSQL database. The data includes patient addresses and supplier addresses. There is an average of about 3 contacts per mailing address listed. I'm estimating 10,000 - 15,000 contact records per database.
When designing the database structure, it occurred to me that rather than storing mailing addresses in a single "contacts" table, I could have one table storing names and other individual data, with a second table holding addresses. I could then create a relationship between the tables, to match addresses with contacts.
I have a pretty good idea how I can neatly organise situations such as changing the address of a single contact, where the other contacts are staying at the same address.
The question is: is it worth it? Can I expect to save much in the way of storage size? Will this impact the speed of queries adversley? How about if I was using something other than PostgreSQL?
I would strongly suggest normalizing this. You never know what kind of trouble you will run into. LedgerSMB has a relatively decent entity/user/contact/location schema that creates a very flexible environment. You can see it here (starts at line 363):
http://ledger-smb.svn.sourceforge.net/viewvc/ledger-smb/trunk/sql/Pg-database.sql?revision=3042&view=markup
Unless you think a large number of your users will be sharing addresses and they'll be often changing, I don't see the need to normalize out the address portion. In the various places I've worked and see users tables, sometimes it is, sometimes it isn't - never really seemed to make a terrible amount of trouble one way or another.
In terms of performance, with just 10-15k records and proper indexes, I can't imagine you'd notice too much difference one way or the other on modern hardware (although technically the separate table should be slower).
I agree with Joshua. Once it's set up properly (normalized) it's very easy to manage any changes in your app in the future.

Help with setting up a Database

My site is going to have many products available, but they'll be categorised into completely different sites (domains).
My question is, am I better off lumping all products into one database and using an ID to distinguish between the sites, or should I set up a table and /or DB per site?
Here are my thoughts
SEPARATE DATABASES
Easier to read from a backend
Categorised better
Makes backups more difficult
If I need to make a change to the schema, it will need to be pushed out to all databases
SAME DATABASES
All in one place
Could get unwieldy
One database will have a massive file size and lookups could suffer
Can someone please offer me some advice on which way is best and why?
You didn't give too many details (which makes it difficult to provide a good answer), though the words you chose to use in your question lead me to believe that this is a single application with different "skins".
My site is going to have many products available, but they'll be categorised into completely different sites (domains).
My assumption is that you will have a single web store with several different store fronts: cool-widgets.com, awesome-sprockets.com, neato-things.com, etc. These will all be the same, save for maybe a CSS skin or something simple like that. The store admin stuff will all be done in some central system, and the domain name will simply act as a category name.
As such, splitting the same data into two different containers using an arbitrary criterion (category_ name=='cool-widges.com') is data partitioning, which is an anti-pattern. Just as you wouldn't have two different user tables based on the user name ([Users$A-to-M] and [Users$N-to-Z]), it makes little sense to have two different tables (or databases) for category names.
There is, and will be, lots of code common among the categories: user management, admin, order processing, data import, etc. It will be far more difficult to aggregate the multiple datastores in the common code than it will be to segregate the categories in the store display code. Not only that, the segregation bugs will be much more obvious: the price comparison page shows items from all three stores. The aggregation bugs will be much less: only three of the four stores were updated. This is why it's an anti-pattern.
Side note: yes, before you say that data portioning has its uses (which it does), those uses come in far after performance problems occur. Many serious database platforms allow behind-the-scenes partitioning as not to create a goofy data model.
If data needs to be shared among all the sites, then it will be recommended to share the same database since data transfer is eliminated. Also data is more centralized.
If data does not need to be shared among all sites, it'll be good to split up one database per site. Talking about difficulty to update table structures, you can just simply record down the database changes (saving the ALTER, UPDATE, DELETE queries in a SQL file) you make for one, and update the other databases with the same SQL file.
Storing in different databases might also help with security. You can set different user permissions for each of the site. and if one gets compromised, you protect the other sites.
Also, you are able to easily maintain and track database when the databases are clearly split up.
As you already say, both options have their pros and cons. Since you're talking about two stores, it probably doesn't matter much.
However, a few questions you might want to ask yourself:
Will it really be two stores, or possibly more? If more, one database might be smarter.
Are the products really the same? If you're gonna have to squeeze products in one general database, because they are of a different kind (eg. cars vs. food; the amount and nature of the details you want to store are completely different), then don't; use two databases / tables instead.
The central question is: what is most likely to become more elaborate in the future: the stores, or the products?
I think separate databases will be easier. You can have a quick-start template database from which you can build a new store database. You can even create a common database and contain common tables and list of stores and their databases. After all you can access to any database within the same server using qualified name, observe:
SELECT value FROM CommonDB.currencies WHERE type='euro';
SELECT price FROM OldTownDB.Products WHERE id=newtownprodid;

Many-to-many relationship: use associative table or delimited values in a column?

Update 2009.04.24
The main point of my question is not developer confusion and what to do about it.
The point is to understand when delimited values are the right solution.
I've seen delimited data used in commercial product databases (Ektron lol).
SQL Server even has an XML datatype, so that could be used for the same purpose as delimited fields.
/end Update
The application I'm designing has some many-to-many relationships. In the past, I've often used associative tables to represent these in the database. This has caused some confusion to the developers.
Here's an example DB structure:
Document
---------------
ID (PK)
Title
CategoryIDs (varchar(4000))
Category
------------
ID (PK)
Title
There is a many-to-many relationship between Document and Category.
In this implementation, Document.CategoryIDs is a big pipe-delimited list of CategoryIDs.
To me, this is bad because it requires use of substring matching in queries -- which cannot make use of indexes. I think this will be slow and will not scale.
With that model, to get all Documents for a Category, you would need something like the following:
select * from documents where categoryids like '%|' + #targetCategoryId + '|%'
My solution is to create an associative table as follows:
Document_Category
-------------------------------
DocumentID (PK)
CategoryID (PK)
This is confusing to the developers. Is there some elegant alternate solution that I'm missing?
I'm assuming there will be thousands of rows in Document. Category may be like 40 rows or so. The primary concern is query performance. Am I over-engineering this?
Is there a case where it's preferred to store lists of IDs in database columns rather than pushing the data out to an associative table?
Consider also that we may need to create many-to-many relationships among documents. This would suggest an associative table Document_Document. Is that the preferred design or is it better to store the associated Document IDs in a single column?
Thanks.
This is confusing to the developers.
Get better developers. That is the right approach.
Your suggestion IS the elegant, powerful, best practice solution.
Since I don't think the other answers said the following strongly enough, I'm going to do it.
If your developers 1) can't understand how to model a many-to-many relationship in a relational database, and 2) strongly insist on storing your CategoryIDs as delimited character data,
Then they ought to immediately lose all database design privileges. At the very least, they need an actual experienced professional to join their team who has the authority to stop them from doing something this unwise and can give them the database design training they are completely lacking.
Last, you should not refer to them as "database developers" again until they are properly up to speed, as this is a slight to those of us who actually are competent developers & designers.
I hope this answer is very helpful to you.
Update
The main point of my question is not developer confusion and what to do about it.
The point is to understand when delimited values are the right solution.
Delimited values are the wrong solution except in extremely rare cases. When individual values will ever be queried/inserted/deleted/updated this proves it was the wrong decision, because you have to parse and touch all the other values just to work with the desired one. By doing this you're violating first (!!!) normal form (this phrase should sound to you like an unbelievably vile expletive). Using XML to do the same thing is wrong, too. Storing delimited values or multi-value XML in a column could make sense when it is treated as an indivisible and opaque "property bag" that is NOT queried on by the database but is always sent whole to another consumer (perhaps a web server or an EDI recipient).
This takes me back to my initial comment. Developers who think violating first normal form is a good idea are very inexperienced developers in my book.
I will grant there are some pretty sophisticated non-relational data storage implementations out there using text property bags (such as Facebook(?) and other multi-million user sites running on thousands of servers). Well, when your database, user base, and transactions per second are big enough to need that, you'll have the money to develop it. In the meantime, stick with best practice.
It's almost always a big mistake to use comma separated IDs.
RDBMS are designed to store relationships.
My solution is to create an
associative table as follows: This is
confusing to the developers
Really? this is database 101, if this is confusing to them then maybe they need to step away from their wizard generated code and learn some basic DB normalization.
What you propose is the right solution!!
The Document_Category table in your design is certainly the correct way to approach the problem. If it's possible, I would suggest that you educate the developers instead of coming up with a suboptimal solution (and taking a performance hit, and not having referential integrity).
Your other options may depend on the database you're using. For example, in SQL Server you can have an XML column that would allow you to store your array in a pre-defined schema and then do joins based on the contents of that field. Other database systems may have something similar.
The many-to-many mapping you are doing is fine and normalized. It also allows for other data to be added later if needed. For example, say you wanted to add a time that the category was added to the document.
I would suggest having a surrogate primary key on the document_category table as well. And a Unique(documentid, categoryid) constraint if that makes sense to do so.
Why are the developers confused?
The 'this is confusing to the developers' design means you have under-educated developers. It is the better relational database design - you should use it if at all possible.
If you really want to use the list structure, then use a DBMS that understands them. Examples of such databases would be the U2 (Unidata, Universe) DBMS, which are (or were, once upon a long time ago) based on the Pick DBMS. There are likely to be other similar DBMS providers.
This is the classic object-relational mapping problem. The developers are probably not stupid, just inexperienced or unaccustomed to doing things the right way. Shouting "3NF!" over and over again won't convince them of the right way.
I suggest you ask your developers to explain to you how they would get a count of documents by category using the pipe-delimited approach. It would be a nightmare, whereas the link table makes it quite simple.
The number one reason that my developers try this "comma-delimited values in a database column" approach is that they have a perception that adding a new table to address the need for multiple values will take too long to add to the data model and the database.
Most of them know that their work around is bad for all kinds of reasons, but they choose this suboptimal method because they just can. They can do this and maybe never get caught, or they will get caught much later in the project when it is too expensive and risky to fix it. Why do they do this? Because their performance is measured solely on speed and not on quality or compliance.
It could also be, as on one of my projects, that the developers had a table to put the multi values in but were under the impression that duplicating that data in the parent table would speed up performance. They were wrong and they were called out on it.
So while you do need an answer to how to handle these costly, risky, and business-confidence damaging tricks, you should also try to find the reason why the developers believe that taking this course of action is better in the short and the long run for the project and company. Then fix both the perception and the data structures.
Yes, it could just be laziness, malicious intent, or cluelessness, but I'm betting most of the time developers do this stuff because they are constantly being told "just get it done". We on the data model and database design sides need to ensure that we aren't sending the wrong message about how responsive we can be to requests to fulfill a business requirement for a new entity/table/piece of information.
We should also see that data people need to be constantly monitoring the "as-built" part of our data architectures.
Personally, I never authorize the use of comma delimited values in a relational database because it is actually faster to build a new table than it is to build a parsing routine to create, update, and manage multiple values in a column and deal with all the anomalies introduced because sometimes that data has embedded commas, too.
Bottom line, don't do comma delimited values, but find out why the developers want to do it and fix that problem.