separate table for picture items with image field - sql

I store different items (notes, articles, pictures, files) in a single table (there are many metadata in common for all the item types - for example, categories, tags, rating, statistics etc.).
My first design was like this: table Items, plus another "detail" table for each of the item types (NoteItems, ArticleItems, PictureItems etc.). To retrieve a single item, tables must be joined one-to-one (SELECT * FROM Items INNER JOIN PictureItems ON Items.Id = PictureItems.Id WHERE Items.Id = N).
I'm pretty sure that this "by-the-book" design would work nicely (done that several times), however, I start to wonder whether the design is overkill. It would be much simpler to have a single table (Items).
Let's say that there are about 5% of items of picture or file type.
And now, the question: if I go for the (almost) single table design, would it be better to have detail tables for image fields anyway (for picture and file items, of course)?
Scenario 1: only one table: Items (for storing notes, articles, pictures, files...)
Scenario 2: two tables: Items (for storing notes, articles, picture files), ImageItems (for storing only image field of item types picture, file); one-to-one relation
(Scenario 3 would be a minor variation of Scenario 2; with 3 tables (Items, PictureItems, FileItems))
Advantages of scenario 1 are:
simpler select queries (no joins)
transaction-less updates (only one table is updated on INSERT/UPDATE)
performance, scalability due to transaction-less updates?
Advantages of scenario 2 are:
cleaner design
lower data consumption (in scenario 1, about 95% of items of type other than picture or file would have NULL value in the image field, that's about 16 bytes wasted for the pointer)
Which scenario would you choose: 1 (transaction-less updates) or 2 (lower data consumption)? Thanks for your opinions.

If programmers are sensible enough to query only required columns from the table instead of "SELECT *", first design approach looks okay.
Need to take care of indexing, referential constraints etc. for the second design.

If the database doesn't need to know what's in those items (won't be indexing or searching on them) then option 1 seems the best option (assuming you just have one 'Item' column as a BLOB) - you can just read out the items as binary data and deal with them yourself as appropriate - thus avoiding that inner join.
I don't believe scenario 2 gives you lower data consumption - you can just use a BLOB field (and anyway the overhead of an extra ImageItems table is probably comparable to 16 bytes per row)
So I would personally go for option 1, but of course it depends on how you handle the Items when they come out of the database.

First approach is usually penalized if you are using some sort of ORM or automatically generate your DAL ( SubSonic? ).You will retreive the Image column (and it's data) everytime you pass arround a DAL object (or a collection), so normally I would use scenario 2 (or 3)
From the SQL perspective either scenarios will work about the same depending on your storage engine (ISAM, InnoDB,etc), but even there the benefits and diferences between scenarios are minor.

If you're right about only about 5% of your row actually having additional image/binary data, then I'd definitely say use the one table approach, combined with the hint Murthy gave - make sure not to do SELECT * on this table, but only request those columns that you really need - leave out the BLOB columns as often as possible.
If your database grows, you might also want to check out a separate filegroup for the BLOB data to keep things separated and clean (but that's really only when you're dealing with hundreds of thousands of rows or more, and if you can split the filegroups onto several separate disks).
KISS - Keep It Smart and Simple - whenever possible! :-)
Marc

Related

Storing multiple entity orderings in a database

I have a database, with one table containing a set of entities - let's say films. I want users on the front end to be able to create and save multiple orderings of these films.
How could I sensibly store these orderings? What should I be thinking about (for making the database schema nice, or for performance, etc.)?
In this particular case, I'm only expecting about 100 films in the database, and probably fewer than 10 saved orderings, so performance is unlikely to be a major issue, although in theory it could get much larger.
My ideas so far, without restricting for niceness, include;
Having a table for orderings, storing ordering id, ordering name,
and a JSON.stringified version of the film ids.
Having a table storing ordering id and ordering name, and a separate
table storing, somehow (a linked list?) several sets of actual
orders - by storing as you might a single order but adding an order
id. (This second table might then contain a record for every pair of
film and order...)
Your ideas basically cover it, omitting some implementation-specific options such as Postgres' array field support which is for practical purposes a different way to accomplish #1.
#1 is compact and easy to set up, but lacks referential integrity: if you delete a film, you have to ensure it is removed from all lists yourself. #2 involves more work and a slightly more complex structure, but the payoff is that the relationships are tracked by the database; taking the example of deleting a film, you have the option to prevent deletion of any film on a list, or of cascading the deletion to remove the entry from all lists.
Each has its advantages and disadvantages. Which is appropriate really comes down to your specific use case.

Having all contact information in one table vs. using key-value tables

(NB. The question is not a duplicate for this, since I am dealing with an ORM system)
I have a table in my database to store all Contacts information. Some of the columns for each contact is fixed (e.g. Id, InsertDate and UpdateDate). In my program I would like to give user the option to add or remove properties for each contact.
Now there are of course two alternatives here:
First is to save it all in one table and add and remove entire columns when user needs to;
Create a key-value table to save each property alongside its type and connect the record to user's id.
These alternatives are both doable. But I am wondering which one is better in terms of speed? In the program it will be a very common thing for the user to view the entire Contact list to check for updates. Plus, I am using an ORM framework (Microsoft's Entity Framework) to deal with database queries. So if the user is to add and remove columns from a table all the time, it will be a difficult task to map them to my program. But again, if alternative (1) is a significantly better option than (2), then I can reconsider the key-value option.
I have actually done both of these.
Example #1
Large, wide table with columns of data holding names, phone, address and lots of small integer values of information that tracked details of the clients.
Example #2
Many different tables separating out all of the Character Varying data fields, the small integer values etc.
Example #1 was a lot faster to code for but in terms of performance, it got pretty slow once the table filled with records. 5000 wasn't a problem. When it reached 50,000 there was a noticeable performance degradation.
Example #2 was built later in my coding experience and was built to resolve the issues found in Example #1. While it took more to get the records I was after (LEFT JOIN this and UNION that) it was MUCH faster as you could ultimately pick and choose EXACTLY what the client was after without having to search a massive wide table full of data that was not all being requested.
I would recommend Example #2 to fit your #2 in the question.
And your USER specified columns for their data set could be stored in a table just to their own (depending on how many you have I suppose) which would allow you to draw on the table specific to that USER, which would also give you unlimited ability to remove and add columns to suit that particular setup.
You could then also have another table which kept track of the custom columns in the custom column table, which would give you the ability to "recover" columns later, as in "Do you want to add this to your current column choices or to one of these columns you have deleted in the past".

Table with a lot of columns

If my table has a huge number of columns (over 80) should I split it into several tables with a 1-to-1 relationship or just keep it as it is? Why? My main concern is performance.
PS - my table is already in 3rd normal form.
PS2 - I am using MS Sql Server 2008.
PS3 - I do not need to access all table data at once, but rather have 3 different categories of data within that table, which I access separately. It is something like: member preferences, member account, member profile.
80 columns really isn't that many...
I wouldn't worry about it from a performance standpoint. Having a single table (if you're typically using all of the data in your standard operations) will probably outperform multiple tables with 1-1 relationships, especially if you're indexing appropriately.
I would worry about this (potentially) from a maintenance standpoint, though. The more columns of data in a single table, the less understandable the role of that table in your grand scheme becomes. Also, if you're typically only using a small subset of the data, and all 80 columns are not always required, splitting into 2+ tables might help performance.
Re the performance question - it depends. The larger a row is, the less rows can be read from disk in one read. If you have a lot of rows, and you want to be able to read the core information from the table very quickly, then it may be worth splitting it into two tables - one with small rows with only the core info that can be read quickly, and an extra table containing all the info you rarely use that you can lookup when needed.
Taking another tack, from a maintenance & testing point of view, if as you say you have 3 distinct groups of data in the one table albeit all with the same unique id (e.g. member_id) it might make sense to split it out into separate tables.
If you need to add fields to say your profile details section of the members info table, do you really want to run the risk of having to re-test the preferences & account details elements of your app as well to ensure no knock on impacts.
Also for audit trail purposes if you want to track the last user ID/Timestamp to change a members data. If the admin app allows Preferences/Account Details/Profile Details to be updated separately then it makes sense to have them in separate tables to more easily track updates.
Not quite a SQL/Performance answer but maybe something to look at from a DB & App design pov
Depends what those columns are. If you've got hard coded duplicated fields like Colour1, Colour2, Colour3, then these are candidates for child tables. My general rule of thumb is if there's more than one field of the same type (Colour), then you might as well code for N of them, not a fixed number.
Rob.
1-1 may be easier, if you have say Member_Info; Member_Pref; Member_Profile. Having too many columns can make it run if you want lots of varchar(255) as you may go over the rowsize limit, and it just makes it too confusing.
Just make sure you have the correct forgein key constraints and suchwhat, so there's always 1 row in each table with the same member_id

Database table with just 1 or 2 optional fields... split into multiple tables?

In a DB I'm designing, there's one fairly central table representing something that's been sold or is for sale. It distinguishes between personal sales (like eBay) and sales from a proper company. This means there is literally 1 or two fields which are not equally appropiate to both cases... for instance one field is only used in one case, another field is optional in one case but mandatory in the other.
If there were more specialty it would be sensible to have a core table and then two tables with the fields relevant to the specific cases. But here, creating two tables just to contain like one field plus the reference to the core table seems both aesthetically bad, and painful to the query designer and DB software.
What do you think? Is it ok to bend the rules slightly by having a single table with weakened constraints - meaning the DB cannot 100% prevent inconsistent data being added (in a very limited way) - or do I suck it up and create dumb-looking 1-field tables?
What you're describing with one table for common columns and dependent tables for subtype-specific columns is called Class Table Inheritance. It's a perfectly good thing to do.
What #Scott Ferguson seems to be describing (two distinct tables for the two types of sales) is called Concrete Table Inheritance. It can also be a good solution depending on your needs, but more often it just makes it harder to write query across both subtypes.
If all you need is one or two columns that apply only to a given subtype, I agree it seems like overkill to create dependent tables. Remember that most brands of SQL database support CHECK constraints or triggers, so you can design data integrity rules into the metadata.
CREATE TABLE Sales (
sale_id SERIAL,
is_business INT NOT NULL, -- 1 for corporate, 0 for personal
sku VARCHAR(20), -- only for corporate
paypal_id VARCHAR(20), -- mandatory but only for personal
CONSTRAINT CHECK (is_business = 0 AND paypal_id IS NOT NULL)
);
I think the choice of having these fields is not going to hurt you today and would be the choice I would go for. just remember that as your database evolves you may need to make the decision to refactor to 2 separate tables, (if you need more fields)
There are some who insist that inapplicable fields should never be allowed, but I think this is one of those rules that someone wrote in a book and now we're all supposed to follow it without questioning why. In the case you're describing, a single table sounds like the simple, intelligent solution.
I would certainly not create two tables. Then all the common fields would be duplicated, and all your queries would have to join or union two tables. So the real question is, One table or three. But you seem to realize that.
You didn't clarify what the additional fields are. If the presence or absence of one field implies the record type, then I sometimes use that fact as the record type indicator rather than creating a redundant type. Like, if the only difference between a "personal sale" and a "business sale" is that a business sale has a foreign key for a company filled in, then you could simply state that you define a business sale as one with a company filled in, and no ambiguity is possible. But if the situation gets even slightly more complicated, this can be a trap: I've seen applications that say if a is null and b=c d / 7 = then it's record type A, else if b is null and etc etc. If you can't do it with one test on one field, forget it and put in a record type field.
You can always enforce consistency with code or constraints.
I worry a lot more about redundant data creating consistency problems then inapplicable fields. Redundant data creates all sorts of problems. Data inapplicable to a record type? In the worst case, just ignore it. If it's a "personal sale" and somehow a company got filled in, ignore it or null it out on sight. Problem solved.
If there are two distinct entities, "Personal Sales" and "Company Sales", then perhaps you ought to have two tables to represent those entities?
News flash: the DB cannot prevent 100% of corrupt data now matter which way you cut it. So far you have only considered what I call level 1 corruption (level 0 corruption is essentially what would happen if you wrote garbage over your database with a hex editor).
I have yet to see a database that could prevent level 2 corruption (syntactically correct records but when taken as a whole mean something perverse).
The PRO for keeping all fields in one table is that you get rid of JOIN's which makes your queries faster.
The CONTRA is that your table grows larger which makes your queries slower.
Which one impacts you more, totally depends on your data distribution and which queries you issue most often.
In general, splitting is better for OLTP systems, joining is better for data analysis (that tends to scan the tables).
Let's imagine 2 scenarios:
Split fields. There are 1,000,000 rows, the average row size is 20 bytes, the split field is filled once per 50 rows (i. e. 20,000 records in the split table).
We want to query like this:
SELECT SUM(mainfield + COALESCE(splitfield, 0))
FROM maintable
LEFT JOIN
splittable
ON splitid = mainid
This will require scanning 20,000,000 bytes and nested loops (or hash lookups) to find 10,000 records.
Each hash lookup is roughly equivalent to scanning 10 rows, so the total time will be equivalent of scanning 20,000,000 + 10 * 20,000 * 20 = 24,000,000 bytes
Joined fields. There are 1,000,000 rows, the average row size is 24 bytes, so the query will scan 24,000,000 bytes.
As you can see, the times tie.
However, if either parameter changes (field is filled more often or more rarely, the row size is more or less, etc), one or another solution will become better.

which is faster, mysql database with one table or multiple tables?

On my website you can search 'ads' or 'classifieds'. There are different categories.
Would the searches be faster with multiple tables, one for each category, or wouldn't it matter?
We are talking about around 500 thousand ads.
If it won't slow down the search, please explain yourself so that I understand why it won't, because it seems like common sense that the more ads you have, the slower the search!
Thanks
Your question is a little unclear. I'm assuming this scenario:
table: ads
id category ad_text
-- -------- ---------------------------
1 pets sample text
2 family sample ad
If you are making one search of ads, then searching multiple tables on each search is slower than searching one table.
HOWEVER, if you're proposing to break "ads" into multiple tables according to "category", leaving you with table names like
pets-ads
family-ads
programmer-ads
And, programatically, you know you're looking for programmer-ads so you can just go search the programmer-ads table, then breaking them out is faster. Barely.
Breaking them out, though, has many drawbacks. You'll need:
some cute code to know which table to
search.
a new table each time you create a new category
to rename a table if you decide a category name is wrong
Given the limited info we have, I would strongly advise one table with a category column, then go ahead and normalize that out into its own table. Slap an index on that column. Databases are built to handle tons of rows of data organized correctly, so don't worry about that so much.
Obviously, it will be nominally faster to search a smaller table (one category) than a larger table. The larger table is probably still the correct design, however. Creating multiple identical tables will simply make the developer's and manager's lives miserable. Furthermore, certain kind of searches are more difficult if you segment the data (for instance, searches across two categories).
Properly indexed, the single-table approach will yield results almost as good as the segmented approach while providing the benefits of proper design.
(Of course, when you say "single table", I assume that you mean a single table to hold the core attributes of the Advertistment entities. Presumably there will be other tables as well.)
It depends.
If you've built a single denormalised table containing text, it'll get progressively slower for a number of reasons. Indexes help to a certain point.
If you have a normalised structure with multiple tables, primary and foreign keys, indexes, etc., it can be more robust and scalable.
A database is very well equipped to deal with 500k adds. Add an index on the category, and you should be fine.
If you add the table definition and the distribution of categories to your question, you'd probably get a better answer :)