On my website you can search 'ads' or 'classifieds'. There are different categories.
Would the searches be faster with multiple tables, one for each category, or wouldn't it matter?
We are talking about around 500 thousand ads.
If it won't slow down the search, please explain yourself so that I understand why it won't, because it seems like common sense that the more ads you have, the slower the search!
Thanks
Your question is a little unclear. I'm assuming this scenario:
table: ads
id category ad_text
-- -------- ---------------------------
1 pets sample text
2 family sample ad
If you are making one search of ads, then searching multiple tables on each search is slower than searching one table.
HOWEVER, if you're proposing to break "ads" into multiple tables according to "category", leaving you with table names like
pets-ads
family-ads
programmer-ads
And, programatically, you know you're looking for programmer-ads so you can just go search the programmer-ads table, then breaking them out is faster. Barely.
Breaking them out, though, has many drawbacks. You'll need:
some cute code to know which table to
search.
a new table each time you create a new category
to rename a table if you decide a category name is wrong
Given the limited info we have, I would strongly advise one table with a category column, then go ahead and normalize that out into its own table. Slap an index on that column. Databases are built to handle tons of rows of data organized correctly, so don't worry about that so much.
Obviously, it will be nominally faster to search a smaller table (one category) than a larger table. The larger table is probably still the correct design, however. Creating multiple identical tables will simply make the developer's and manager's lives miserable. Furthermore, certain kind of searches are more difficult if you segment the data (for instance, searches across two categories).
Properly indexed, the single-table approach will yield results almost as good as the segmented approach while providing the benefits of proper design.
(Of course, when you say "single table", I assume that you mean a single table to hold the core attributes of the Advertistment entities. Presumably there will be other tables as well.)
It depends.
If you've built a single denormalised table containing text, it'll get progressively slower for a number of reasons. Indexes help to a certain point.
If you have a normalised structure with multiple tables, primary and foreign keys, indexes, etc., it can be more robust and scalable.
A database is very well equipped to deal with 500k adds. Add an index on the category, and you should be fine.
If you add the table definition and the distribution of categories to your question, you'd probably get a better answer :)
Related
I am developing a mssql db for stores that are in different cities. is it better to have a table for each city, or house all in one table. I also dont want users from different cities accessing data from cities that are not theirs
SQL is designed to handle large tables, really big tables. It is not designed to handle a zillion little tables. The clear answer to your question is that all examples of a particular entity should go in a single table. There are lots of good reasons for this:
You want to be able to write a query that will return data about any city or all cities. This is easy if the data is in one table; hard if the data is in multiple tables.
You want to optimize your queries by choosing correct indexes and data types and collecting statistics and defragging indexes and so on. Why multiply the work by multiplying the number of tables?
Foreign key relationships should be properly declared. You cannot do that if the foreign key could be to multiple tables.
Lots of small tables results in lots of partially filled data pages, which just makes the database bigger and slows it down.
I could go on. But you probably get the idea by now that one table per entity is the right way to go (at least under most circumstances).
Your issue of limiting users to see data only in one city can be handled in a variety of ways. Probably the most common is simply to use views.
I wonder what is the difference between having one table with 6 millions row (aka with a huge DB) and 100k active users:
CREATE TABLE shoes (
id serial primary key,
color text,
is_left_one boolean,
stock int
);
With also 6 index like:
CREATE INDEX blue_left_shoes ON shoes(color,is_left_one) WHERE color=blue AND is_left_one=true;
Versus: 6 tables with 1 million rows:
CREATE TABLE blue_left_shoes(
id serial primary key,
stock int
);
The latter one seems more efficient because users don't have to ask for the condition since the table IS the condition, but perhaps creating the indexes mitigate this?
This table is used to query either left, right, "blue", "green" or "red" shoes and to check the number of remaining items, but it is a simplified example but you can think of Amazon (or any digital selling platform) tooltip "only 3 items left in stock" for the workload and the usecase. It is the users (100k active daily) who will make the query.
NB: The question is mostly for PostgreSQL but differences with other DB is still relevant and interesting.
In the latter case, where you use a table called blue_left_shoes
Your code needs to first work out which table to look at (as opposed to parameterising a value in the where clause)
As permutations and options increase, you need to increase the number of tables, and increase the logic in your app that works out which table to use
Anything that needs to use this database (i.e. a reporting tool or an API) now needs to re implement all of these rules
You are imposing logic at a high layer to improve performance.
If you were to partition and/or index your table appropriately, you get the same effect - SQL queries only look through the records that matter. The difference is that you don't need to implement this logic in higher layers
As long as you can get the indexing right, keeping this is one table is almost always the right thing to do.
Partitioning
Database partitioning is where you select one or more columns to decide how to "split up" your table. In your case you could choose (color, is_left_one).
Now your table is logically split and ordered in this way and when you search for blue,true it automatically knows which partition to look in. It doesn't look in any other partitions (this is called partition pruning)
Note that this occurs automatically from the search criteria. You don't need to manually work out a particular table to look at.
Partitioning doesn't require any extra storage (beyond various metadata that has to be saved)
You can't apply multiple partitions to a table. Only one
Indexing
Creating an index also provides performance improvements. However indexes take up space and can impact insert and update performance (as they need to be maintained). Practically speaking, the select trade off almost always far outweighs any insert/update negatives
You should always look at indexes before partitioning
Non selective indexes
In your particular case, there's an extra thing to consider: a boolean field is not "selective". I won't go into details but suffice to say you shouldn't create an index on this field alone, as it won't be used because it only halves the number of records you have to look through. You'd need to include some other fields in any index (i.e. colour) to make it useful
In general, you want to keep all "like" data in a single table, not split among multiples. There are good reasons for this:
Adding new combinations is easier.
Maintaining the tables is easier.
You an easily do queries "across" entities.
Overall, the database is more efficient, because it is more likely that pages will be filled.
And there are other reasons as well. In your case, you might have an argument for breaking the data into 6 separate tables. The gain here comes from not having the color and is_left_one in the data. That means that this data is not repeated 6 million times. And that could save many tens of megabytes of data storage.
I say the last a bit tongue-in-cheek (meaning I'm not that serious). Computers nowadays have so much member that 100 Mbytes is just not significant in general. However, if you have a severely memory limited environment (I'm thinking "watch" here, not even "smart phone") then it might be useful.
Otherwise, partitioning is a fine solution that pretty much meets your needs.
For this:
WHERE color=blue AND is_left_one=true
The optimal index is
INDEX(color, is_left_one) -- in either order
Having id first makes it useless for that WHERE.
It is generally bad to have multiple identical tables instead of one.
I have a choice of creating three tables with identical structure but different content or one table with all of the data and one additional column that distinguishes the data. Each table will have about 10,000 rows in it, and it will be used exclusively for looking up data. The key design criteria is speed of lookup, so which is faster: three tables with 10K rows each or one table with 30K rows, or is there no substantive difference? Note: all columns that will be used as query parameters will have indices.
There should be no substantial difference between 10k or 30k rows in any modern RDBMS in terms of lookup time. In any case not enough difference to warrant the de-normalization. Indexed qualifier column is a common approach for such a design.
The only time you may consider de-normalizing if your update pattern affects a limited set of data that you can put in a "short" table (say, today's messages in social network) with few(er) indexes for fast inserts/updates and there is a background process transferring the stabilized updates to a large, fully indexed table. The case were you really win during write operations will be a dramatic one though, with very particular and unfortunate requirements. RDBMS engines are sophisticated enough to handle most of the simple scenarios in very efficient way. 30k or rows does not sound like a candidate.
If still in doubt, it is very easy to write a test to check on your particular database / system setup. I think if you post your findings here with real data, it will be a useful info for everyone in your steps.
Apart from the speed issue, which the other posters have covered and I agree with, you should also take into consideration the business model that your are replicating in your database, as this may affect the maintenance cost of your solution.
If is it possible that the 3 'things' may turn into 4, and you have chosen the separate table path, then you will have to add another table. Whereas if you choose the discriminator path then it is as simple as coming up with a new discriminator.
However, if you choose the discriminator path and then new requirements dictate that one of 'things' has more data to store then you are going to have to add extra columns to your table which have no relevance to the other 'things'.
I cannot say which is the right way to go, as only you know your business model.
Is it reasonable for an application to create database tables dynamically as a means of partitioning?
For example, say I have a large table "widgets" with a "userID" column identifying the owner of each row. If this table tended to grow extremely large, would it make sense to instead have the application create a new table called "widgets_{username}" for each new user? Assume that the application will only ever have to query for widgets belonging to a single user at a time (i.e. no need to try and join any of these user widget tables together).
Doing this would break up the one large table into more easily-managed chunks, but this doesn't seem like an elegant solution. In my mind, the database schema should be defined when the application is written, and any runtime data is stored as rows, not as additional tables.
As a more general question, is modifying the database schema at runtime ever ok?
Edit: This question is mostly hypothetical; I had a pretty good feeling that creating tables at runtime didn't make sense. That being said, we do have a table with millions of rows in our application. SELECTs perform fine, but things like deleting all rows owned by a particular user can take a while. Basically I'm looking for some solid reasoning why just dynamically creating a table for each user doesn't make sense for when I'm asked.
NO, NO, NO!! Now repeat after me, I will not do this because it will create many headaches and problems in the future! Databases are made to handle large amounts of information. they use indexes to quickly find what you are after. think phone book how effective is the index? would it be better to have a different book for each last name?
This will not give you anything performance wise. Keep a single table, but be sure to index on UserID and you'll be able to get the data fast. however if you split the table up, it becomes impossible/really really hard to get any info that spans multiple users, like search all users for a certain widget, count of all widgets of a certain type, etc. you need to have every query be built dynamically.
If deleting rows is slow, look into that. How many rows at one time are we talking about 10, 1000, 100000? What is your clustered index on this table? Could you use a "soft delete", where you have a status column that you UPDATE to "D" to mark the row as deleted. Can you delete the rows at a later time, with less database activity. is the delete slow because it is being blocked by other activity. look into those before you break up the table.
No, that would be a bad idea. However some DBMSs (e.g. Oracle) allow a single table to be partitioned on values of a column, which would achieve the objective without creating new tables at run time. Having said that, it is not "the norm" to partition tables like this: it is only usually done in very large databases.
Using an index on userID should result nearly in the same performance.
In my opinion, changing the database schema at runtime is bad practice.
Consider, for example, security issues...
Is it reasonable for an application to create database tables
dynamically as a means of partitioning?
No. (smile)
If my table has a huge number of columns (over 80) should I split it into several tables with a 1-to-1 relationship or just keep it as it is? Why? My main concern is performance.
PS - my table is already in 3rd normal form.
PS2 - I am using MS Sql Server 2008.
PS3 - I do not need to access all table data at once, but rather have 3 different categories of data within that table, which I access separately. It is something like: member preferences, member account, member profile.
80 columns really isn't that many...
I wouldn't worry about it from a performance standpoint. Having a single table (if you're typically using all of the data in your standard operations) will probably outperform multiple tables with 1-1 relationships, especially if you're indexing appropriately.
I would worry about this (potentially) from a maintenance standpoint, though. The more columns of data in a single table, the less understandable the role of that table in your grand scheme becomes. Also, if you're typically only using a small subset of the data, and all 80 columns are not always required, splitting into 2+ tables might help performance.
Re the performance question - it depends. The larger a row is, the less rows can be read from disk in one read. If you have a lot of rows, and you want to be able to read the core information from the table very quickly, then it may be worth splitting it into two tables - one with small rows with only the core info that can be read quickly, and an extra table containing all the info you rarely use that you can lookup when needed.
Taking another tack, from a maintenance & testing point of view, if as you say you have 3 distinct groups of data in the one table albeit all with the same unique id (e.g. member_id) it might make sense to split it out into separate tables.
If you need to add fields to say your profile details section of the members info table, do you really want to run the risk of having to re-test the preferences & account details elements of your app as well to ensure no knock on impacts.
Also for audit trail purposes if you want to track the last user ID/Timestamp to change a members data. If the admin app allows Preferences/Account Details/Profile Details to be updated separately then it makes sense to have them in separate tables to more easily track updates.
Not quite a SQL/Performance answer but maybe something to look at from a DB & App design pov
Depends what those columns are. If you've got hard coded duplicated fields like Colour1, Colour2, Colour3, then these are candidates for child tables. My general rule of thumb is if there's more than one field of the same type (Colour), then you might as well code for N of them, not a fixed number.
Rob.
1-1 may be easier, if you have say Member_Info; Member_Pref; Member_Profile. Having too many columns can make it run if you want lots of varchar(255) as you may go over the rowsize limit, and it just makes it too confusing.
Just make sure you have the correct forgein key constraints and suchwhat, so there's always 1 row in each table with the same member_id