DB Architecture: One table using WHERE vs multiple - sql

I wonder what is the difference between having one table with 6 millions row (aka with a huge DB) and 100k active users:
CREATE TABLE shoes (
id serial primary key,
color text,
is_left_one boolean,
stock int
);
With also 6 index like:
CREATE INDEX blue_left_shoes ON shoes(color,is_left_one) WHERE color=blue AND is_left_one=true;
Versus: 6 tables with 1 million rows:
CREATE TABLE blue_left_shoes(
id serial primary key,
stock int
);
The latter one seems more efficient because users don't have to ask for the condition since the table IS the condition, but perhaps creating the indexes mitigate this?
This table is used to query either left, right, "blue", "green" or "red" shoes and to check the number of remaining items, but it is a simplified example but you can think of Amazon (or any digital selling platform) tooltip "only 3 items left in stock" for the workload and the usecase. It is the users (100k active daily) who will make the query.
NB: The question is mostly for PostgreSQL but differences with other DB is still relevant and interesting.

In the latter case, where you use a table called blue_left_shoes
Your code needs to first work out which table to look at (as opposed to parameterising a value in the where clause)
As permutations and options increase, you need to increase the number of tables, and increase the logic in your app that works out which table to use
Anything that needs to use this database (i.e. a reporting tool or an API) now needs to re implement all of these rules
You are imposing logic at a high layer to improve performance.
If you were to partition and/or index your table appropriately, you get the same effect - SQL queries only look through the records that matter. The difference is that you don't need to implement this logic in higher layers
As long as you can get the indexing right, keeping this is one table is almost always the right thing to do.
Partitioning
Database partitioning is where you select one or more columns to decide how to "split up" your table. In your case you could choose (color, is_left_one).
Now your table is logically split and ordered in this way and when you search for blue,true it automatically knows which partition to look in. It doesn't look in any other partitions (this is called partition pruning)
Note that this occurs automatically from the search criteria. You don't need to manually work out a particular table to look at.
Partitioning doesn't require any extra storage (beyond various metadata that has to be saved)
You can't apply multiple partitions to a table. Only one
Indexing
Creating an index also provides performance improvements. However indexes take up space and can impact insert and update performance (as they need to be maintained). Practically speaking, the select trade off almost always far outweighs any insert/update negatives
You should always look at indexes before partitioning
Non selective indexes
In your particular case, there's an extra thing to consider: a boolean field is not "selective". I won't go into details but suffice to say you shouldn't create an index on this field alone, as it won't be used because it only halves the number of records you have to look through. You'd need to include some other fields in any index (i.e. colour) to make it useful

In general, you want to keep all "like" data in a single table, not split among multiples. There are good reasons for this:
Adding new combinations is easier.
Maintaining the tables is easier.
You an easily do queries "across" entities.
Overall, the database is more efficient, because it is more likely that pages will be filled.
And there are other reasons as well. In your case, you might have an argument for breaking the data into 6 separate tables. The gain here comes from not having the color and is_left_one in the data. That means that this data is not repeated 6 million times. And that could save many tens of megabytes of data storage.
I say the last a bit tongue-in-cheek (meaning I'm not that serious). Computers nowadays have so much member that 100 Mbytes is just not significant in general. However, if you have a severely memory limited environment (I'm thinking "watch" here, not even "smart phone") then it might be useful.
Otherwise, partitioning is a fine solution that pretty much meets your needs.

For this:
WHERE color=blue AND is_left_one=true
The optimal index is
INDEX(color, is_left_one) -- in either order
Having id first makes it useless for that WHERE.
It is generally bad to have multiple identical tables instead of one.

Related

Normalizing an extremely big table

I face the following issue. I have an extremely big table. This table is a heritage from the people who previously worked on the project. The table is in MS SQL Server.
The table has the following properties:
it has about 300 columns. All of them have "text" type but some of them eventually should represent other types (for example, integer or datetime). So one has to convert this text values in appropriate types before using them
the table has more than 100 milliom rows. The space for the table would soon reach 1 terabyte
the table does not have any indices
the table does not have any implemented mechanisms of partitioning.
As you may guess, it is impossible to run any reasonable query to this table. Now people only insert new records into the table but nobody uses it. So I need to restructure it. I plan to create a new structure and refill the new structure with the data from the old table. Obviously, I will implement partioning, but it is not the only thing to be done.
One of the most important features of the table is that those fields that are purely textual (i.e. they don't have to be converted into another type) usually have frequently repeated values. So the actual variety of values in a given column is in the range of 5-30 different values. This induces the idea to make normalization: for every such a textual column I will create an additional table with the list of all the different values that may appear in this column, then I will create a (tinyint) primary key in this additional table and then will use an appropriate foreign key in the original table instead of keeping those text values in the original table. Then I will put an index on this foreign key column. The number of the columns to be processed this way is about 100.
It raises the following questions:
would this normalization really increase the speed of the queires imposing conditions on some of those 100 fields? If we forget about the size needed to keep those columns, whether would there be any increase in the performance due to the substition of the initial text-columns with tinyint-columns? If I do not do any normalization and simply put an index on those initial text columns, whether the performace will be the same as for the index on the planned tinyint-column?
If I do the described normalization, then building a view showing the text values will require joining my main table with some 100 additional tables. A positive moment is that I'll do those joins for pairs "primary key"="foreign key". But still quite a big amount of tables should be joined. Here is the question: whether the performance of the queryes made to this view compare to the performance of the queries to the initial non-normalized table will be not worse? Whether the SQL Server Optimizer will really be able to optimize the query the way that allows taking the benefits of the normalization?
Sorry for such a long text.
Thanks for every comment!
PS
I created a related question regarding joining 100 tables;
Joining 100 tables
You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...
However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA or LOB_DATA Allocation Units.
By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.
As far as the overhead added by performing JOIN to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.
Whether this is worth the effort depends on how long the values are. If the values are, say, state abbreviations (2 characters) or country codes (3 characters), the resulting table would be even larger than the existing one. Remember, you need to include the primary key of the reference table. That would typically be an integer and occupy four bytes.
There are other good reasons to do this. Having reference tables with valid lists of values maintains database consistency. The reference tables can be used both to validate inputs and for reporting purposes. Additional information can be included, such as a "long name" or something like that.
Also, SQL Server will spill varchar columns over onto additional pages. It does not spill other types. You only have 300 columns but eventually your record data might get close to the 8k limit for data on a single page.
And, if you decide to go ahead, I would suggest that you look for "themes" in the columns. There may be groups of columns that can be grouped together . . . detailed stop code and stop category, short business name and full business name. You are going down the path of modelling the data (a good thing). But be cautious about doing things at a very low level (managing 100 reference tables) versus identifying a reasonable set of entities and relationships.
1) The system is currently having to do a full table scan on very significant amounts of data, leading to the performance issues. There are many aspects of optimisation which could improve this performance. The conversion of columns to the correct data types would not only significantly improve performance by reducing the size of each record, but would allow data to be made correct. If querying on a column, you're currently looking at the text being compared to the text in the field. With just indexing, this could be improved, but changing to a lookup would allow the ID value to be looked up from a table small enough to keep in memory and then use this to scan just integer values, which is a much quicker process.
2) If data is normalised to 3rd normal form or the like, then you can see instances where performance suffers in the name of data integrity. This is most a problem if the engine cannot work out how to restrict the rows without projecting the data first. If this does occur, though, the execution plan can identify this and the query can be amended to reduce the likelihood of this.
Another point to note is that it sounds like if the database was properly structured it may be able to be cached in memory because the amount of data would be greatly reduced. If this is the case, then the performance would be greatly improved.
The quick way to improve performance would probably be to add indexes. However, this would further increase the overall database size, and doesn't address the issue of storing duplicate data and possible data integrity issues.
There are some other changes which can be made - if a lot of the data is not always needed, then this can be separated off into a related table and only looked up as needed. Fields that are not used for lookups to other tables are particular candidates for this, as the joins can then be on a much smaller table, while preserving a fairly simple structure that just looks up the additional data when you've identified the data you actually need. This is obviously not a properly normalised structure, but may be a quick and dirty way to improve performance (after adding indexing).
Construct in your head and onto paper a normalized database structure
Construct the database (with indexes)
De-construct that monolith. Things will not look so bad. I would guess that A LOT (I MEAN A LOT) of data is repeated
Create SQL insert statements to insert the data into the database
Go to the persons that constructed that nightmare in the first place with a shotgun. Have fun.

Multiple single field indexes vs multiple-fields indexes

I know there are similar questions on StackOverflow, but after testing different indexes on my tables, I think I don't quite understand how indexes work and I'd like it if someone could explain the behavior I'm experiencing on my queries' performance.
I'm using this query as an example, I'm going to try to explain it in detail:
SELECT ss1.PlayerID, ss1.Name, ss1.Series, ss1.LanesNum, ss1.Date, ss1.LeagueName, ss1.Season FROM SeriesScores ss1
JOIN (SELECT Series, Gender, LanesNum, Bowlout, Season FROM SeriesScores
WHERE Gender = ? AND LanesNum = ? AND Series > -1 AND Bowlout = 'No' AND Season = '2011-2012'
ORDER BY Series DESC LIMIT 0,?) as ss2
USING(series, gender, lanesNum, bowlout, season)
ORDER BY ss1.Series DESC
This query is used to get the highest series bowled in a given season for each pair of lanes in a bowling center for both male and female players.
I'm joining the table on itself instead of using the MAX aggregate function because if there's a tie on a given pair of lanes, I want all the names to come up.
Basically, I join all the fields that match what the inner SELECT returns. That inner SELECT returns the top X players for a given gender and a given pair of lanes.
The USING part makes sure only the players that haven't bowled out, with the same gender, series, lanesNum and season as I'm looking for get selected. I then order them by highest series to lowest series.
This query is in a for loop, which gets run 12 times for men and 12 times for women (12 pair of lanes in the bowling center) with only the lanesNum and gender parameters changing.
I then put all the results in two different vectors in Java to display the results in an application (one vector for men, one for women).
Without any indexes whatsoever, it takes around 11 seconds to run everything including putting the results in a vector and all of that. (5.5 seconds for the 12 queries for men, same for women).
With an index on (gender, lanesNum, series), it takes 0.04 seconds for the whole thing, which is amazing, since that's a more than acceptable speed for my needs.
I used that index because those are all the most important fields I'm using in my WHERE clause, but I don't get why it speeds things up that much, because I tried other things and using some other indexes actually made my queries SLOWER by more than 100%. Also, I'm wondering if I would get an even faster query if I added "bowlout" and "season" to that index.
I wanted to try a single column index on series first and test performance. That's the index that made all of those queries take a total of 22 seconds.
I came to the conclusion that I don't understand where I should be using my indexes and when I should be using them on multiple fields, or using multiple indexes on single fields, etc. Also, I don't understand how using (the wrong) indexes can actually make performance worse.
Optimizing an index too aggresively for just one query runs the risk of slowing down other queries (and thus a real world application, or the next version of it). However, let us do exactly that as an exercise in analysing index performance.
Indexes influence query performance in multiple ways; their existence can actually completely change the algorithm that the database server will use to get to the data. A nice overview is here, but as your query is simple, and you actually have very few relevant indexes in your database (the one you see, and also automatically created indexes to support the primary keys of your tables) we can simplify the story greatly.
A good index makes it faster to cross reference the data between the tables. Ideally it contains columns in your USING and WHERE clauses, and enough of them to reference a unique row in its table most of the time. If it contains less, it may still be used by the database server, but the remaining rows will have to be visited one by one.
An great index does not only all that, but it also contains all data that you will be selecting from the table (yes, this makes sense when the two tables are actually the same physical table due to the self-join; the database server still processes as if it was two different tables, incidentally with the same data). The benefit of such a "fully covering index" is that the database server does not have to visit its table at all; all the columns are available in the index.
Order of columns in the index matters. It is especially essential that the leftmost column in the index appears in the USING clause, or WHERE clause; otherwise the index is pretty much unusable as matching data for a single lookup can appear in many locations in that index. It should also be highly selective (have many different values in the table). Do a few experiments now to see this first hand.
For this reason, the first choice index I'd suggest to you would be series, gender, lanesNum, bowlout; but yours is also a very good one for this query.
There is not much use in creating more than one index explicitly. There is basically no use for more than one of them during query execution, because your query is so simple. So the most useful one will supposedly win and all the others will be ignored.
To your last question: some people believe that superfluous indexes only slow down UPDATE, INSERT and DELETE statements (because these carry the overhead to update the indexes), but it is not that simple. As the database server considers multiple algorithms to compute your query (there are two logical tables to start from and automatic and explicit indexes to use, or not to use), it may choose the wrong plan: an index may look seductive without knowing the data distribution in the table, but be very counterproductive given the distribution.
There is actually a way to let the database server analyze the data and record some statistics that will greatly help it optimize your subsequent queries reasonably and probably to avoid any 22 second executions of your query (until you change your data so much that the statistics will no longer hold true). That is the ANALYZE command. Issue it every time after you change your indexes to see the subsequent sqlite performance at its best. In a production database, schedule ANALYZE to execute every night, so that your database does not gradually slow down over time, or abruptly after adding a harmless, useless index.

What indexing implementations can handle arbitrary column combinations?

I am developing a little data warehouse system with a web interface where people can do filtered searches. There are current about 50 columns that people may wish to filter on, and about 2.5 million rows. A table scan is painfully slow. The trouble is that the range of queries I'm getting have no common prefixes.
Right now I'm using sqlite3, which will only use an index if there the columns required are the leftmost columns in that index. This seems to mean I'd need a lot of indexes. A quick glance at MySQL suggests it would also require many indexes for this kind of query.
My question is what indexing implementations are available for different database systems which can handle this kind of query on arbitrary combinations of columns?
I've prototyped my own indexing scheme; I store extra tables which list integer primary keys in my big table where each value for each column occur, and I keep enough statistics to be able to first examine the values with the smallest number of matches. It works okay; much better than a table scan but still a bit on the slow side, which is unsurprising for a first version in Python doing many SQL queries.
There are column-oriented databases that store data on a per-column base, where every column is its own index. They are a very good fit for Data Warehouse as they are extremly fast to read, but fairly slow to update.
Kickfire is such an example, which is a customized MySQL engine and has held the TPC-H benchmark top crown for a number of weeks, at an impressive system cost. Note that Kickfire is an appliance, sold as a hardware box.
Infobright would be another similar example, and has a free community edition that runs on Windows and Linux.
When there's too many indexes to create for a table I usually fall back on Full Text Search. Can't say if it will fit your scenario though.
SInce data warehouses are typically optimized for reading data not writing it, I would consider simply indexing all the columns. Yes this will slow down putting data into the warehouse, but typically that happens during non-peak hours and only once a day or less often.
One should only consider introducing "home grown" index structures, based on SQL tables, as a last resort, i.e. if there still exists [business-wise plausible] query cases not properly handled with an traditional index setting. For example if the list of such indexes were to become to big etc.
A few observations
You do not necessarily need indexes that include all of the columns that may be involved in one particular query; only the [collectively] selective ones may be required.
In other words if the query uses, for example, columns a, b, c and d, but if an index with a and b exists and if that produces, statistically only a few thousand rows, it may be acceptable to not introduce indexes with a, b and c (or and d or both), if c or d are not very plausible search criteria (used infrequently), and if their width is such that is would unduly burden the a+b index (or if there were other columns with a better fit for being "tacked-on" to the a+b index).
Aside from the obvious additional demand they put on disk storage, additional indexes, while possibly helping with SELECT (read) operations may also become an impediment with CUD (Create/Update/Delete) operations. It appears the context here is akin to a datawarehouse, where few [unscheduled] CUD operations take place, but it is good to keep this in mind.
See SQLite Optimizer for valuable insight into the way SQLite determines the way a particular query is executed.
Making a list of indexes
A tentative basis for the index scheme for this application may look like this:
[A] A single column index for every column in the table (save maybe the ones which are ridiculously unselective, say a "Married" column w/ "Y/N" values in it....)
[B] A two (or three) columns index for each the likely/common use case queries
[C] Additional two/three column indexes for the cases where some non-common query case involves a set of columns none of which is individually selective.
From this basis we then can define the actual list of indexes needed by:
Adding one (or a few) extra columns at the end of (and in a well thought out order...) to the [B] indexes above. Typically such columns are choosed because of their relative small width (they do grow the index unduly) and because they have a relative chance of being used in combination with the columns cited before them in the index.
Removing the [A] indexes which are generally equivalent to one or several [B] indexes. That is: columns which start with the same column, and for which the extra columns do no burden much the index.
reviewing the TREE of all possible (or all acceptable) cases, and marking off the branches adequately served with the indexes above. Then adding yet more indexes for the odd use cases not readily covered (if only with partial index scan + main table lookup for an acceptable number of rows).
In this situation, I find a hand-written tree structure a useful tool to help manage the otherwise unmanageable lists of possible combinations. Assuming a maximum of 4 search criteria selected from the 50 columns indicated in the question, we have in excess of 230,000 combinations to consider... The tree helps prune this rather quickly.

Table with a lot of columns

If my table has a huge number of columns (over 80) should I split it into several tables with a 1-to-1 relationship or just keep it as it is? Why? My main concern is performance.
PS - my table is already in 3rd normal form.
PS2 - I am using MS Sql Server 2008.
PS3 - I do not need to access all table data at once, but rather have 3 different categories of data within that table, which I access separately. It is something like: member preferences, member account, member profile.
80 columns really isn't that many...
I wouldn't worry about it from a performance standpoint. Having a single table (if you're typically using all of the data in your standard operations) will probably outperform multiple tables with 1-1 relationships, especially if you're indexing appropriately.
I would worry about this (potentially) from a maintenance standpoint, though. The more columns of data in a single table, the less understandable the role of that table in your grand scheme becomes. Also, if you're typically only using a small subset of the data, and all 80 columns are not always required, splitting into 2+ tables might help performance.
Re the performance question - it depends. The larger a row is, the less rows can be read from disk in one read. If you have a lot of rows, and you want to be able to read the core information from the table very quickly, then it may be worth splitting it into two tables - one with small rows with only the core info that can be read quickly, and an extra table containing all the info you rarely use that you can lookup when needed.
Taking another tack, from a maintenance & testing point of view, if as you say you have 3 distinct groups of data in the one table albeit all with the same unique id (e.g. member_id) it might make sense to split it out into separate tables.
If you need to add fields to say your profile details section of the members info table, do you really want to run the risk of having to re-test the preferences & account details elements of your app as well to ensure no knock on impacts.
Also for audit trail purposes if you want to track the last user ID/Timestamp to change a members data. If the admin app allows Preferences/Account Details/Profile Details to be updated separately then it makes sense to have them in separate tables to more easily track updates.
Not quite a SQL/Performance answer but maybe something to look at from a DB & App design pov
Depends what those columns are. If you've got hard coded duplicated fields like Colour1, Colour2, Colour3, then these are candidates for child tables. My general rule of thumb is if there's more than one field of the same type (Colour), then you might as well code for N of them, not a fixed number.
Rob.
1-1 may be easier, if you have say Member_Info; Member_Pref; Member_Profile. Having too many columns can make it run if you want lots of varchar(255) as you may go over the rowsize limit, and it just makes it too confusing.
Just make sure you have the correct forgein key constraints and suchwhat, so there's always 1 row in each table with the same member_id

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks
You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle
Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.
We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.
Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?
I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.
The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.
Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)
Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows
We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.
In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.
Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.
We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.
The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.
No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.
From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.
Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.
This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.
Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.