perfomance of single table vs. joined dual structure - sql

This is not a question about using another tool. It is not a question about using different data structure. It is issue about WHY I see what I see -- please read to the end before answering. Thank you.
THE STORY
I have one single table which has one condition, the records are not deleted. Instead the record is marked as not active (there is field for that) and in such case all fields (except for identifiers and this isActive field) are considered irrelevant.
More on identifiers -- there are two fields:
id -- int, primary key, clustered
name -- unique, varchar, external index
How the update is done for example (I use C#/Linq/MSSQL2005): I fetch the record basing on name, then change required fields and commit the changes, so the update is executed (UPDATE uses id, not name).
However there is a problem with storage. So why not not break this table into dual structure -- "header" table (id, name, isActive) and data table (id, rest of the fields). In case of a problem with storage we can delete all records from data table for real (for isActive=false).
edit (by Shimmy): header+data are not retrieved by LINQ with join. Data records are loaded on demand (and this always occurs because of the code).
comment (by poster): AFAIR there is no join, so this is irrelevant. Data for headers were loaded manually. See below.
Performance -- (my) Theory
Now, what about performance? Which one will be faster? Let's say you have 10000 records in both tables (single, header, data) and you update them one by one (all 3 tables) -- fields isActive and some field from the "data" fields.
My calculation was/is:
mono table -- search using external index, then jumping into the structure, fetching all the data, update using primary key.
dual tables -- search using external index, jumping into the header table, fetching all the data, search using primary key on data table (no jumping here, it is clustered index), fetching all the data, update both tables using primary keys.
So, for me mono structure should be faster, because in dual case I have the same operations plus some extras.
The results
No matter what I do, update, select, insert, dual structure is either slightly better (speed) or up to 30% faster. And now I am all puzzled -- I would understand that I if were insert/update/select only header records, but in every case data records are used as well.
The question -- why/how dual structure can be faster?

I think this all boils down to how much data is being fetched, inserted, and updated.
SELECT case - in the dual-table configuration you're fetching less data. Database runtime is heavily dominated by I/O time, so having the "header" fields replicated on each row in the single-table configuration means you have to read that same data over and over again. In the dual-table configuration you only read the header data once.
INSERT case - similar to the above, but related to writing the data instead of reading it.
UPDATE case - your code updates the "isActive" field, which if I've read it correctly is part of the "header" fields. In the single-table configuration you're forcing many rows to be updated for each "isActive" change. In the dual-table configuration you're only updating a single header row for each "isActive" change.
I think this is a case of premature optimization. I get the feeling that you understood that according to data normalization rules the dual-table configuration was "better" - but because the single-table case seemed like it would provide better performance you wanted to go with that design. Thankfully you took the time to test what would happen and found that observed performance did not match your expectations. GOOD JOB! I wish more people would take the time to test things out like this. I think the lesson to learn here is that data normalization is a Good Thing.
Remember, the best time to optimize something is NEVER! The second-best time to optimize things is when you have an observed performance problem. The worst time to optimize is during analysis.
I hope this helps.

Assumption: Sql Server for database.
Sql Server tends to be higher in performance on narrow tables rather than wide. While this might not be true for something such as a mainframe.
This really points to normalization until you decide NOT to for performance reasons, and in this case the assumption that de-normalized tables would be more efficient is incorrect. Normalized structures can be better managed in the resources than de-normalized in this environment. I suspect (no citable basis of this) that the resource (hardware, multiprocessors, threading etc) makes the normalized structure faster because more stuff gets done at the same time.

Have you looked at the two query plans? That often gives it away.
As for speculation, the size of a row in a table affects how fast you can scan it. smaller rows means more rows fit into a data page. the brunt of a query is usually in the I/O time, so using two smaller tables greatly reduces the amount of data you have to sift through in the indexes.
Also, the locks are more granular -- the first update could write to table1, and then the second update could write to table1 while you're finishing up table2.

Related

DB Architecture: One table using WHERE vs multiple

I wonder what is the difference between having one table with 6 millions row (aka with a huge DB) and 100k active users:
CREATE TABLE shoes (
id serial primary key,
color text,
is_left_one boolean,
stock int
);
With also 6 index like:
CREATE INDEX blue_left_shoes ON shoes(color,is_left_one) WHERE color=blue AND is_left_one=true;
Versus: 6 tables with 1 million rows:
CREATE TABLE blue_left_shoes(
id serial primary key,
stock int
);
The latter one seems more efficient because users don't have to ask for the condition since the table IS the condition, but perhaps creating the indexes mitigate this?
This table is used to query either left, right, "blue", "green" or "red" shoes and to check the number of remaining items, but it is a simplified example but you can think of Amazon (or any digital selling platform) tooltip "only 3 items left in stock" for the workload and the usecase. It is the users (100k active daily) who will make the query.
NB: The question is mostly for PostgreSQL but differences with other DB is still relevant and interesting.
In the latter case, where you use a table called blue_left_shoes
Your code needs to first work out which table to look at (as opposed to parameterising a value in the where clause)
As permutations and options increase, you need to increase the number of tables, and increase the logic in your app that works out which table to use
Anything that needs to use this database (i.e. a reporting tool or an API) now needs to re implement all of these rules
You are imposing logic at a high layer to improve performance.
If you were to partition and/or index your table appropriately, you get the same effect - SQL queries only look through the records that matter. The difference is that you don't need to implement this logic in higher layers
As long as you can get the indexing right, keeping this is one table is almost always the right thing to do.
Partitioning
Database partitioning is where you select one or more columns to decide how to "split up" your table. In your case you could choose (color, is_left_one).
Now your table is logically split and ordered in this way and when you search for blue,true it automatically knows which partition to look in. It doesn't look in any other partitions (this is called partition pruning)
Note that this occurs automatically from the search criteria. You don't need to manually work out a particular table to look at.
Partitioning doesn't require any extra storage (beyond various metadata that has to be saved)
You can't apply multiple partitions to a table. Only one
Indexing
Creating an index also provides performance improvements. However indexes take up space and can impact insert and update performance (as they need to be maintained). Practically speaking, the select trade off almost always far outweighs any insert/update negatives
You should always look at indexes before partitioning
Non selective indexes
In your particular case, there's an extra thing to consider: a boolean field is not "selective". I won't go into details but suffice to say you shouldn't create an index on this field alone, as it won't be used because it only halves the number of records you have to look through. You'd need to include some other fields in any index (i.e. colour) to make it useful
In general, you want to keep all "like" data in a single table, not split among multiples. There are good reasons for this:
Adding new combinations is easier.
Maintaining the tables is easier.
You an easily do queries "across" entities.
Overall, the database is more efficient, because it is more likely that pages will be filled.
And there are other reasons as well. In your case, you might have an argument for breaking the data into 6 separate tables. The gain here comes from not having the color and is_left_one in the data. That means that this data is not repeated 6 million times. And that could save many tens of megabytes of data storage.
I say the last a bit tongue-in-cheek (meaning I'm not that serious). Computers nowadays have so much member that 100 Mbytes is just not significant in general. However, if you have a severely memory limited environment (I'm thinking "watch" here, not even "smart phone") then it might be useful.
Otherwise, partitioning is a fine solution that pretty much meets your needs.
For this:
WHERE color=blue AND is_left_one=true
The optimal index is
INDEX(color, is_left_one) -- in either order
Having id first makes it useless for that WHERE.
It is generally bad to have multiple identical tables instead of one.

Normalizing an extremely big table

I face the following issue. I have an extremely big table. This table is a heritage from the people who previously worked on the project. The table is in MS SQL Server.
The table has the following properties:
it has about 300 columns. All of them have "text" type but some of them eventually should represent other types (for example, integer or datetime). So one has to convert this text values in appropriate types before using them
the table has more than 100 milliom rows. The space for the table would soon reach 1 terabyte
the table does not have any indices
the table does not have any implemented mechanisms of partitioning.
As you may guess, it is impossible to run any reasonable query to this table. Now people only insert new records into the table but nobody uses it. So I need to restructure it. I plan to create a new structure and refill the new structure with the data from the old table. Obviously, I will implement partioning, but it is not the only thing to be done.
One of the most important features of the table is that those fields that are purely textual (i.e. they don't have to be converted into another type) usually have frequently repeated values. So the actual variety of values in a given column is in the range of 5-30 different values. This induces the idea to make normalization: for every such a textual column I will create an additional table with the list of all the different values that may appear in this column, then I will create a (tinyint) primary key in this additional table and then will use an appropriate foreign key in the original table instead of keeping those text values in the original table. Then I will put an index on this foreign key column. The number of the columns to be processed this way is about 100.
It raises the following questions:
would this normalization really increase the speed of the queires imposing conditions on some of those 100 fields? If we forget about the size needed to keep those columns, whether would there be any increase in the performance due to the substition of the initial text-columns with tinyint-columns? If I do not do any normalization and simply put an index on those initial text columns, whether the performace will be the same as for the index on the planned tinyint-column?
If I do the described normalization, then building a view showing the text values will require joining my main table with some 100 additional tables. A positive moment is that I'll do those joins for pairs "primary key"="foreign key". But still quite a big amount of tables should be joined. Here is the question: whether the performance of the queryes made to this view compare to the performance of the queries to the initial non-normalized table will be not worse? Whether the SQL Server Optimizer will really be able to optimize the query the way that allows taking the benefits of the normalization?
Sorry for such a long text.
Thanks for every comment!
PS
I created a related question regarding joining 100 tables;
Joining 100 tables
You'll find other benefits to normalizing the data besides the speed of queries running against it... such as size and maintainability, which alone should justify normalizing it...
However, it will also likely improve the speed of queries; currently having a single row containing 300 text columns is massive, and is almost certainly past the 8,060 byte limit for storing the row data page... and is instead being stored in the ROW_OVERFLOW_DATA or LOB_DATA Allocation Units.
By reducing the size of each row through normalization, such as replacing redundant text data with a TINYINT foreign key, and by also removing columns that aren't dependent on this large table's primary key into another table, the data should no longer overflow, and you'll also be able to store more rows per page.
As far as the overhead added by performing JOIN to get the normalized data... if you properly index your tables, this shouldn't add a substantial amount of overhead. However, if it does add an unacceptable overhead, you can then selectively de-normalize the data as necessary.
Whether this is worth the effort depends on how long the values are. If the values are, say, state abbreviations (2 characters) or country codes (3 characters), the resulting table would be even larger than the existing one. Remember, you need to include the primary key of the reference table. That would typically be an integer and occupy four bytes.
There are other good reasons to do this. Having reference tables with valid lists of values maintains database consistency. The reference tables can be used both to validate inputs and for reporting purposes. Additional information can be included, such as a "long name" or something like that.
Also, SQL Server will spill varchar columns over onto additional pages. It does not spill other types. You only have 300 columns but eventually your record data might get close to the 8k limit for data on a single page.
And, if you decide to go ahead, I would suggest that you look for "themes" in the columns. There may be groups of columns that can be grouped together . . . detailed stop code and stop category, short business name and full business name. You are going down the path of modelling the data (a good thing). But be cautious about doing things at a very low level (managing 100 reference tables) versus identifying a reasonable set of entities and relationships.
1) The system is currently having to do a full table scan on very significant amounts of data, leading to the performance issues. There are many aspects of optimisation which could improve this performance. The conversion of columns to the correct data types would not only significantly improve performance by reducing the size of each record, but would allow data to be made correct. If querying on a column, you're currently looking at the text being compared to the text in the field. With just indexing, this could be improved, but changing to a lookup would allow the ID value to be looked up from a table small enough to keep in memory and then use this to scan just integer values, which is a much quicker process.
2) If data is normalised to 3rd normal form or the like, then you can see instances where performance suffers in the name of data integrity. This is most a problem if the engine cannot work out how to restrict the rows without projecting the data first. If this does occur, though, the execution plan can identify this and the query can be amended to reduce the likelihood of this.
Another point to note is that it sounds like if the database was properly structured it may be able to be cached in memory because the amount of data would be greatly reduced. If this is the case, then the performance would be greatly improved.
The quick way to improve performance would probably be to add indexes. However, this would further increase the overall database size, and doesn't address the issue of storing duplicate data and possible data integrity issues.
There are some other changes which can be made - if a lot of the data is not always needed, then this can be separated off into a related table and only looked up as needed. Fields that are not used for lookups to other tables are particular candidates for this, as the joins can then be on a much smaller table, while preserving a fairly simple structure that just looks up the additional data when you've identified the data you actually need. This is obviously not a properly normalised structure, but may be a quick and dirty way to improve performance (after adding indexing).
Construct in your head and onto paper a normalized database structure
Construct the database (with indexes)
De-construct that monolith. Things will not look so bad. I would guess that A LOT (I MEAN A LOT) of data is repeated
Create SQL insert statements to insert the data into the database
Go to the persons that constructed that nightmare in the first place with a shotgun. Have fun.

select * vs select column

If I just need 2/3 columns and I query SELECT * instead of providing those columns in select query, is there any performance degradation regarding more/less I/O or memory?
The network overhead might be present if I do select * without a need.
But in a select operation, does the database engine always pull atomic tuple from the disk, or does it pull only those columns requested in the select operation?
If it always pulls a tuple then I/O overhead is the same.
At the same time, there might be a memory consumption for stripping out the requested columns from the tuple, if it pulls a tuple.
So if that's the case, select someColumn will have more memory overhead than that of select *
There are several reasons you should never (never ever) use SELECT * in production code:
since you're not giving your database any hints as to what you want, it will first need to check the table's definition in order to determine the columns on that table. That lookup will cost some time - not much in a single query - but it adds up over time
if you need only 2/3 of the columns, you're selecting 1/3 too much data which needs to be retrieving from disk and sent across the network
if you start to rely on certain aspects of the data, e.g. the order of the columns returned, you could get a nasty surprise once the table is reorganized and new columns are added (or existing ones removed)
in SQL Server (not sure about other databases), if you need a subset of columns, there's always a chance a non-clustered index might be covering that request (contain all columns needed). With a SELECT *, you're giving up on that possibility right from the get-go. In this particular case, the data would be retrieved from the index pages (if those contain all the necessary columns) and thus disk I/O and memory overhead would be much less compared to doing a SELECT *.... query.
Yes, it takes a bit more typing initially (tools like SQL Prompt for SQL Server will even help you there) - but this is really one case where there's a rule without any exception: do not ever use SELECT * in your production code. EVER.
It always pulls a tuple (except in cases where the table has been vertically segmented - broken up into columns pieces), so, to answer the question you asked, it doesn't matter from a performance perspective. However, for many other reasons, (below) you should always select specifically those columns you want, by name.
It always pulls a tuple, because (in every vendors RDBMS I am familiar with), the underlying on-disk storage structure for everything (including table data) is based on defined I/O Pages (in SQL Server for e.g., each Page is 8 kilobytes). And every I/O read or write is by Page.. I.e., every write or read is a complete Page of data.
Because of this underlying structural constraint, a consequence is that Each row of data in a database must always be on one and only one page. It cannot span multiple Pages of data (except for special things like blobs, where the actual blob data is stored in separate Page-chunks, and the actual table row column then only gets a pointer...). But these exceptions are just that, exceptions, and generally do not apply except in special cases ( for special types of data, or certain optimizations for special circumstances)
Even in these special cases, generally, the actual table row of data itself (which contains the pointer to the actual data for the Blob, or whatever), it must be stored on a single IO Page...
EXCEPTION. The only place where Select * is OK, is in the sub-query after an Exists or Not Exists predicate clause, as in:
Select colA, colB
From table1 t1
Where Exists (Select * From Table2
Where column = t1.colA)
EDIT: To address #Mike Sherer comment, Yes it is true, both technically, with a bit of definition for your special case, and aesthetically. First, even when the set of columns requested are a subset of those stored in some index, the query processor must fetch every column stored in that index, not just the ones requested, for the same reasons - ALL I/O must be done in pages, and index data is stored in IO Pages just like table data. So if you define "tuple" for an index page as the set of columns stored in the index, the statement is still true.
and the statement is true aesthetically because the point is that it fetches data based on what is stored in the I/O page, not on what you ask for, and this true whether you are accessing the base table I/O Page or an index I/O Page.
For other reasons not to use Select *, see Why is SELECT * considered harmful? :
You should always only select the columns that you actually need. It is never less efficient to select less instead of more, and you also run into fewer unexpected side effects - like accessing your result columns on client side by index, then having those indexes become incorrect by adding a new column to the table.
[edit]: Meant accessing. Stupid brain still waking up.
Unless you're storing large blobs, performance isn't a concern. The big reason not to use SELECT * is that if you're using returned rows as tuples, the columns come back in whatever order the schema happens to specify, and if that changes you will have to fix all your code.
On the other hand, if you use dictionary-style access then it doesn't matter what order the columns come back in because you are always accessing them by name.
This immediately makes me think of a table I was using which contained a column of type blob; it usually contained a JPEG image, a few Mbs in size.
Needless to say I didn't SELECT that column unless I really needed it. Having that data floating around - especially when I selected mulitple rows - was just a hassle.
However, I will admit that I otherwise usually query for all the columns in a table.
During a SQL select, the DB is always going to refer to the metadata for the table, regardless of whether it's SELECT * for SELECT a, b, c... Why? Becuase that's where the information on the structure and layout of the table on the system is.
It has to read this information for two reasons. One, to simply compile the statement. It needs to make sure you specify an existing table at the very least. Also, the database structure may have changed since the last time a statement was executed.
Now, obviously, DB metadata is cached in the system, but it's still processing that needs to be done.
Next, the metadata is used to generate the query plan. This happens each time a statement is compiled as well. Again, this runs against cached metadata, but it's always done.
The only time this processing is not done is when the DB is using a pre-compiled query, or has cached a previous query. This is the argument for using binding parameters rather than literal SQL. "SELECT * FROM TABLE WHERE key = 1" is a different query than "SELECT * FROM TABLE WHERE key = ?" and the "1" is bound on the call.
DBs rely heavily on page caching for there work. Many modern DBs are small enough to fit completely in memory (or, perhaps I should say, modern memory is large enough to fit many DBs). Then your primary I/O cost on the back end is logging and page flushes.
However, if you're still hitting the disk for your DB, a primary optimization done by many systems is to rely on the data in indexes, rather than the tables themselves.
If you have:
CREATE TABLE customer (
id INTEGER NOT NULL PRIMARY KEY,
name VARCHAR(150) NOT NULL,
city VARCHAR(30),
state VARCHAR(30),
zip VARCHAR(10));
CREATE INDEX k1_customer ON customer(id, name);
Then if you do "SELECT id, name FROM customer WHERE id = 1", it is very likely that you DB will pull this data from the index, rather than from the tables.
Why? It will likely use the index anyway to satisfy the query (vs a table scan), and even though 'name' isn't used in the where clause, that index will still be the best option for the query.
Now the database has all of the data it needs to satisfy the query, so there's no reason to hit the table pages themselves. Using the index results in less disk traffic since you have a higher density of rows in the index vs the table in general.
This is a hand wavy explanation of a specific optimization technique used by some databases. Many have several optimization and tuning techniques.
In the end, SELECT * is useful for dynamic queries you have to type by hand, I'd never use it for "real code". Identification of individual columns gives the DB more information that it can use to optimize the query, and gives you better control in your code against schema changes, etc.
I think there is no exact answer for your question, because you have pondering performance and facility of maintain your apps. Select column is more performatic of select *, but if you is developing an oriented object system, then you will like use object.properties and you can need a properties in any part of apps, then you will need write more methods to get properties in special situations if you don't use select * and populate all properties. Your apps need have a good performance using select * and in some case you will need use select column to improve performance. Then you will have the better of two worlds, facility to write and maintain apps and performance when you need performance.
The accepted answer here is wrong. I came across this when another question was closed as a duplicate of this (while I was still writing my answer - grr - hence the SQL below references the other question).
You should always use SELECT attribute, attribute.... NOT SELECT *
It's primarily for performance issues.
SELECT name FROM users WHERE name='John';
Is not a very useful example. Consider instead:
SELECT telephone FROM users WHERE name='John';
If there's an index on (name, telephone) then the query can be resolved without having to look up the relevant values from the table - there is a covering index.
Further, suppose the table has a BLOB containing a picture of the user, and an uploaded CV, and a spreadsheet...
using SELECT * will willpull all this information back into the DBMS buffers (forcing out other useful information from the cache). Then it will all be sent to client using up time on the network and memory on the client for data which is redundant.
It can also cause functional issues if the client retrieves the data as an enumerated array (such as PHP's mysql_fetch_array($x, MYSQL_NUM)). Maybe when the code was written 'telephone' was the third column to be returned by SELECT *, but then someone comes along and decides to add an email address to the table, positioned before 'telephone'. The desired field is now shifted to the 4th column.
There are reasons for doing things either way. I use SELECT * a lot on PostgreSQL because there are a lot of things you can do with SELECT * in PostgreSQL that you can't do with an explicit column list, particularly when in stored procedures. Similarly in Informix, SELECT * over an inherited table tree can give you jagged rows while an explicit column list cannot because additional columns in child tables are returned as well.
The main reason why I do this in PostgreSQL is that it ensures that I get a well-formed type specific to a table. This allows me to take the results and use them as the table type in PostgreSQL. This also allows for many more options in the query than a rigid column list would.
On the other hand, a rigid column list gives you an application-level check that db schemas haven't changed in certain ways and this can be helpful. (I do such checks on another level.)
As for performance, I tend to use VIEWs and stored procedures returning types (and then a column list inside the stored procedure). This gives me control over what types are returned.
But keep in mind I am using SELECT * usually against an abstraction layer rather than base tables.
Reference taken from this article:
Without SELECT *:
When you are using ” SELECT * ” at that time you are selecting more columns from the database and some of this column might not be used by your application.
This will create extra cost and load on database system and more data travel across the network.
With SELECT *:
If you have special requirements and created dynamic environment when add or delete column automatically handle by application code. In this special case you don’t require to change application and database code and this will automatically affect on production environment. In this case you can use “SELECT *”.
Just to add a nuance to the discussion which I don't see here: In terms of I/O, if you're using a database with column-oriented storage you can do A LOT less I/O if you only query for certain columns. As we move to SSDs the benefits may be a bit smaller vs. row-oriented storage but there's a) only reading the blocks that contain columns you care about b) compression, which generally greatly reduces the size of the data on disk and therefore the volume of data read from disk.
If you're not familiar with column-oriented storage, one implementation for Postgres comes from Citus Data, another is Greenplum, another Paraccel, another (loosely speaking) is Amazon Redshift. For MySQL there's Infobright, the now-nigh-defunct InfiniDB. Other commercial offerings include Vertica from HP, Sybase IQ, Teradata...
select * from table1 INTERSECT select * from table2
equal
select distinct t1 from table1 where Exists (select t2 from table2 where table1.t1 = t2 )

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks
You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle
Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.
We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.
Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?
I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.
The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.
Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)
Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows
We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.
In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.
Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.
We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.
The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.
No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.
From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.
Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.
This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.
Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.

What is the best way to implement soft deletion?

Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.
I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL
You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.
Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).
The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.
if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table
That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?
Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)
You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.
you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.
Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy
Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.
I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...
Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345
#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete