Should I use a separate table for repetitive values (varchar)? - sql

I have a table in which 3 rows of data are added per second and in which I intend to keep around 30M rows. (Older data will be removed).
I need to add a column: varchar(1000). I can't tell in advance what it's content will be but I do know it will be very repetitive : thousands to millions of rows will have the same value. It is usually around 200 characters long.
Since data is being added using a Stored Procedure I see two option
Add a column varchar(1000)
Create a table (int id,varchar(1000) value)
Within the StoredProcedure, look if the value exist in that other table or create it
I would expect this other table to have a maximum of 100 value at all time.
I know some of the tradeoff between these two options but I have difficulty making up my mind on the question.
Option 1 is heavier but I get faster inserts. Requires less joins hence query are simpler.
Option 2 is lighter insert take longers but query have the potential to be faster. I think I'm closer to normal form but then I also have a table with a single meaningful column.
From the information I gave you, which option seems better? (You can also come up with another option).

You should also investigate page compression, perhaps you can do the simple thing and still get a a small(ish) table. Although, if you say is SQL Express, you won't be able to use it as is an Enterprise Edition requirement.
I have used repeatedly in my projects your second approach. Every insert would have to go through a stored procedure that gets the lookup value id, or inserts a new one if not found and returns the id. Specially for such large columns like your seems to be, with plenty of rows yet so few distinct values, the saving in space should trump the additional overhead of the foreign key and lookup cost in query joins. See also Disk is Cheap... That's not the point!.

Related

DB Architecture: One table using WHERE vs multiple

I wonder what is the difference between having one table with 6 millions row (aka with a huge DB) and 100k active users:
CREATE TABLE shoes (
id serial primary key,
color text,
is_left_one boolean,
stock int
);
With also 6 index like:
CREATE INDEX blue_left_shoes ON shoes(color,is_left_one) WHERE color=blue AND is_left_one=true;
Versus: 6 tables with 1 million rows:
CREATE TABLE blue_left_shoes(
id serial primary key,
stock int
);
The latter one seems more efficient because users don't have to ask for the condition since the table IS the condition, but perhaps creating the indexes mitigate this?
This table is used to query either left, right, "blue", "green" or "red" shoes and to check the number of remaining items, but it is a simplified example but you can think of Amazon (or any digital selling platform) tooltip "only 3 items left in stock" for the workload and the usecase. It is the users (100k active daily) who will make the query.
NB: The question is mostly for PostgreSQL but differences with other DB is still relevant and interesting.
In the latter case, where you use a table called blue_left_shoes
Your code needs to first work out which table to look at (as opposed to parameterising a value in the where clause)
As permutations and options increase, you need to increase the number of tables, and increase the logic in your app that works out which table to use
Anything that needs to use this database (i.e. a reporting tool or an API) now needs to re implement all of these rules
You are imposing logic at a high layer to improve performance.
If you were to partition and/or index your table appropriately, you get the same effect - SQL queries only look through the records that matter. The difference is that you don't need to implement this logic in higher layers
As long as you can get the indexing right, keeping this is one table is almost always the right thing to do.
Partitioning
Database partitioning is where you select one or more columns to decide how to "split up" your table. In your case you could choose (color, is_left_one).
Now your table is logically split and ordered in this way and when you search for blue,true it automatically knows which partition to look in. It doesn't look in any other partitions (this is called partition pruning)
Note that this occurs automatically from the search criteria. You don't need to manually work out a particular table to look at.
Partitioning doesn't require any extra storage (beyond various metadata that has to be saved)
You can't apply multiple partitions to a table. Only one
Indexing
Creating an index also provides performance improvements. However indexes take up space and can impact insert and update performance (as they need to be maintained). Practically speaking, the select trade off almost always far outweighs any insert/update negatives
You should always look at indexes before partitioning
Non selective indexes
In your particular case, there's an extra thing to consider: a boolean field is not "selective". I won't go into details but suffice to say you shouldn't create an index on this field alone, as it won't be used because it only halves the number of records you have to look through. You'd need to include some other fields in any index (i.e. colour) to make it useful
In general, you want to keep all "like" data in a single table, not split among multiples. There are good reasons for this:
Adding new combinations is easier.
Maintaining the tables is easier.
You an easily do queries "across" entities.
Overall, the database is more efficient, because it is more likely that pages will be filled.
And there are other reasons as well. In your case, you might have an argument for breaking the data into 6 separate tables. The gain here comes from not having the color and is_left_one in the data. That means that this data is not repeated 6 million times. And that could save many tens of megabytes of data storage.
I say the last a bit tongue-in-cheek (meaning I'm not that serious). Computers nowadays have so much member that 100 Mbytes is just not significant in general. However, if you have a severely memory limited environment (I'm thinking "watch" here, not even "smart phone") then it might be useful.
Otherwise, partitioning is a fine solution that pretty much meets your needs.
For this:
WHERE color=blue AND is_left_one=true
The optimal index is
INDEX(color, is_left_one) -- in either order
Having id first makes it useless for that WHERE.
It is generally bad to have multiple identical tables instead of one.

Select * from table vs Select col1,col2,col3 from table [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
select * vs select column
I was just having a discussion with one of my colleague on the SQL Server performance on specifying the query command in the stored procedure.
So I want to know which one is preferred over another and whats the concrete reason behind that.
Suppose, We do have one table called
Employees(EmpName,EmpAddress)
And we want to select all the records from the table. So we can write the query in two ways,
Select * from Employees
Select EmpName, EmpAddress from Employees
So I would like to know is there any specific difference or performance issue in the above queries or are they just equal to the SQL Server Engine.
UPDATE:
Lets say the table schema won't change anymore. So no point for future maintenance.
Performance wise, lets say, the usage is very very high i.e. millions of hits per seconds on the database server. I want a clear and precise performance rating on both approaches.
No Indexing is done on the entire table.
The specific difference would show its ugly head if you add a column to the table.
Suddenly, the query you expected to return two columns now returns three. If you coded specifically for the two columns, the rest of your code is now broken.
Performance-wise, there shouldn't be a difference.
I always take the approach that being as specific as possible is the best when dealing with databases. If the table has two columns and you only need those two columns, be specific. Specify those two columns. It'll save you headaches in the future.
I am an avid avokat of the "be as specific as possible" rule, too. Not following it will hurt you in the long run. However, your question seems to be coming from a different background, so let me attempt to answer it.
When you submit a query to SQL Server it goes through several stages:
transmitting of query string over the network.
parsing of query string, producing a parse-tree
linking the referenced objects in the parse tree to existing objects
optimizing based on statistics and row count/size estimates
executing
transmitting of result data over the network
Let's look at each one:
The * query is a few bytes shorter, so step this will be faster
The * query contains fewer "tokens" so this should(!) be faster
During linking the list of columns need to be puled and compared to the query string. Here the "*" gets resolved to the actual column reference. Without access to the code it is impossible to say which version takes less cycles, however the amount of data accessed is about the same so this should be similar.
-6. In these stages there is no difference between the two example queries, as they will both get compiled to the same execution plan.
Taking all this into account, you will probably save a few nanoseconds when using the * notation. However, you example is very simplistic. In a more complex example it is possible that specifying as subset of columns of a table in a multi table join will lead to a different plan than using a *. If that happens we can be pretty certain that the explicit query will be faster.
The above comparison also assumes that the SQL Server process is running alone on a single processor and no other queries are submitted at the same time. If the process has to yield during the compilation those extra cycles will be far more than the ones we are trying to save.
So, the amont of saving we are talking about is very minute compared to the actual execution time and should not be used as an excuse for a "bad" coding practice.
I hope this answers your question.
You should always reference columns explicitly. This way, if the table structure changes (and such changes are made in an intelligent, backward-compatible way), your queries will continue to work and can be modified over time.
Also, unless you actually need all of the columns from the table (not typical), using SELECT * is bringing more data to your application than is necessary, and potentially forcing a clustered index scan instead of what might have been satisfied by a narrower covering index.
Bad habits to kick : using SELECT * / omitting the column list
Performance wise there are no difference between those 2 i think.But those 2 are used in different cases what may be the difference.
Consider a slightly larger table.If your table(Employees) contains 10 columns,then the 1st query will retain all of the information of the table.But for 2nd query,you may specify which columns information you need.So when you need all of the information of employees no.1 is the best one rather than specifying all of the column names.
Ofcourse,when you need to ALTER a table then those 2 would not be equal.

One large table or split into two smaller tables?

Is there any performance benefit to splitting a large table with roughly 100 columns into 2 separate tables? This would be in terms of inserting, deleting and selecting tasks? I'm using SQL Server 2008.
If one of the fields is a CLOB or BLOB and you anticipate it holding a huge amount of data and you won't need that field very often and the result set will transmitted over a long pipe (like server to a web-based client), then I think putting that field in a separate table would be appropriate.
But just returning 100 regular fields probably won't tax your system so much as to justify a separate table and a join.
The only benefit you might see is if a number of columns are only occasionally populated. In which case putting those into their own table and only adding a row when there is data might make sense in terms of overall row overhead and, depending on the number of rows, overall page count for the table(s). That said, this is one of the reasons they introduced sparse columns in SQL Server 2008.
For the maintenance and other overhead of managing two tables instead of one (especially given that people can act on individual tables if they choose), it's unlikely it would be worth it.
Can you describe what type of entity needs to have over 100 columns? Perhaps the data model is just wrong in the first place.
I would say no as it would take more execution time to join the 2 tables whenever you wanted to do something.
I depends if you use these fields in the same time in your application.
These kind of performance improvements are really bad : you make your source code impossible to understand. If you have performance trouble with this table, add something (like a table containing the 15 fields you'll use in a request that'll updated via trigger), don't modify your clean solution.
If you don't have performance problem, don't do anything, you'll see later !

What is wrong with using SELECT * FROM sometable [duplicate]

I've heard that SELECT * is generally bad practice to use when writing SQL commands because it is more efficient to SELECT columns you specifically need.
If I need to SELECT every column in a table, should I use
SELECT * FROM TABLE
or
SELECT column1, colum2, column3, etc. FROM TABLE
Does the efficiency really matter in this case? I'd think SELECT * would be more optimal internally if you really need all of the data, but I'm saying this with no real understanding of database.
I'm curious to know what the best practice is in this case.
UPDATE: I probably should specify that the only situation where I would really want to do a SELECT * is when I'm selecting data from one table where I know all columns will always need to be retrieved, even when new columns are added.
Given the responses I've seen however, this still seems like a bad idea and SELECT * should never be used for a lot more technical reasons that I ever though about.
One reason that selecting specific columns is better is that it raises the probability that SQL Server can access the data from indexes rather than querying the table data.
Here's a post I wrote about it: The real reason select queries are bad index coverage
It's also less fragile to change, since any code that consumes the data will be getting the same data structure regardless of changes you make to the table schema in the future.
Given your specification that you are selecting all columns, there is little difference at this time. Realize, however, that database schemas do change. If you use SELECT * you are going to get any new columns added to the table, even though in all likelihood, your code is not prepared to use or present that new data. This means that you are exposing your system to unexpected performance and functionality changes.
You may be willing to dismiss this as a minor cost, but realize that columns that you don't need still must be:
Read from database
Sent across the network
Marshalled into your process
(for ADO-type technologies) Saved in a data-table in-memory
Ignored and discarded / garbage-collected
Item #1 has many hidden costs including eliminating some potential covering index, causing data-page loads (and server cache thrashing), incurring row / page / table locks that might be otherwise avoided.
Balance this against the potential savings of specifying the columns versus an * and the only potential savings are:
Programmer doesn't need to revisit the SQL to add columns
The network-transport of the SQL is smaller / faster
SQL Server query parse / validation time
SQL Server query plan cache
For item 1, the reality is that you're going to add / change code to use any new column you might add anyway, so it is a wash.
For item 2, the difference is rarely enough to push you into a different packet-size or number of network packets. If you get to the point where SQL statement transmission time is the predominant issue, you probably need to reduce the rate of statements first.
For item 3, there is NO savings as the expansion of the * has to happen anyway, which means consulting the table(s) schema anyway. Realistically, listing the columns will incur the same cost because they have to be validated against the schema. In other words this is a complete wash.
For item 4, when you specify specific columns, your query plan cache could get larger but only if you are dealing with different sets of columns (which is not what you've specified). In this case, you do want different cache entries because you want different plans as needed.
So, this all comes down, because of the way you specified the question, to the issue resiliency in the face of eventual schema modifications. If you're burning this schema into ROM (it happens), then an * is perfectly acceptable.
However, my general guideline is that you should only select the columns you need, which means that sometimes it will look like you are asking for all of them, but DBAs and schema evolution mean that some new columns might appear that could greatly affect the query.
My advice is that you should ALWAYS SELECT specific columns. Remember that you get good at what you do over and over, so just get in the habit of doing it right.
If you are wondering why a schema might change without code changing, think in terms of audit logging, effective/expiration dates and other similar things that get added by DBAs for systemically for compliance issues. Another source of underhanded changes is denormalizations for performance elsewhere in the system or user-defined fields.
You should only select the columns that you need. Even if you need all columns it's still better to list column names so that the sql server does not have to query system table for columns.
Also, your application might break if someone adds columns to the table. Your program will get columns it didn't expect too and it might not know how to process them.
Apart from this if the table has a binary column then the query will be much more slower and use more network resources.
There are four big reasons that select * is a bad thing:
The most significant practical reason is that it forces the user to magically know the order in which columns will be returned. It's better to be explicit, which also protects you against the table changing, which segues nicely into...
If a column name you're using changes, it's better to catch it early (at the point of the SQL call) rather than when you're trying to use the column that no longer exists (or has had its name changed, etc.)
Listing the column names makes your code far more self-documented, and so probably more readable.
If you're transferring over a network (or even if you aren't), columns you don't need are just waste.
Specifying the column list is usually the best option because your application won't be affected if someone adds/inserts a column to the table.
Specifying column names is definitely faster - for the server. But if
performance is not a big issue (for example, this is a website content database with hundreds, maybe thousands - but not millions - of rows in each table); AND
your job is to create many small, similar applications (e.g. public-facing content-managed websites) using a common framework, rather than creating a complex one-off application; AND
flexibility is important (lots of customization of the db schema for each site);
then you're better off sticking with SELECT *. In our framework, heavy use of SELECT * allows us to introduce a new website managed content field to a table, giving it all of the benefits of the CMS (versioning, workflow/approvals, etc.), while only touching the code at a couple of points, instead of a couple dozen points.
I know the DB gurus are going to hate me for this - go ahead, vote me down - but in my world, developer time is scarce and CPU cycles are abundant, so I adjust accordingly what I conserve and what I waste.
SELECT * is a bad practice even if the query is not sent over a network.
Selecting more data than you need makes the query less efficient - the server has to read and transfer extra data, so it takes time and creates unnecessary load on the system (not only the network, as others mentioned, but also disk, CPU etc.). Additionally, the server is unable to optimize the query as well as it might (for example, use covering index for the query).
After some time your table structure might change, so SELECT * will return a different set of columns. So, your application might get a dataset of unexpected structure and break somewhere downstream. Explicitly stating the columns guarantees that you either get a dataset of known structure, or get a clear error on the database level (like 'column not found').
Of course, all this doesn't matter much for a small and simple system.
Lots of good reasons answered here so far, here's another one that hasn't been mentioned.
Explicitly naming the columns will help you with maintenance down the road. At some point you're going to be making changes or troubleshooting, and find yourself asking "where the heck is that column used".
If you've got the names listed explicitly, then finding every reference to that column -- through all your stored procedures, views, etc -- is simple. Just dump a CREATE script for your DB schema, and text search through it.
Performance wise, SELECT with specific columns can be faster (no need to read in all the data). If your query really does use ALL the columns, SELECT with explicit parameters is still preferred. Any speed difference will be basically unnoticeable and near constant-time. One day your schema will change, and this is good insurance to prevent problems due to this.
definitely defining the columns, because SQL Server will not have to do a lookup on the columns to pull them. If you define the columns, then SQL can skip that step.
It's always better to specify the columns you need, if you think about it one time, SQL doesn't have to think "wtf is *" every time you query. On top of that, someone later may add columns to the table that you actually do not need in your query and you'll be better off in that case by specifying all of your columns.
The problem with "select *" is the possibility of bringing data you don't really need. During the actual database query, the selected columns don't really add to the computation. What's really "heavy" is the data transport back to your client, and any column that you don't really need is just wasting network bandwidth and adding to the time you're waiting for you query to return.
Even if you do use all the columns brought from a "select *...", that's just for now. If in the future you change the table/view layout and add more columns, you'll start bring those in your selects even if you don't need them.
Another point in which a "select *" statement is bad is on view creation. If you create a view using "select *" and later add columns to your table, the view definition and the data returned won't match, and you'll need to recompile your views in order for them to work again.
I know that writing a "select *" is tempting, 'cause I really don't like to manually specify all the fields on my queries, but when your system start to evolve, you'll see that it's worth to spend this extra time/effort in specifying the fields rather than spending much more time and effort removing bugs on your views or optimizing your app.
While explicitly listing columns is good for performance, don't get crazy.
So if you use all the data, try SELECT * for simplicity (imagine having many columns and doing a JOIN... query may get awful). Then - measure. Compare with query with column names listed explicitly.
Don't speculate about performance, measure it!
Explicit listing helps most when you have some column containing big data (like body of a post or article), and don't need it in given query. Then by not returning it in your answer DB server can save time, bandwidth, and disk throughput. Your query result will also be smaller, which is good for any query cache.
You should really be selecting only the fields you need, and only the required number, i.e.
SELECT Field1, Field2 FROM SomeTable WHERE --(constraints)
Outside of the database, dynamic queries run the risk of injection attacks and malformed data. Typically you get round this using stored procedures or parameterised queries. Also (although not really that much of a problem) the server has to generate an execution plan each time a dynamic query is executed.
It is NOT faster to use explicit field names versus *, if and only if, you need to get the data for all fields.
Your client software shouldn't depend on the order of the fields returned, so that's a nonsense too.
And it's possible (though unlikely) that you need to get all fields using * because you don't yet know what fields exist (think very dynamic database structure).
Another disadvantage of using explicit field names is that if there are many of them and they're long then it makes reading the code and/or the query log more difficult.
So the rule should be: if you need all the fields, use *, if you need only a subset, name them explicitly.
The result is too huge. It is slow to generate and send the result from the SQL engine to the client.
The client side, being a generic programming environment, is not and should not be designed to filter and process the results (e.g. the WHERE clause, ORDER clause), as the number of rows can be huge (e.g. tens of millions of rows).
Naming each column you expect to get in your application also ensures your application won't break if someone alters the table, as long as your columns are still present (in any order).
Performance wise I have seen comments that both are equal. but usability aspect there are some +'s and -'s
When you use a (select *) in a query and if some one alter the table and add new fields which do not need for the previous query it is an unnecessary overhead. And what if the newly added field is a blob or an image field??? your query response time is going to be really slow then.
In other hand if you use a (select col1,col2,..) and if the table get altered and added new fields and if those fields are needed in the result set, you always need to edit your select query after table alteration.
But I suggest always to use select col1,col2,... in your queries and alter the query if the table get altered later...
This is an old post, but still valid. For reference, I have a very complicated query consisting of:
12 tables
6 Left joins
9 inner joins
108 total columns on all 12 tables
I only need 54 columns
A 4 column Order By clause
When I execute the query using Select *, it takes an average of 2869ms.
When I execute the query using Select , it takes an average of 1513ms.
Total rows returned is 13,949.
There is no doubt selecting column names means faster performance over Select *
Select is equally efficient (in terms of velocity) if you use * or columns.
The difference is about memory, not velocity. When you select several columns SQL Server must allocate memory space to serve you the query, including all data for all the columns that you've requested, even if you're only using one of them.
What does matter in terms of performance is the excecution plan which in turn depends heavily on your WHERE clause and the number of JOIN, OUTER JOIN, etc ...
For your question just use SELECT *. If you need all the columns there's no performance difference.
It depends on the version of your DB server, but modern versions of SQL can cache the plan either way. I'd say go with whatever is most maintainable with your data access code.
One reason it's better practice to spell out exactly which columns you want is because of possible future changes in the table structure.
If you are reading in data manually using an index based approach to populate a data structure with the results of your query, then in the future when you add/remove a column you will have headaches trying to figure out what went wrong.
As to what is faster, I'll defer to others for their expertise.
As with most problems, it depends on what you want to achieve. If you want to create a db grid that will allow all columns in any table, then "Select *" is the answer. However, if you will only need certain columns and adding or deleting columns from the query is done infrequently, then specify them individually.
It also depends on the amount of data you want to transfer from the server. If one of the columns is a defined as memo, graphic, blob, etc. and you don't need that column, you'd better not use "Select *" or you'll get a whole bunch of data you don't want and your performance could suffer.
To add on to what everyone else has said, if all of your columns that you are selecting are included in an index, your result set will be pulled from the index instead of looking up additional data from SQL.
SELECT * is necessary if one wants to obtain metadata such as the number of columns.
Gonna get slammed for this, but I do a select * because almost all my data is retrived from SQL Server Views that precombine needed values from multiple tables into a single easy to access View.
I do then want all the columns from the view which won't change when new fields are added to underlying tables. This has the added benefit of allowing me to change where data comes from. FieldA in the View may at one time be calculated and then I may change it to be static. Either way the View supplies FieldA to me.
The beauty of this is that it allows my data layer to get datasets. It then passes them to my BL which can then create objects from them. My main app only knows and interacts with the objects. I even allow my objects to self-create when passed a datarow.
Of course, I'm the only developer, so that helps too :)
What everyone above said, plus:
If you're striving for readable maintainable code, doing something like:
SELECT foo, bar FROM widgets;
is instantly readable and shows intent. If you make that call you know what you're getting back. If widgets only has foo and bar columns, then selecting * means you still have to think about what you're getting back, confirm the order is mapped correctly, etc. However, if widgets has more columns but you're only interested in foo and bar, then your code gets messy when you query for a wildcard and then only use some of what's returned.
And remember if you have an inner join by definition you do not need all the columns as the data in the join columns is repeated.
It's not like listing columns in SQl server is hard or even time-consuming. You just drag them over from the object browser (you can get all in one go by dragging from the word columns). To put a permanent performance hit on your system (becasue this can reduce the use of indexes and becasue sending unneeded data over the network is costly) and make it more likely that you will have unexpected problems as the database changes (sometimes columns get added that you do not want the user to see for instance) just to save less than a minute of development time is short-sighted and unprofessional.
Absolutely define the columns you want to SELECT every time. There is no reason not to and the performance improvement is well worth it.
They should never have given the option to "SELECT *"
If you need every column then just use SELECT * but remember that the order could potentially change so when you are consuming the results access them by name and not by index.
I would ignore comments about how * needs to go get the list - chances are parsing and validating named columns is equal to the processing time if not more. Don't prematurely optimize ;-)

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks
You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle
Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.
We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.
Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?
I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.
The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.
Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)
Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows
We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.
In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.
Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.
We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.
The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.
No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.
From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.
Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.
This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.
Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.