Is better have a boolean (important flag )like an attribute or in a separate table? - sql

I have three cases and I don't know what is the better solution for each one, but all are about boolean attributes
I have a table of links and each links has attributes to determine if is visited, broken or filtered and the values of each one must updated one time (except for rare cases of reseting all).
The same links above hava a visiting attribute that is updated constantly, but in a table with more than 1 million of rows, in the maximum, 10,000 or 20,000 will be true.
I have a table with pages and one attribute to indicate if each one was processed or not. In the end (after processing), all rows must be true.
I want to know what is the better solution for each one of these cases.
I think that is: attribute in the first case, table in the second, and I don't know for the third.
Another solution (like index, maybe) are welcome.
IMPORTANT: both tables (pages and links) can have more than a million of rows.

I would say columns for the first case, tables for the second, and columns for the third.
Your main concern, depending on the scale of your database, might be to separate the often-updated data from the bulk of the rest. That's why I'd suggest a table for the second case. You could, however, make judicious use of the "HOT" feature of PostgreSQL, which means that updates do not cause table bloat if the columns being updated are not indexed. But it's probably still a good idea to keep the traffic away from large tables, because of potentially large seek times, keeping autovacuum happy, etc. If you're concerned, I would test this out.

There is no "best" way. The only way to know if your approach is adequately performant is to do it and see. One approach where there are constant updates will not perform the same where there are large numbers of reads and few updates.
I'd suggest just putting everything in the table, unless you have a reason not to and giving that a whirl.
But most importantly: what DBMS?

Related

Arrangement of database columns in any DBMS

We had a discussion in class about whether putting an identity key as the first column of the table affects the performance when writing your queries.
I think whether the identity key is the first or last column does not really matter since it will be used for linking one table with another. But then again I could be mistaken. I can't really find good articles that address this. What are your thoughts on this? And you have good references please do like wise.
Thanks!
In short...
Depending on your DBMS, field order may actually make a difference, but that difference is likely to be too small to matter.
Slightly longer answer...
Fields are stored together in rows, and rows are grouped in database pages1.
So when the DBMS loads a specific field, it also loads the row containing that field and the entire page containing that row2. I/O, not CPU, tends to be most important for DB performance, so once the row is in memory, field order typically doesn't matter (much).
Depending on the physical row layout...
The DBMS may already know the offset of each field, in which case each field is equally fast.
Or may need to "scan" through preceding fields, which is also quite fast but can make a small difference.
In fact, the DBMS may not even honor the order of fields in your CREATE TABLE statement - for example, it may move all fixed field to the front of the row and relegate the variable fields to the back.
But as always: if in doubt, measure on realistic amounts of data.
1 Which are some multiple of disk blocks. One database page may (and typically does) contain many rows.
2 Unless the row cannot fit into page, in which case some of its fields can be "chained" into another page or stored "out of line" in some way. But let's not complicate things too much here.
In a truly relational DBMS there is no logical ordering of attributes in a table. In a SQL DBMS there is a logical ordering of columns but that doesn't necessarily match the ordering of data in physical storage. It is the structures used in physical storage that might influence performance and not the logical placement of columns in a table. So as a rule you are right, column order doesn't / shouldn't make any difference. Whether physical ordering makes a difference in any particular DBMS product very much depends on the physical storage details of that product.

Implementing soft delete with minimal impact on performance and code

There are some similar questions on the topic, but they are not really helping me.
I want to implement a soft delete feature like on StackOverflow, where items are not really deleted, but just hidden. I am using a SQL database. Here are 3 options:
Add a is_deleted boolean field.
Advantages: Simple.
Disadvantages: No date record. Forces me to add a is_deleted = 0 in every query.
Add a deleted_date date field. This is set to NULL if it's not deleted.
Advantages: Has date.
Disadvantages: Still cluttering my queries.
For both of the above
It will also impact performance because there are all these useless rows. They still have to be maintained in indexes. Also an index on the deleted column won't help when fetching non-deleted (the majority) of the rows. Full table scan is needed.
Another option is to create a separate table to hold deleted items:
Advantages: Improved performance when querying non-deleted rows. No need to add conditions to my queries on non-deleted rows. Easier on index maintenance.
Disadvantages: Complexity: Requires data migration for both deletion and undeletion. Need for new tables. Referential integrity is harder to handle.
Is there a better option?
I personally would base my answer off of how often you anticipate your users wanting to access that deleted data or "restore" that deleted data.
If it's often, then I would go with a "Date_Deleted" field and put a calculated "IsDeleted" in my poco in the code.
If it's never (or almost never) then a history table or deleted table is good for the benefits you explained.
I personally almost never use deleted tables (and opt for isDeleted or date_deleted) because of the potential risk to referencial integrity. You have A -> B and you remove the record from B database... You now have to manage referencial integrity because of your design choice.
If the key is numeric, I handle a "soft-delete" by negating the key. (Of course, won't work for identity keys). You don't need to change your code at all, and can easily restore the record by multiplying by -1.
Just another approach to give some thought to... If the key is alphanumeric, you can do something similar by prepending a unique "marker" characters. Since deleted records will all begin with this marker, then will end up off by themselves in the index.
In my opinion, the best way forward, when thinking about scaling and eventual table/database sizes is your third option - a separate table for deleted items. Such a table can eventually be moved to a different database to support scaling.
I believe you have listed the three most common options. As you have seen, each has advantages and disadvantages. Personally, I like taking the longer view on things.
I think your analysis of the options is good but you missed a few relevant points which I list below. Almost all implementations that I have seen use some sort of deleted or versioning field on the row as you suggest in your first two options.
Using one table with deleted flag:
If your indexes all contain the deleted flag field first and your query's mostly contain a where isdeleted=false type structure then it DOES solve you performance problems and the indexes very efficiently exclude the deleted rows. Similar logic could be used for the deleted date option.
Using two Tables
In general you need to make massive changes to reports because some reports may refer to deleted data (like old sales figures might refer to a deleted sales category). One can overcome this by creating a view which is a union of the two tables to read from and only write to the active records table.
Let's suppose we create a field called dead to mark deleted rows. We can create a index where field dead is false.
In this way, we only search non-deleted rows using the hint use index.

SQL Server column design

I always tried to make my sql database as simple and as understandable as possible.
Until now I always used a limited number of columns, I think I never had more than 20. Now, there is one thing, that would make my life easier, if I had much more columns. Let´s say 200 columns. (not rows). What do you think about it?
I just want to know, if it is a bad idea, not why i´m doing this or if there are other possibilities, just if somebody has already experienced something like that and if it is a bad idea to do such a table.
Fewer, smaller width columns is better than lots of columns and/or large width columns.
Why? Because the narrower the row size, the more rows you fit on a 8K page. That means you do less I/O and use less memory to buffer pages. That is always a good thing.
In those (hopefully) rare cases, where the domain requires many attributes on an object (with the assumption of 1-1 object-table mapping), you should consider splitting into two tables ina 1-1 relationship, one containing the frequently used columns.
I don't think it is black and white. Having a large row size (implied by the large number of columns) will hurt performance (i.e., more I/O) -- but there are cases where taking a small hit in performance in one place will be offset by increased performance in others.
I'd say it depends on how many rows you expect this table to have, how often will it will be queried, how many of those additional columns will really be accessed, and how it would compare to your alternative design in terms of efficiency and complexity.
Luke--
It really depends on the type of the system you are working with. Example in transactional systems, most tables have at most 50 columns or so with almost no redundant data attrributes ( If you have a process date, you would not need the Process Month or the process year as a seperate column). This of course is because the records are updated/inserted frequently and you'll need to update all the redundant attributes everytime you update one row.
In Data Warehouse/reporting environments, for Dimension tables (which have the attributes for an entity) it is typical to have 100+ columns as there are could be various ways you want to categorize a given entity.The Updates here are not so much a problem as data is typically loaded once during off-peak hours and then is used mostly in selects.
Take a look at these links to know more..
http://en.wikipedia.org/wiki/Database_normalization
http://en.wikipedia.org/wiki/Star_schema
So the answer is it depends... If you want a perfectly relational system, then may be 200+ columns is kind of a red flag indicating you should look at normalize your data (May be not). Updates and Indexes are two things that you should be concerned with in such a system.
You are using SQL Server, which I think defaults to row-oriented storage (all fields in a row are stored together in a page), which can be a problem with large number of columns. However, if you use column-oriented storage, the number of columns per table does not matter because each column is stored together. I don't know if this is possible with SQL Server.

MySQL query performance

I have the following table structure:
EVENT_ID(INT) EVENT_NAME(VARCHAR) EVENT_DATE(DATETIME) EVENT_OWNER(INT)
I need to add the field EVENT_COMMENTS which should be a text field or a very big VARCHAR.
I have 2 places where I query this table, one is on a page that lists all the events (in that page I do not need to display the event_comments field).
And another page that loads all the details for a specific events, which I will need to display the event_comments field on.
Should I create an extra table with the event_id and the event_comments for that event? Or should I just add that field on the current table?
In other words, what I'm asking is, if I have a text field in my table, but I don't SELECT it, will it affect the performance of the queries to my table?
Adding a field to your table makes it larger.
This means that:
Table scans will take more time
Less records will fit into a page and hence into the cache, thus increasing the risk of cache misses
Selecting this field with a join, however, would take more time.
So adding this field into this table will make the queries which don't select it run slower, and those which do select it run faster.
Yes, it affect the performance. At least, according to this article published yesterday.
According to it, if you don't want to suffer performance issues, it's better to put them in a separate table and JOIN them when needed.
This is the relative section:
Try to limit the number of columns in
a table. Too many columns in a table
can make the scan time for queries
much longer than if there are just a
few columns. In addition, if you have
a table with many columns that aren't
typically used, you are also wasting
disk space with NULL value fields.
This is also true with variable size
fields, such as text or blob, where
the table size can grow much larger
than needed. In this case, you should
consider splitting off the additional
columns into a different table,
joining them together on the primary
key of the records
You should put in on the same table.
Yes, it probably will affect other queries on the same table, and you should probably do it anyway, as you probably don't care.
Depending on the engine, blobs are either stored inline (MyISAM), partially off-page (InnoDB) or entirely off-page (InnoDB Plugin, in some cases).
These have the potential to decrease the number of rows per page, and therefore increase the number of IO operations to satisfy some query.
However, it is extremely unlikely that you care, so you should just do it anyway. How many rows does this table have? 10^9 ? How many of them have non-null values for the blob?
It shouldn't be too much of a hit, but if you're worried about performance, you should always run a few benchmarks and run EXPLAINs on your queries to see the true effect.
How many events are you expecting to have?
Chances are that if you don't have a truckload of hundred of thousands events, your performance will be good in any case.

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks
You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle
Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.
We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.
Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?
I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.
The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.
Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)
Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows
We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.
In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.
Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.
We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.
The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.
No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.
From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.
Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.
This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.
Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.