Indexing items marked as deleted - sql

Due to client requirements I need to implement following scenario:
Whenever user want to delete a record representing a document, that particular record needs to be marked as deleted using simple BOOLEAN is_deleted condition.
Document is a general name for one of the tables that store invoices, orders or offers.
Everything is pretty dead simple, but I wonder if there is a way to index records to perform quick searching and somehow skip/omit deleted items (or there is no need to worry about performance at all and use simple where clause is_deleted=False).
Other solutions/advices would be appreciated as well.

PostgreSQL supports partial indexes. You can do something like:
create index document_id_is_deleted_idx ON document(id) where is_deleted;
You can even create unique indexes if you need unique subsets over portions of your data.
Of course getting the right columns in your index is an exercise, but it is quite manageable.

Another option you might like to explore is to move those records to another table, or to use partitioning to separate the deleted and undeleted rows (which would amount to broadly the same thing).
That would let you keep all the records of interest in a smaller table that can be indexed differently from that of the deleted records.
If you went down the partitioning route you'd have a DOCUMENTS master table with DOCUMENTS_DELETED and DOCUMENTS_LIVE tables inheriting from it.

Related

Conditionally linking Postgres rows to data in various other tables

I have a product table that is updated using CSV feeds from various suppliers. Each feed has its own table, however products can appear multiple times in the same supplier table, and in multiple supplier tables. Each product can only occur once in our main table though. I don't anticipate ever using more than about ten different supplier tables. Tables are updated at least daily, and at most every 6-8 hours, and read speeds are a much higher priority than write speeds. There are usually about 500,000 enabled products at any given time.
My first plan was to store the table name and primary key ID in that table for each product, then recalculate it during each update, but according to the responses here, having to do that is an indication that the database isn't designed correctly.
Using a view to combine these tables into a single virtual table seems like it'd help a lot with the organization. That way, I can just create a rule to make one column an SQL query, then index that column to increase search/read speed. The rules that determine where to pull supplier information from are not somewhat involved, and need to take country and price into account, as well as perhaps a few other things.
So I guess the question here is, is there a correct way of doing this? Or is it going to be messy no matter how I do it? Also, am I on the right track?
Using a view unifying all your feed tables might well simplify the form of your queries, but you cannot index a view. (Well, in Oracle I think you can index a MATERIALIZED view, but that's a special case).
Structurally, I find it a bit suspect that you split your supplier feeds into separate tables; doing so may simplify and speed updates from the supplier feeds, and it is certainly the fastest alternative for queries against specific, individual feeds, but it's ugly for updating (recomputing?) the main table, and it is flatly unsound for supporting rows of the main table being related back to the particular supplier feed from which they were drawn.
If you need fast queries against the supplier feeds, independent of the main table, and you also need the main table to be related to a detail table containing supplier-specific information, then perhaps your best bet would be to maintain a physical auxiliary table as the UNION ALL of all the per-supplier tables (this requires those tables to have the same structure), each with a distinct supplier ID. In Oracle, you can automate that as a MATERIALIZED VIEW, but with most DBMSs you would need to maintain that table manually.
The auxilliary table can be indexed, can be joined to the main table as needed in queries, and can be queried fairly efficiently. If appropriate, it can be used to update the main table.
Hmm, why not just create one product table that contains data from all suppliers? Have a field in that table that identifies which supplier. When you get your input feeds, update this one table rather than having a separate table for each supplier. If you're using COPY to import a CSV file into a db table, fine, but then the imported table is just a temporary work table. Promptly copy the data from there into the "real", unified table. Then the import table can be dropped or truncated, or more likely you keep it around for troubleshooting. But you don't use it within the program.
You should be able to copy from the import table to the unified table with a single insert statement. Even if the tables are large I'd expect that to be fast. It would almost surely be faster overall to do one mass insert for each import than to have a view that does a union on 10 tables and try to work with that. If the unified table has all the data from all suppliers plus a supplier field, then I don't see why you would ever need to query the raw import tables. Except, that is, for trouble-shooting problems with the import, but fine, so you keep them around for that. Unless you're constrained on disk space so that keeping what amounts to duplicates of every record is a problem, I'd think this would be the easy solution. If disk space is an issue, than drop the import table immediately after copying the data to the unified table, and keep the original raw import on backup media somewhere.

Implementing soft delete with minimal impact on performance and code

There are some similar questions on the topic, but they are not really helping me.
I want to implement a soft delete feature like on StackOverflow, where items are not really deleted, but just hidden. I am using a SQL database. Here are 3 options:
Add a is_deleted boolean field.
Advantages: Simple.
Disadvantages: No date record. Forces me to add a is_deleted = 0 in every query.
Add a deleted_date date field. This is set to NULL if it's not deleted.
Advantages: Has date.
Disadvantages: Still cluttering my queries.
For both of the above
It will also impact performance because there are all these useless rows. They still have to be maintained in indexes. Also an index on the deleted column won't help when fetching non-deleted (the majority) of the rows. Full table scan is needed.
Another option is to create a separate table to hold deleted items:
Advantages: Improved performance when querying non-deleted rows. No need to add conditions to my queries on non-deleted rows. Easier on index maintenance.
Disadvantages: Complexity: Requires data migration for both deletion and undeletion. Need for new tables. Referential integrity is harder to handle.
Is there a better option?
I personally would base my answer off of how often you anticipate your users wanting to access that deleted data or "restore" that deleted data.
If it's often, then I would go with a "Date_Deleted" field and put a calculated "IsDeleted" in my poco in the code.
If it's never (or almost never) then a history table or deleted table is good for the benefits you explained.
I personally almost never use deleted tables (and opt for isDeleted or date_deleted) because of the potential risk to referencial integrity. You have A -> B and you remove the record from B database... You now have to manage referencial integrity because of your design choice.
If the key is numeric, I handle a "soft-delete" by negating the key. (Of course, won't work for identity keys). You don't need to change your code at all, and can easily restore the record by multiplying by -1.
Just another approach to give some thought to... If the key is alphanumeric, you can do something similar by prepending a unique "marker" characters. Since deleted records will all begin with this marker, then will end up off by themselves in the index.
In my opinion, the best way forward, when thinking about scaling and eventual table/database sizes is your third option - a separate table for deleted items. Such a table can eventually be moved to a different database to support scaling.
I believe you have listed the three most common options. As you have seen, each has advantages and disadvantages. Personally, I like taking the longer view on things.
I think your analysis of the options is good but you missed a few relevant points which I list below. Almost all implementations that I have seen use some sort of deleted or versioning field on the row as you suggest in your first two options.
Using one table with deleted flag:
If your indexes all contain the deleted flag field first and your query's mostly contain a where isdeleted=false type structure then it DOES solve you performance problems and the indexes very efficiently exclude the deleted rows. Similar logic could be used for the deleted date option.
Using two Tables
In general you need to make massive changes to reports because some reports may refer to deleted data (like old sales figures might refer to a deleted sales category). One can overcome this by creating a view which is a union of the two tables to read from and only write to the active records table.
Let's suppose we create a field called dead to mark deleted rows. We can create a index where field dead is false.
In this way, we only search non-deleted rows using the hint use index.

Dynamically creating tables as a means of partitioning: OK or bad practice?

Is it reasonable for an application to create database tables dynamically as a means of partitioning?
For example, say I have a large table "widgets" with a "userID" column identifying the owner of each row. If this table tended to grow extremely large, would it make sense to instead have the application create a new table called "widgets_{username}" for each new user? Assume that the application will only ever have to query for widgets belonging to a single user at a time (i.e. no need to try and join any of these user widget tables together).
Doing this would break up the one large table into more easily-managed chunks, but this doesn't seem like an elegant solution. In my mind, the database schema should be defined when the application is written, and any runtime data is stored as rows, not as additional tables.
As a more general question, is modifying the database schema at runtime ever ok?
Edit: This question is mostly hypothetical; I had a pretty good feeling that creating tables at runtime didn't make sense. That being said, we do have a table with millions of rows in our application. SELECTs perform fine, but things like deleting all rows owned by a particular user can take a while. Basically I'm looking for some solid reasoning why just dynamically creating a table for each user doesn't make sense for when I'm asked.
NO, NO, NO!! Now repeat after me, I will not do this because it will create many headaches and problems in the future! Databases are made to handle large amounts of information. they use indexes to quickly find what you are after. think phone book how effective is the index? would it be better to have a different book for each last name?
This will not give you anything performance wise. Keep a single table, but be sure to index on UserID and you'll be able to get the data fast. however if you split the table up, it becomes impossible/really really hard to get any info that spans multiple users, like search all users for a certain widget, count of all widgets of a certain type, etc. you need to have every query be built dynamically.
If deleting rows is slow, look into that. How many rows at one time are we talking about 10, 1000, 100000? What is your clustered index on this table? Could you use a "soft delete", where you have a status column that you UPDATE to "D" to mark the row as deleted. Can you delete the rows at a later time, with less database activity. is the delete slow because it is being blocked by other activity. look into those before you break up the table.
No, that would be a bad idea. However some DBMSs (e.g. Oracle) allow a single table to be partitioned on values of a column, which would achieve the objective without creating new tables at run time. Having said that, it is not "the norm" to partition tables like this: it is only usually done in very large databases.
Using an index on userID should result nearly in the same performance.
In my opinion, changing the database schema at runtime is bad practice.
Consider, for example, security issues...
Is it reasonable for an application to create database tables
dynamically as a means of partitioning?
No. (smile)

`active' flag or not?

OK, so practically every database based application has to deal with "non-active" records. Either, soft-deletions or marking something as "to be ignored". I'm curious as to whether there are any radical alternatives thoughts on an `active' column (or a status column).
For example, if I had a list of people
CREATE TABLE people (
id INTEGER PRIMARY KEY,
name VARCHAR(100),
active BOOLEAN,
...
);
That means to get a list of active people, you need to use
SELECT * FROM people WHERE active=True;
Does anyone suggest that non active records would be moved off to a separate table and where appropiate a UNION is done to join the two?
Curiosity striking...
EDIT: I should make clear, I'm coming at this from a purist perspective. I can see how data archiving might be necessary for large amounts of data, but that is not where I'm coming from. If you do a SELECT * FROM people it would make sense to me that those entries are in a sense "active"
Thanks
You partition the table on the active flag, so that active records are in one partition, and inactive records are in the other partition. Then you create an active view for each table which automatically has the active filter on it. The database query engine automatically restricts the query to the partition that has the active records in it, which is much faster than even using an index on that flag.
Here is an example of how to create a partitioned table in Oracle. Oracle doesn't have boolean column types, so I've modified your table structure for Oracle purposes.
CREATE TABLE people
(
id NUMBER(10),
name VARCHAR2(100),
active NUMBER(1)
)
PARTITION BY LIST(active)
(
PARTITION active_records VALUES (0)
PARTITION inactive_records VALUES (1)
);
If you wanted to you could put each partition in different tablespaces. You can also partition your indexes as well.
Incidentally, this seems a repeat of this question, as a newbie I need to ask, what's the procedure on dealing with unintended duplicates?
Edit: As requested in comments, provided an example for creating a partitioned table in Oracle
Well, to ensure that you only draw active records in most situations, you could create views that only contain the active records. That way it's much easier to not leave out the active part.
We use an enum('ACTIVE','INACTIVE','DELETED') in most tables so we actually have a 3-way flag. I find it works well for us in different situations. Your mileage may vary.
Moving inactive stuff is usually a stupid idea. It's a lot of overhead with lots of potential for bugs, everything becomes more complicated, like unarchiving the stuff etc. What do you do with related data? If you move all that, too, you have to modify every single query. If you don't move it, what advantage were you hoping to get?
That leads to the next point: WHY would you move it? A properly indexed table requires one additional lookup when the size doubles. Any performance improvement is bound to be negligible. And why would you even think about it until the distant future time when you actually have performance problems?
I think looking at it strictly as a piece of data then the way that is shown in the original post is proper. The active flag piece of data is directly dependent upon the primary key and should be in the table.
That table holds data on people, irrespective of the current status of their data.
The active flag is sort of ugly, but it is simple and works well.
You could move them to another table as you suggested. I'd suggest looking at the percentage of active / inactive records. If you have over 20 or 30 % inactive records, then you might consider moving them elsewhere. Otherwise, it's not a big deal.
Yes, we would. We currently have the "active='T/F'" column in many of our tables, mainly to show the 'latest' row. When a new row is inserted, the previous T row is marked F to keep it for audit purposes.
Now, we're moving to a 2-table approach, when a new row is inserted, the previous row is moved to an history table. This give us better performance for the majority of cases - looking at the current data.
The cost is slightly more than the old method, previously you had to update and insert, now you have to insert and update (ie instead of inserting a new T row, you modify the existing row with all the new data), so the cost is just that of passing in a whole row of data instead of passing in just the changes. That's hardly going to make any effect.
The performance benefit is that your main table's index is significantly smaller, and you can optimise your tablespaces better (they won't grow quite so much!)
Binary flags like this in your schema are a BAD idea. Consider the query
SELECT count(*) FROM users WHERE active=1
Looks simple enough. But what happens when you have a large number of users, so many that adding an index to this table would be required. Again, it looks straight forward
ALTER TABLE users ADD INDEX index_users_on_active (active)
EXCEPT!! This index is useless because the cardinality on this column is exactly two! Any database query optimiser will ignore this index because of it's low cardinality and do a table scan.
Before filling up your schema with helpful flags consider how you are going to access that data.
https://stackoverflow.com/questions/108503/mysql-advisable-number-of-rows
We use active flags quite often. If your database is going to be very large, I could see the value in migrating inactive values to a separate table, though.
You would then only require a union of the tables when someone wants to see all records, active or inactive.
In most cases a binary field indicating deletion is sufficient. Often there is a clean up mechanism that will remove those deleted records after a certain amount of time, so you may wish to start the schema with a deleted timestamp.
Moving off to a separate table and bringing them back up takes time. Depending on how many records go offline and how often you need to bring them back, it might or might not be a good idea.
If the mostly dont come back once they are buried, and are only used for summaries/reports/whatever, then it will make your main table smaller, queries simpler and probably faster.
We use both methods for dealing with inactive records. The method we use is dependent upon the situation. For records that are essentially lookup values, we use the Active bit field. This allows us to deactivate entries so they wont be used, but also allows us to maintain data integrity with relations.
We use the "move to separation table" method where the data is no longer needed and the data is not part of a relation.
The situation really dictates the solution, methinks:
If the table contains users, then several "flag" fields could be used. One for Deleted, Disabled etc. Or if space is an issue, then a flag for disabled would suffice, and then actually deleting the row if they have been deleted.
It also depends on policies for storing data. If there are policies for keeping data archived, then a separate table would most likely be necessary after any great length of time.
No - this is a pretty common thing - couple of variations depending on specific requirements (but you already covered them):
1) If you expect to have a whole BUNCH of data - like multiple terabytes or more - not a bad idea to archive deleted records immediately - though you might use a combination approach of marking as deleted then copying to archive tables.
2) Of course the option to hard delete a record still exists - though us developers tend to be data pack-rats - I suggest that you should look at the business process and decide if there is now any need to even keep the data - if there is - do so... if there isn't - you should probably feel free just to throw the stuff away.....again, according to the specific business scenario.
From a 'purist perspective' the realtional model doesn't differentiate between a view and a table - both are relations. So that use of a view that uses the discriminator is perfectly meaningful and valid provided the entities are correctly named e.g. Person/ActivePerson.
Also, from a 'purist perspective' the table should be named person, not people as the name of the relation reflects a tuple, not the entire set.
Regarding indexing the boolean, why not:
ALTER TABLE users ADD INDEX index_users_on_active (id, active) ;
Would that not improve the search?
However I don't know how much of that answer depends on the platform.
This is an old question but for those search for low cardinality/selectivity indexes, I'd like to propose the following approach that avoids partitioning, secondary tables, etc.:
The trick is to use "dateInactivated" column that stores the timestamp of when the record is inactivated/deleted. As the name implies, the value is NULL while the record is active, but once inactivated, write in the system datetime. Thus, an index on that column ends up having high selectivity as the number of "deleted" records grows since each record will have a unique (not strictly speaking) value.
Then your query becomes:
SELECT * FROM people WHERE dateInactivated is NULL;
The index will pull in just the right set of rows that you care about.
Filtering data on a bit flag for big tables is not really good in terms of performance. In case when 'active' determinate virtual deletion you can create 'TableName_delted' table with the same structure and move deleted data there using delete trigger.
That solution will help with performance and simplifies data queries.

What is the best way to implement soft deletion?

Working on a project at the moment and we have to implement soft deletion for the majority of users (user roles). We decided to add an is_deleted='0' field on each table in the database and set it to '1' if particular user roles hit a delete button on a specific record.
For future maintenance now, each SELECT query will need to ensure they do not include records where is_deleted='1'.
Is there a better solution for implementing soft deletion?
Update: I should also note that we have an Audit database that tracks changes (field, old value, new value, time, user, ip) to all tables/fields within the Application database.
I would lean towards a deleted_at column that contains the datetime of when the deletion took place. Then you get a little bit of free metadata about the deletion. For your SELECT just get rows WHERE deleted_at IS NULL
You could perform all of your queries against a view that contains the WHERE IS_DELETED='0' clause.
Having is_deleted column is a reasonably good approach.
If it is in Oracle, to further increase performance I'd recommend partitioning the table by creating a list partition on is_deleted column.
Then deleted and non-deleted rows will physically be in different partitions, though for you it'll be transparent.
As a result, if you type a query like
SELECT * FROM table_name WHERE is_deleted = 1
then Oracle will perform the 'partition pruning' and only look into the appropriate partition. Internally a partition is a different table, but it is transparent for you as a user: you'll be able to select across the entire table no matter if it is partitioned or not. But Oracle will be able to query ONLY the partition it needs. For example, let's assume you have 1000 rows with is_deleted = 0 and 100000 rows with is_deleted = 1, and you partition the table on is_deleted. Now if you include condition
WHERE ... AND IS_DELETED=0
then Oracle will ONLY scan the partition with 1000 rows. If the table weren't partitioned, it would have to scan 101000 rows (both partitions).
The best response, sadly, depends on what you're trying to accomplish with your soft deletions and the database you are implementing this within.
In SQL Server, the best solution would be to use a deleted_on/deleted_at column with a type of SMALLDATETIME or DATETIME (depending on the necessary granularity) and to make that column nullable. In SQL Server, the row header data contains a NULL bitmask for each of the columns in the table so it's marginally faster to perform an IS NULL or IS NOT NULL than it is to check the value stored in a column.
If you have a large volume of data, you will want to look into partitioning your data, either through the database itself or through two separate tables (e.g. Products and ProductHistory) or through an indexed view.
I typically avoid flag fields like is_deleted, is_archive, etc because they only carry one piece of meaning. A nullable deleted_at, archived_at field provides an additional level of meaning to yourself and to whoever inherits your application. And I avoid bitmask fields like the plague since they require an understanding of how the bitmask was built in order to grasp any meaning.
if the table is large and performance is an issue, you can always move 'deleted' records to another table, which has additional info like time of deletion, who deleted the record, etc
that way you don't have to add another column to your primary table
That depends on what information you need and what workflows you want to support.
Do you want to be able to:
know what information was there (before it was deleted)?
know when it was deleted?
know who deleted it?
know in what capacity they were acting when they deleted it?
be able to un-delete the record?
be able to tell when it was un-deleted?
etc.
If the record was deleted and un-deleted four times, is it sufficient for you to know that it is currently in an un-deleted state, or do you want to be able to tell what happened in the interim (including any edits between successive deletions!)?
Careful of soft-deleted records causing uniqueness constraint violations.
If your DB has columns with unique constraints then be careful that the prior soft-deleted records don’t prevent you from recreating the record.
Think of the cycle:
create user (login=JOE)
soft-delete (set deleted column to non-null.)
(re) create user (login=JOE). ERROR. LOGIN=JOE is already taken
Second create results in a constraint violation because login=JOE is already in the soft-deleted row.
Some techniques:
1. Move the deleted record to a new table.
2. Make your uniqueness constraint across the login and deleted_at timestamp column
My own opinion is +1 for moving to new table. Its take lots of
discipline to maintain the *AND delete_at = NULL* across all your
queries (for all of your developers)
You will definitely have better performance if you move your deleted data to another table like Jim said, as well as having record of when it was deleted, why, and by whom.
Adding where deleted=0 to all your queries will slow them down significantly, and hinder the usage of any of indexes you may have on the table. Avoid having "flags" in your tables whenever possible.
you don't mention what product, but SQL Server 2008 and postgresql (and others i'm sure) allow you to create filtered indexes, so you could create a covering index where is_deleted=0, mitigating some of the negatives of this particular approach.
Something that I use on projects is a statusInd tinyint not null default 0 column
using statusInd as a bitmask allows me to perform data management (delete, archive, replicate, restore, etc.). Using this in views I can then do the data distribution, publishing, etc for the consuming applications. If performance is a concern regarding views, use small fact tables to support this information, dropping the fact, drops the relation and allows for scalled deletes.
Scales well and is data centric keeping the data footprint pretty small - key for 350gb+ dbs with realtime concerns. Using alternatives, tables, triggers has some overhead that depending on the need may or may not work for you.
SOX related Audits may require more than a field to help in your case, but this may help.
Enjoy
Use a view, function, or procedure that checks is_deleted = 0; i.e. don't select directly on the table in case the table needs to change later for other reasons.
And index the is_deleted column for larger tables.
Since you already have an audit trail, tracking the deletion date is redundant.
I prefer to keep a status column, so I can use it for several different configs, i.e. published, private, deleted, needsAproval...
Create an other schema and grant it all on your data schema.
Implment VPD on your new schema so that each and every query will have the predicate allowing selection of the non-deleted row only appended to it.
http://download.oracle.com/docs/cd/E11882_01/server.112/e16508/cmntopc.htm#CNCPT62345
#AdditionalCriteria("this.status <> 'deleted'")
put this on top of your #entity
http://wiki.eclipse.org/EclipseLink/Examples/JPA/SoftDelete