Maintaining logical consistency with a soft delete, whilst retaining the original information - sql

I have a very simple table students, structure as below, where the primary key is id. This table is a stand-in for about 20 multi-million row tables that get joined together a lot.
+----+----------+------------+
| id | name | dob |
+----+----------+------------+
| 1 | Alice | 01/12/1989 |
| 2 | Bob | 04/06/1990 |
| 3 | Cuthbert | 23/01/1988 |
+----+----------+------------+
If Bob wants to change his date of birth, then I have a few options:
Update students with the new date of birth.
Positives: 1 DML operation; the table can always be accessed by a single primary key lookup.
Negatives: I lose the fact that Bob ever thought he was born on 04/06/1990
Add a column, created date default sysdate, to the table and change the primary key to id, created. Every update becomes:
insert into students(id, name, dob) values (:id, :name, :new_dob)
Then, whenever I want the most recent information do the following (Oracle but the question stands for every RDBMS):
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by created desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: All queries over the entire database take that little bit longer. If the table was the size indicated this doesn't matter but once you're on your 5th left outer join using range scans rather than unique scans begins to have an effect.
Add a different column, deleted date default to_date('2100/01/01','yyyy/mm/dd'), or whatever overly early, or futuristic, date takes my fancy. Change the primary key to id, deleted then every update becomes:
update students x
set deleted = sysdate
where id = :id
and deleted = ( select max(deleted) from students where id = x.id );
insert into students(id, name, dob) values ( :id, :name, :new_dob );
and the query to get out the current information becomes:
select id, name, dob
from ( select a.*, rank() over ( partition by id
order by deleted desc ) as "rank"
from students a )
where "rank" = 1
Positives: I never lose any information.
Negatives: Two DML operations; I still have to use ranked queries with the additional cost or a range scan rather than a unique index scan in every query.
Create a second table, say student_archive and change every update into:
insert into student_archive select * from students where id = :id;
update students set dob = :newdob where id = :id;
Positives: Never lose any information.
Negatives: 2 DML operations; if you ever want to get all the information ever you have to use union or an extra left outer join.
For completeness, have a horribly de-normalised data-structure: id, name1, dob, name2, dob2... etc.
If number 1 is not an option if I never want to lose any information and always do a soft delete. Number 5 can be safely discarded as causing more trouble than it's worth.
I'm left with options 2, 3 and 4 with their attendant negative aspects. I usually end up using option 2 and the horrific 150 line (nicely-spaced) multiple sub-select joins that go along with it.
tl;dr I realise I'm skating close to the line on a "not constructive" vote here but:
What is the optimal (singular!) method of maintaining logical consistency while never deleting any data?
Is there a more efficient way than those I have documented? In this context I'll define efficient as "less DML operations" and / or "being able to remove the sub-queries". If you can think of a better definition when (if) answering please feel free.

I'd stick to #4 with some modifications.No need to delete data from original table ; it's enough to copy old values to archive table before updating(or before deleting) original record. That's can be easily done with row level trigger. Retrieving all information in my opinion is not a frequent operation, and I don't see anything wrong with extra join /union. Also, you can define a view , so all queries will be straightforward from end user perspective.

Related

SQL: Taking one column from two tables and putting them into one predefined table

Just a little bug off my shoulder, but for what I'm using this code for, it is not the end of the world if this one doesn't get answered. To preface, a few things: I know this is entirely improper, I know this should never be used -- let alone, done -- in a production environment, and I know that the root of this operation is totally unconventional, but I'm asking anyway:
If I have two tables with a set of values that I am looking to grab and put into one other, combined and predefined table, side by side, how might I do that?
Right now, I have two statements doing
INSERT INTO table ('leftCol') SELECT NAME FROM smolT1 ORDER BY num DESC LIMIT 3
INSERT INTO table ('rightCol') SELECT NAME FROM smolT2 ORDER BY num DESC LIMIT 3
but, as one would imagine, that query ends up with something like...
leftCol | rightCol
Jack |
James |
John |
| Jill
| Justina
| Jesebelle
and of course, it would be much more preferred if the left and right column lined up, though, for the sake of gathering just those six records, I suppose it is not too big of a concern.
To add on, yes, these two tables do have a NAME in common, but with how I am querying them, they are totally irrelevant one another and should not be associated with one another, just displayed side by side.
I am simply curious as to whether or not one query would get these two unrelated queries to work together and print neatly into a form or if I just have to live with this data looking like this.
Cheers!
The most recent versions of SQLite support window functions. This allows you to do:
select min(name1) as name1, min(name2) as name2
from ((select name as name1, null as num2 row_number() over (order by name) as seqnum
from smolt1
where name is not null
) union all
(select null, name, row_number() over (order by name) as seqnum
from smolt2
where name is not null
)
) lr
group by seqnum;

Best practice for setup and querying versioned records in T-SQL

I'm trying to optimize my SQL queries and I always come back to this one issue and I was hoping to get some insight into how I could best optimize this.
For brevity, lets say I have a simple employee table:
tbl_employees
Id HiredDateTime
------------------
1 ...
2 ...
That has versioned information in another another table for each employee:
tbl_emplyees_versioned
Id Version Name HourlyWage
-------------------------------
1 1 Bob 10
1 2 Bob 20
1 3 Bob 30
2 1 Dan 10
2 2 Dan 20
And this is how the latest version records are retrieved in a View:
Select tbl_employees.Id, employees_LatestVersion.Name, employees_LatestVersion.HourlyWage, employees_LatestVersion.Version
From tbl_employees
Inner Join tbl_employees_versioned
ON tbl_employees.Id = tbl_employees_versioned.Id
CROSS APPLY
(SELECT Id, Max(Version) AS Version
FROM tbl_employees_versioned AS employees_LatestVersion
WHERE Id = tbl_employees_versioned.Id
GROUP BY Id) AS employees_LatestVersion
To get a response like this:
Id Version Name HourlyWage
-------------------------------
1 3 Bob 30
2 2 Dan 20
When pulling a query that has over 500 employees records for which each have a couple few versions, this query starts choking up and takes a few seconds to run.
There are a couple strikes right off the bat, but I'm not sure how to overcome them.
Obviously the Cross Apply adds some performance loss. Is there a best practice when dealing with versioned information like this? Is there a better way to get just a record with the highest version?
The versioned table doesn't have a clustered index beause neither Id or Version are unique. Concatenated together they would be, but it doesn't work like that. Instead there is a non-clustered index for Id and another one for Version. Is there a better way to index this table to get any performance gain? Would an indexed view really help here?
I think the best way to structure the data is using start dates and end dates. So, the data structure for your original table would look like:
create table tbl_EmployeesHistory (
EmployeeHistoryId int,
EffDate date not null,
EndDate date,
-- Fields that describe the employee during this time
)
Then, you can see the current version using a view:
create view vw_Employees as
select *
from tbl_EmployeesHistory
where EndDate is NULL
In some cases, where future end dates are allowed, the where clause would be:
where coalesce(EndDate, getdate()) >= getdate()
Alternatively, in this case, you can default EndDate to some future date far, far away such as '01-o1-9999'. You would add this as the default in the create table statement, make the column not null, and then you can always use the statement:
where getdate() between EffDate and EndDate
As Martin points out in his comment, the coalesce() might impede the use of an index (it does in SQL Server), whereas this does not have that problem.
This is called a slowly changing dimension. Ralph Kimball discusses this concept in some length in his books on data warehousing.
Here's one way you can get a view of the most recent version for each employee:
Select Id, Name, HourlyWage, Version
FROM (
Select E.Id, V.Name, V.HourlyWage, V.Version,
row_number() OVER (PARTITION BY V.ID ORDER BY V.Version DESC) as nRow
From tbl_employees E
Inner Join tbl_employees_versioned V ON E.Id = V.Id
) A
WHERE A.nRow = 1
I suspect that this will perform better than your previous solution. One index across Id and Version in tbl_employees_versioned would most likely also help.
Also, note that you only need to join on tbl_employees if you're selecting fields that are not in tbl_employees_versioned.

order by field with more than 10000 ids

I need to do specific ordering with use of order by field.
select * from table order by field(id,3,4,1,2.......upto 10000 ids)
As the ordering required is not gettable from SQL then how much it affect as per performance and is it feasible to do?
Updates from the comments:
Ordering depends on user and category IDs and can be anything the user wants.
The ordering specification changes (about) daily.
So, we need a custom ordering that depends on the user and category and this ordering needs to change daily.
The easiest way would be to put your ordering in a separate table (called ordering_table in this example):
id | position
----+----------
1 | 11
2 | 42
3 | 23
etc.
The above would mean "put an id of 1 at position 11, 2 at position 42, 3 at position 23, ...". Then you can join that ordering table in:
SELECT t.id, t.col1, t.col2
FROM some_table t
JOIN ordering_table o ON (t.id = o.id)
ORDER BY o.position
Where ordering_table is the table (as above) that defines your strange ordering. This approach simply represents your ordering function as a table (any function with a finite domain is, essentially, just a table after all).
This "ordering table" approach should work fine as long as the ordering table is complete.
If you only need this strange ordering in one place then you could merge the position column into your main table and add NOT NULL and UNIQUE constraints on that column to make sure you cover everything and have a consistent ordering.
Further commenting indicates that you want different orderings for different users and categories and that the ordering will change on a daily basis. You could make separate tables for each condition (which would lead to a combinatorial explosion) or, as Mikael Eriksson and ypercube suggest, add a couple more columns to the ordering table to hold the user and category:
CREATE TABLE ordering_table (
thing_id INT NOT NULL,
position INT NOT NULL,
user_id INT NOT NULL,
category_id INT NOT NULL
);
The thing_id, user_id, and category_id would be foreign keys to their respective tables and you'd probably want to index all the columns in ordering_table but a couple minutes of looking at the query plans would be worthwhile to see if the indexes get used would be worthwhile. You could also make all four columns the primary key to avoid duplicates. Then, the lookup query would be something like this:
SELECT t.id, t.col1, t.col2
FROM some_table t
LEFT JOIN ordering_table o
ON (t.id = o.thing_id AND o.user_id = $user AND o.category_id = $cat)
ORDER BY COALESCE(o.position, 99999)
Where $user and $cat are the user and category IDs (respectively). Note the change to a LEFT JOIN and the addition of COALESCE to allow for missing rows in ordering_table, these changes will push anything that doesn't have a specified position in the order to the bottom of the list rather than removing them from the results completely.

Tricky SQL statement over 3 tables

I have 3 different transaction tables, which look very similar, but have slight differences. This comes from the fact that there are 3 different transaction types; depending on the transaction types the columns change, so to get them in 3NF I need to have them in separate tables (right?).
As an example:
t1:
date,user,amount
t2:
date,user,who,amount
t3:
date,user,what,amount
Now I need a query who is going to get me all transactions in each table for the same user, something like
select * from t1,t2,t3 where user='me';
(which of course doesn't work).
I am studying JOIN statements but haven't got around the right way to do this. Thanks.
EDIT: Actually I need then all of the columns from every table, not just the ones who are the same.
EDIT #2: Yeah,having transaction_type doesn't break 3NF, of course - so maybe my design is utterly wrong. Here is what really happens (it's an alternative currency system):
- Transactions are between users, like mutual credit. So units get swapped between users.
- Inventarizations are physical stuff brought into the system; a user gets units for this.
- Consumations are physical stuff consumed; a user has to pay units for this.
|--------------------------------------------------------------------------|
| type | transactions | inventarizations | consumations |
|--------------------------------------------------------------------------|
| columns | date | date | date |
| | creditor(FK user) | creditor(FK user) | |
| | debitor(FK user) | | debitor(FK user) |
| | service(FK service)| | |
| | | asset(FK asset) | asset(FK asset) |
| | amount | amount | amount |
| | | | price |
|--------------------------------------------------------------------------|
(Note that 'amount' is in different units;these are the entries and calculations are made on those amounts. Outside the scope to explain why, but these are the fields). So the question changes to "Can/should this be in one table or be multiple tables (as I have it for now)?"
I need the previously described SQL statement to display running balances.
(Should this now become a new question altogether or is that OK to EDIT?).
EDIT #3: As EDIT #2 actually transforms this to a new question, I also decided to post a new question. (I hope this is ok?).
You can supply defaults as constants in the select statements for columns where you have no data;
so
SELECT Date, User, Amount, 'NotApplicable' as Who, 'NotApplicable' as What from t1 where user = 'me'
UNION
SELECT Date, User, Amount, Who, 'NotApplicable' from t2 where user = 'me'
UNION
SELECT Date, User, Amount, 'NotApplicable', What from t3 where user = 'me'
which assumes that Who And What are string type columns. You could use Null as well, but some kind of placeholder is needed.
I think that placing your additional information in a separate table and keeping all transactions in a single table will work better for you though, unless there is some other detail I've missed.
I think the meat of your question is here:
depending on the transaction types the columns change, so to get them in 3NF I need to have them in separate tables (right?).
I'm no 3NF expert, but I would approach your schema a little differently (which might clear up your SQL a bit).
It looks like your data elements are as such: date, user, amount, who, and what. With that in mind, a more normalized schema might look something like this:
User
----
id, user info (username, etc)
Who
---
id, who info
What
----
id, what info
Transaction
-----------
id, date, amount, user_id, who_id, what_id
Your foreign key constraint verbiage will vary based on database implementation, but this is a little clearer (and extendable).
You should consider STI "architecture" (single table inheritance). I.e. put all different columns into one table, and put them all under one index.
In addition you may want to add indexes to other columns you're making selection.
What is the result schema going to look like? - If you only want the minimal columns that are in all 3 tables, then it's easy, you would just UNION the results:
SELECT Date, User, Amount from t1 where user = 'me'
UNION
SELECT Date, User, Amount from t2 where user = 'me'
UNION
SELECT Date, User, Amount from t3 where user = 'me'
Or you could 'SubClass' them
Create Table Transaction
(
TransactionId Integer Primary Key Not Null,
TransactionDateTime dateTime Not Null,
TransactionType Integer Not Null,
-- Othe columns all transactions Share
)
Create Table Type1Transactions
{
TransactionId Integer PrimaryKey Not Null,
// Type 1 specific columns
}
ALTER TABLE Type1Transactions WITH CHECK ADD CONSTRAINT
[FK_Type1Transaction_Transaction] FOREIGN KEY([TransactionId])
REFERENCES [Transaction] ([TransactionId])
Repeat for other types of transactions...
What about simply leaving the unnecessary columns null and adding a TransactionType column? This would result in a simple SELECT statement.
select *
from (
select user from t1
union
select user from t2
union
select user from t3
) u
left outer join t1 on u.user=t1.user
left outer join t2 on u.user=t2.user
left outer join t3 on u.user=t3.user

SQL standard select current records from an audit log question

My memory is failing me. I have a simple audit log table based on a trigger:
ID int (identity, PK)
CustomerID int
Name varchar(255)
Address varchar(255)
AuditDateTime datetime
AuditCode char(1)
It has data like this:
ID CustomerID Name Address AuditDateTime AuditCode
1 123 Bob 123 Internet Way 2009-07-17 13:18:06.353I
2 123 Bob 123 Internet Way 2009-07-17 13:19:02.117D
3 123 Jerry 123 Internet Way 2009-07-17 13:36:03.517I
4 123 Bob 123 My Edited Way 2009-07-17 13:36:08.050U
5 100 Arnold 100 SkyNet Way 2009-07-17 13:36:18.607I
6 100 Nicky 100 Star Way 2009-07-17 13:36:25.920U
7 110 Blondie 110 Another Way 2009-07-17 13:36:42.313I
8 113 Sally 113 Yet another Way 2009-07-17 13:36:57.627I
What would be the efficient select statement be to get all most current records between a start and end time? FYI: I for insert, D for delete, and U for update.
Am I missing anything in the audit table? My next step is to create an audit table that only records changes, yet you can extract the most recent records for the given time frame. For the life of me I cannot find it on any search engine easily. Links would work too. Thanks for the help.
Another (better?) method to keep audit history is to use a 'startDate' and 'endDate' column rather than an auditDateTime and AuditCode column. This is often the approach in tracking Type 2 changes (new versions of a row) in data warehouses.
This lets you more directly select the current rows (WHERE endDate is NULL), and you will not need to treat updates differently than inserts or deletes. You simply have three cases:
Insert: copy the full row along with a start date and NULL end date
Delete: set the End Date of the existing current row (endDate is NULL)
Update: do a Delete then Insert
Your select would simply be:
select * from AuditTable where endDate is NULL
Anyway, here's my query for your existing schema:
declare #from datetime
declare #to datetime
select b.* from (
select
customerId
max(auditdatetime) 'auditDateTime'
from
AuditTable
where
auditcode in ('I', 'U')
and auditdatetime between #from and #to
group by customerId
having
/* rely on "current" being defined as INSERTS > DELETES */
sum(case when auditcode = 'I' then 1 else 0 end) >
sum(case when auditcode = 'D' then 1 else 0 end)
) a
cross apply(
select top 1 customerId, name, address, auditdateTime
from AuditTable
where auditdatetime = a.auditdatetime and customerId = a.customerId
) b
References
A cribsheet for data warehouses, but has a good section on type 2 changes (what you want to track)
MSDN page on data warehousing
Ok, a couple of things for audit log tables.
For most applications, we want audit tables to be extremely quick on insertion.
If the audit log is truly for diagnostic or for very irregular audit reasons, then the quickest insertion criteria is to make the table physically ordered upon insertion time.
And this means to put the audit time as the first column of the clustered index, e.g.
create unique clustered index idx_mytable on mytable(AuditDateTime, ID)
This will allow for extremely efficient select queries upon AuditDateTime O(log n), and O(1) insertions.
If you wish to look up your audit table on a per CustomerID basis, then you will need to compromise.
You may add a nonclustered index upon (CustomerID, AuditDateTime), which will allow for O(log n) lookup of per-customer audit history, however the cost will be the maintenance of that nonclustered index upon insertion - that maintenance will be O(log n) conversely.
However that insertion time penalty may be preferable to the table scan (that is, O(n) time complexity cost) that you will need to pay if you don't have an index on CustomerID and this is a regular query that is performed.
An O(n) lookup which locks the table for the writing process for an irregular query may block up writers, so it is sometimes in writers' interests to be slightly slower if it guarantees that readers aren't going to be blocking their commits, because readers need to table scan because of a lack of a good index to support them....
Addition: if you are looking to restrict to a given timeframe, the most important thing first of all is the index upon AuditDateTime. And make it clustered as you are inserting in AuditDateTime order. This is the biggest thing you can do to make your query efficient from the start.
Next, if you are looking for the most recent update for all CustomerID's within a given timespan, well thereafter a full scan of the data, restricted by insertion date, is required.
You will need to do a subquery upon your audit table, between the range,
select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
and then incorporate that into your select query proper, eg.
select AuditTrail.* from AuditTrail
inner join
(select CustomerID, max(AuditDateTime) MaxAuditDateTime
from AuditTrail
where AuditDateTime >= #begin and Audit DateTime <= #end
) filtration
on filtration.CustomerID = AuditTrail.CustomerID and
filtration.AuditDateTime = AuditTrail.AuditDateTime
Another approach is using a sub select
select a.ID
, a.CustomerID
, a.Name
, a.Address
, a.AuditDateTime
, a.AuditCode
from myauditlogtable a,
(select s.id as maxid,max(s.AuditDateTime)
from myauditlogtable as s
group by maxid)
as subq
where subq.maxid=a.id;
start and end time? e.g as in between 1am to 3am
or start and end date time? e.g as in 2009-07-17 13:36 to 2009-07-18 13:36