Ordered insertion at next unused index, generic SQL - sql

There have been various similar questions, but they either referred to a too specific DB or assumed unsorted data.
In my case, the SQL should be portable if possible. The index column in question is a clustered PK containing a timestamp.
The timestamp is 99% of the time larger than previously inserted value. On rare occasions however, it can be smaller, or collide with an existing value.
I'm currently using this code to insert new values:
IF NOT EXISTS (select * from Foo where Timestamp = #ts) BEGIN
INSERT INTO Foo ([Timestamp]) VALUES (#ts);
END
ELSE BEGIN
INSERT INTO Foo ([Timestamp]) VALUES (
(SELECT Max (t1.Timestamp) - 1
FROM Foo t1
WHERE Timestamp < #ts
AND NOT EXISTS (select * from Foo t2 where t2.Timestamp = t1.Timestamp - 1))
);
END;
If the row is unused yet, just insert. Else, find the closest free row with a smaller value using an EXISTS check.
I am a novice when it comes to databases, so I'm not sure if there is a better way. I'm open for any ideas to make the code simpler and/or faster (around 100-1000 insertions per second), or to use a different approach altogether.
Edit Thank you for your comments ans answers so far.
To explain about the nature of my case: The timestamp is the only value ever used to sort the data, minor inconsistencies can be neglected. There are no FK relationships.
However, I agree that my approach is flawed, outweighing the reasons to use the presented idea in the first place. If I understand correctly, a simple way to fix the design is to have a regular, autoincremented PK column in combination with the known (and renamed) timestamp column, which will be clustered.
From a performance POV, I don't see how this could be worse than the initial approach. It also simplifies the code a lot.

This method is a prescription for disaster. In the first place you will have race conditions which will cause user annoyance when their insert won't work. Even worse, if you are adding to another table using that value as the foreign key and the whole thing is not in one transaction, you may be adding child data to the wrong record.
Further, looking for the lowest unused value is a recipe for further data integrity messes if you have not properly set up foreign key relationships and deleted a record without getting all of it's child records. Now you just joined to records which don;t belong with the new record.
This manual method is flawed and unreliable. All the major databases have a way to create an autogenerated value. Use that instead, the problems have been worked out and tested.
Timestamp BTW is a SQL server reserved word and should never be used as a fieldname.

If you can't guaranteed that your PK values are unique, then it's not a good PK candidate. Especially if it's a timestamp - I'm sure Goldman Sachs would love it if their high-frequency trading programs could cause collisions on an insert and get inserted 1 microsecond earlier because the system fiddles the timestamp of their trade.
Since you can't guarantee uniqueness of the timestamps, a better choice would be to use a plain-jane auto-increment int/bigint column, which takes care of the collision problem, gives you a nice method of getting insertion order, and you can still sort on the timestamp field to get a nice straight timeline if need be.

One idea would be to add a surrogate identity/autonumber/sequence key, so the primary key becomes (timestamp, newkey).
This way, you preserve row order and uniqueness without code
To run the code above, you'd need to fiddle with lock granularity and concurrency hints in the code above, or TRY/CATCH to retry with the alternate value (SQL Server). This removes portability. However, under heavy load you'd have to keep retrying because the alternate value may already exist.

A Timestamp as a key? Really? Every time a row is updated, its timestamp is modified. The SQL Server timestamp data type is intended for use in versioning rows. It is not the same as the ANSI/ISO SQL timestamp — that is the equivalent of SQL Server's datetime data type.
As far as "sorting" on a timestamp column goes: the only thing that guaranteed with a timestamp is that every time a row is inserted or updated it gets a new timestamp value and that value is a unique 8-octet binary value, different from the previous value assigned to the row, if any. There is no guarantee that that value has any correlation to the system clock.

Related

SQL autoincrement with no gaps -- workarounds / best practices

My application needs a table with an autoincrement primary key column with no gaps. As others have noted AUTOINCREMENT implementations typically cause gaps (txn rollbacks, deletes, etc.) Autoincrement with no gaps is straightforward to implement at the application layer, but I wonder if there's a better (more SQL'ish) way to approach this.
The reason why I prefer to have no gaps is because I imagine range-queries of the form
SELECT *
FROM chainTable
WHERE chn_id >= 10005003 AND chn_id <= 10005009
are faster than queries of the form
SELECT *
FROM chainTable
WHERE chn_id >= 10005003
ORDER BY chn_id
LIMIT 7
In my application, the selected rows were created in the same transaction. So my need that there be no gaps could be relaxed to values generated within the same transaction. So my question boils down to this:
Are AUTOINCREMENT column values generated within a transaction guaranteed to be contiguous (i.e. no gaps)?
My guess would still be "no", but I'd love to be wrong.
overhead of managing id in the app is going to be more expensive than letting your sql engine handle it.
for your queries , there would be no noticeable performance difference as far as you have a proper index on that column.
however the second query might be slightly faster because It has to check only one condition.
2 conclusions from your comments/answer:
The use case mentioned here is not apt. (No reason to expect a performance oomph; maybe the opposite.) But,
If your data model requires ascending numbers with no gaps, better off implementing it yourself.
Thank you all
I imagine range-queries of the form
SELECT *
FROM chainTable
WHERE chn_id >= 10005003 AND chn_id <= 10005009
are faster than queries of the form
SELECT *
FROM chainTable
WHERE chn_id >= 10005003
ORDER BY chn_id
LIMIT 7
You should try to prove this first. I don't think the first form is faster if an index is used for chn_id.
If an index is used, then the rows will be read in index order, therefore ORDER BY chn_id is a no-op. MySQL is already reading the rows in index order by chn_id, so it'll just continue reading the first 7 after the start of your range, and then stop because of the LIMIT.
I don't think you need a solution to make your auto-inc consecutive (that is, with no gaps).
For the record, it is certainly NOT the case that auto-increment will be consecutive within a transaction. If it were, then one transaction would block other sessions from inserting data.
my tables are supposed to be ledgers (append-only), and the row-number figures prominently in the data model.
The auto-increment is NOT a row number. Don't try to use it as a row number.
You will always have gaps, if you rollback or delete rows. Or if an INSERT fails because of an error for instance a constraint violation. Also depending on what brand of RDBMS you are using, the implementation of auto-inc may not guarantee against gaps even under normal usage.

Create an index that only cares if a DateTime field is null or not?

I have many tables with a nullable DeletedDate column. Every query I write needs to check that the record is not deleted, but doesn't care when it was deleted.
I want to add an index to this column, but it seems like it would be more efficient if there was a way to index it in a way that only cared if it was null or not instead of trying to group by date. Is this possible to do or is SQL Server smart enough to handle this kind of optimization on its own?
You can add a computed column, which is a binary indicator of the deletion. However, indexing a binary column is not usually very useful.
If you want to speed SELECT queries, then including the delete flag (or deletion date) as the first column in a clustered index can be helpful. Queries that use the flag would only scan the pages with undeleted records. For this purpose, using the date itself is probably fine, assuming that the date is set to the current date in a deletion.
The downside, of course, is that the data has to physically move when it is deleted. If the deletions only go one way (i.e. no "undeletes"), then the overhead might not be too bad.

Database Design: replace a boolean column with a timestamp column?

Earlier I have created tables this way:
create table workflow (
id number primary key,
name varchar2(100 char) not null,
is_finished number(1) default 0 not null,
date_finished date
);
Column is_finished indicates whether the workflow finished or not. Column date_finished is when the workflow was finished.
Then I had the idea "I don't need is_finished as I can just say: where data_finished is not null", and I designed without is_finished column:
create table workflow (
id number primary key,
name varchar2(100 char) not null,
date_finished date
);
(We use Oracle 10)
Is it a good or bad idea? I've heard you can not have an index on a column with NULL values, so where data_finished is not null will be very slow on big tables.
Is it a good or bad idea?
Good idea.
You've eliminated space taken by a redundant column; the DATE column serves double duty--you know the work was finished, and when.
I've heard like you can't have an index on a column with NULL values, so "where data_finished is not null" will be very slow on big tables.
That's incorrect. Oracle indexes ignore NULL values.
You can create a function based index in order to get around the NULL values not being indexed, but most DBAs I've encountered really don't like them so be prepared for a fight.
There is a right way to index null values, and it doesn't use a FBI. Oracle will index null values, but it will NOT index null LEAF values in the tree. So, you could eliminate the column is_finished and create the index like this.
CREATE INDEX ON workflow (date_finished, 1);
Then, if you check the explain plan on this query:
SELECT count(*) FROM workflow WHERE date_finished is null;
You might see the index being used (if the optimizer is happy).
Back to the original question: looking at the variety of answers here, I think there is no right answer. I may have a personal preference to eliminate a column if it is unnecessary, but I also don't like overloading the meaning of columns either. There are two concepts here:
The record has finished. is_finished
The record finished on a particular date. date_finished
Maybe you need to keep these separate, maybe you don't. When I think about eliminating the is_finished column, it bothers me. Down the road, the situation may arise where the record finished, but you don't know precisely when. Perhaps you have to import data from another source and the date is unknown. Sure, that's not in the business requirements now, but things change. What do you do then? Well, you have to put some dummy value in the date_finished column, and now you've compromised the data a bit. Not horribly, but there is a rub there. The little voice in my head is shouting YOU'RE DOING IT WRONG when I do things like that.
My advice, keep it separate. You're talking about a tiny column and a very skinny index. Storage should not be an issue here.
Rule of Representation: Fold knowledge
into data so program logic can be
stupid and robust.
-Eric S. Raymond
To all those who said the column is a waste of space:
Double Duty isn't a good thing in a database. Your primary goal should be clarity. Lots of systems, tools, people will use your data. If you disguise values by burying meaning inside of other columns you're BEGGING for another system or user to get it wrong.
And anyone who thinks it saves space is utterly wrong.
You'll need two indexes on that date column... one will be Function Based as OMG suggests. It will look like this:
NVL(Date_finished, TO_DATE('01-JAN-9999'))
So to find unfinished jobs you'll have to make sure to write the where clause correctly
It will look like this:
WHERE
NVL(Date_finished, TO_DATE('01-JAN-9999')) = TO_DATE('01-JAN-9999')
Yep. That's so clear. It's completely better than
WHERE
IS_Unfinished = 'YES'
The reason you'll want to have a second index on the same column is for EVERY OTHER query on that date... you won't want to use that index for finding jobs by date.
So let's see what you've accomplish with OMG's suggestion et al.
You've used more space, you've obfuscated the meaning of the data, you've made errors more likely... WINNER!
Sometime it seems programmers are still living in the 70's when a MB of hard drive space was a down payment on a house.
You can be space efficient about this without giving up a lot of clarity. Make the Is_unfinished either Y or NULL... IF you will only use that column to find 'work to do'. This will keep that index compact. It will only be as big as rows which are unfinished (in this way you exploit the unindexed nulls instead of being screwed by it). You put a little bit of space in your table, but over all it's less than the FBI. You need 1 byte for the column and you'll only index the unfinished rows so that' a small fraction of job and probably stays pretty constant. The FBI will need 7 bytes for EVERY ROW whether you're trying to find them or not. That index will keep pace with the size of the table, not just the size of the unfinished jobs.
Reply to the comment by OMG
In his/her comment he/she states that to find unfinished jobs you'd just use
WHERE date_finished IS NULL
But in his answer he says
You can create a function based index in order to get around the NULL values not being indexed
If you follow the link he points you toward, using NVL to replace null values with some other arbitrary value then I'm not sure what else there is to explain.
Is it a good or bad idea? I've heard like you can't have an index on a column with NULL values, so "where data_finished is not null" will be very slow on big tables.
Oracle does index nullable fields, but does not index NULL values
This means that you can create an index on a field marked NULL, but the records holding NULL in this field won't make it into the index.
This, on its turn, means that if you make date_finished NULL, the index will be less in size, as the NULL values won't be stored in the index.
So the queries involving equality of range searches on date_finished will in fact perform better.
The downside of this solution, of course, is that the queries involving the NULL values of date_finished will have to revert to full table scan.
You can work around this by creating two indexes:
CREATE INDEX ON mytable (date_finished)
CREATE INDEX ON mytable (DECODE(date_finished, NULL, 1))
and use this query to find unfinished work:
SELECT *
FROM mytable
WHERE DECODE(date_finished, NULL, 1) = 1
This will behave like partitioned index: the complete works will be indexed by the first index; the incomplete ones will be indexed by the second.
If you don't need to search for complete or incomplete works, you can always get rid of the appropriate indexes.
In terms of table design, I think it's good that you removed the is_finished column as you said that it isn't necessary (it's redundant). There's no need to store extra data if it isn't necessary, it just wastes space. In terms of performance, I don't see this being a problem for NULL values. They should be ignored.
I would use nulls as indexes work, as already mentioned in other answers, for all queries apart from "WHERE date_finished IS NULL" (so it depends if you need to use that query). I definitely wouldn't use outliers like year 9999 as suggested by the answer:
you could also use a "dummy" value (such as 31 December 9999) as the date_finished value for unfinished workflows
Outliers like year 9999 affect performance, because (from http://richardfoote.wordpress.com/2007/12/13/outlier-values-an-enemy-of-the-index/):
The selectivity of a range scan is basically calculated by the CBO to be the number of values in the range of interest divided by the full range of possible values (IE. the max value minus the min value)
If you use a value like 9999 then the DB will think the range of values being stored in the field is e.g. 2008-9999 rather than the actual 2008-2010; so any range query (e.g. "between 2008 and 2009") will appear to be covering a tiny % of the range of possible values, vs. actually covering about half the range. It uses this statistic to say, if the % of the ths possible values covered is high, probably a lot of rows will match, and then a full table scan will be faster than an index scan. It won't do this correctly if there are outliers in the data.
good idea to remove the deriveable value column as others have said.
one more thought is that by removing the column, you will avoid paradoxical conditions that you will need to code around, such as what happens when the is_finished = No and the finished_date = yesterday... etc.
To resolve the indexed / non-indexed columns, wouldn't it be easier to simply JOIN two tables, like this:
-- PostgreSQL
CREATE TABLE workflow(
id SERIAL PRIMARY KEY
, name VARCHAR(100) NOT NULL
);
CREATE TABLE workflow_finished(
id INT NOT NULL PRIMARY KEY REFERENCES workflow
, date_finished date NOT NULL
);
Thus, if a record exists in workflow_finished, this workflow's completed, else it isn't. It seems to me this is rather simple.
When querying for unfinished workflows, the query becomes:
-- Only unfinished workflow items
SELECT workflow.id
FROM workflow
WHERE NOT EXISTS(
SELECT 1
FROM workflow_finished
WHERE workflow_finished.id = workflow.id);
Maybe you want the original query? With a flag and the date? Query like this then:
-- All items, with the flag and date
SELECT
workflow.id
, CASE
WHEN workflow_finished.id IS NULL THEN 'f'
ELSE 't'
END AS is_finished
, workflow_finished.date_finished
FROM
workflow
LEFT JOIN workflow_finished USING(id);
For consumers of the data, views can and should be created for their needs.
As an alternative to a function-based index, you could also use a "dummy" value (such as 31 December 9999, or alternatively one day before the earliest expected date_finished value) as the date_finished value for unfinished workflows.
EDIT: Alternative dummy date value, following comments.
I prefer the single-column solution.
However, in the databases I use most often NULLs are included in indexes, so your common case of searching for open workflows will be fast whereas in your case it will be slower. Because the case of searching for open workflows is likely to be one of the most common things you do, you may need the redundant column simply to support that search.
Test for performance to see if you can use the better solution performance-wise, then fall back to the less-good solution if necessary.

SQL Best Practices - Ok to rely on auto increment field to sort rows chronologically?

I'm working with a client who wants to add timestamps to a bunch of tables so that they may sort the records in those tables chronologically. All of the tables also have an auto incrementing integer field as their primary key (id).
The (simple) idea - save the overhead/storage and rely on the primary key to sort fields chronologically. Sure this works, but I'm uncertain whether or not this approach is acceptable in sound database design.
Pros: less storage required per record, simpler VO classes, etc. etc.
Con: it implies a characteristic of that field, an otherwise simple identifer, whose definition does not in any way define or guarantee that it should/will function as such.
Assume for the sake of my question that the DB table definitions are set in stone. Still - is this acceptable in terms of best practices?
Thanks
You asked for "best practices", rather than "not terrible practices" so: no, you should not rely on an autoincremented primary key to establish chronology. One day you're going to introduce a change to the db design and that will break. I've seen it happen.
A datetime column whose default value is GETDATE() has very little overhead (about as much as an integer) and (better still) tells you not just sequence but actual date and time, which often turns out to be priceless. Even maintaining an index on the column is relatively cheap.
These days, I always put a CreateDate column data objects connected to real world events (such as account creation).
Edited to add:
If exact chronology is crucial to your application, you can't rely on either auto-increment or timestamps (since there can always be identical timestamps, no matter how high the resolution). You'll probably have to make something application-specific instead.
Further to egrunin's answer, a change to the persistence or processing logic of these rows may cause rows to be inserted into the database in a non-sequential or nondeterministic manner. You may implement a parallelized file processor that throws a row into the DB as soon as the thread finishes transforming it, which may be before another thread has finished processing a row that occurred earlier in the file. Using an ORM for record persistence may result in a similar behavior; the ORM may just maintain a "bag" (unordered collection) of object graphs awaiting persistence, and grab them at random to persist them to the DB when it's told to "flush" its object buffer.
In either case, trusting the autoincrement column to tell you the order in which records came into the SYSTEM is bad juju. It may or may not be able to tell you the order in which records his the DATABASE; that depends on the DB implementation.
You can acheive the same goal in the short term by sorting on the ID column. This would be better that adding additional data to acheive the same result. I don't think that it would be confusing for anyone to look at the data table and know that it's chronological when they see that it's an identity column.
There are a few drawbacks or limitations that I see however.
The chronological sort can be messed up if someone re-seeds the column
Chronology for a date period cannot be ascertained without the additional data
This setup prevents you from sorting chronologically if the system ever accepts new, non-chronological data
Based on the realistic evaluation of these "limitations" you should be able to advise a proper approach.
Auto-incrementing ID will give you an idea of order as Brad points out, but do it right - if you want to know WHEN something was added, have a datetime column. Then you can not only chronologically sort but also apply filters.
Don't do it. You should never rely on the actual value of your ID column. Treat it like a black box, only useful for doing key lookups.
You say "less storage required per record," but how important is that? How big are the rows we're talking about? If you've got 200-byte rows, another 4 bytes probably isn't going to matter much.
Don't optimize without measuring. Get it working right first, and THEN optimize.
#MadBreaker
There's to separate things, if you need to know the order you create a column order with autoincrement, however if you want to know the date and time it was inserted you use datetime2.
Chronological order can be garanteed if you don't allow updates or deletes, but if you want time control over select you should use datetime2.
You didnt mention if you are running on a single db or clustered. If you are clustered, be wary of increment implementations, as you are not always guaranteed things will come out in the order you would naturally think. For example, Oracle sequences can cache groups of next values (depending on your setup) and give you a 1,3,2,4,5 sort of list...

MySQL SELECT statement using Regex to recognise existing data

My web application parses data from an uploaded file and inserts it into a database table. Due to the nature of the input data (bank transaction data), duplicate data can exist from one upload to another. At the moment I'm using hideously inefficient code to check for the existence of duplicates by loading all rows within the date range from the DB into memory, and iterating over them and comparing each with the uploaded file data.
Needless to say, this can become very slow as the data set size increases.
So, I'm looking to replace this with a SQL query (against a MySQL database) which checks for the existence of duplicate data, e.g.
SELECT count(*) FROM transactions WHERE desc = ? AND dated_on = ? AND amount = ?
This works fine, but my real-world case is a little bit more complicated. The description of a transaction in the input data can sometimes contain erroneous punctuation (e.g. "BANK 12323 DESCRIPTION" can often be represented as "BANK.12323.DESCRIPTION") so our existing (in memory) matching logic performs a little cleaning on this description before we do a comparison.
Whilst this works in memory, my question is can this cleaning be done in a SQL statement so I can move this matching logic to the database, something like:
SELECT count(*) FROM transactions WHERE CLEAN_ME(desc) = ? AND dated_on = ? AND amount = ?
Where CLEAN_ME is a proc which strips the field of the erroneous data.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
Thanks a lot
can this cleaning be done in a SQL statement
Yes, you can write a stored procedure to do it in the database layer:
mysql> CREATE FUNCTION clean_me (s VARCHAR(255))
-> RETURNS VARCHAR(255) DETERMINISTIC
-> RETURN REPLACE(s, '.', ' ');
mysql> SELECT clean_me('BANK.12323.DESCRIPTION');
BANK 12323 DESCRIPTION
This will perform very poorly across a large table though.
Obviously the cleanest (no pun intended!) solution would be to store the already cleaned data in the database (either in the same column, or in a separate column), but before I resort to that I thought I'd try and find out whether there's a cleverer way around this.
No, as far as databases are concerned the cleanest way is always the cleverest way (as long as performance isn't awful).
Do that, and add indexes to the columns you're doing bulk compares on, to improve performance. If it's actually intrinsic to the type of data that desc/dated-on/amount are always unique, then express that in the schema by making it a UNIQUE index constraint.
The easiest way to do that is to add a unique index on the appropriate columns and to use ON DUPLICATE KEY UPDATE. I would further recommend transforming the file into a csv and loading it into a temporary table to get the most out of mysql's builtin functions, which are surely faster than anything that you could write yourself - if you consider that you would have to pull the data into your own application, while mysql does everything in place.
The cleanest way is indeed to make sure only correct data is in the database.
In this example the "BANK.12323.DESCRIPTION" would be returned by:
SELECT count(*) FROM transactions
WHERE desc LIKE 'BANK%12323%DESCRIPTION' AND dated_on = ? AND amount = ?
But this might impose performance issues when you have a lot of data in the table.
Another way that you could do it is as follows:
Clean the description before inserting.
Create a primary key for the table that is a combination of the columns that uniquely identifier the entry. Sounds like that might be cleaned description, date and amount.
Use the either the 'replace' or 'on duplicate key' syntax, which ever is more appropriate. 'replace' actually replaces the existing row in the db with the updated one when an existing unique key confict occurs, e.g:
REPLACE INTO transactions (desc, dated_on, amount) values (?,?,?)
'on duplicate key' allows you to specify which columns to update on a duplicate key error:
INSERT INTO transaction (desc, dated_on, amount) values (?,?,?)
ON DUPLICATE KEY SET amount = amount
By using the multi-column primary key, you will gain a lot of performance since primary key lookups are usually quite fast.
If you prefer to keep your existing primary key, you could also create a unique unix on those three columns.
Whichever way you choose, I would recommend cleaning the description before going into the db, even if you also store the original description and just use the cleaned one for indexing.