My application needs a table with an autoincrement primary key column with no gaps. As others have noted AUTOINCREMENT implementations typically cause gaps (txn rollbacks, deletes, etc.) Autoincrement with no gaps is straightforward to implement at the application layer, but I wonder if there's a better (more SQL'ish) way to approach this.
The reason why I prefer to have no gaps is because I imagine range-queries of the form
SELECT *
FROM chainTable
WHERE chn_id >= 10005003 AND chn_id <= 10005009
are faster than queries of the form
SELECT *
FROM chainTable
WHERE chn_id >= 10005003
ORDER BY chn_id
LIMIT 7
In my application, the selected rows were created in the same transaction. So my need that there be no gaps could be relaxed to values generated within the same transaction. So my question boils down to this:
Are AUTOINCREMENT column values generated within a transaction guaranteed to be contiguous (i.e. no gaps)?
My guess would still be "no", but I'd love to be wrong.
overhead of managing id in the app is going to be more expensive than letting your sql engine handle it.
for your queries , there would be no noticeable performance difference as far as you have a proper index on that column.
however the second query might be slightly faster because It has to check only one condition.
2 conclusions from your comments/answer:
The use case mentioned here is not apt. (No reason to expect a performance oomph; maybe the opposite.) But,
If your data model requires ascending numbers with no gaps, better off implementing it yourself.
Thank you all
I imagine range-queries of the form
SELECT *
FROM chainTable
WHERE chn_id >= 10005003 AND chn_id <= 10005009
are faster than queries of the form
SELECT *
FROM chainTable
WHERE chn_id >= 10005003
ORDER BY chn_id
LIMIT 7
You should try to prove this first. I don't think the first form is faster if an index is used for chn_id.
If an index is used, then the rows will be read in index order, therefore ORDER BY chn_id is a no-op. MySQL is already reading the rows in index order by chn_id, so it'll just continue reading the first 7 after the start of your range, and then stop because of the LIMIT.
I don't think you need a solution to make your auto-inc consecutive (that is, with no gaps).
For the record, it is certainly NOT the case that auto-increment will be consecutive within a transaction. If it were, then one transaction would block other sessions from inserting data.
my tables are supposed to be ledgers (append-only), and the row-number figures prominently in the data model.
The auto-increment is NOT a row number. Don't try to use it as a row number.
You will always have gaps, if you rollback or delete rows. Or if an INSERT fails because of an error for instance a constraint violation. Also depending on what brand of RDBMS you are using, the implementation of auto-inc may not guarantee against gaps even under normal usage.
Related
Suppose, if following rows are inserted in chronological order into a table:
row1, row2, row3, row4, ..., row1000, row1001.
After a while, we delete/remove the latest row1001.
As in this post: How to get Top 5 records in SqLite?
If the below command is run:
SELECT * FROM <table> LIMIT 1;
Will it assuredly provide the "row1000"?
If no, then is there any efficient way to get the latest row(s)
without traversing through all the rows? -- i.e. without using
combination of ORDER BY and DESC.
[Note: For now I am using "SQLite", but it will be interesting for me to know about SQL in general as well.]
You're misunderstanding how SQL works. You're thinking row-by-row which is wrong. SQL does not "traverse rows" as per your concern; it operates on data as "sets".
Others have pointed out that relational database cannot be assumed to have any particular ordering, so you must use ORDER BY to explicitly specify ordering.
However (not mentioned yet is that), in order to ensure it performs efficiently, you need to create an appropriate index.
Whether you have an index or not, the correct query is:
SELECT <cols>
FROM <table>
ORDER BY <sort-cols> [DESC] LIMIT <no-rows>
Note that if you don't have an index the database will load all data and probably sort in memory to find the TOP n.
If you do have the appropriate index, the database will use the best index available to retrieve the TOP n rows as efficiently as possible.
Note that the sqllite documentation is very clear on the matter. The section on ORDER BY explains that ordering is undefined. And nothing in the section on LIMIT contradicts this (it simply constrains the number of rows returned).
If a SELECT statement that returns more than one row does not have an ORDER BY clause, the order in which the rows are returned is undefined.
This behaviour is also consistent with the ANSI standard and all major SQL implementations. Note that any database vendor that guaranteed any kind of ordering would have to sacrifice performance to the detriment of queries trying to retrieve data but not caring about order. (Not good for business.)
As a side note, flawed assumptions about ordering is an easy mistake to make (similar to flawed assumptions about uninitialised local variables).
RDBMS implementations are very likely to make ordering appear consistent. They follow a certain algorithm for adding data, a certain algorithm for retrieving data. And as a result, their operations are highly repeatable (it's what we love (and hate) about computers). So things repeatably look the same.
Theoretical examples:
Inserting a row results in the row being added to the next available free space. So data appears sequential. But an update would have to move the row to a new location if it no longer fits.
The DB engine might retrieve data sequentially from clustered index pages and seem to use clustered index as the 'natural ordering' ... until one day a page-split puts one of the pages in a different location. * Or a new version of the DMBS might cache certain data for performance, and suddenly order changes.
Real-world example:
The MS SQL Server 6.5 implementation of GROUP BY had the side-effect of also sorting by the group-by columns. When MS (in version 7 or 2000) implemented some performance improvements, GROUP BY would by default, return data in a hashed order. Many people blamed MS for breaking their queries when in fact they had made false assumptions and failed to ORDER BY their results as needed.
This is why the only guarantee of a specific ordering is to use the ORDER BY clause.
No. Table records have no inherent order. So it is undefined which row(s) to get with a LIMIT clause without an ORDER BY.
SQLite in its current implemantation may return the latest inserted row, but even if that were the case you must not rely on it.
Give a table a datetime column or some sortkey, if record order is important for you.
In SQL, data is stored in tables unordered. What comes out first one day might not be the same the next.
ORDER BY, or some other specific selection criteria is required to guarantee the correct value.
It is my understanding that select is not guaranteed to always return the same result.
Following query is not guaranteed to return the same result every time:
select * from myTable offset 10000 limit 100
My question is if myTable is not changed between executions of select (no deletions or inserts) can i rely on it returning the same result set every time?
Or to put it in another way if my database is locked for changes can I rely on select returning the same result?
I am using postgresql.
Tables and result sets (without order by) are simply not ordered. It really is that simple.
In some databases, under some circumstances, the order will be consistent. However, you should never depend on this. Subsequent releases, for instance, might invalidate the query.
For me, I think the simplest way to understand this is by thinking of parallel processing. When you execute a query, different threads might go out and start to fetch data; which values are returned first depends on non-reproducible factors.
Another way to think of it is to consider a page cache that already has pages in memory -- probably from the end of the table. The SQL engine could read the pages in any order (although in practice this doesn't really happen).
Or, some other query might have a row or page lock, so that page gets skipped when reading the records.
So, just accept that unordered means what ordered means. Add an order by if you want data in a particular order. If you use a clustered index key, then there is basically no performance hit.
I have an sql query which fetch the first N rows in a table which is designed as a low-level queue.
select top N * from my_table where status = 0 order by date asc
The intention behind this query is as follows:
First, this question is intended to be database agnostic, as my implementation will support sql server, oracle, DB2 and sybase. The sql syntax above of "top N" is just an example.
The table can contain millions of rows.
N is a relatively small number in comparison, e.g. 100.
status is 0 when the row is in the queue. Later it is changed to 1 to indicate that it is in processing. After processing it is deleted. So it is expected that at least 90% of the rows in the table will be with status 0.
rows in the table should be fetched according to their date, hence the order by clause.
What is the optimal index to make this query works fastest?
I initially thought the index should be on (date, status), but I am not sure about it anymore. Since the status column will contain mostly zeros, is there an added-value to it? Will it be sufficient to index by (date) alone?
Or maybe it should be (status, date)?
I don't think there is an efficient solution that will be RDMS independent. For example, Oracle has bitmap indexes, SQLServer has partial indexes, and I don't see reasons not to use them if, for instance, Mysql or Sqlite has nothing similar. Also, historically SQLServer implements clustered tables (or IOT in Oracle world) way better than Oracle does, so having clustered index on date column may work perfectly for SQLServer, but not for Oracle.
I'd rather change approach a bit. If you say 90% of rows don't satisfy status=0 condition, why not try refactoring schema, and adding a new table (or materialized view) that holds only records you are interested in ? The number of new programmable objects required for keeping that table up-to-date and merging data with original table is relatively small even if RDMS doesn't support materialized view directly. Also, if it's possible to redesign underlying logic, so rows never updated, only inserted or deleted, then it will help avoiding lock contentions , and as a result , the whole system will have a better performance .
Have a clustered index on Date and a non clustered index on Status.
There have been various similar questions, but they either referred to a too specific DB or assumed unsorted data.
In my case, the SQL should be portable if possible. The index column in question is a clustered PK containing a timestamp.
The timestamp is 99% of the time larger than previously inserted value. On rare occasions however, it can be smaller, or collide with an existing value.
I'm currently using this code to insert new values:
IF NOT EXISTS (select * from Foo where Timestamp = #ts) BEGIN
INSERT INTO Foo ([Timestamp]) VALUES (#ts);
END
ELSE BEGIN
INSERT INTO Foo ([Timestamp]) VALUES (
(SELECT Max (t1.Timestamp) - 1
FROM Foo t1
WHERE Timestamp < #ts
AND NOT EXISTS (select * from Foo t2 where t2.Timestamp = t1.Timestamp - 1))
);
END;
If the row is unused yet, just insert. Else, find the closest free row with a smaller value using an EXISTS check.
I am a novice when it comes to databases, so I'm not sure if there is a better way. I'm open for any ideas to make the code simpler and/or faster (around 100-1000 insertions per second), or to use a different approach altogether.
Edit Thank you for your comments ans answers so far.
To explain about the nature of my case: The timestamp is the only value ever used to sort the data, minor inconsistencies can be neglected. There are no FK relationships.
However, I agree that my approach is flawed, outweighing the reasons to use the presented idea in the first place. If I understand correctly, a simple way to fix the design is to have a regular, autoincremented PK column in combination with the known (and renamed) timestamp column, which will be clustered.
From a performance POV, I don't see how this could be worse than the initial approach. It also simplifies the code a lot.
This method is a prescription for disaster. In the first place you will have race conditions which will cause user annoyance when their insert won't work. Even worse, if you are adding to another table using that value as the foreign key and the whole thing is not in one transaction, you may be adding child data to the wrong record.
Further, looking for the lowest unused value is a recipe for further data integrity messes if you have not properly set up foreign key relationships and deleted a record without getting all of it's child records. Now you just joined to records which don;t belong with the new record.
This manual method is flawed and unreliable. All the major databases have a way to create an autogenerated value. Use that instead, the problems have been worked out and tested.
Timestamp BTW is a SQL server reserved word and should never be used as a fieldname.
If you can't guaranteed that your PK values are unique, then it's not a good PK candidate. Especially if it's a timestamp - I'm sure Goldman Sachs would love it if their high-frequency trading programs could cause collisions on an insert and get inserted 1 microsecond earlier because the system fiddles the timestamp of their trade.
Since you can't guarantee uniqueness of the timestamps, a better choice would be to use a plain-jane auto-increment int/bigint column, which takes care of the collision problem, gives you a nice method of getting insertion order, and you can still sort on the timestamp field to get a nice straight timeline if need be.
One idea would be to add a surrogate identity/autonumber/sequence key, so the primary key becomes (timestamp, newkey).
This way, you preserve row order and uniqueness without code
To run the code above, you'd need to fiddle with lock granularity and concurrency hints in the code above, or TRY/CATCH to retry with the alternate value (SQL Server). This removes portability. However, under heavy load you'd have to keep retrying because the alternate value may already exist.
A Timestamp as a key? Really? Every time a row is updated, its timestamp is modified. The SQL Server timestamp data type is intended for use in versioning rows. It is not the same as the ANSI/ISO SQL timestamp — that is the equivalent of SQL Server's datetime data type.
As far as "sorting" on a timestamp column goes: the only thing that guaranteed with a timestamp is that every time a row is inserted or updated it gets a new timestamp value and that value is a unique 8-octet binary value, different from the previous value assigned to the row, if any. There is no guarantee that that value has any correlation to the system clock.
In a certain app I must constantly query data that are likely to be amongst the last inserted rows. Since this table is going to grow a lot, I wonder if theres a standard way of optimizing the queries by making them start the lookup at the table's end. I think I would get the same optmization if the database stored data for the table in a stack-like structure, so the last inserted rows would be searched first.
The SQL spec doesn't mention anything about maintaining the insertion order. In practice, most of decent DB's also doesn't maintain it. Then it stops here. Sorting the table first ain't going to make it faster. Just index the column(s) of interest (at least the ones which you use in the WHERE).
One of the "tenets" of a proper RDBMS is that this kind of matters shouldn't concern you or anyone else using the DB.
The DB engine is "free" to use whatever method it wants to store/retrieve records, so if you want to enforce a "top" behaviour do what other suggested: add a timestamp field to the table (or tables), add an index on it and query using it as a sort and/or query criteria (e.g.: you poll the table each minute, and ask for records with timestamp>=systime-1 minute)
There is no standard way.
In some databases you can specify the sort order on an index.
SQL Server allows you to write ASC or DESC on an index:
[ ASC | DESC ]
Determines the ascending or descending sort direction for the particular index column. The default is ASC.
In MySQL you can also write ASC or DESC when you create the index but currently this is ignored. It might be implemented in a future version.
Add a counter or a time field in your table, sort on it and get top rows.
In other words: You should forget the idea that SQL tables are accessed in any particular order by default. A seqscan does not mean the oldest rows will be searched first, only that all rows will be checked. If you want to optimize some search you add indexes on some fields. What you are looking for is probably indexes.
If your data is indexed, it won't matter. The index is doing a binary search, not a sequential scan.
Unless you're doing TOP 1 (or something like it), the SELECT will have to scan the whole table or index anyway.
According to Data Independence you shouldn't care. That said a clustered index would probably suit your needs if you typically look for a date range. (sorting acs/desc shouldn't matter but you should try it out.)
If you find that you really need it you can also shard your database to increase perf on the most recently added data.
If you have enough rows that its actually becomming a problem, and you know how many "the most recently inserted rows" should be, you could try a round-about method.
Note: Even for pretty big tables, this is less efficient, but once your main table gets big enough, I've seen this work wonders for user-facing performance.
Create a "staging" table that exactly mimics your table's structure. Whenever you insert into your main table, also insert into your "staging" area. Limit your "staging" area to n rows by using a trigger to delete the lowest id row in the table when a new row over your arbitrary maximum is reached (say, 10,000 or whatever your limit is).
Then, queries can hit that smaller table first looking for the information. Since the table is arbitrarilly limited to the last n rows, it's only looking in the most recent data. Only if that fails to find a match would your query (actually, at this point a stored procedure because of the decision making) hit your main table.
Some Gotchas:
1) Make sure your trigger(s) is(are) set up properly to maintain the correct concurrancy between your "main" and "staging" tables.
2) This can quickly become a maintenance nightmare if not handled properly- and depending on your scenario it be be a little finiky.
3) I cannot stress enough that this is only efficient/useful in very specific scenarios. If yours doesn't match it, use one of the other answers.
ISO/ANSI Standard SQL does not consider optimization at all. For example the widely recognized CREATE INDEX SQL DDL does not appear in the Standard. This is because the Standard makes no assumptions about the underlying storage medium and nor should it. I regularly use SQL to query data in text files and Excel spreadsheets, neither of which have any concept of database indexes.
You can't do this.
However, there is a way to do something that might be even better. Depending on the design of your table, you should be able to create an index that keeps things in almost the order of entry. For example, if you adopt the common practice of creating an id field that autoincrements, then that index is just about in chronological order.
Some RDBMSes permit you to declare a backwards index, that is, one that descends instead of ascending. If you create a backwards index on the ID field, and if the optimizer uses that index, it will look at the most recent entries first. This will give you a rapid response for the first row.
The next step is to get the optimizer to use the index. You need to use explain plan to see if the index is being used. If you ask for the rows in order of id descending, the optimizer will almost certainly use the backwards index. If not you may be able to use hints to guide the optimizer.
If you still need to avoid reading all the rows in order to avoid wasting time, you may be able to use the LIMIT feature to declare that you only want, say 10 rows, and no more, or 1 row and no more. That should do it.
Good luck.
If your table has a create date, then I'd reverse sort by that and take the top 1.