It's my understanding that the quickest way to access a particular row is by its ROWID. In INFORMIX-SE 7.3, when I do a SELECT ROWID FROM table I notice that its values are type SERIAL[INT]. In Oracle, they are SERIAL[HEX]. Has anyone ever used ROWID for any practical use? If I wanted to locate the most recent row added to a table, would SELECT MAX(ROWID) FROM table be quicker and more reliable than say SELECT MAX(pk_id) FROM table, where pk_id is a user-created SERIAL column? What other practical use have you ever put ROWID to work for you?
Your understanding is not necessarily correct. The ROWID property in SQL Server is primarily intended for replication as a way to guarantee that the table has a single-field unique index value. This way the replication system does not have to account for any specific primary key semantics that your design might employ, while still being able to identify every row by a single column. No table is required to have a ROWID column unless it is part of a merge replication publication, so it's not something that every table has, unlike Oracle. It also doesn't serve the same purpose (they're Guid's--or uniqueidentifier in T-SQL parlance--on SQL Server and are random, not sequential integers like they are on Oracle).
The quickest way to retrieve a row from a table is by accessing the row via the clustered index. A table can only have one clustered index, as it's what determines the physical ordering of the rows on the disk. Furthermore, if the table has a primary key, the primary key is the clustered index. While it's possible to declare a table without a primary key and assign the clustered index to something else, I can't (off the top of my head) fathom a reason why you'd want to do this (or, for practical purposes, how you can justify having a table without a primary key).
In short, that means that the quickest way to retireve a row is by using the primary key of the table. Unless the ROWID column is the primary key (which is certainly possible to do), then it isn't the fastest way.
Well, I can only really tell how it works in Oracle, using it for 19+ years :-)
Put simply, ROWID is an internel identification, that acts like an physical address. It can be split into database file no, block no, and row no. So obtaining the ROWID makes the db engine able to look the data up in a single direct IO.
In an index the B* tree will have ROWIDs on the leaf nodes pointing directly the location of the data, e.g. in a primary index.
Being an physical address it is submit to change on relocation on disk, which can happen after restoring a backup, rebuilding a table, or export/import of data.
The db engine can do some tricks, e.g. when moving a pluggable tablespace from one instance to another to avoid rebuilding indexes, however this is strickly db engine internals.
So to keep out of trouble leave the ROWID for internal use for the db engine. Storing the ROWID for your own usage will eventually lead to inconsistency.
In Informix-SE, the ROWID is basically the record number within the C-ISAM file that is used to hold the table. SE only deals in fixed size records, of course (no VARCHAR data).
In Informix Dynamic Server, the ROWID is (a) more complex (page number plus slot number) and (b) not always present (fragmented tables do not expose the ROWID, unless the table was created WITH ROWIDS, in which case the ROWID is a physical column that is indexed after all) - be aware!
If no data is ever deleted and you are using SE, then selecting the row with the maximum ROWID will be the most recently added row. If a row is deleted, then that space will eventually be reused, and then the most recently added row ceases to be the one with the maximum ROWID. (IDS does not make that promise for a variety of complex reasons.)
The SE implementation of ROWID does not store the ROWID in the table, and does not create an index on it, but it does not need an index because it knows the formula for where to go to find the data (offset in data file = ROWID * RowSize), give or take a plus one on RowSize or a minus one ROWID or both.
As to practical use for ROWID, the style that was used before fragmentation was added to IDS was to select a list of ROWID values for the records of interest in the table, maintaining that list in memory:
SELECT ROWID
FROM InterestingTable
WHERE SomeColumn = xxx
AND AnotherColumn < yyy;
Then, the program could present these rows one at time, fetching the current data via the stored ROWID. The ROWID for a record would not change while a program was running. This ensured that the current data - whether edits from the current user or someone else - was shown when the record was displayed.
There's a program you're familiar with, ISQL Perform, that behaves like this. And it does not work with fragmented tables (necessarily in IDS; SE does not support fragmented tables) unless they are created with a physical ROWID column with the WITH ROWIDS clause.
Perhaps the term "RDBMS" rather than "an SQL server"?
Attaching any purpose to a ROWID is a bad idea. Particularly if you're in the habit of dropping and recreating tables. If your table needs a SERIAL PK, then that's what it should have. No good can come of using ROWIDs within your application.
Related
I have a large domain set of tables in a database - over 100 tables. Every single one uses a uniqueidentifier as a PK.
I'm realizing now that my mistake is that these are also by default, the clustered index.
Consider a table with this type of structure:
Orders
Id (uniqueidentifier) Primary Key
UserId (uniqueidentifier)
.
.
.
.
Other columns
Most queries are going to be something like "Get top 10 orders for user X sorted by OrderDate".
In this case, would it make sense to create a clustered index on UserId,Id...that way the data is physically stored sorted by UserId?
I'm not too concerned about Inserts and Updates - those will be few enough that performance loss there isn't a big deal. I'm mostly concerned with READs.
A clustered index means that data is physically stored in the order of the values. By default, the primary key is used for the clustered index.
The problem with GUIDs is that they are generated is (essentially) random order. That means that inserts are happening "in the middle" of the table. And, such inserts result in fragmentation.
Without getting into database internals, this is a little hard to explain. But what it means is that inserts require much more work than just inserting the values "at the end" of the table, because new rows go in the middle of a data page so the other rows have to be moved around.
SQL Server offers a solution for this, newsequentialid(). On a given server, this returns a sequential value which is inserted at the end. Often, this is an excellent compromise if you have to use GUIDs.
That said, I have a preference for just plain old ints as ids -- identity columns. These are smaller, so they take up less space. This is particularly true for indexes. Inserts work well because new values go at the "end" of the table. I also find integers easier to work with visually.
Using identity columns for primary keys and foreign key references still allows you to have unique GUID columns for each identity, if that is a requirement for the database (say for interfacing to other applications).
Clustered index is when you want to retrieve rows for a range of values for a given column. As data is physically arranged in that order, the rows can be extracted very efficiently.
a GUID, while excellent for a primary key, could be positively detrimental to performance, as there will be additional cost for inserts and no perceptible benefit on selects.
So yes, don't cluster an index on GUID.
If I have a table with a primary key, i.e. a physically arranged, clustered index that is of type integer and has an identity value like so (pseudo-SQL-code):
MyTable
--------
Id ( int, primary key, identity(1, 1) )
MyField1
MyField2
Would an insert operation in this table take more time as the number of rows in the table grew? Why?
The only reason I can imagine it taking longer is if the table rows are stored as nodes of a linked list internally before being flushed to the disk.
I am assuming that giving a clustered index to a table makes a copy of the table data and stores that as an array, so traversing that array is a lot faster (constant time as you need only one JMP instruction by a single integer (or machine bit-ness, i.e. 32 bits on a 32-bit machine and 64 bits on a 64-bit machine) size) than traversing the linked list.
And would it make any difference to the differential insertion time if the table did not have an index? That is, if the primary key in the above case was missing?
Where may I read about how a relational database stores a table in the RAM and on the disk?
In general, the overhead for inserting a row consists of a few components. Off-hand, I can think of:
Finding a page to put the row.
Updating indexes.
Logging the transaction.
Any overhead for triggers and constraints.
For (1). Because of the clustered index on an identity column, a new row goes into the table at the "end" of the table -- meaning on the last page. There is not a relationship between the size of the table and finding space for the row, in this case.
For (2). There is a very small additional overhead for updating the clustered index as the table grows. But this is very small -- and fragmentation doesn't seem to be an issue.
For (3). This is not related to the table size.
For (4). You don't seem to have triggers or constraints, so this isn't an issue.
So, by my reckoning, there would be very little additional overhead for an insert as the table grows bigger.
Note: There may be other factors as well. For instance, you might need to grow the table space to support a larger table. However, that isn't really related just to the size of the table, just to the relationship between the size of data and the available resources.
Clustering factor - A Awesome Simple Explanation on how it is calculated:
Basically, the CF is calculated by performing a Full Index Scan and
looking at the rowid of each index entry. If the table block being
referenced differs from that of the previous index entry, the CF is
incremented. If the table block being referenced is the same as the
previous index entry, the CF is not incremented. So the CF gives an
indication of how well ordered the data in the table is in relation to
the index entries (which are always sorted and stored in the order of
the index entries). The better (lower) the CF, the more efficient it
would be to use the index as less table blocks would need to be
accessed to retrieve the necessary data via the index.
My Index statistics:
So, here are my indexes(index over just one column) under analysis.
Index starting PK_ is my Primary Key and UI is a Unique key. (Ofcourse both hold unique values)
Query1:
SELECT index_name,
UNIQUENESS,
clustering_factor,
num_rows,
CEIL((clustering_factor/num_rows)*100) AS cluster_pct
FROM all_indexes
WHERE table_name='MYTABLE';
Result:
INDEX_NAME UNIQUENES CLUSTERING_FACTOR NUM_ROWS CLUSTER_PCT
-------------------- --------- ----------------- ---------- -----------
PK_TEST UNIQUE 10009871 10453407 96 --> So High
UITEST01 UNIQUE 853733 10113211 9 --> Very Less
We can see the PK having the highest CF and the other unique index is not.
The only logical explanation that strikes me is, the data beneath is stored actually by order of column over the Unique index.
1) Am I right with this understanding?
2) Is there any way to give the PK , the lowest CF number?
3) Seeing the Query cost using both these index, it is very fast for single selects. But still, the CF number is what baffle us.
The table is relatively huge over 10M records, and also receives real time inserts/updates.
My Database version is Oracle 11gR2, over Exadata X2
You are seeing the evidence of a heap table indexed by an ordered tree structure.
To get extremely low CF numbers you'd need to order the data as per the index. If you want to do this (like SQL Server or Sybase clustered indexes), in Oracle you have a couple of options:
Simply create supplemental indexes with additional columns that can satisfy your common queries. Oracle can return a result set from an index without referring to the base table if all of the required columns are in the index. If possible, consider adding columns to the trailing end of your PK to serve your heaviest query (practical if your query has small number of columns). This is usually advisable over changing all of your tables to IOTs.
Use an IOT (Index Organized Table) - It is a table, stored as an index, so is ordered by the primary key.
Sorted hash cluster - More complicated, but can also yield gains when accessing a list of records for a certain key (like a bunch of text messages for a given phone number)
Reorganize your data and store the records in the table in order of your index. This option is ok if your data isn't changing, and you just want to reorder the heap, though you can't explicitly control the order; all you can do is order the query and let Oracle append it to a new segment.
If most of your access patterns are random (OLTP), single record accesses, then I wouldn't worry about the clustering factor alone. That is just a metric that is neither bad nor good, it just depends on the context, and what you are trying to accomplish.
Always remember, Oracle's issues are not SQL Server's issues, so make sure any design change is justified by performance measurement. Oracle is highly concurrent, and very low on contention. Its multi-version concurrency design is very efficient and differs from other databases. That said, it is still a good tuning practice to order data for sequential access if that is your common use case.
To read some better advice on this subject, read Ask Tom: what are oracle's clustered and nonclustered indexes
In SQL Server 2008 I have some million rows of data which needs be deleted. They are scattered across a handful of tables. Deletion takes up to 20 seconds which I think is way to slow! The data to be deleted is identified by a timestamp column. Here is what I have done so far in order to optimize:
Using isolation level read uncommitted. I don't care about transactions. If we fail the user will issue the delete operation again. And new data is ensured not to have the timestamp we are deleting.
Deleting leaf tables before parent tables.
The timestamp column is part of the PK clustered index, in fact its the first position of the PK/index.
Each table is emptied using a loop which deletes top 200000 entries in order to reduce the transaction log overhead.
Neither I/O nor CPU is maxed out on the server
What have I overlooked?
Also I am in doubt of the effect of moving the timestamp column to the first position in the PK. After doing so, must I reorganize the tables or is SQL Server smart enough to do this itself. My understanding of clustered index is that since it defines the physical layout of the rows, it is force into reorganizing the data. But we have no complaints from the customer that the changing clustered index operation took a long time to perform.
Please make sure the tables you want to delete data from has "primary key" specifically indicated.
Wrong: create table myTable (ID int)
True: create table myTable (ID int PRIMARY KEY)
In addition to that, please try to add "option (recompile)", which will help the performance:
DELETE FROM myTable
WHERE timestamp in (select timestamp from other_table)
OPTION (RECOMPILE)
I'm looking at a client application which retrieves several columns including ROWID, and
later uses ROWID to identify rows it needs to update:
update some_table t set col1=value1
where t.rowid = :selected_rowid
Is it safe to do so? As the table is being modified, can ROWID of a row change?
"From Oracle 8 the ROWID format and size changed from 8 to 10 bytes. Note that ROWID's will change when you reorganize or export/import a table. In case of a partitioned table, it also changes if the row migrates from a partition to another one during an UPDATE."
http://www.orafaq.com/wiki/ROWID
I'd say no. This could be safe if for instance the application stores ROWID temporarily(say generating a list of select-able items, each identified with ROWID, but the list is routinely regenerated and not stored). But if ROWID is used in any persistent way it's not safe.
Assuming that you are using the ROWID a short period of time after you SELECT it, that the table is a standard heap-organized table, and that the DBA isn't doing something to the table (which is a reasonably safe assumption if the application is online), the ROWID will be stable. It would be preferable to use the primary key but when the primary key isn't available, plenty of Oracle-developed tools and frameworks will use the ROWID for short periods of time. It would not be safe if you intended to use the ROWID a long period of time after you SELECT it-- for example, if you allow users to edit data locally and then synchronize with the master database some arbitrary length of time later.
The ROWID is just a physical location of a row so anything that causes that location to change will change the ROWID.
If you are using index-organized tables or partitioned tables, updates to the row can change where the row is physically located which will change the ROWID.
If a row is deleted from a heap-organized table, a subsequent INSERT might insert data with completely different data that happens to use the same ROWID the deleted row previously had.
Various administrative tasks can cause the ROWID to change. Exporting and importing the table will change the ROWID for example, but so will doing something like the new-ish online shrink command. These administrative tasks will not normally be done while the application is up, however, and will almost certainly not be done during the day. But it could lead to problems if the application isn't shut down when a DBA does this sort of thing or if the application persists the data.
Over time, it has become more and more common for new features to introduce new possibilities for ROWIDs to change. Index-organized tables and the online shrink option, for example, are relatively new features. In the future, it is likely that there will be more features that will involve the potential at least for a ROWID to change.
Of course, if we're being pedantic, it's also not safe to rely on the primary key. It is perfectly possible that some other session comes along and updates the primary key of the row after you read it or that some other session deletes the row after you select it and inserts a new row with the same data and a different primary key. In either case, it helps to have some local knowledge about what the applications using the database are actually supposed to be doing. It would be extremely uncommon, for example, to allow updates to primary keys or to reuse primary keys so you can generally determine that it's safe to use a primary key. Similarly, it is relatively common to conclude that given the way you're using partitioning or given the way you've defined the index in your index-organized table that updates won't actually change the ROWID. If you know that the table is partitioned by the LOAD_DATE, for example, and that you never update the LOAD_DATE, you won't actually experience changes to the ROWID because of an update. If you know that the table is index-organized but that you're not updating a column that is part of that index, the ROWID won't change on an UPDATE.
I do not think it is safe to do so, in theory it will not change, that is of course until someone "accidentally" deletes something on the actual DB...
I would just use the PK makes a lot more sense.