I needed some help in knowing whether we can have an option of sorting the rows of the table in Oracle based on the time of insertion.
Like do we use any sorting of function based indexes.
I would like to perform this operation of auto sorting without having to declare any new column for recording time.
Does the oracle server keep track of that information which i can use for sorting.
Thanks in advance
Oracle doesn't record the "time of insertion" for you, so you must add that column yourself, set it to the current time (e.g. SYSTIMESTAMP) on every insert, then ORDER BY that column when you query the table.
The general answer is no.
However...
The ORA_ROWSCN pseudo column returns the conservative upper bound system change number (SCN) of the most recent change to the row. This pseudocolumn is useful for determining approximately when a row was last updated. It is not absolutely precise, because Oracle tracks SCNs by transaction committed for the block in which the row resides. You can obtain a more fine-grained approximation of the SCN by creating your tables with row-level dependency tracking.
Related
Suppose, if following rows are inserted in chronological order into a table:
row1, row2, row3, row4, ..., row1000, row1001.
After a while, we delete/remove the latest row1001.
As in this post: How to get Top 5 records in SqLite?
If the below command is run:
SELECT * FROM <table> LIMIT 1;
Will it assuredly provide the "row1000"?
If no, then is there any efficient way to get the latest row(s)
without traversing through all the rows? -- i.e. without using
combination of ORDER BY and DESC.
[Note: For now I am using "SQLite", but it will be interesting for me to know about SQL in general as well.]
You're misunderstanding how SQL works. You're thinking row-by-row which is wrong. SQL does not "traverse rows" as per your concern; it operates on data as "sets".
Others have pointed out that relational database cannot be assumed to have any particular ordering, so you must use ORDER BY to explicitly specify ordering.
However (not mentioned yet is that), in order to ensure it performs efficiently, you need to create an appropriate index.
Whether you have an index or not, the correct query is:
SELECT <cols>
FROM <table>
ORDER BY <sort-cols> [DESC] LIMIT <no-rows>
Note that if you don't have an index the database will load all data and probably sort in memory to find the TOP n.
If you do have the appropriate index, the database will use the best index available to retrieve the TOP n rows as efficiently as possible.
Note that the sqllite documentation is very clear on the matter. The section on ORDER BY explains that ordering is undefined. And nothing in the section on LIMIT contradicts this (it simply constrains the number of rows returned).
If a SELECT statement that returns more than one row does not have an ORDER BY clause, the order in which the rows are returned is undefined.
This behaviour is also consistent with the ANSI standard and all major SQL implementations. Note that any database vendor that guaranteed any kind of ordering would have to sacrifice performance to the detriment of queries trying to retrieve data but not caring about order. (Not good for business.)
As a side note, flawed assumptions about ordering is an easy mistake to make (similar to flawed assumptions about uninitialised local variables).
RDBMS implementations are very likely to make ordering appear consistent. They follow a certain algorithm for adding data, a certain algorithm for retrieving data. And as a result, their operations are highly repeatable (it's what we love (and hate) about computers). So things repeatably look the same.
Theoretical examples:
Inserting a row results in the row being added to the next available free space. So data appears sequential. But an update would have to move the row to a new location if it no longer fits.
The DB engine might retrieve data sequentially from clustered index pages and seem to use clustered index as the 'natural ordering' ... until one day a page-split puts one of the pages in a different location. * Or a new version of the DMBS might cache certain data for performance, and suddenly order changes.
Real-world example:
The MS SQL Server 6.5 implementation of GROUP BY had the side-effect of also sorting by the group-by columns. When MS (in version 7 or 2000) implemented some performance improvements, GROUP BY would by default, return data in a hashed order. Many people blamed MS for breaking their queries when in fact they had made false assumptions and failed to ORDER BY their results as needed.
This is why the only guarantee of a specific ordering is to use the ORDER BY clause.
No. Table records have no inherent order. So it is undefined which row(s) to get with a LIMIT clause without an ORDER BY.
SQLite in its current implemantation may return the latest inserted row, but even if that were the case you must not rely on it.
Give a table a datetime column or some sortkey, if record order is important for you.
In SQL, data is stored in tables unordered. What comes out first one day might not be the same the next.
ORDER BY, or some other specific selection criteria is required to guarantee the correct value.
I googled around and found no answer. The question is whether for an existing SQL table (assume any of H2, MySQL, or Postgres)...
Is there a way to get the last-update timestamp value for a given table row. That is, without explicitly declaring a new column (altering the table), and/or adding triggers that update a timestamp column.
I'm using a JDBC driver, preparing statements, getting ResultSets and so forth. I need to be able to determine whether the data has changed recently or not, and for this a timestamp would help. If possible I want to avoid adding timestamp columns across all tables in the system.
There is no implicit standard approach to this problem. The standard way is having an explicit column and logic in a db trigger or app function...
As mentioned, there are ways to do it through the logs but it's hard and usually wont be much accurate. For example, in Postgres you can enable commit timestamps in postgresql.conf and check the last update time but those are approximate and are not kept for a long time...
I was asked to design a class for caching SQL query results. Calling the class' query method will query and cache the entire set of results at the first time; afterward, each subsequence query will retrieve only the updated portion, and will merge the result into the cache.
If the class is required to be generic, i.e. NO knowledge about the db and the tables, do you have any idea?
Is it possible, and how to retrieve only updated/new records since the last query?
MySQL (nor any other SQL-based database, as far as I know) does not store the "last update time" or "insert time" of a row anywhere, at least not anywhere you can query (parsing out the binary log is probably not something you're looking to do).
To get "new records" you will need to either
Query for all records with an autoincrement value higher than the last known max
Add a datetime/timestamp to the table that would store the insert/updated time of the record, and query based on that
You could consider having your code create its own table that would store a record every time an UPDATE or INSERT happened on another table, and then add triggers to all those other tables that would populate your table.
But that might be a bit much.
Either your cache class has to be notified of all updates, or you need some kind of expiration.
The only other option is to query the database to check, but then you kind of lose the point of the cache :)
The above is true for a "generic" cache. If you implement a specialized cache for a particular table you might be able to exploit DML patterns.
In a certain app I must constantly query data that are likely to be amongst the last inserted rows. Since this table is going to grow a lot, I wonder if theres a standard way of optimizing the queries by making them start the lookup at the table's end. I think I would get the same optmization if the database stored data for the table in a stack-like structure, so the last inserted rows would be searched first.
The SQL spec doesn't mention anything about maintaining the insertion order. In practice, most of decent DB's also doesn't maintain it. Then it stops here. Sorting the table first ain't going to make it faster. Just index the column(s) of interest (at least the ones which you use in the WHERE).
One of the "tenets" of a proper RDBMS is that this kind of matters shouldn't concern you or anyone else using the DB.
The DB engine is "free" to use whatever method it wants to store/retrieve records, so if you want to enforce a "top" behaviour do what other suggested: add a timestamp field to the table (or tables), add an index on it and query using it as a sort and/or query criteria (e.g.: you poll the table each minute, and ask for records with timestamp>=systime-1 minute)
There is no standard way.
In some databases you can specify the sort order on an index.
SQL Server allows you to write ASC or DESC on an index:
[ ASC | DESC ]
Determines the ascending or descending sort direction for the particular index column. The default is ASC.
In MySQL you can also write ASC or DESC when you create the index but currently this is ignored. It might be implemented in a future version.
Add a counter or a time field in your table, sort on it and get top rows.
In other words: You should forget the idea that SQL tables are accessed in any particular order by default. A seqscan does not mean the oldest rows will be searched first, only that all rows will be checked. If you want to optimize some search you add indexes on some fields. What you are looking for is probably indexes.
If your data is indexed, it won't matter. The index is doing a binary search, not a sequential scan.
Unless you're doing TOP 1 (or something like it), the SELECT will have to scan the whole table or index anyway.
According to Data Independence you shouldn't care. That said a clustered index would probably suit your needs if you typically look for a date range. (sorting acs/desc shouldn't matter but you should try it out.)
If you find that you really need it you can also shard your database to increase perf on the most recently added data.
If you have enough rows that its actually becomming a problem, and you know how many "the most recently inserted rows" should be, you could try a round-about method.
Note: Even for pretty big tables, this is less efficient, but once your main table gets big enough, I've seen this work wonders for user-facing performance.
Create a "staging" table that exactly mimics your table's structure. Whenever you insert into your main table, also insert into your "staging" area. Limit your "staging" area to n rows by using a trigger to delete the lowest id row in the table when a new row over your arbitrary maximum is reached (say, 10,000 or whatever your limit is).
Then, queries can hit that smaller table first looking for the information. Since the table is arbitrarilly limited to the last n rows, it's only looking in the most recent data. Only if that fails to find a match would your query (actually, at this point a stored procedure because of the decision making) hit your main table.
Some Gotchas:
1) Make sure your trigger(s) is(are) set up properly to maintain the correct concurrancy between your "main" and "staging" tables.
2) This can quickly become a maintenance nightmare if not handled properly- and depending on your scenario it be be a little finiky.
3) I cannot stress enough that this is only efficient/useful in very specific scenarios. If yours doesn't match it, use one of the other answers.
ISO/ANSI Standard SQL does not consider optimization at all. For example the widely recognized CREATE INDEX SQL DDL does not appear in the Standard. This is because the Standard makes no assumptions about the underlying storage medium and nor should it. I regularly use SQL to query data in text files and Excel spreadsheets, neither of which have any concept of database indexes.
You can't do this.
However, there is a way to do something that might be even better. Depending on the design of your table, you should be able to create an index that keeps things in almost the order of entry. For example, if you adopt the common practice of creating an id field that autoincrements, then that index is just about in chronological order.
Some RDBMSes permit you to declare a backwards index, that is, one that descends instead of ascending. If you create a backwards index on the ID field, and if the optimizer uses that index, it will look at the most recent entries first. This will give you a rapid response for the first row.
The next step is to get the optimizer to use the index. You need to use explain plan to see if the index is being used. If you ask for the rows in order of id descending, the optimizer will almost certainly use the backwards index. If not you may be able to use hints to guide the optimizer.
If you still need to avoid reading all the rows in order to avoid wasting time, you may be able to use the LIMIT feature to declare that you only want, say 10 rows, and no more, or 1 row and no more. That should do it.
Good luck.
If your table has a create date, then I'd reverse sort by that and take the top 1.
I'm updating a table from another table in the same database - what would be the best way to count the updates/inserts?
I can think of a couple of methods:
Join the tables and count (i.e. inner join for update, left join where null for inserts) then perform the update/insert
Use the modification date in the target table (this is maintained correctly) and do a count where the mod date has change, this would have to be done after the update, and before and after the insert... sure you get the idea.
Currently I use method two as I thought it may be faster not having to join the tables, and the modification time stamp data is there anyway.
What are peoples thoughts on this? (I wanted to tag this best-practice, but that tag seems to have disappeared).
EDITED: Sorry, I should have been more specific to the scenario - assume only one concurrent update (this is to update an archive/warehouse overnight) and the provider for SSIS were using won't return the number of rows updated.
I'd probably keep using the second option. If you know you're running daily (or any other regular interval), you could just test for all modifications (updates and inserts) based on the datepart / day (depending on interval) value in your timestamp column.
This way you'll not have to rewrite your update-testing procedure when your inserts/joins/other requirements change.
(Of course, you're vulnerable to changes by other agents).
You could also introduce a 'helper column' where you set a unique update value, but that smells fishy.