I need to get the latest modified time of a table, so came across
select relfilenode from pg_class where relname = 'test';
which gives me the relfilenode id, this seems to be a directory name in
L:\Databases\PostgresSQL\data\base\inodenumber
For which I later extract the latest modified time.
Is this the right way to do this or are there any better methods to do the same
Testing the mtime of the table's relfilenode won't work well. As Eelke noted VACUUM among other operations will modify the timestamp. Hint bit setting will modify the table too, causing it to appear to be "modified" by a SELECT. Additionally, sometimes a table has more than one fork to its on-disk relation (1GB chunks), and you'd have to check all of them to find the most recent.
If you want to keep a last modified time for a table, add an AFTER INSERT OR UPDATE OR DELETE OR TRUNCATE ... FOR EACH STATEMENT trigger that updates a timestamp row in a table you use for tracking modification times.
The downside of the trigger is that it'll contest a single row lock on the table, so it'll serialize all your transactions. It'll also greatly increase the chance of getting deadlocks. What you really want is probably something nontransactional that doesn't have to roll back when the transaction does, where if multiple transactions update a counter the highest value wins. There's nothing like that built in, though it might not be too hard as a C extension.
A slightly more sophisticated option would be to create a trigger that uses dblink to do an update of the last-updated counter. That'll avoid most of the contention problems but it'll actually make deadlocking worse because PostgreSQL's deadlock detection won't be able to "see" the fact that the two sessions are deadlocked via an intermediary. You'd need a way to SELECT ... FOR UPDATE with a timeout to make it reliable without aborting transactions too often.
In any case, a trigger won't catch DDL, though. DDL triggers ("Event triggers") are coming in Pg 9.3.
See also:
How do I find the last time that a PostgreSQL database has been updated?
How to get 'last modified time' of the table in postgres?
I don't think that would be completly reliable as a vacuum would also modify the file(s) containing the table but the logical content of the table does not change during a vacuum.
You could create triggers for INSERT, UPDATE and DELETE that maintain the last modified timestamp for each table in another table. This method has a slight performance impact but will provide accurate information.
Related
I have a large table that I need to re-sort periodically. I am partly basing this on a suggestion I was given to stay away from using cluster keys since I am inserting data ordered differently (by time) from how I need it clustered (by ID), and that can cause re-clustering to get a little out of control.
Since I am writing to the table on a hourly I am wary of causing problems with these two processes conflicting: If I CTAS to a newly sorted temp table and then swap the table name it seems like I am opening the door to have a write on the source table not make it to the temp table.
I figure I can trigger a flag when I am re-sorting that causes the ETL to pause writing, but that seems a bit hacky and maybe fragile.
I was considering leveraging locking and transactions, but this doesn't seem to be the right use case for this since I don't think I'd be locking the table I am copying from while I write to a new table. Any advice on how to approach this?
I've asked some clarifying questions in the comments regarding the clustering that you are avoiding, but in regards to your sort, have you considered creating a nice 4XL warehouse and leveraging the INSERT OVERWRITE option back into itself? It'd look something like:
INSERT OVERWRITE INTO table SELECT * FROM table ORDER BY id;
Assuming that your table isn't hundreds of TB in size, this will complete rather quickly (inside an hour, I would guess), and any inserts into the table during that period will queue up and wait for it to finish.
There are some reasons to avoid the automatic reclustering, but they're basically all the same reasons why you shouldn't set up a job to re-cluster frequently. You're making the database do all the same work, but without the built in management of it.
If your table is big enough that you are seeing performance issues with the clustering by time, and you know that the ID column is the main way that this table is filtered (in JOINs and WHERE clauses) then this is probably a good candidate for automatic clustering.
So I would recommend at least testing out a cluster key on the ID and then monitoring/comparing performance.
To give a brief answer to the question about resorting without conflicts as written:
I might recommend using a time column to re-sort records older than a given time (probably in a separate table). While it's sorting, you may get some new records. But you will be able to use that time column to marry up those new records with the, now sorted, older records.
You might consider revoking INSERT, UPDATE, DELETE privileges on the original table within the same script or procedure that performs the CTAS creating the newly sorted copy of the table. After a successful swap you can re-enable the privileges for the roles that are used to perform updates.
Twice a day, I run a heavy query and save the results (40MBs worth of rows) to a table.
I truncate this results table before inserting the new results such that it only ever has the latest query's results in it.
The problem, is that while the update to the table is written, there is technically no data and/or a lock. When that is the case, anyone interacting with the site could experience an interruption. I haven't experienced this yet, but I am looking to mitigate this in the future.
What is the best way to remedy this? Is it proper to write the new results to a table named results_pending, then drop the results table and rename results_pending to results?
Two methods come to mind. One is to swap partitions for the table. To be honest, I haven't done this in SQL Server, but it should work at a low level.
I would normally have all access go through a view. Then, I would create the new day's data in a separate table -- and change the view to point to the new table. The view change is close to "atomic". Well, actually, there is a small period of time when the view might not be available.
Then, at your leisure you can drop the old version of the table.
TRUNCATE is a DDL operation which causes problems like this. If you are using snapshot isolation with row versioning and want users to either see the old or new data then use a single transaction to DELETE the old records and INSERT the new data.
Another option if a lot of the data doesn't actually change is to UPDATE / INSERT / DELETE only those records that need it and leave unchanged records alone.
I want to find when the last INSERT, UPDATE or DELETE statement was performed on a table (for now, in the future I want to do this in multiple tables) in an Oracle database.
I created a table and then I updated one of its rows. Now I've the following query:
SELECT SCN_TO_TIMESTAMP(ora_rowscn) from test_table;
This query returns the timestamps of each row, and for each of them it gives the time when they were first created.
But the row that I've updated have the same timestamp as the others. Why? Shouldn't the timestamp be updated?
ORA_ROWSCN is not the right solution for this. It is not necessarily reliable at the row level. Moreover, it's not going to be useful at all for deleted rows.
If you have a real need to know when DML changes were made to a table, you should look at Oracle's auditing feature.
An alternative is to use triggers to record when changes are made to the table. Since you say you only care about the time of the most recent change, you can just create a single-column table to record the time, and write a trigger that fires on any DML statement to maintain it. If you're doing this in a production environment or even just in one where more than one session might be modifying the table, you'd want to think about how it should work when concurrent changes are made. You could force the table to have at most one row, but that would serialize every change to the table. You could allow each session to insert a separate row and take the max value when querying it, but then you probably want to think about clearing out old rows from time to time.
This question already has answers here:
How can I get a hash of an entire table in postgresql?
(7 answers)
Closed 9 years ago.
Suppose you have a reasonably large (for local definitions of “large”), but relatively stable table.
Right now, I want to take a checksum of some kind (any kind) of the contents of the entire table.
The naïve approach might be to walk the entire table, taking the checksum (say, MD5) of the concatenation of every column on each row, and then perhaps concatenate them and take its MD5sum.
From the client side, that might be optimized a little by progressively appending columns' values into the MD5 sum routine, progressively mutating the value.
The reason for this, is that at some point in future, we want to re-connect to the database, and ensure that no other users may have mutated the table: that includes INSERT, UPDATE, and DELETE.
Is there a nicer way to determine if any change/s have occurred to a particular table? Or a more efficient/faster way?
Update/clarification:
We are not able/permitted to make any alterations to the table itself (e.g. adding a “last-updated-at” column or triggers or so forth)
(This is for Postgres, if it helps. I'd prefer to avoid poking transaction journals or anything like that, but if there's a way to do so, I'm not against the idea.)
Adding columns and triggers is really quite safe
While I realise you've said it's a large table in a production DB so you say you can't modify it, I want to explain how you can make a very low impact change.
In PostgreSQL, an ALTER TABLE ... ADD COLUMN of a nullable column takes only moments and doesn't require a table re-write. It does require an exclusive lock, but the main consequence of that is that it can take a long time before the ALTER TABLE can actually proceed, it won't hold anything else up while it waits for a chance to get the lock.
The same is true of creating a trigger on the table.
This means that it's quite safe to add a modified_at or created_at column and an associated trigger function to maintain them to a live table that's in intensive real-world use. Rows added before the column was created will be null, which makes perfect sense since you don't know when they were added/modified. Your trigger will set the modified_at field whenever a row changes, so they'll get progressively filled in.
For your purposes it's probably more useful to have a trigger-maintained side-table that tracks the timestamp of the last change (insert/update/delete) anywhere in the table. That'll save you from storing a whole bunch of timestamps on disk and will let you discover when deletes have happened. A single-row side-table with a row you update on each change using a FOR EACH STATEMENT trigger will be quite low-cost. It's not a good idea for most tables because of contention - it essentially serializes all transactions that attempt to write to the table on the row update lock. In your case that might well be fine, since the table is large and rarely updated.
A third alternative is to have the side table accumulate a running log of the timestamps of insert/update/delete statements or even the individual rows. This allows your client read the change-log table instead of the main table and make small changes to its cached data rather than invalidating and re-reading the whole cache. The downside is that you have to have a way to periodically purge old and unwanted change log records.
So... there's really no operational reason why you can't change the table. There may well be business policy reasons that prevent you from doing so even though you know it's quite safe, though.
... but if you really, really, really can't:
Another option is to use the existing "md5agg" extension: http://llg.cubic.org/pg-mdagg/ . Or to apply the patch currently circulating pgsql-hackers to add an "md5_agg" to the next release to your PostgreSQL install if you built from source.
Logical replication
The bi-directional replication for PostgreSQL project has produced functionality that allows you to listen for and replay logical changes (row inserts/updates/deletes) without requiring triggers on tables. The pg_receivellog tool would likely suit your purposes well when wrapped with a little scripting.
The downside is that you'd have to run a patched PostgreSQL 9.3, so I'm guessing if you can't change a table, running a bunch of experimental code that's likely to change incompatibly in future isn't going to be high on your priority list ;-) . It's included in the stock release of 9.4 though, see "changeset extraction".
Testing the relfilenode timestamp won't work
You might think you could look at the modified timestamp(s) of the file(s) that back the table on disk. This won't be very useful:
The table is split into extents, individual files that by default are 1GB each. So you'd have to find the most recent timestamp across them all.
Autovacuum activity will cause these timestamps to change, possibly quite a while after corresponding writes happened.
Autovacuum must periodically do an automatic 'freeze' of table contents to prevent transaction ID wrap-around. This involves progressively rewriting the table and will naturally change the timestamp. This happens even if nothing's been added for potentially quite a long time.
Hint-bit setting results in small writes during SELECT. These writes will also affect the file timestamps.
Examine the transaction logs
In theory you could attempt to decode the transaction logs with pg_xlogreader and find records that affect the table of interest. You'd have to try to exclude activity caused by vacuum, full page writes after hint bit setting, and of course the huge amount of activity from every other table in the entire database cluster.
The performance impact of this is likely to be huge, since every change to every database on the entire system must be examined.
All in all, adding a trigger on a table is trivial in comparison.
What about creating a trigger on insert/update/delete events on the table? The trigger could call a function that inserts a timestamp into another table which would mark the time for any table-changing event.
The only concern would be an update event updated using the same data currently in the table. The trigger would fire, even though the table didn't really change. If you're concerned about this case, you could make the trigger call a function that generates a checksum against just the updated rows and compares against a previously generated checksum, which would usually be more efficient than scanning and checksumming the whole table.
Postgres documentation on triggers here: http://www.postgresql.org/docs/9.1/static/sql-createtrigger.html
If you simply just want to know when a table has last changed without doing anything to it, you can look at the actual file(s) timestamp(s) on your database server.
SELECT relfilenode FROM pg_class WHERE relname = 'your_table_name';
If you need more detail on exactly where it's located, you can use:
select t.relname,
t.relfilenode,
current_setting('data_directory')||'/'||pg_relation_filepath(t.oid)
from pg_class t
join pg_namespace ns on ns.oid = t.relnamespace
where relname = 'your_table_name';
Since you did mention that it's quite a big table, it will definitely be broken into segments, and toasts, but you can utilize the relfilenode as your base point, and do a ls -ltr relfilenode.* or relfilnode_* where relfilenode is the actual relfilenode from above.
These files gets updated at every checkpoint if something occured on that table, so depending on how often your checkpoints occur, that's when you'll see the timestamps update, which if you haven't changed the default checkpoint interval, it's within a few minutes.
Another trivial, but imperfect way to check if INSERTS or DELETES have occurred is to check the table size:
SELECT pg_total_relation_size('your_table_name');
I'm not entirely sure why a trigger is out of the question though, since you don't have to make it retroactive. If your goal is to ensure nothing changes in it, a trivial trigger that just catches an insert, update, or delete event could be routed to another table just to timestamp an attempt but not cause any activity on the actual table. It seems like you're not ensuring anything changes though just by knowing that something changed.
Anyway, hope this helps you in this whacky problem you have...
A common practice would be to add a modified column. If it were MySQL, I'd use timestamp as datatype for the field (updates to current date on each updade). Postgre must have something similar.
Newish to Oracle programming (from Sybase and MS SQL Server). What is the "Oracle way" to avoid filling the trans log with large updates?
In my specific case, I'm doing an update of potentially a very large number of rows. Here's my approach:
UPDATE my_table
SET a_col = null
WHERE my_table_id IN
(SELECT my_table_id FROM my_table WHERE some_col < some_val and rownum < 1000)
...where I execute this inside a loop until the updated row count is zero,
Is this the best approach?
Thanks,
The amount of updates to the redo and undo logs will not at all be reduced if you break up the UPDATE in multiple runs of, say 1000 records. On top of it, the total query time will be most likely be higher compared to running a single large SQL.
There's no real way to address the UNDO/REDO log issue in UPDATEs. With INSERTs and CREATE TABLEs you can use a DIRECT aka APPEND option, but I guess this doesn't easily work for you.
Depends on the percent of rows almost as much as the number. And it also depends on if the update makes the row longer than before. i.e. going from null to 200bytes in every row. This could have an effect on your performance - chained rows.
Either way, you might want to try this.
Build a new table with the column corrected as part of the select instead of an update. You can build that new table via CTAS (Create Table as Select) which can avoid logging.
Drop the original table.
Rename the new table.
Reindex, repoint contrainst, rebuild triggers, recompile packages, etc.
you can avoid a lot of logging this way.
Any UPDATE is going to generate redo. Realistically, a single UPDATE that updates all the rows is going to generate the smallest total amount of redo and run for the shortest period of time.
Assuming you are updating the vast majority of the rows in the table, if there are any indexes that use A_COL, you may be better off disabling those indexes before the update and then doing a rebuild of those indexes with NOLOGGING specified after the massive UPDATE statement. In addition, if there are any triggers or foreign keys that would need to be fired/ validated as a result of the update, getting rid of those temporarily might be helpful.