My question is probably very specific to Postgres, probably not.
A program which I cannot modify has access to Postgress via npgsql and a simple select command, all I know.
I also have access via npgsql. The table is defined as:
-- Table: public.n_data
-- DROP TABLE public.n_data;
CREATE TABLE public.n_data
(
u_id integer,
p_id integer NOT NULL,
data text,
CONSTRAINT nc PRIMARY KEY (p_id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.n_data
OWNER TO postgres;
(If that info is useful anyway)
I access one single big column, read from it and write back to it.
This all works fine so far.
The Question is: how does Postgres handles it if we write at the same time.
Any Problems there?
And if Postgres does not handle that automatically, how about when I read the data, process it and in the meantime data changes, and I write back that data after I processed it---> lost data.
Its a bit tricky to test for data integrity, since this datablock is huge, and corruptions are hard to find.
I do it with c# if that means anything.
Locking (in most1) relational databases (including Postgres) is always on row level, never on column level (it's columns and rows in a relational database not "cells", "fields" or "records")
If two transactions modify the same row, the second one will have to wait until the first one commits or rolls back.
If two transactions modify different rows then they can do that without any problems as long as they don't modify columns that are part of a unique constraint or primary key to the same value.
Read access to data is never blocked in Postgres by regular DML statements. So yes while one transaction modifies data, another one will see the old data until the first transaction commits the changes ("read consistency").
To handle lost updates you can either use the serializable isolation level or make all transactions follow the pattern that they first need to obtain a lock on the row (select ... for update) and hold that until they are finished. Search for "pessimistic locking" to get more details about this pattern.
Another option is to include a "modified" timestamp in your table. When a process reads the data it also reads the modification timestamp. When it sends back the new changes it includes a where modified_at = <value obtained when reading> - if the data has changed the condition will not hold true and nothing will be updated and you need to restart your transaction. Search for "optimistic locking" to find more details about this pattern.
1 some DBMS do page locking and some escalate many row level locks to a table lock. Neither is the case in Postgres
Related
while creating a tkinter application to store book information, I realize that simply deleting a row of information from the SQL database does not update the indexes. Kinda hard to explain but here is a picture of what I meant:
link to picture. (still young on this account, so pictures can't be embedded, sorry for the inconvenience)
As you can see, the first column represents the index and index 3 is missing because I deleted it. Is there a way such that upon deleting a row, anything below it just shifts up to cover for the empty spot?
Your use of the word "index" must be based on the application language, not the database language. In databases, indexes are additional data structures that speed certain operations on tables.
You are referring to an "id" column, presumably one that is defined automatically as identity, auto_increment, serial, or whatever the underlying database uses.
A very important point is that deleting a row from a table does not affect other rows in the table (unless you have gone through the work of writing triggers to make that happen). It just deletes the rows.
The second more important point is that you do not want to change the "identity" of rows -- and that is what the column you are calling an "index" is doing. It identifies the row. It not only identifies the row today, but it identifies the same row tomorrow. And, if it existed, yesterday. That is, you don't want to change the identity.
This is even more important when you have foreign key relationships -- that is, other tables that refer to this row. Those relationships could get all messed up if the ids start changing.
SQL does offer a simple way to get a number with no gaps:
select row_number() over (order by "index") as seqnum
from t;
I'm starting a new application and was wondering what the best method of logging is. Some tables in the database will need to have every change recorded, and the user that made the change. Other tables may just need to have the last modified time recorded.
In previous applications I've used different methods to do this but want to hear what others have done.
I've tried the following:
Add a "modified" date-time field to the table to record the last time it was edited.
Add a secondary table just for recording changes in a primary table. Each row in the secondary table represents a changed field in the primary table. So one record update in the primary could create several records in the secondary table.
Add a table similar to no.2 but it records edits across three or fours tables, reference the table it relates to in an additional field.
what methods do you use and would recommend?
Also what is the best way to record deleted data? I never like the idea that a user can permanently delete a record from the DB, so usually I have a boolean field 'deleted' which is changed to true when its deleted, and then it'll be filtered out of all queries at model level. Any other suggestions on this?
Last one.. What is the best method for recording user activity? At the moment I have a table which records logins/logouts/password changes etc, and depending what the action is, gives it a code either 1,2, 3 etc.
Hope I haven't crammed too much into this question. thanks.
I know it's a very old question, but I'd wanted to add more detailed answer as this is the first link I got googling about db logging.
There are basically two ways to log data changes:
on application server layer
on database layer.
If you can, just use logging on server side. It is much more clear and flexible.
If you need to log on database layer you can use triggers, as #StanislavL said. But triggers can slow down your database performance and limit you to store change log in the same database.
Also, you can look at the transaction log monitoring.
For example, in PostgreSQL you can use mechanism of logical replication to stream changes in json format from your database to anywhere.
In the separate service you can receive, handle and log changes in any form and in any database (for example just put json you got to Mongo)
You can add triggers to any tracked table to olisten insert/update/delete. In the triggers just check NEW and OLD values and write them in a special table with columns
table_name
entity_id
modification_time
previous_value
new_value
user
It's hard to figure out user who makes changes but possible if you add changed_by column in the table you listen.
I have a table named as 'games', which contains a column named as 'title', this column is unique, database used in PostgreSQL
I have a user input form that allows him to insert a new 'game' in 'games' table. The function that insert a new game checks if a previously entered 'game' with the same 'title' already exists, for this, I get the count of rows, with the same game 'title'.
I use transactions for this, the insert function at the start uses BEGIN, gets the row count, if row count is 0, inserts the new row and after process is completed, it COMMITS the changes.
The problem is that, there are chances that 2 games with the same title if submitted by the user at the same time, would be inserted twice, since I just get the count of rows to chk for duplicate records, and each of the transaction would be isolated from each other
I thought of locking the tables when getting the row count as:
LOCK TABLE games IN ACCESS EXCLUSIVE MODE;
SELECT count(id) FROM games WHERE games.title = 'new_game_title'
Which would lock the table for reading too (which means the other transaction would have to wait, until the current one is completed successfully). This would solve the problem, which is what I suspect. Is there a better way around this (avoiding duplicate games with the same title)
You should NOT need to lock your tables in this situation.
Instead, you can use one of the following approaches:
Define UNIQUE index for column that really must be unique. In this case, first transaction will succeed, and second will error out.
Define AFTER INSERT OR UPDATE OR DELETE trigger that will check your condition, and if it does not hold, it should RAISE error, which will abort offending transaction
In all these cases, your client code should be ready to properly handle possible failures (like failed transactions) that could be returned by executing your statements.
Using the highest transaction isolation(Serializable) you can achieve something similar to your actual question. But be aware that this may fail ERROR: could not serialize access due to concurrent update
I do not agree with the constraint approach entirely. You should have a constraint to protect data integrity, but relying on the constraint forces you to identify not only what error occurred, but which constraint caused the error. The trouble is not catching the error as some have discussed but identifying what caused the error and providing a human readable reason for the failure. Depending on which language your application is written in, this can be next to impossible. eg: telling the user "Game title [foo] already exists" instead of "game must have a price" for a separate constraint.
There is a single statement alternative to your two stage approach:
INSERT INTO games ( [column1], ... )
SELECT [value1], ...
WHERE NOT EXISTS ( SELECT x FROM games as g2 WHERE games.title = g2.title );
I want to be clear with this... this is not an alternative to having a unique constraint (which requires extra data for the index). You must have one to protect your data from corruption.
We have a table and a set of procedures that are used for generating pk ids. The table holds the last id, and the procedures gets the id, increments it, updates the table, and then returns the newly incremented id.
This procedure could potentially be within a transaction. The problem is that if the we have a rollback, it could potentially rollback to an id that is before any id's that came into use during the transaction (say generated from a different user or thread). Then when the id is incremented again, it will cause duplicates.
Is there any way to exclude the id generating table from a parent transaction to prevent this from happening?
To add detail our current problem...
First, we have a system we are preparing to migrate a lot of data into. The system consists of a ms-sql (2008) database, and a textml database. The sql database houses data less than 3 days old, while the textml acts as an archive for anything older. The textml db also relies on the sql db to provide ids' for particular fields. These fields are Identity PK's currently, and are generated on insertion before publishing to the texml db. We do not want to wash all our migrated data through sql since the records will flood the current system, both in terms of traffic and data. But at the same time we have no way of generating these id's since they are auto-incremented values that sql server controls.
Secondly, we have a system requirement which needs us to be able to pull an old asset out of the texml database and insert it back into the sql database with the original id's. This is done for correction and editing purposes, and if we alter the id's it will break relations downstream on clients system which we have no control over. Of course all this is an issue because id columns are Identity columns.
procedures gets the id, increments it,
updates the table, and then returns
the newly incremented id
This will cause deadlocks. procedure must increment and return in one single, atomic, step, eg. by using the OUTPUT clause in SQL Server:
update ids
set id = id + 1
output inserted.id
where name= #name;
You don't have to worry about concurrency. The fact that you generate ids this way implies that only one transaction can increment an id, because the update will lock the row exclusively. You cannot get duplicates. You do get complete serialization of all operations (ie. no performance and low throughput) but that is a different issue. And this why you should use built-in mechanisms for generating sequences and identities. These are specific to each platform: AUTO_INCREMENT in MySQL, SEQUENCE in Oracle, IDENTITY and SEQUENCE in SQL Server (sequence only in Denali) etc etc.
Updated
As I read your edit, the only reason why you want control of the generated identities is to be able to insert back archived records. This is already possible, simply use IDENTITY_INSERT:
Allows explicit values to be inserted
into the identity column of a table
Turn it on when you insert back the old record, then turn it back off:
SET IDENTITY_INSERT recordstable ON;
INSERT INTO recordstable (id, ...) values (#oldid, ...);
SET IDENTITY_INSERT recordstable OFF;
As for why manually generated ids serialize all operations: any transaction that generates an id will exclusively lock the row in the ids table. No other transaction can read or write that row until the first transaction commits or rolls back. Therefore there can be only one transaction generating an id on a table at any moment, ie. serialization.
Earlier today I asked this question which arose from A- My poor planning and B- My complete disregard for the practice of normalizing databases. I spent the last 8 hours reading about normalizing databases and the finer points of JOIN and worked my way through the SQLZoo.com tutorials.
I am enlightened. I understand the purpose of database normalization and how it can suit me. Except that I'm not entirely sure how to execute that vision from a procedural standpoint.
Here's my old vision: 1 table called "files" that held, let's say, a file id and a file url and appropos grade levels for that file.
New vision!: 1 table for "files", 1 table for "grades", and a junction table to mediate.
But that's not my problem. This is a really basic Q that I'm sure has an obvious answer- When I create a record in "files", it gets assigned the incremented primary key automatically (file_id). However, from now on I'm going to need to write that file_id to the other tables as well. Because I don't assign that id manually, how do I know what it is?
If I upload text.doc and it gets file_id 123, how do I know it got 123 in order to write it to "grades" and the junction table? I can't do a max(file_id) because if you have concurrent users, you might nab a different id. I just don't know how to get the file_id value without having manually assigned it.
You may want to use LAST_INSERT_ID() as in the following example:
START TRANSACTION;
INSERT INTO files (file_id, url) VALUES (NULL, 'text.doc');
INSERT INTO grades (file_id, grade) VALUES (LAST_INSERT_ID(), 'some-grade');
COMMIT;
The transaction ensures that the operation remains atomic: This guarantees that either both inserts complete successfully or none at all. This is optional, but it is recommended in order to maintain the integrity of the data.
For LAST_INSERT_ID(), the most
recently generated ID is maintained in
the server on a per-connection basis.
It is not changed by another client.
It is not even changed if you update
another AUTO_INCREMENT column with a
nonmagic value (that is, a value that
is not NULL and not 0).
Using
LAST_INSERT_ID() and AUTO_INCREMENT
columns simultaneously from multiple
clients is perfectly valid. Each
client will receive the last inserted
ID for the last statement that client
executed.
Source and further reading:
MySQL Reference: How to Get the Unique ID for the Last Inserted Row
MySQL Reference: START TRANSACTION, COMMIT, and ROLLBACK Syntax
In PHP to get the automatically generated ID of a MySQL record, use mysqli->insert_id property of your mysqli object.
How are you going to find the entry tomorrow, after your program has forgotten the value of last_insert_id()?
Using a surrogate key is fine, but your table still represents an entity, and you should be able to answer the question: what measurable properties define this particular entity? The set of these properties are the natural key of your table, and even if you use surrogate keys, such a natural key should always exist and you should use it to retrieve information from the table. Use the surrogate key to enforce referential integrity, for indexing purpuses and to make joins easier on the eye. But don't let them escape from the database