If I have a database with for an article and I have a field for upvotes.
I was thinking about creating an SQL Query with which I will first get the current value of upvotes and then I will increment 1 to the value
But what if 5 people at once click on the upvote button what will happen then?
or is there a better way to do this altogether *
My strong suggestion is to keep a record of every upvote and downvote in a votes table:
create table votes (
votes_id <autoincrement> primary key,
user_id int references users(user_id), -- whodunnit
topic_id int references topics(topic_id), -- what they're voting on
inc int,
created_at datetime default current_datetime,
check (inc in (-1, 1))
);
You can then summarize the votes as you want. You can see trends in voting over time. You can ensure that someone can "unvote" if they have voted in the past.
And, inserting into a table runs no risk of having different users interfere with each other.
The downside is that summarizing the results takes a bit more time. You can optimize that when the issue arises.
There are two solutions:
If you really need to load the value into your application and increment it there, writing it back afterwards, get an appropriate lock on the table before selecting the value. Release the lock after you finished with the value. Either because of cancellation or rewriting an actual upvote.
Otherwise a concurrent instance B could have read the same value and write it back after the first instance A. Say both read 3. Both increment it to 4. Then A writes it back before B, the value in the database is 4, B the also writes it back and again the value in the database is 4. Though 3+2=5. So one upvote would get "lost" this way. It's called a "lost update problem".
You can prevent this with a lock as mentioned. As B cannot read from the table before you've updated and released the lock. Afterwards it will read 4 instead of 3 and therefore write back 5, which is correct.
But preferably, do it in a single update, like
UDPATE votes
SET votes = votes + 1
WHERE article = #some_id;
That is, you increment the actual value in the database, regardless of what your application thinks this value currently is.
Provided that your transaction has an appropriate isolation level the database will take care of locking by itself and thus keep concurrent transaction from updating with "dirty", outdated data.
I suggest, you read a little more about transactions, isolation levels and locking to fully understand the problem.
Related
My question is probably very specific to Postgres, probably not.
A program which I cannot modify has access to Postgress via npgsql and a simple select command, all I know.
I also have access via npgsql. The table is defined as:
-- Table: public.n_data
-- DROP TABLE public.n_data;
CREATE TABLE public.n_data
(
u_id integer,
p_id integer NOT NULL,
data text,
CONSTRAINT nc PRIMARY KEY (p_id)
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.n_data
OWNER TO postgres;
(If that info is useful anyway)
I access one single big column, read from it and write back to it.
This all works fine so far.
The Question is: how does Postgres handles it if we write at the same time.
Any Problems there?
And if Postgres does not handle that automatically, how about when I read the data, process it and in the meantime data changes, and I write back that data after I processed it---> lost data.
Its a bit tricky to test for data integrity, since this datablock is huge, and corruptions are hard to find.
I do it with c# if that means anything.
Locking (in most1) relational databases (including Postgres) is always on row level, never on column level (it's columns and rows in a relational database not "cells", "fields" or "records")
If two transactions modify the same row, the second one will have to wait until the first one commits or rolls back.
If two transactions modify different rows then they can do that without any problems as long as they don't modify columns that are part of a unique constraint or primary key to the same value.
Read access to data is never blocked in Postgres by regular DML statements. So yes while one transaction modifies data, another one will see the old data until the first transaction commits the changes ("read consistency").
To handle lost updates you can either use the serializable isolation level or make all transactions follow the pattern that they first need to obtain a lock on the row (select ... for update) and hold that until they are finished. Search for "pessimistic locking" to get more details about this pattern.
Another option is to include a "modified" timestamp in your table. When a process reads the data it also reads the modification timestamp. When it sends back the new changes it includes a where modified_at = <value obtained when reading> - if the data has changed the condition will not hold true and nothing will be updated and you need to restart your transaction. Search for "optimistic locking" to find more details about this pattern.
1 some DBMS do page locking and some escalate many row level locks to a table lock. Neither is the case in Postgres
Apologies for the clumsy title feel free to suggest an improvement.
I have a table Records and there's a UserId column that refers to who's made the deposition. There's also Counter column which is identity(1,1) (it's important to keep in mind that it's not same one as the Id column that is the primary key).
The problem got obvious when we started depositing from different accounts, because before, the user could ask for record number 123 through 127, getting 5 amounts but now, their picks might be 123, 125, 126 or even worse - nothing at all.
The only option to handle it as far I can imagine to create a business logic layer that checks for the highest deposition counter for a user and adds the new record with that increased by one.
But it sure would be nice to have it automagically working. Something like identity(1,1,guid). Is it possible?
The only option to handle it as far I can imagine to create a business
logic layer that checks for the highest deposition counter for a user
and adds the new record with that increased by one.
Time to learn.
Add the last given number to the account table.
Use a trigger to assign higher numbers in the insert event of SQL Server.
Finished. This obviously assumes your relational database is used in a relational fashion so the records are related to a user table, not just holding a user id (which would be a terrible design).
If not, you can also maintain a table of all seen GUID's by means of said trigger.
To maintain such a column, you would need a trigger.
You might consider calculating the value when you query the table:
select r.*, row_number() over (partition by guid order by id) as seqnum
from records r;
The Tables
Let us assume that we have a table of articles:
CREATE TABLE articles
(
id integer PRIMARY KEY,
last_update timestamp NOT NULL,
...
);
Users can bookmark articles:
CREATE TABLE bookmarks
(
user integer NOT NULL REFERENCES users(id),
article integer NOT NULL REFERENCES articles(id),
PRIMARY KEY(user, article),
last_seen timestamp NOT NULL
);
The Feature to be Implemented
What I want to do now is to inform users about articles which have been updated after the user has last seen them. The access to the whole system is via a web interface. Whenever a page is requested, the system should check whether the user should be notified about updated articles (similar to the notification bar on the top of a page here on SO).
The Question
What is the best and most efficient implementation of such a feature, given that both tables above contain tens of millions of rows?
My Solution #1
One could do a simple join like this:
SELECT ... FROM articles, bookmarks WHERE bookmarks.user = 1234
AND bookmarks.article = articles.article AND last_seen < last_update;
However, I'm worried that doing this JOIN might be expensive if the user has many bookmarked articles (which might happen more often than you think), especially if the database (in my case PostgreSQL) has to traverse the index on the primary key of articles for every bookmarked article. Also the last_seen < last_update predicate can only be checked after accessing the rows on the disk.
My Solution #2
Another method is more difficult, but might be better in my case. It involves expanding the bookmarks table by a notify column:
CREATE TABLE bookmarks
(
user integer NOT NULL REFERENCES users(id),
article integer NOT NULL REFERENCES articles(id),
PRIMARY KEY(user, article),
last_seen timestamp NOT NULL,
notify boolean NOT NULL DEFAULT false
);
CREATE INDEX bookmark_article_idx ON bookmarks (article);
Whenever an article is updated, the update operation should trigger setting notify to true for every user who has bookmarked this article. The big disadvantage that comes to mind is that if an article has been bookmarked a lot, setting notify to true for lots of rows can be expensive. The advantage could be that checking for notifications is as simple as:
SELECT article FROM bookmarks WHERE user = 1234 AND notify = true;
Final Thoughts
I think that the second method can be a lot more efficient if the number of page views (and with it the number of times the system checks for notifications) outweighs the number of updates of articles. However, this might not always be the case. There might be users with lots of bookmarked articles which log in only once a month for a couple of minutes, and others who check for updates almost every minute.
There's also a third method that involves a notification table in which the system INSERTs notifications for every user once an article is updated. However, I consider that an inefficient variant of Method #2 since it involves saving notifications.
What method is the most efficient when both tables contain millions of rows? Do you have another method that might be better?
Want to improve this post? Provide detailed answers to this question, including citations and an explanation of why your answer is correct. Answers without enough detail may be edited or deleted.
I would certainly go for solution one, making sure that articles has an index on (article,last_update).
Normalization theory takes you directly to solution #1. Rather than asking which design is faster, you might want to ask, how do I make my server executes this query efficiently given my bog-standard BCNF tables. :-)
If your server cannot be made to execute your query fast enough (for whatever value of enough in your case) you need a faster server. Why? Because performance will only degrade as users and rows are added. Normalization was invented to minimize updates and update anomalies. Use it to your advantage, or pay the price in hours of your time and hard-to-detect errors in your system.
I see a third solution, to make things more interesting. ;-) It is a mixture of both solutions. I would assume that there is a time of the day or night where little usage is on the system and make a dayly/nightly run to mark all the bookmarks that are new.
That alone would delay the information "new article updates for you!" for a day which is not what you want. But I would store an additional column "updated today" (enum "Yes", "No" or tinyint) which is set to "Yes" at article- update and reset to "No" on that nightly update-run.
Then show the "has changes" for all bookmarks with the mark "is changed" (from nightly cron) and additionally add the Information with the select from version 1, but restricted to the articles which have changed today.
Probably most articles get not updated dayly, so you should win with that.
Of course I would approve the measurement-answer, but you need a lot of assumptions to make a good benchmark.
WARNING: This tale of woe contains examples of code smells and poor design decisions, and technical debt.
If you are conversant with SOLID principles, practice TDD and unit test your work, DO NOT READ ON. Unless you want a good giggle at someone's misfortune and gloat in your own awesomeness knowing that you would never leave behind such a monumental pile of crap for your successors.
So, if you're sitting comfortably then I'll begin.
In this app that I have inherited and been supporting and bug fixing for the last 7 months I have been left with a DOOZY of a balls up by a developer that left 6 and a half months ago. Yes, 2 weeks after I started.
Anyway. In this app we have clients, employees and visits tables.
There is also a table called AppNewRef (or something similar) that ... wait for it ... contains the next record ID to use for each of the other tables. So, may contain data such as :-
TypeID Description NextRef
1 Employees 804
2 Clients 1708
3 Visits 56783
When the application creates new rows for Employees, it looks in the AppNewRef table, gets the value, uses that value for the ID, and then updates the NextRef column. Same thing for Clients, and Visits and all the other tables whose NextID to use is stored in here.
Yes, I know, no auto-numbering IDENTITY columns on this database. All under the excuse of "when it was an Access app". These ID's are held in the (VB6) code as longs. So, up to 2 billion 147 million records possible. OK, that seems to work fairly well. (apart from the fact that the app is updating and taking care of locking / updating, etc., and not the database)
So, our users are quite happily creating Employees, Clients, Visits etc. The Visits ID is steady increasing a few dozen at a time. Then the problems happen. Our clients are causing database corruptions while creating batches of visits because the server is beavering away nicely, and the app becomes unresponsive. So they kill the app using task manager instead of being patient and waiting. Granted the app does seem to lock up.
Roll on to earlier this year and developer Tim (real name. No protecting the guilty here) starts to modify the code to do the batch updates in stages, so that the UI remains 'responsive'. Then April comes along, and he's working his notice (you can picture the scene now, can't you ?) and he's beavering away to finish the updates.
End of April, and beginning of May we update some of our clients. Over the next few months we update more and more of them.
Unseen by Tim (real name, remember) and me (who started two weeks before Tim left) and the other new developer that started a week after, the ID's in the visit table start to take huge leaps upwards. By huge, I mean 10000, 20000, 30000 at a time. Sometimes a few hundred thousand.
Here's a graph that illustrates the rapid increase in IDs used.
Roll on November. Customer phones Tech Support and reports that he's getting an error. I look at the error message and ask for the database so I can debug the code. I find that the value is too large for a long. I do some queries, pull the information, drop it into Excel and graph it.
I don't think making the code handle anything longer than a long for the ID's is the right approach, as this app passes that ID into other DLL's and OCX's and breaking the interface on those just seems like a whole world of hurt that I don't want to encounter right now.
One potential idea that I'm investigating is try to modify the ID's so that I can get them down to a lower level. Essentially filling the gaps. Using the ROW_NUMBER function
What I'm thinking of doing is adding a new column to each of the tables that have a Foreign Key reference to these Visit ID's (not a proper foreign key mind, those constraints don't exist in this database). This new column could store the old (current) value of the Visit ID (oh, just to confuse things; on some tables it's called EventID, and on some it's called VisitID).
Then, for each of the other tables that refer to that VisitID, update to the new value.
Ideas ? Suggestions ? Snippets of T-SQL to help all gratefully received.
Option one:
Explicitly constrain all of your foreign key relationships, and set them to be ON UPDATE CASCADE.
This will mean that whenever you change the ID, the foreign keys will automatically be updated.
Then you just run something like this...
WITH
resequenced AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY id) AS newID,
*
FROM
yourTable
)
UPDATE
resequenced
SET
id = newID
I haven't done this in ages, so I forget if it causes problems mid-update by having two records with the same id value. If it does, you could do somethign like this first...
UPDATE yourTable SET id = -id
Option two:
Ensure that none of your foreign key relationships are explicitly defined. If they are, note them donw and remove them.
Then do something like...
CREATE TABLE temp AS
newID INT IDENTITY (1,1),
oldID INT
)
INSERT INTO temp (oldID) SELECT id FROM yourTable
/* Do this once for the table you are re-identifiering */
/* Repeat this for all fact tables holding that ID as a foreign key */
UPDATE
factTable
SET
foreignID = temp.newID
FROM
temp
WHERE
foreignID = temp.oldID
Then re-apply any existing foreign key relationships.
This is a pretty scary option. If you forget to update a table, you just borked your data. But, you can give that temp table a much nicer name and KEEP it.
Good luck. And may the lord have mercy on your soul. And Tim's if you ever meet him in a dark alley.
I would create a numbers table that has just a sequence from 1 to whatever max with an increment of 1 is for long and then change the logic of getting the maxid for visitid and maybe the others doing a right join between the numbers and the visits table. and then you can just look for te min of that number
select min(number) from visits right join numbers on visits.id = numbers.number
That way you get all the gaps filled in without having to change any of the other tables.
but I would just redo the whole database.
Earlier today I asked this question which arose from A- My poor planning and B- My complete disregard for the practice of normalizing databases. I spent the last 8 hours reading about normalizing databases and the finer points of JOIN and worked my way through the SQLZoo.com tutorials.
I am enlightened. I understand the purpose of database normalization and how it can suit me. Except that I'm not entirely sure how to execute that vision from a procedural standpoint.
Here's my old vision: 1 table called "files" that held, let's say, a file id and a file url and appropos grade levels for that file.
New vision!: 1 table for "files", 1 table for "grades", and a junction table to mediate.
But that's not my problem. This is a really basic Q that I'm sure has an obvious answer- When I create a record in "files", it gets assigned the incremented primary key automatically (file_id). However, from now on I'm going to need to write that file_id to the other tables as well. Because I don't assign that id manually, how do I know what it is?
If I upload text.doc and it gets file_id 123, how do I know it got 123 in order to write it to "grades" and the junction table? I can't do a max(file_id) because if you have concurrent users, you might nab a different id. I just don't know how to get the file_id value without having manually assigned it.
You may want to use LAST_INSERT_ID() as in the following example:
START TRANSACTION;
INSERT INTO files (file_id, url) VALUES (NULL, 'text.doc');
INSERT INTO grades (file_id, grade) VALUES (LAST_INSERT_ID(), 'some-grade');
COMMIT;
The transaction ensures that the operation remains atomic: This guarantees that either both inserts complete successfully or none at all. This is optional, but it is recommended in order to maintain the integrity of the data.
For LAST_INSERT_ID(), the most
recently generated ID is maintained in
the server on a per-connection basis.
It is not changed by another client.
It is not even changed if you update
another AUTO_INCREMENT column with a
nonmagic value (that is, a value that
is not NULL and not 0).
Using
LAST_INSERT_ID() and AUTO_INCREMENT
columns simultaneously from multiple
clients is perfectly valid. Each
client will receive the last inserted
ID for the last statement that client
executed.
Source and further reading:
MySQL Reference: How to Get the Unique ID for the Last Inserted Row
MySQL Reference: START TRANSACTION, COMMIT, and ROLLBACK Syntax
In PHP to get the automatically generated ID of a MySQL record, use mysqli->insert_id property of your mysqli object.
How are you going to find the entry tomorrow, after your program has forgotten the value of last_insert_id()?
Using a surrogate key is fine, but your table still represents an entity, and you should be able to answer the question: what measurable properties define this particular entity? The set of these properties are the natural key of your table, and even if you use surrogate keys, such a natural key should always exist and you should use it to retrieve information from the table. Use the surrogate key to enforce referential integrity, for indexing purpuses and to make joins easier on the eye. But don't let them escape from the database