Optimal solution for massive number of requests on one database table

Optimal solution for massive number of requests on one database table - sql

We have a system where customers are allocated a product on a first come first served basis.
Our products table contains an incrementing primary key that started at zero which we use to keep track of how many products have been allocated i.e. a user reserves a product and gets allocated 1, next user gets 2 etc.
The problem, is that potentially hundreds of thousands of users will access the system in any given hour. All of whom will be hitting this one table.
Since we need to ensure that each customer is only allocated one product and keep track of how many products have been allocated, we use a row lock for each customer accessing the system to ensure they write to the table before the next customer hits the system - i.e. enforcing the first come first served rule.
We are concerned about the bottleneck that is the processing time of each request coming into SQL Server 2008 Enterprise Edition and the row lock.
We can't use multiple servers as we need to ensure the integrity of the primay key so anything that requires replication isn't going to work.
Does anyone know of any good solutions that are particularly efficient at handling a massive number of requests on one database table?
A bit more info:
The table in question essentially contains two fields only - ID and CustomerID. The solution is for a free giveaway of a million products - hence the expectation of high demand and why using the incrementing primary key as a key makes sense for us - once the key hits a million, no more customers can register. Also, the products are all different so allocation of the correct key is important e.g. first 100 customers entered receieve a higher value product than the next 100 etc

First, to remove the issue of key generation, I would generate them all in advance. It's only 1m rows and it means you don't have to worry about managing the key generation process. It also means you don't have to worry about generating too many rows accidentally, because once you have the table filled, you will only do UPDATEs, not INSERTs.
One important question here is, are all 1m items identical or not? If they are, then it doesn't matter what order the keys are in (or even if they have an order), so as customers submit requests, you just 'try' to UPDATE the table something roughly like this:
UPDATE TOP(1) dbo.Giveaway -- you can use OUTPUT to return the key value here
SET CustomerID = #CurrentCustomerID
WHERE CustomerID IS NULL
IF ##ROWCOUNT = 0 -- no free items left
PRINT 'Bad luck'
ELSE
PRINT 'Winner'
If on the other hand the 1m items are different then you need another solution, e.g. item 1 is X, items 2-10 are Y, 11-50 are Z etc. In this case it's important to assign customers to keys in the order the requests are submitted, so you should probably look into a queuing system of some kind, perhaps using Service Broker. Each customer adds a request to the queue, then a stored procedure processes them one at a time and assigns them the MAX free key, then returns the details of what they won.

Related

Incremental invoice number with multiple users

In my app I have users and invoices (amongst other things).
I'm not sure what's the best way to make sure the invoices of every user start with 1 and go up incrementally. For example:
User 1 makes 2 invoices - ID 1 and ID 2
User 2 makes 3 invoices - ID 1 ID 2 and ID 3
User 1 makes 1 invoice - ID 3
I obviously can't use the default id column as it would just increase by one on every entry. The solution I thought about is having a new column in my invoices table called "incrementalId" that I update manually when I save the invoice and make it be 1 more than the maximum count.
Is this the standard way to go about solve this problem? An issue I can think of is that if I delete an invoice, the next one that I generate will most likely have an ID that is already taken. Or is deleting the entry when I delete the invoice not really the way to go?

What you want to achieve is a gapless sequence in SQL.
You cannot rely on auto generated ID as gaps can appear if a transaction is rollbacked (the id value is lost forever, getting it back is... something you should not want).
There are different ways to achieve that, it all depends on your requirements and if performance is a high parameter.
First choice, you just not populate your invoice number at creation. You just run a batch every x seconds or minutes or hours that fetch invoices that don't have number, get the highest invoice number from the database, and fill all these number columns in one transaction. The batch should be the unique actor to populate the invoice number column with no other concurrent execution (beware if the batch is still running to not spawn another). As you populate all empty invoice numbers in one shot, it is very performant, you don't need to lock the reading on the table as well, but it is not real-time.
Second choice, the real-time solution. You will have to use counter tables in your database to do so, it will surely involve some plsql or similar as your database is the only one to know the exact status of your invoice sequence number (multiple invoice creations can happen within your database, so the database is the only arbiter to decide this sequence). It will imply locking rows so if you upsert many invoices, every insert will slow down the whole thing, but you will get your invoice number ASAP. At the cost of delegating some work to your database to do.
Check this solution for postgresql for instance: PostgreSQL gapless sequences

Running total - trigger or query?

Which of the following scenarios will a) provide better performance and b) be more reliable/accurate. I've simplified the process and tables used. I would provide code/working but it's fairly simple stuff. I'm using MS-SQL2008 but I would assume the question is platform independent.
1) An item is removed from stock (the stock item has a unique ID), a trigger is fired which updates [tblSold], if the ID doesn't exist it creates a record and adds a value of 1, if it does exist it adds 1 to the current value. The details of the sale are recorded elsewhere.
When stock availability is requested its calculated from this table based on the item ID.
2) When stock availability is requested it simply sums the quantity in [tblSales] based on the ID.
Stock availability will be heavily requested and for obvious reasons can't ever be wrong.

I'm going to play devil's to advocate the previous answer and suggest using a query - here are my reasons.
SQL is designed for reads, a well maintained database will have no problem with hundreds of millions of rows of data. If your data is well indexed and maintained performance shouldn't be an issue.
Triggers can be hard to trace, they're a little less explicit and update information in the background - if you forget about them they can be a nightmare. A minor point but one which has annoyed me many times in the past!
The most important point, if you use a query (assuming it's right) your data can never get out of sync and can be regenerated easily. A running count would make this very difficult.
Ultimately this is a design decision which everyone will have a different view on. At the end of the day it will come down to your preferences and design.

I would go with first approach, there is no reason to count rows, when you can have just read one value from database, trigger would not do any bad, because you will not be selling items so often as you request quantity.

Bypass or fulfill unique constraint for large update

I have a database of order line items. Two of the columns are for controlling batch processing. They're called prod_batch_id (number) and prod_batch_index (varchar2 - 10 byte). They both need to be either null or filled, and prod_batch_index has a unique index on it. To fetch unbatched orders we simply pull all records where prod_batch_id is null.
The table (ord_lin) has 171,602 records that pertain to a particular product line that I'm about to start batching. Rather than limit my batching queries by the date that we start I'd like to add all of those records to a dummy batch.
I'm not well versed in Oracle, or even SQL, so I don't know if it's possible but would there be a way to temporarily disable the unique constraint for the update, and then enable it again without it complaining about those records?
If that's not an option, how would I go about creating the unique values for each record during the update?
Notes: The batch system was originally made for a whole new product line and was designed this way because those products didn't exist before batching did. I will likely have several other product lines that will get switched to the batching system down the line, so I'll be back to this point again sooner or later.
I'm also aware that I could do things like make another field to act as a "batched" flag that doesn't have such constraints, but that would involve updating a lot of programs.

Convert multiple rows into single column

I have a database table, UserRewards that has 30+ million rows. In this row, there is a userID, and a rewardID per row (along with other fields).
There is a users table (has around 4 million unique users), that has the primary key userID, and other fields.
For performance reasons, I want to move the rewardID per user in userrewards into a concatenated field in users. (new nvarchar(4000) field called Rewards)
I need a script that can do this a fast as possible.
I have a cursor which joins up the rewards using the script below, but it only processes around 100 users per minute, which would take far too long to get though the around 4 million unique users I have.
set #rewards = ( select REPLACE( (SELECT rewardsId AS [data()] from userrewards
where UsersID = #users_Id and BatchId = #batchId
FOR XML PATH('') ), ' ', ',') )
Any suggestions to optimise this? I am about to try a while loop so see how that works, but any other ideas would be greatly received.
EDIT:
My site does the following:
We have around 4 million users who have been pre assigned 5-10 "awards". This relationship is in the userrewards table.
A user comes to the site, we identify them, and lookup in the database the rewards assigned to them.
Issue is, the site is very popular, so I am having a large number of people hitting the site at the same time requesting their data. The above will reduce my joins, but I understand this may not be the best solution. My database server goes upto 100% CPU usage within 10 seconds of me turing the site on, so most people's requests timeout (they are shown an error page), or they get results, but not in a satisfactory time.
Is anyone able to suggest a better solution to my issue?

There are several reasons why I think the approach you are attempting is a bad idea. First, how are you going to maintain the comma delimited list in the users table? It is possible that the rewards are loaded in batch, say at night, so this isn't really a problem now. Even so, one day you might want to assign the rewards more frequently.
Second, what happens when you want to delete a reward or change the name of one of them? Instead of updating one table, you need to update the information in two different places.
If you have 4 million users, with thousands of concurrent accesses, then small inconsistencies due to timing will be noticeable and may generate user complaints. A call from the CEO on why complaints are increasing is probably not something you want to deal with.
An alternative is to build an index on UserRewards(UserId, BatchId, RewardsId). Presumably, each field is few bytes, so 30 million records should easily fit into 8 Gbytes of memory (be sure that SQL Server is allocated almost all the memory!). The query that you want can be satisfied strictly by this index, without having to bring the UserRewards table into memory. So, only the index needs to be cached. And, it will be optimized for this query.
One thing that might be slowing everything down is the frequency of assigning rewards. If these are being assigned at even 10% of the read rate, you could have the inserts/updates blocking the reads. You want to do the queries with READ_NOLOCK, to avoid this problem. You also want to be sure that locking is occurring at the record or page level, to avoid conflicts with the reads.

Maybe too late, but using uniqueidentifiers as keys will not only quadruple your storage space (compared to using ints as keys), but slow your queries by orders of magnitude. AVOID!!!

Efficiently detecting concurrent insertions using standard SQL

The Requirements
I have a following table (pseudo DDL):
CREATE TABLE MESSAGE (
MESSAGE_GUID GUID PRIMARY KEY,
INSERT_TIME DATETIME
)
CREATE INDEX MESSAGE_IE1 ON MESSAGE (INSERT_TIME);
Several clients concurrently insert rows in that table, possibly many times per second. I need to design a "Monitor" application that will:
Initially, fetch all the rows currently in the table.
After that, periodically check if there are any new rows inserted and then fetch
these rows only.
There may be multiple Monitors concurrently running. All the Monitors need to see all the rows (i.e. when a row is inserted, it must be "detected" by all the currently running Monitors).
This application will be developed for Oracle initially, but we need to keep it portable to every major RDBMS and would like to avoid as much database-specific stuff as possible.
The Problem
The naive solution would be to simply find the maximal INSERT_TIME in rows selected in step 1 and then...
SELECT * FROM MESSAGE WHERE INSERT_TIME >= :max_insert_time_from_previous_select
...in step 2.
However, I'm worried this might lead to race conditions. Consider the following scenario:
Transaction A inserts a new row but does not yet commit.
Transaction B inserts a new row and commits.
The Monitor selects rows and sees that the maximal INSERT_TIME
is the one inserted by B.
Transaction A commits. At this point, A's INSERT_TIME is actually
earlier than the B's (A's INSERT was actually executed before
B's, before we even knew who is going to commit first).
The Monitor selects rows newer than B's INSERT_TIME (as a consequence of step 3). Since A's INSERT_TIME is earlier than B's insert time, A's row is skipped.
So, the row inserted by transaction A is never fetched.
Any ideas how to design the client SQL or even change the database schema (as long as it is mildly portable), so these kinds of concurrency problems are avoided, while still keeping a decent performance?
Thanks.

Without using any of the platform-specific change data capture (CDC) technologies, there are a couple of approaches.
Option 1
Each Monitor registers a sort of subscription to the MESSAGE table. The code that writes messages then writes each MESSAGE once per Monitor, i.e.
CREATE TABLE message_subscription (
message_subscription_id NUMBER PRIMARY KEY,
message_id RAW(32) NOT NULLL,
monitor_id NUMBER NOT NULL,
CONSTRAINT uk_message_sub UNIQUE (message_id, monitor_id)
);
INSERT INTO message_subscription
SELECT message_subscription_seq.nextval,
sys_guid,
monitor_id
FROM monitor_subscribers;
Each Monitor then deletes the message from its subscription once that is processed.
Option 2
Each Monitor maintains a cache of the recent messages it has processed that is at least as long as the longest-running transaction could be. If the Monitor maintained a cache of the messages it has processed for the last 5 minutes, for example, it would query your MESSAGE table for all messages later than its LAST_MONITOR_TIME. The Monitor would then be responsible for noting that some of the rows it had selected had already been processed. The Monitor would only process MESSAGE_ID values that were not in its cache.
Option 3
Just like Option 1, you set up subscriptions for each Monitor but you use some queuing technology to deliver the messages to the Monitor. This is less portable than the other two options but most databases can deliver messages to applications via queues of some sort (i.e. JMS queues if your Monitor is a Java application). This saves you from reinventing the wheel by building your own queue table and gives you a standard interface in the application tier to code against.

You need to be able to identify all rows added since the last time you checked (i.e. the monitor checks). You have a "time of insert" column. However, as you spell it out, that time of insert column cannot be used with "greater than [last check]" logic to reliably identify subsequently inserted new items. Commits do not occur in the same order as (initial) inserts. I am not aware of anything that works on all major RDBMSs that would clearly and safely apply such an "as of" tag at the actual time of commit. [This is not to say I would know it if such a thing existed, but it seems a pretty safe guess to me.] Thus, you will have to use something other than a "greater than last check" algorithm.
That leads to filtering. Upon insert, an item (row) is flagged as "not yet checked"; when a montior logs in, it reads all not yet checked items, returns that set, and flips the flag to "checked" (and if there are multiple monitors, this must all be done within its own transaction). The monitors' queries will have to read all the data and pick out which have not yet been checked. The implication is, however, that this will be a fairly small set of data, at least relative to the entire set of data. From here, I see two likely options:
Add a column, perhaps "Checked". Store a binary 1/0 value for is/isnot checked. The cardinality of this value will be extreme -- 99.9s Checked, 00,0s Unchecked, so it should be rather efficient. (Some RDBMSs provide filtered queries, such that the Checked rows won't even be in the index; once flipped to checked, a row will presumably never be flipped back, so the overhead to support this shouldn't be too great.)
Add a separate table identify those rows in the "primary" table that have not yet been checked. When a montior logs in, it reads and deletes the items from that table. This doesn't seem efficient... but again, if the data set involved is small, the overall performance pain might be acceptable.

You should use Oracle AQ with a multi-subscriber queue.
This is Oracle specific, but you can create an abstraction layer of stored procedures (or abstract in Java if you like) so that you have a common API to enqueue the new messages and have each subscriber (monitor) dequeue any pending messages. Behind that API, for Oracle you use AQ.
I am not sure if there is a queuing solution for other databases.
I don't think you will be able to come up with a totally database agnostic approach that meets your requirements. You could extend the example above that included the 'checked' column, to have a second table called monitor_checked - that would contain one row per message per monitor. That is basically what AQ does behind the scenes, so it is sort of reinventing the wheel.

With PostgreSQL, use PgQ. It has all those little details worked out for you.
I doubt you will find a robust and manageable database-agnostic solution for this.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas