PostgreSQL - How to prevent duplicate inserts in a competitive slot taking contest - sql

I don't know how to phrase my question right. But to provide further details about the problem I am trying to solve, let me describe my application. Suppose I am trying to implement a queue reservation application, and I maintain the number of slots in a table roughly.
id | appointment | slots_available | slots_total
---------------------------------------------------
1 | apt 1 | 30 | 30
2 | apt 2 | 1 | 5
.. | .. | .. | ..
So, in a competitive scenario, assuming that everything works in the application side of things. A scenario can happen in the application where :
user 1 -> reserves apt 2 -> [validate if slot exists] -> update slot_available to 0 -> reserve (insert a record)
user 2 -> reserves ap2 2 -> validate if slot exists -> [update slot_available to 0] -> reserve (insert a record)
What if user 1 and 2 happens to find a slot available for apt2 at the same time in the user interface? (Of course I would validate first if there is one slot, but they would see the same value in the UI if not one of them has clicked yet). Then the two submits a reservation at the same time.
Now what if user 1 validates that there is a slot that is available, even though user 2 has already taken it though the update operation is not yet done? Then there will be two inserts.
At any case, how do I ensure that only one of them gets the reservation at database level? I'm sure this is a common scenario, but I have no idea yet on how to implement something like this. A suggestion to remodel would also be acceptable as long as it solves the scenario.

Related

How to represent a sequence of actions in a database while keeping detailed information about each action?

We have many actions players can take in a game. Imagine a card game (like poker) or a board game where there are multiple choices at each decision point and there is a clear sequence of events. We keep track of each action taken by a player. We care about the action's size (if applicable), other action possibilities that weren't taken, the player who took the action, the action that player faced before his move. Additionally, we need to know whether some action happened or did not happen before the action we're looking at.
The database helps us answer questions like:
1. How often is action A taken given the opportunity? (sum(actionA)/sum(actionA_opp)
2. How often is action A taken given the opportunity and given that action B took place?
3. How often is action A taken with size X, or made within Y seconds given the opportunity and given that action B took place and action C did not?
4. How often is action A taken given that action B took place performed by player P?
So for each action, we need to keep information about the player that took the action, size, timing, the action performed, what action opportunities were available and other characteristics. There is a finite number of actions.
One game can have on average 6 actions with some going up to 15.
There could be million of games and we want the aggregate queries across all of them to run as fast as possible. (seconds)
It could be represented in document database with an array of embedded documents like:
game: 123
actions: [
{
player: Player1,
action: deals,
time: 0.69,
deal_opp: 1
discard_opp: 1
},
{
player: Player2,
action: discards,
time: 1.21
deal_opp: 0,
discard_opp: 1,
}
...
]
Or in a relational model:
game | player | seq_n | action | time | deal_opp | discard_opp
123 | Player | 1 | deals | 0.28 | 1 | 1
All possible designs that I come up with can't satisfy all my conditions.
In the relational model presented, to see the previous actions taken in the same game requires N inner joins where N is previous actions we want to filter for. Given that the table would hold billions of rows, it would require several self joins on a billion row table which seems very inefficient.
If we instead store it in a wide column table, and represent the entire sequence in one row, we have very easy aggregates (can filter what happened and did not by comparing column values, eg. sum(deal)/sum(deal_opp) where deal_opp = 1 to get frequency of deal action given the player had the opportunity to do it) but we don't know WHO took the given action which is a necessity. We cannot just append a player column next to an action to represent who took that action because an action like call or discard or could have many players in a row (in a poker game, one player raises, 1 or more players can call).
More possibilities:
Graph database (overkill given that we have at most 1 other connecting node? - basically a linked list)
Closure tables (more efficient querying of previous actions)
??
If i understand very well, is you're dealing with How to store a decision tree within your database. Right ?
I remember i programmed a chess game yeasr ago, which means every action is a consequetive set of previous actions of both users. So to keep record of all the actions, with all the details you need, i think you should check the following :
+ In relational database, the most efficient way to store a Tree is a Modified Preorder Tree Traversal. Not easy tbh, but you can give it a try.
This will help you : https://gist.github.com/tmilos/f2f999b5839e2d42d751

SSIS ForEach ADO Enumerator - Performance Issues

This is a best practice/other approach question about using a ADO Enumerator ForEach loop.
My data is financial accounts, coming from a source system into a data warehouse.
The current structure of the data is a list of financial transactions eg.
+-----------------------+----------+-----------+------------+------+
| AccountGUID | Increase | Decrease | Date | Tags |
+-----------------------+----------+-----------+------------+------+
| 00000-0000-0000-00000 | 0 | 100.00 | 01-01-2018 | Val1 |
| 00000-0000-0000-00000 | 200.00 | 0 | 03-01-2018 | Val3 |
| 00000-0000-0000-00000 | 400.00 | 0 | 06-01-2018 | Val1 |
| 00000-0000-0000-00000 | 0 | 170.00 | 08-01-2018 | Val1 |
| 00000-0000-0000-00002 | 200.00 | 0 | 04-01-2018 | Val1 |
| 00000-0000-0000-00002 | 0 | 100.00 | 09-01-2018 | Val1 |
+-----------------------+----------+-----------+------------+------+
My SSIS Package, current has two forEach Loops
All Time Balances
End Of Month Balances
All Time Balances
Passes AccountGUID into the loop and selects all transactions for that account. It then orders them by date with the first transaction being first and assigns it a sequence number.
Once the sequence number is assigned, it begins to count the current balances based on the increase and decrease cols, along with the tag col to work out which balance its dealing with.
It finishes this off by assigning the latest record with a Current flag.
All Time Balances - Work Flow
->Get All Account ID's in Staging table
|-> Write all Account GUID's to object variable
|--> ADO Enumerator ForEach - Loop Account GUID List - Write GUID to variable
|---> (Data Flow) Select all transactions for Account GUID
|----> (Data Flow) Order all transactions by date and assign Sequence number
|-----> (Data Flow) Run each row through a script component transformation to calculate running totals for each record
|------> (Data Flow) Insert balance data into staging table
End Of Month Balances
The second package, End of Month does something very similar with the exception of a second loop. The select will find the earliest transnational record and the latest transnational record. Using those two dates it will figure out all the months between those two and loop for each of those months.
Inside the date loop, it does pretty much the same thing, works out the balances based on tags and stamps the end of month record for each account.
The Issue/Question
All of this currently works fine, but the performance is horrible.
In one database with approx 8000 Accounts and 500,000 transactions. This process takes upwards of a day to run. This being one of our smaller clients, I tremble at the idea of running it for our heavy databases.
Is there a better approach to doing this, using SQL cursors or so other neat way I have not seen?
Ok, so I have managed to take my package execution from around 3 days to about 11 minutes all up.
I ran a profiler and standard windows stats while running the loops and found a few interesting things.
Firstly, there was almost no utilization of HDD, CPU, RAM or network during the execution of the packages. It told me what I kind of already knew, that it was not running as quickly as it could.
What I did notice, between each execution of the loop there was a 1 to 2ms delay before the next instance of the loop started executing.
Eventually I found that every time a new instance of the loop began, SSIS created a new connection to the SQL database, it appears that this is SSIS's default behavior. Whenever you create a Source or Destination, you are adding a connection delay to your project.
The Fix:
Now this was an odd fix, you need to go into your connection manager (The odd bit) it must be the onscreen window not in the right hand project manager window.
If you select your connect that is referenced in the loop, the properties window on the right side (In my layout anyway) you will see the option called "RetainSameConnection" which be default is set to false.
By setting this to true, I eliminated the 2ms delay.
Considerations:
In doing this I created a heap of other issues, which really just highlighted areas of my package that I had not thought out well.
Some things that appears to be impacted by this change were stored procedures that used temp tables, these seemed to break instantly. I assume that is because of how SQL handles temp tables, in closing the connection and reopening, you can be pretty certain that the temp table is gone. With the same connection setting, the chance of running into temp tables appears to be an issue again.
I removed all temp tables and replaced them with CTE statements, this appears to fix this issue.
The second major issue I found was with tasks that ran parallel and both used the same connection manager. From this I received an error that SQL is still trying to run the previous statement. This bombed out my package.
To get around this, I created a duplicate connection manager (All up I made three connection managers for the same database).
Once I had my connections set up, I went into each of my parallel Source and Destinations and assigned them their own connection manager. This appears to have resolved the last error I received.
Conclusion:
They may be more unforeseen issues in doing this, but for now my packages are lightening quick and this highlighted some faults in my design.

How to handle concurrent requests that delete and create the same rows?

I have a table that looks like the following:
game_stats table:
id | game_id | player_id | stats | (many other cols...)
----------------------
1 | 'game_abc' | 8 | 'R R A B S' | ...
2 | 'game_abc' | 9 | 'S B A S' | ...
A user uploads data for a given game in bulk, submitting both players' data at once. For example:
"game": {
id: 'game_abc',
player_stats: {
8: {
stats: 'R R A B S'
},
9: {
stats: 'S B A S'
}
}
}
Submitting this to my server should result in the first table.
Instead of updating the existing rows when the same data is submitted again (with revisions, for example) what I do in my controller is first delete all existing rows in the game_stats table that have the given game_id:
class GameStatController
def update
GameStat.where("game_id = ?", game_id).destroy_all
params[:game][:player_stats].each do |stats|
game_stat.save
end
end
end
This works fine with a single threaded or single process server. The problem is that I'm running Unicorn, which is a multi-process server. If two requests come in at the same time, I get a race condition:
Request 1: GameStat.where(...).destroy_all
Request 2: GameStat.where(...).destroy_all
Request 1: Save new game_stats
Request 2: Save new game_stats
Result: Multiple game_stat rows with the same data.
I believe somehow locking the rows or table is the way to go to prevent multiple updates at the same time - but I can't figure out how to do it. Combining with a transaction seems the right thing to do, but I don't really understand why.
EDIT
To clarify why I can't figure out how to use locking: I can't lock a single row at a time, since the row is simply deleted and not modified.
AR doesn't support table-level locking by default. You'll have to either execute db specific SQL or use a gem like Monogamy
Wrapping up the save statements in a transaction will speed things up if nothing else.
Another alternative is to implement the lock with Redis. Gems like redis-lock are also available. This will probably be less risky as it doesn't touch the DB, and you can set Redis keys to expire.

Looping a rotation into Excel cells

I have a database of users (with a number assignment) and an Excel listing of items that I am successfully pulling into two arrays. One array of users, the other of items from Excel (tasks). I am trying to think of the logic to represent this scenario:
A user in the DB can exist with a number 0-5. Basically this represents how many days off they are in the week, and help break up how many items in the Excel range each person can get so as to proportion correctly items (before I was using a Boolean indicator to indicate to either include or exclude). For example:
User | Present #
-------------------
Jared | 0 'present daily
John | 0 'present daily
Mary | 1 'off 1 day
Tom | 5 'off rotation entirely
Question is: what is the best way to relate this to how many items they should be getting overall?
I would expect Jared and John to get the most, Mary a bit less, and Tom would never be included. Let's say I have 50 items.
One way I have thought is, while looping the names into Excel, count each time I start back at the top of the array as a "pass" (while assigning into Excel cells).
Anyone with a 0 is never skipped through each pass
Anyone with a 1 is skipped every 4th pass
Anyone with a 2 is skipped every 3rd pass
Anyone with a 3 is skipped every 2nd pass
Anyone with a 4 is skipped every other pass
Anyone with a 5 is never included (easy)
For my example, Jared and John would always be used, Mary would be skipped every fourth pass, and Tom would never be used.
Does this make sense?
What is the best way to catch looping through an array every Nth time?
Am I going about this in the correct manner?
To avoid a lot of looping and the delays this might cause, I’d suggest calculating a ‘demand factor’.
There are 50 items (in the example) to be distributed according to availabilities. The availabilities are shown as 0 present daily to 5 off rotation entirely, but it is easier to work with these the other way around: ‘off rotation’ has no resources available so 0 and ‘presently daily’ has all weekdays (?) available, so 5.
The User | Present # table would then become:
Jared 5
John 5
Mary 4
Tom 0
14
So 14 person-days are available to cover 50 items, an average of 3.57 items per person-day. Presuming an item can’t be split that is 3 items per person-day and 8 over. The ‘3 each’ can be allocated in one pass by multiplying the (revised) table values by INT(item_total/table-total). So for Jared and John the result is 5x3 = 15 and for Mary 4x3 = 12.
That though only accounts for 42, so 8 have yet to be allocated. 3,3,2 ‘extras’ is obvious (resulting in 18,18,14) but programming that not so easy. I’d suggest where there is any residual from the INT formula then use its result +1 (ie here 4 rather than 3) accept preliminary results of 20,20,16,0 (6 too many) then loop through each user knocking 1 off (where possible) until 6 have been knocked off.
This doesn't entirely make sense since you appear to be assigning weekly tasks, one per day:
Anyone with a 0 is never skipped through each pass
Anyone with a 1 is skipped every 4th pass
Anyone with a 2 is skipped every 3rd pass
Anyone with a 3 is skipped every 2nd pass
Anyone with a 4 is skipped every other pass
Anyone with a 5 is never included (easy)
However, presuming the above, you skip users when their individual TaskAssignmentsAttempted Mod (6 - Present#) = 0.
Perhaps you need:
Anyone with a 0 is never skipped
Anyone with a 1 is skipped once every 5 passes
Anyone with a 2 is skipped twice every 5 passes
Anyone with a 3 is skipped 3 times every 5 passes
Anyone with a 4 is skipped 4 times every 5 passes
Anyone with a 5 is always skipped.
Presuming the above, you skip users when their individual 5 - Present# is less than their individual TaskAssignmentsAttempted Mod 5.
With either of these, you need to track the number of times that each user has an assignment attempt (successful or not), as well as the actual assignments.

Is it important to have an automated acceptance tests to test whether a field saves to a database?

I'm using SpecFlow for the automated Acceptance Testing framework and NHibernate for persistance. Many of the UI pages for an intranet application that I'm working on are basic data entry pages. Obviously adding a field to one of these pages is considered a "feature", but I can't think of any scenarios for this feature other than
Given that I enter data X for field Y on Record 1
And I click Save
When I edit Record 1
Then I should data X for field Y
How common and necessary is it to automate tests like this? Additionally, I'm using NHibernate so it's not like I'm handrolling my own data persistance layer. Once I add a property to my mapping file, there is a high chance that it won't get deleted by mistake. When considering this, isn't a "one-time" manual test enough? I'm eager to hear your suggestions and experience in this matter.
I usually have scenarios like "successful creation of ..." that tests the success case (you fill-in all required fields, all input is valid, you confirm, and finally it is really saved).
I don't think that you can easily define a separate scenario for one single field, because usually the scenario of successful creation requires several other criteria to be met "at the same time" (e.g. all required fields must be filled).
For example:
Scenario: Successful creation of a customer
Given I am on the customer creation page
When I enter the following customer details
| Name | Address |
| Cust | My addr |
And I save the customer details
Then I have a new customer saved with the following details
| Name | Address |
| Cust | My addr |
Later I can add additional fields to this scenario (e.g. the billing address):
Scenario: Successful creation of a customer
Given I am on the customer creation page
When I enter the following customer details
| Name | Address | Billing address |
| Cust | My addr | Bill me here |
And I save the customer details
Then I have a new customer saved with the following details
| Name | Address | Billing address |
| Cust | My addr | Bill me here |
Of course there can be more scenarios related to the new field (e.g. validations, etc), that you have to define or extend.
I think if you take this approach you can avoid having a lot of "trivial" scenarios. And I can argue that this is the success case of the "create customer feature", which deserves a single test at least.