I have a table that looks like the following:
game_stats table:
id | game_id | player_id | stats | (many other cols...)
----------------------
1 | 'game_abc' | 8 | 'R R A B S' | ...
2 | 'game_abc' | 9 | 'S B A S' | ...
A user uploads data for a given game in bulk, submitting both players' data at once. For example:
"game": {
id: 'game_abc',
player_stats: {
8: {
stats: 'R R A B S'
},
9: {
stats: 'S B A S'
}
}
}
Submitting this to my server should result in the first table.
Instead of updating the existing rows when the same data is submitted again (with revisions, for example) what I do in my controller is first delete all existing rows in the game_stats table that have the given game_id:
class GameStatController
def update
GameStat.where("game_id = ?", game_id).destroy_all
params[:game][:player_stats].each do |stats|
game_stat.save
end
end
end
This works fine with a single threaded or single process server. The problem is that I'm running Unicorn, which is a multi-process server. If two requests come in at the same time, I get a race condition:
Request 1: GameStat.where(...).destroy_all
Request 2: GameStat.where(...).destroy_all
Request 1: Save new game_stats
Request 2: Save new game_stats
Result: Multiple game_stat rows with the same data.
I believe somehow locking the rows or table is the way to go to prevent multiple updates at the same time - but I can't figure out how to do it. Combining with a transaction seems the right thing to do, but I don't really understand why.
EDIT
To clarify why I can't figure out how to use locking: I can't lock a single row at a time, since the row is simply deleted and not modified.
AR doesn't support table-level locking by default. You'll have to either execute db specific SQL or use a gem like Monogamy
Wrapping up the save statements in a transaction will speed things up if nothing else.
Another alternative is to implement the lock with Redis. Gems like redis-lock are also available. This will probably be less risky as it doesn't touch the DB, and you can set Redis keys to expire.
Related
We have many actions players can take in a game. Imagine a card game (like poker) or a board game where there are multiple choices at each decision point and there is a clear sequence of events. We keep track of each action taken by a player. We care about the action's size (if applicable), other action possibilities that weren't taken, the player who took the action, the action that player faced before his move. Additionally, we need to know whether some action happened or did not happen before the action we're looking at.
The database helps us answer questions like:
1. How often is action A taken given the opportunity? (sum(actionA)/sum(actionA_opp)
2. How often is action A taken given the opportunity and given that action B took place?
3. How often is action A taken with size X, or made within Y seconds given the opportunity and given that action B took place and action C did not?
4. How often is action A taken given that action B took place performed by player P?
So for each action, we need to keep information about the player that took the action, size, timing, the action performed, what action opportunities were available and other characteristics. There is a finite number of actions.
One game can have on average 6 actions with some going up to 15.
There could be million of games and we want the aggregate queries across all of them to run as fast as possible. (seconds)
It could be represented in document database with an array of embedded documents like:
game: 123
actions: [
{
player: Player1,
action: deals,
time: 0.69,
deal_opp: 1
discard_opp: 1
},
{
player: Player2,
action: discards,
time: 1.21
deal_opp: 0,
discard_opp: 1,
}
...
]
Or in a relational model:
game | player | seq_n | action | time | deal_opp | discard_opp
123 | Player | 1 | deals | 0.28 | 1 | 1
All possible designs that I come up with can't satisfy all my conditions.
In the relational model presented, to see the previous actions taken in the same game requires N inner joins where N is previous actions we want to filter for. Given that the table would hold billions of rows, it would require several self joins on a billion row table which seems very inefficient.
If we instead store it in a wide column table, and represent the entire sequence in one row, we have very easy aggregates (can filter what happened and did not by comparing column values, eg. sum(deal)/sum(deal_opp) where deal_opp = 1 to get frequency of deal action given the player had the opportunity to do it) but we don't know WHO took the given action which is a necessity. We cannot just append a player column next to an action to represent who took that action because an action like call or discard or could have many players in a row (in a poker game, one player raises, 1 or more players can call).
More possibilities:
Graph database (overkill given that we have at most 1 other connecting node? - basically a linked list)
Closure tables (more efficient querying of previous actions)
??
If i understand very well, is you're dealing with How to store a decision tree within your database. Right ?
I remember i programmed a chess game yeasr ago, which means every action is a consequetive set of previous actions of both users. So to keep record of all the actions, with all the details you need, i think you should check the following :
+ In relational database, the most efficient way to store a Tree is a Modified Preorder Tree Traversal. Not easy tbh, but you can give it a try.
This will help you : https://gist.github.com/tmilos/f2f999b5839e2d42d751
This is a best practice/other approach question about using a ADO Enumerator ForEach loop.
My data is financial accounts, coming from a source system into a data warehouse.
The current structure of the data is a list of financial transactions eg.
+-----------------------+----------+-----------+------------+------+
| AccountGUID | Increase | Decrease | Date | Tags |
+-----------------------+----------+-----------+------------+------+
| 00000-0000-0000-00000 | 0 | 100.00 | 01-01-2018 | Val1 |
| 00000-0000-0000-00000 | 200.00 | 0 | 03-01-2018 | Val3 |
| 00000-0000-0000-00000 | 400.00 | 0 | 06-01-2018 | Val1 |
| 00000-0000-0000-00000 | 0 | 170.00 | 08-01-2018 | Val1 |
| 00000-0000-0000-00002 | 200.00 | 0 | 04-01-2018 | Val1 |
| 00000-0000-0000-00002 | 0 | 100.00 | 09-01-2018 | Val1 |
+-----------------------+----------+-----------+------------+------+
My SSIS Package, current has two forEach Loops
All Time Balances
End Of Month Balances
All Time Balances
Passes AccountGUID into the loop and selects all transactions for that account. It then orders them by date with the first transaction being first and assigns it a sequence number.
Once the sequence number is assigned, it begins to count the current balances based on the increase and decrease cols, along with the tag col to work out which balance its dealing with.
It finishes this off by assigning the latest record with a Current flag.
All Time Balances - Work Flow
->Get All Account ID's in Staging table
|-> Write all Account GUID's to object variable
|--> ADO Enumerator ForEach - Loop Account GUID List - Write GUID to variable
|---> (Data Flow) Select all transactions for Account GUID
|----> (Data Flow) Order all transactions by date and assign Sequence number
|-----> (Data Flow) Run each row through a script component transformation to calculate running totals for each record
|------> (Data Flow) Insert balance data into staging table
End Of Month Balances
The second package, End of Month does something very similar with the exception of a second loop. The select will find the earliest transnational record and the latest transnational record. Using those two dates it will figure out all the months between those two and loop for each of those months.
Inside the date loop, it does pretty much the same thing, works out the balances based on tags and stamps the end of month record for each account.
The Issue/Question
All of this currently works fine, but the performance is horrible.
In one database with approx 8000 Accounts and 500,000 transactions. This process takes upwards of a day to run. This being one of our smaller clients, I tremble at the idea of running it for our heavy databases.
Is there a better approach to doing this, using SQL cursors or so other neat way I have not seen?
Ok, so I have managed to take my package execution from around 3 days to about 11 minutes all up.
I ran a profiler and standard windows stats while running the loops and found a few interesting things.
Firstly, there was almost no utilization of HDD, CPU, RAM or network during the execution of the packages. It told me what I kind of already knew, that it was not running as quickly as it could.
What I did notice, between each execution of the loop there was a 1 to 2ms delay before the next instance of the loop started executing.
Eventually I found that every time a new instance of the loop began, SSIS created a new connection to the SQL database, it appears that this is SSIS's default behavior. Whenever you create a Source or Destination, you are adding a connection delay to your project.
The Fix:
Now this was an odd fix, you need to go into your connection manager (The odd bit) it must be the onscreen window not in the right hand project manager window.
If you select your connect that is referenced in the loop, the properties window on the right side (In my layout anyway) you will see the option called "RetainSameConnection" which be default is set to false.
By setting this to true, I eliminated the 2ms delay.
Considerations:
In doing this I created a heap of other issues, which really just highlighted areas of my package that I had not thought out well.
Some things that appears to be impacted by this change were stored procedures that used temp tables, these seemed to break instantly. I assume that is because of how SQL handles temp tables, in closing the connection and reopening, you can be pretty certain that the temp table is gone. With the same connection setting, the chance of running into temp tables appears to be an issue again.
I removed all temp tables and replaced them with CTE statements, this appears to fix this issue.
The second major issue I found was with tasks that ran parallel and both used the same connection manager. From this I received an error that SQL is still trying to run the previous statement. This bombed out my package.
To get around this, I created a duplicate connection manager (All up I made three connection managers for the same database).
Once I had my connections set up, I went into each of my parallel Source and Destinations and assigned them their own connection manager. This appears to have resolved the last error I received.
Conclusion:
They may be more unforeseen issues in doing this, but for now my packages are lightening quick and this highlighted some faults in my design.
I don't know how to phrase my question right. But to provide further details about the problem I am trying to solve, let me describe my application. Suppose I am trying to implement a queue reservation application, and I maintain the number of slots in a table roughly.
id | appointment | slots_available | slots_total
---------------------------------------------------
1 | apt 1 | 30 | 30
2 | apt 2 | 1 | 5
.. | .. | .. | ..
So, in a competitive scenario, assuming that everything works in the application side of things. A scenario can happen in the application where :
user 1 -> reserves apt 2 -> [validate if slot exists] -> update slot_available to 0 -> reserve (insert a record)
user 2 -> reserves ap2 2 -> validate if slot exists -> [update slot_available to 0] -> reserve (insert a record)
What if user 1 and 2 happens to find a slot available for apt2 at the same time in the user interface? (Of course I would validate first if there is one slot, but they would see the same value in the UI if not one of them has clicked yet). Then the two submits a reservation at the same time.
Now what if user 1 validates that there is a slot that is available, even though user 2 has already taken it though the update operation is not yet done? Then there will be two inserts.
At any case, how do I ensure that only one of them gets the reservation at database level? I'm sure this is a common scenario, but I have no idea yet on how to implement something like this. A suggestion to remodel would also be acceptable as long as it solves the scenario.
Supposed I have a reversely linked list, i.e. a data structure where each node points to its predecessor:
A <- B <- C <- D
This is a pattern you can find in Git, for example, where each commit contains the ID of its predecessor.
Now let's assume the following scenario: Every time a new node is added to the list, another component gets notified. Unfortunately, from time to time, some of these notifications get lost, so there is no guarantee that all the notifications arrive. For the list given above, the notifications should be:
A
B
C
D
But, as an example, the following is received:
A
D
Now I would like to detect "holes" in the receiving component. I.e., when D is received, the component can detect that something is missing, since the predecessor of D has not been received as well. So it asks the sending component for the part that is missing. What can be told is: The last one that was received is A, and the newest one received is D.
So now the component managing the original list has the task of effectively figuring out that what is missing is the sublist B <- C.
And this is finally, where my question comes into play: How to do this effectively? I mean, the simplest approach would be to move backwards from D, until we reach A. Everything in between is apparently what's missing.
Supposed every node is stored as a single record in a table in a relational database system: What is the most efficient way to figure out this sublist? Obviously, I could run SELECT in a loop over and over and over again. Is there a better way for this?
The table layout of the Nodes table basically looks like this:
ID | PredecessorID | Data
-------------------------------------|--------------------------------------|-----------------
43b1e103-d8c6-40f9-b031-e5d9ef18a739 | null | ...
55f6951b-5ed3-46c8-9ad5-64e496cb521a | 43b1e103-d8c6-40f9-b031-e5d9ef18a739 | ...
3eaa0889-31a6-449d-a499-e4beb9e4cad1 | 55f6951b-5ed3-46c8-9ad5-64e496cb521a | ...
This means, as the IDs are not numerical, you can not simply select the range of the missing nodes.
PS: The only solution I'm aware of is introducing a position field which actually is an increasing numerical ID, but I explicitly do not want to have something as this, as then I need a single point of failure which consistently hands out the next ID, and this is something I would like to avoid.
Below is My Scenario:
Scenario: Delete Customer
Given We declare a new Request
And We have below Path parameters
| userid | |
| magcode | |
And We have below Header parameters
| sharedsecret | |
And We log the Request
When We send Delete request to service "DeleteCustomerWebservice"
Then The response status code should be 200
Here I am deleting customer only once, but same thing i have to do multiple times But the data should be coming from database.
I dont think you can make your database as datatable for your story. Cucumber isnt designed for this style i believe. Instead handle this in a do/while or for loop inside your test to iterate over all records of resultset.