How to represent a sequence of actions in a database while keeping detailed information about each action? - sql

We have many actions players can take in a game. Imagine a card game (like poker) or a board game where there are multiple choices at each decision point and there is a clear sequence of events. We keep track of each action taken by a player. We care about the action's size (if applicable), other action possibilities that weren't taken, the player who took the action, the action that player faced before his move. Additionally, we need to know whether some action happened or did not happen before the action we're looking at.
The database helps us answer questions like:
1. How often is action A taken given the opportunity? (sum(actionA)/sum(actionA_opp)
2. How often is action A taken given the opportunity and given that action B took place?
3. How often is action A taken with size X, or made within Y seconds given the opportunity and given that action B took place and action C did not?
4. How often is action A taken given that action B took place performed by player P?
So for each action, we need to keep information about the player that took the action, size, timing, the action performed, what action opportunities were available and other characteristics. There is a finite number of actions.
One game can have on average 6 actions with some going up to 15.
There could be million of games and we want the aggregate queries across all of them to run as fast as possible. (seconds)
It could be represented in document database with an array of embedded documents like:
game: 123
actions: [
{
player: Player1,
action: deals,
time: 0.69,
deal_opp: 1
discard_opp: 1
},
{
player: Player2,
action: discards,
time: 1.21
deal_opp: 0,
discard_opp: 1,
}
...
]
Or in a relational model:
game | player | seq_n | action | time | deal_opp | discard_opp
123 | Player | 1 | deals | 0.28 | 1 | 1
All possible designs that I come up with can't satisfy all my conditions.
In the relational model presented, to see the previous actions taken in the same game requires N inner joins where N is previous actions we want to filter for. Given that the table would hold billions of rows, it would require several self joins on a billion row table which seems very inefficient.
If we instead store it in a wide column table, and represent the entire sequence in one row, we have very easy aggregates (can filter what happened and did not by comparing column values, eg. sum(deal)/sum(deal_opp) where deal_opp = 1 to get frequency of deal action given the player had the opportunity to do it) but we don't know WHO took the given action which is a necessity. We cannot just append a player column next to an action to represent who took that action because an action like call or discard or could have many players in a row (in a poker game, one player raises, 1 or more players can call).
More possibilities:
Graph database (overkill given that we have at most 1 other connecting node? - basically a linked list)
Closure tables (more efficient querying of previous actions)
??

If i understand very well, is you're dealing with How to store a decision tree within your database. Right ?
I remember i programmed a chess game yeasr ago, which means every action is a consequetive set of previous actions of both users. So to keep record of all the actions, with all the details you need, i think you should check the following :
+ In relational database, the most efficient way to store a Tree is a Modified Preorder Tree Traversal. Not easy tbh, but you can give it a try.
This will help you : https://gist.github.com/tmilos/f2f999b5839e2d42d751

Related

What do the entries in Lamport clocks representations represent?

I'm trying to understand an illustrative example of how Lamport's algorithm is applied. In the course that I'm taking, we were presented with two representations of the clocks within three [distant] processes, one with the lamport alogrithm applied and the other without.
Without the Lamport algorithm:
With the lamport algorithm applied:
My question is concerning the validity of the change that was applied to the third entry of the table pertaining to the process P1. Shouldn't it be, as the Lamport algorithm instructs, max(2, 2) + 1, which is 3 not 4?
When I asked some of my classmates regarding this issue, one of them informed me that the third entry of the table of P1 represents a "local" event that happened within P1, and so when message A is arrived, the entry is updated to max(2, 3) + 1, which is 4. However, if that was the case, shouldn't the receipt of the message be represented in a new entry of its own, instead of being put in the same entry that represents the local event that happened within P1?
Upon further investigation, I found, in the same material of the course, a figure that was taken from Tannenbaum's Distributed Systems: Principles and Paradigms, in which the new values of an entry that corresponds to the receipt of a message is updated by adding 1 to the max of the entry before it in the same table and the timestamp of the received message, as shown below, which is quite different from what was performed in the first illustration.
I'm unsure if the problem relates to a faulty understanding that I have regarding the algorithm, or to the possibility that the two illustrations are using different conventions with respect to what the entries represent.
validity of the change that was applied to the third entry of the table pertaining to the process P1
In classical lamport algorithm, there is no need to increase local counter before taking max. If you do that, that still works, but seems like an useless operation. In the second example, all events are still properly ordered. In general, as long as counters go up, the algorithm works.
Another way of looking at correctness is trying to rebuild the total order manually. The hard requirement is that if an event A happens before an event B, then in the total order A will be placed before B. In both picture 2 and 3, everything is good.
Let's look into picture 2. Event (X) from second cell in P0 happens before the event (Y) of third cell of P1. To make sure X does come before Y in the total order it is required that the time of Y to be larger than X's. And it is. It doesn't matter if the time difference is 1 or 2 or 100.
in which the new values of an entry that corresponds to the receipt of a message is updated by adding 1 to the max of the entry before it in the same table and the timestamp of the received message, as shown below, which is quite different from what was performed in the first illustration
It's actually pretty much the same logic, with exception of incrementing local counter before taking max. Generally speaking, every process has its own clock and every event increases that clock by one. The only exception is when a clock of a different process is already in front, then taking max is required to make sure all events have correct total order. So, in the third picture, P2 adjusts clock (taking max) as P3 is way ahead. Same for P1 adjust.

Randomly select DynamoDB entry

I'm have a DynamoDB table called URLArray that contains a list of URL's (myURL) and a unique video name (myKey).
I need to do two things:
When a user clicks the next video button, a random entry needs to be selected from this URLArray. There could be potentially tens of thousands of rows.
The user is logged into the app. Everytime they finish watching a video, the video's unique video name is recorded. So....when the user has seen a video, its added to a list in a table called Users under the user's info row.
Soo...This random entry that gets selected when the user clicks the next video button in point 1, has to be compared to the list of videos they've already seen. To make sure that it doesn't randomly appear again for that particular user.
I do something woefully inefficient so far, that works, but it's not great:
By the way i'm using AppSync + GraphQL to interact with the DynamoDB table. I first get a local copy of the URLArray:
//Gets a list of the Key/URL pairs in the UrlArrays table in GraphQL ****IN CONSTRUCTOR, so we have this URLArray data when componentDidMount()****
listUrlArrays = async () => {
try {
URLData = await API.graphql(graphqlOperation(ListUrlArrays)); //GraphQL query
//URLData[] is available in the entire class
this.setState({urlArrayLength: apiData.data.listURLArrays.items.length}); //gets the length of URLArray (i.e. how many videos are in the database)
}
}
As an overview, when user clicks for the next video:
//When clicking next video
async nextVideo(){
await this.logVideosSeen(); //add myKey to the list of videos in *Users* table the logged in user has now seen
await this.getURL(); //get the NEXT upcoming video's details, for Video Player to play and make sure it's not been seen before
}
//This will update the 'listOfVideosSeen[]' in Users table with videos unique myKey, the logged in user has seen
logVideosSeen = async () => {
.......
}
async getURL() {
var dbIndex = this.getUniqueRandomNumber(this.state.urlArrayLength); //Choose a number between 0 and N number of videos in URLArray
//the hasVideoBeenSeen() basically gets the list of videos a user has already seen from `Users` table with the GraphQL getusers command, and creates a local copy of this list (can get big). I use javascripts indexOf() to check whether myKey already exists in the list
while(await this.hasVideoBeenSeen(this.state.URLData[dbIndex].myKey)) //while true i.e. user has seen that video before
{
dbIndex = this.getUniqueRandomNumber(this.state.urlArrayLength); //get another random number to fetch a new myKey
}
//If false, we'll exit the loop and know we've got a not seen before myKey, proceed to set to play...
if(dbIndex != null){
this.setState({ playURL: this.state.URLData[dbIndex].vidURL }); //Retrieve the URL from the local URLArray that we're going to play (i.e. the next video to come)
}
}
I can share a little more code if needed, but essentially I wanted to know how to:
Let a Lambda function select a random number based on the current URLArray size (i may need to keep a local copy of URLArray anyway). But i think point 2 here is where it's really inefficient..
Let a Lambda function check (the while loop) against the Users table whether myKey has already been seen. Mainly to shift this computational burden to the cloud instead of the local device the app runs on.
AFTER A THINK................
Thanks for the suggestion Seth. I have been thinking about it for some time, and while the randomness requirement still holds true, I think there is some truth in what you’ve suggested. The reason I need randomness, is so that 2 users sat side by side for example, can’t predict which video is coming next. It shouldn’t be a predictable sequence of videos. I'm not sure I can use Scan function with AWS Amplify/GraphQL. So remember there’s 2 things going on here: (1) a video upload, recording it in the URLArray sensibly for future reference. (2) users viewing a previously unseen random video and then moving onto another unseen random video
*(1)
I like your idea of using a number to index the URLArray, and it’s helped to make life a bit easier. So the first URL being at index 0, the next at 1 etc…
My thinking here (to avoid me doing a ListUrlArrays() and bringing the WHOLE array locally to the phone), is to create a GSI called VideoNumber for the URLArray table. This will be the unique VideoNumber column with a number 0-N. So imagine the diagram above having another column called VideoNumber. Row 1 having VideoNumber set to 0, Row 2 having VideoNumber set to 1 etc… THEN all I would need to do, is locally on the device, generate a random number between 0-N, call a getURLArrayIdbyVideoNumber() query specific for that GSI, with the number that we just generated, and it’ll unlock the information I need from the row. Voila! I think that shifts most of that heavy burden away now.
Question: Before each video is uploaded, how do I easily get the current total number of rows N in the table (or row count)? I would then increment it by one.
The other thing I can do is save this current count number in another DynamoDB table that I use for persisted data, read the number from there before upload, and write an N+1 after upload to increment it (2 DynamoDB operations per upload). It’s not ideal.
*(2)
When a user has finished watching a video, I can log in a list (under the users information in DynamoDB), which video’s they’ve already seen. So for example this could now be a seen list: [3,12,73,108,57] for the 5 videos they’ve seen so far. When the user clicks nextVideo() we’ll generate a random newNumber, and straight away compare that with any number in the seen list. I use seenlist.indexOf(newNumber) and it will, either go again or stop if the newNumber doesn’t exist in the list. THEN I can go through the GSI query, and retrieve the relevant information to display the video from URLArray.
I think that this indexOf() is the biggest computational burden on the device, and obviously gets a little slower as the seenList increases. But it should be quicker with pure integer numbers then an alphanumeric myKey as I was using before. Any other suggestion would be welcome :)
I’ve yet to try it, but it was just an idea, as I need to keep the random element. But first, do you know how I can easily find the number of rows or table count of URLArray?
I think you'll have an easier time coming up with a solution to this problem if you drop the randomness requirement. It sounds like the more important requirement is presenting the user with a video they haven't seen before.
If that's correct, it sounds like your access pattern could be stated as
Fetch previously unseen video for user
which is an easier problem to solve.
Unlike SQL databases, there are often many ways to implement a given access pattern in DynamoDB. My answer here is just one way.
Imagine your URLArray table as a giant array. The first URL is at index 0, the next URL is at index 1, the second URL at index 2, and so on. Each user of your application would start by watching the video at URL index 0, then URL index 1, etc. This would ensure the user never sees the same video twice. You would not need to store a list of all the videos they've seen. Instead, you could store the index of the last video they saw.
Your application could grab the first n videos from the table to present to your users. Once that list was exhausted, it could go grab the next n videos. And so on...
What I've described here is essentially how pagination is implemented in DynamoDB. To bring this abstraction back to the world of DynamoDB, your algorithm could look something like this:
Scan the URLArray table for the first "page" of URLs (a scan operation with no filter criteria)
Along with the results, DynamoDB will respond with a LastEvaluatedKey, which will allow you to retrieve the next page of results starting from this position
Present your user with each video you pulled back from the scan operation, making sure to record the id (the Primary Key) the the last video they saw.
When you exhaust the URLs from step 1, execute another scan operation with the ExclusiveStartKey set to the LastEvaluatedKey returned from step 2.
When users return to your application, query for the next page from the URLArray table with ExclusiveStartKey set to the id of the last video they viewed.
This effectively uses the scan operation to search through your URLArray table one page at a time. Your application would effectively be searching the table from top to bottom, keeping track of where each user is at any given time. When a user revisits your application, just start where they left off.
In response to your edit:
If your use case requires the next video to be unpredictable (e.g. no 2 users can predict what video is next), you have a few problems to solve at the same time:
Selecting an item in an unpredictable/random manner
Tracking what a user has already seen
Putting those two requirements together makes for a tricky access pattern. Let's say you have N videos in your table, and the user has viewed N-1 of these videos leaving only one video unseen. If you are fetching your next video randomly and need to ensure it has not yet been seen, how will you find the last unseen video? How many times would you need to guess before you came across the only unseen video? What query/scan operation could you perform that does this in a single request to DDB? I'm not saying it's impossible, it's just...complicated.
I think it's better to generate a strategy that is unpredictable to the user, but predictable to you when it comes to select the next unseen video.
For example, you could pre-calculate a random order of indexes from 1..N ahead of time, which would represent the order you present the videos for a given user. You could go through that list sequentially, keeping track of the last seen index. That way, you'd always know which video was next and that the video hadn't previously been seen by this user. Fetching that video would be a simple query operation to DDB.
You also asked how to find the number of items in DynamoDB. Unfortunately, there is no DynamoDB equivalent of the SQL count operation. The answer to this question is not straightforward. For the benefit of the community (and to get a diverse set of answers), I'd suggest you make a separate question on Stackoverflow regarding the number of items in a DDB table.

Repast: Agent execution order

I have an agent called truck, which will perform some actions (e.g. loading packages).
The problem here is related to the random sequence of agents executing actions. For instance, suppose I have three trucks, the loading sequence is random at each different run.
Run-1: truck-1, truck-3, truck-2
Run-2: truck-2, truck-1, truck-3
Run-3: truck-3, truck-1, truck-2
...
How to make sure the agent(truck) executing actions based on a sequence, e.g. by their id so that we can always get the consistant result from the simulation.
Run-1: truck-1, truck-2, truck-3
Run-2: truck-1, truck-2, truck-3
Run-3: truck-1, truck-2, truck-3
...
There's at least 3 ways to do this.
If you set the random seed, the order of the trucks should be the same across runs, all other things being equal. It most likely won't be ordered by id, but it should be the same.
Add all the trucks to an ArrayList when they are created. Sort this list by id and each tick of the simulation iterate through this list, executing the truck action on each truck. A quick google should show you how to order a Java List using a Comparator.
Adapt the scheduling to reflect truck id - for example, truck 1 executes at 1.0 and every tick thereafter, truck 2 at 1.1 and every tick thereafter, truck 3 at 1.2, and so on.
A kind of variation on 3. Set the scheduling priority by id -- all the trucks could execute at 1.0 and every tick thereafter, but with truck 1 having the highest priority, truck 2 the next, and so on.
As a side note, the random iteration of the items in a schedule is the default to prevent common ABM behavior execution ordering issues, such as first mover advantage.

How to make the Automatic Record Permission field to update itself as quickly as possible?

If you are working with access control, you must have faced the issue where the Automatic Record Permission field (with Rules) does not update itself on recalculating the record. You either have to launch full recalculation or wait for a considerable amount of time for the changes to take place.
I am facing this issue where based on 10 different field values in the record, I have to give read/edit access to 10 different groups respectively.
For instance:
if rule 1 is true, give edit access to 1st group of users
if rule 1 and 2 are true, give edit access to 1st AND 2nd group of
users.
I have selected 'No Minimum' and 'No Maximum' in the Auto RP field.
How to make the Automatic Record Permission field to update itself as quickly as possible? Am I missing something important here?
If you are working with access control, you must have faced the issue
where the Automatic Record Permission field (with Rules) does not
update itself on recalculating the record. You either have to launch
full recalculation or wait for a considerable amount of time for the
changes to take place.
Tanveer, in general, this is not a correct statement. You should not face this issue with [a] well-designed architecture (relationships between your applications) and [b] correct calculation order within the application.
About the case you described. I suggest you check and review the following possibilities:
1. Calculation order.Automatic Record Permissions [ARP from here] are treated by Archer platform in the same way as calculated fields. This means that you can modify the calculation order in which calculated field and automatic record permissions will be updated when you save the record.So it is possible that your ARP field is calculated before certain calculated fields you use in the rules in ARP. For example, let say you have two rules in ARP field:
if A>0 then group AAA
if B>0 then groub BBB
Now, you will have a problem if calculation order is the following:
"ARP", "A", "B"
ARP will not be updated after you click "Save" or "Apply", but it will be updated after you click "Save" or "Apply" twice within the save record.With calculation order "A","B","ARP" your ARP will get recalculated right away.
2. Full recalculation queue.
Since ARPs are treated as calculated fields, this mean that every time ARP needs to get updated there will be recalculation job(s) created on the application server on the back end. And if for some reason the calculation queue is full, then record permission will not get updated right away. Job engine recalculation queue can be full if you have a data feed running or if you have a massive amount of recalculations triggered via manual data imports. Recalculation job related to ARP update will be created and added to the queue. Recalculation job will be processed based on the priorities defined for job queue. You can monitor the job queue and alter default job's processing priorities in Archer v5.5 via Archer Control Panel interface. I suggest you check the job queue state next time you see delays in ARP recalculations.
3. "Avalanche" of recalculations
It is important to design relationships and security inheritance between your applications so recalculation impact is minimal.
For example, let's say we have Contacts application and Department application. - Record in the Contacts application inherits access using Inherited Record Permission from the Department record.-Department record has automatic record permission and Contacts record inherits it.-Now the best part - Department D1 has 60 000 Contacts records linked to it, Department D2 has 30 000 Contacts records linked to it.The problem you described is reproducible in the described configuration. I will go to the Department record D1 and updated it in a way that ARP in the department record will be forced to recalculate. This will add 60 000 jobs to the job engine queue to recalculate 60k Contacts linked to D1 record. Now without waiting I go to D2 and make change forcing to recalculate ARP in this D2 record. After I save record D2, new job to recalculate D2 and other 30 000 Contacts records will be created in the job engine queue. But record D2 will not be instantly recalculated because first set of 60k records was not recalculated yet and recalculation of the D2 record is still sitting in the queue.
Unfortunately, there is not a good solution available at this point. However, this is what you can do:
- review and minimize inheritance
- review and minimize relationships between records where 1 record reference 1000+ records.
- modify architecture and break inheritance and relationships and replace them with Archer to Archer data feeds if possible.
- add more "recalculation" power to you Application server(s). You can configure your web-servers to process recalculation jobs as well if they are not utilized to certain point. Add more job slots.
Tanveer, I hope this helps. Good luck!

How to keep a list of 'used' data per user

I'm currently working on a project in MongoDB where I want to get a random sampling of new products from the DB. But my problem is not MongoDB specific, I think it's a general database question.
The scenario:
Let's say we have a collection (or table) of products. And we also have a collection (or table) of users. Every time a user logs in, they are presented with 10 products. These products are selected randomly from the collection/table. Easy enough, but the catch is that every time the user logs in, they must be presented with 10 products that they have NEVER SEEN BEFORE. The two obvious ways that I can think of solving this problem are:
Every user begins with their own private list of all products. Each time they get one of these products, the product is removed from their private list. The result is that the next time products are chosen from this previously trimmed list, it already contains only new items.
Every user has a private list of previously viewed products. When a user logs in, they select 10 random products from the master list, compare the id of each against their list of previously viewed products, and if the item appears on the previously viewed list, the application throws this one away selects a new one, and iterates until there are 10 new items, which it then adds to the previously viewed list for next time.
The problem with #1 is it seems like a tremendous waste. You would basically be duplicating the list data for n number of users. Also removing/adding new items to the system would be a nightmare since it would have to iterate through all users. #2 seems preferable, but it too has issues. You could end up making a lot of extra and unnecessary calls to the DB in order to guarantee 10 new products. As a user goes through more and more products, there are less new ones to choose from, so the chances of having to throw one away and get new one from the DB greatly increases.
Is there an alternative solution? My first and primary concern is performance. I will give up disk space in order to optimize performance.
Those 2 ways are a complete waste of both primary and secondary memory.
You want to show 2 never before seen products, but is this a real must?
If you have a lot of products 10 random ones have a high chance of being unique.
3 . You could list 10 random products, even though not as easy as in MySQL, still less complicated than 1 and 2.
If you don't care how random the sequence of id's is you could do this:
Create a single randomized table of just product id's and a sequential integer surrogate key column. Start each customer at a random point in the list on first login and cycle through the list ordered by that key. If you reach the end, start again from the top.
The customer record would contain a single value for the last product they saw (the surrogate from the randomized list, not the actual id). You'd then pull the next ten on login and do a single update to the customer. It wouldn't really be random, of course. But this kind of table-seed strategy is how a lot of simpler pseudo-random number generators work.
The only problem I see is if your product list grows more quickly than your users log in. Then they'd never see the portions of the list which appear before wherever they started. Even so, with a large list of products and very active users this should scale much better than storing everything they've seen. So if it doesn't matter that products appear in a set psuedo-random sequence, this might be a good fit for you.
Edit:
If you stored the first record they started with as well, you could still generate the list of all things seen. It would be everything between that value and last viewed.
How about doing this: crate a collection prodUser where you will have just the id of the product and the list of customersID, (who have seen these products) .
{
prodID : 1,
userID : []
}
when a customer logs in you find the 10 prodID which has not been assigned to that user
db.prodUser.find({
userID : {
$nin : [yourUser]
}
})
(For some reason $not is not working :-(. I do not have time to figure out why. If you will - plz let me know.). After showing the person his products - you can update his prodUser collection. To mitigate mongos inability to find random elements - you can insert elements randomly and just find first 10.
Everything should work really fast.