Custom user-supplied filters, display number of matches for each - sql

I have an app where users can track their website visitors in realtime. A user can create Groups, which is basically an array of JSON objects (filters) that they can use to filter a resource (here a website visitor).
Group(user_id:id, name:string, filters: JSONB[type, field, value])
Example of a group:
name: "my group"
filters: [
{field: "sessions", type: "greater_than", value: 5},
{field: "email", type: "contains", value: "#example.com}
]
I am displaying each of a user's groups in the interface, but I'd like to also show the amount of records (visitors) matching each group.
As can be seen, it's possible for website visitors to dynamically be included/excluded in a user's group, depending on their behavior.
I've thought of using a materialized view to keep a mapping of all groups and the count of matches, that'd be updated every 30 seconds. I fear that this will be very inefficeint however.
Is there a better approach?
Thanks

It really depends on the number of records involved, and how much of an impact to your system it will be regenerating the materialized view every 30 seconds. i.e. if the materialized view regenerates in 5 seconds or so, it wouldn't be much of an issue. If it takes 20 seconds while maxing-out processors and disks, then it's a really bad idea.
An alternative is to implement a trigger in your table (or triggers in all involved tables) to increase/decrease counters where appropriate, plus a trigger in your Groups table to calculate the current value whenever a new record is added or the condition is changed.

Related

Checking Whether Table Data Exists, Updating / Inserting Into Two Tables & Posting End Outcome

I am working on my cron system which gathers informaiton via an API call. For most, it has been fairly straight forward, but now I am faced with multiple difficulties, as the API call is dependant on who is making the API request. It runs through each users API Key and certain information will be visible/hidden to them and visaversa to the public.
There are teams, and users are part of teams. A user can stealth their move, however all information will be showed to them and their team, however this will not be visible to their oponent, however both teams share the same id and have access tothe same informaiton, just one can see more of it than the other.
Defendants Point Of View
"attacks": {
"12345`": {
"timestamp": 1645345234,
"attacker_id": "",
"attacker_team_id": "",
"defender_id": 321,
"defender_team_id": 1,
"stealthed": 1
}
}
Attackers Point Of View
"attacks": {
"12345`": {
"timestamp": 1645345234,
"attacker_id": 123,
"attacker_team_id": 2
"defender_id": 321,
"defender_team_id": 1,
"stealthed": 1,
"boosters": {
"fair_fight": 3,
"retaliation": 1,
"group_attack": 1
}
}
}
So, if the defendant's API key is first used, id 12345 will already be in the team_attacks table but will not include the attacker_id and attacker_team_id. For each insert there after, I need to check to see whether the new insert's ID already exist and has any additional information to add to the row.
Here is the part of my code that loops through the API and obtains the data, it loops through all the attacks per API Key;
else if ($category === "attacks") {
$database = new Database();
foreach($data as $attack_id => $info) {
$database->query('INSERT INTO team_attacks (attack_id, attacker_id, attacker_team_id, defender_id, defender_team_id) VALUES (:attack_id, :attacker_id, :attacker_team_id, :defender_id, :defender_team_id)');
$database->bind(':attack_id', $attack_id);
$database->bind(':attacker_id', $info["attacker_id"]);
$database->bind(':attacker_team_id', $info["attacker_team_id"]);
$database->bind(':defender_id', $info["defender_id"]);
$database->bind(':defender_team_id', $info["defender_team_id"]);
$database->execute();
}
}
I have also been submitting to the news table, and typically I have simply been submitting X new entries have been added or whatnot, however I haven't a clue if there is a way to check during the above if any new entries and any updated entries to produce two news feeds:
2 attacks have bee updated.
49 new attack information added.
For this part, I was simply counting how many is in the array, but this only works for the first ever upload, I know I cannot simply count the array length on future inserts which require additional checks.
If The attack_id Does NOT Already Exist I also need to submit the boosters into another table, for this I was adding them to an array during the above loop and then looping through them to submit those, but this also depends on the above, not simply attempting to upload for each one without any checks. Boosters will share the attack_id.
With over 1,000 teams who will potentially have at least one members join my site, I need to be as efficient as this as possible. The API will give the last 100 attacks per call and I want this to be within my cron which collects any new data every 30 seconds, so I need to sort through potentially 100,000.
In SQL, you can check conditions when inserting new data using merge:
https://en.wikipedia.org/wiki/Merge_(SQL)
Depending on the database you are using, the name and syntax of the command might be different. Common names for the command are also upsert and replace.
But: If you are seeking for high performance and almost-realtimeness, consider using a cache holding critical aggregated data instead of doing the aggregation 100'000 times per minute.
This may or may not be the "answer" you're looking for. The question(s) imply use of a single table for both teams. It's worth considering one table per team for writes to avoid write contention altogether. The two data sets could be combined at query time in order to return "team" results via the API. At scale, you could have another process calculating and storing combined team results in an API-specific cache table that serves the API request.

React Admin - Make input for filter based on other resource

I am using React Admin to make a dashboard and I have this Lead resource with the status field, that is computed based on another resource, Call, and wanted to make a filter component for Lead's list. The way it works is that for each lead, I query the last call (sorted by a date field) associated with this lead and get its status. The lead status is the status for the last call.
{ filter: { lead }, sort: { date: -1 }, limit: 1 }
the lead status query
I use this query to make a field (that appear in the list in the row of a single lead), and wanted to know how I can make an input component to use as a filter in the list. I know this pattern is weird, but it's hard to change it in the backend because of how it's structured. I am open to suggestions concerning how to change this messy computed field situation, but as I said, I would be satisfied with knowing how I can create the input component.
The solution I'm going with is a computed field. In my case, as I use MongoDB, it will be done through an aggregation pipeline. As I'm using REST instead of GraphQL, I cannot use a resolver that would only be called in the need of the status field, sometimes resulting in an uneeded aggregation (getting the last Call for a given Lead). However, it won't incur in an additional roundtrip - and instead only consume more processing time in the DB - which would be necessary for react-admin to compute this field in through a reference. And status is an important field that will usually be needed anyways.

How to represent a sequence of actions in a database while keeping detailed information about each action?

We have many actions players can take in a game. Imagine a card game (like poker) or a board game where there are multiple choices at each decision point and there is a clear sequence of events. We keep track of each action taken by a player. We care about the action's size (if applicable), other action possibilities that weren't taken, the player who took the action, the action that player faced before his move. Additionally, we need to know whether some action happened or did not happen before the action we're looking at.
The database helps us answer questions like:
1. How often is action A taken given the opportunity? (sum(actionA)/sum(actionA_opp)
2. How often is action A taken given the opportunity and given that action B took place?
3. How often is action A taken with size X, or made within Y seconds given the opportunity and given that action B took place and action C did not?
4. How often is action A taken given that action B took place performed by player P?
So for each action, we need to keep information about the player that took the action, size, timing, the action performed, what action opportunities were available and other characteristics. There is a finite number of actions.
One game can have on average 6 actions with some going up to 15.
There could be million of games and we want the aggregate queries across all of them to run as fast as possible. (seconds)
It could be represented in document database with an array of embedded documents like:
game: 123
actions: [
{
player: Player1,
action: deals,
time: 0.69,
deal_opp: 1
discard_opp: 1
},
{
player: Player2,
action: discards,
time: 1.21
deal_opp: 0,
discard_opp: 1,
}
...
]
Or in a relational model:
game | player | seq_n | action | time | deal_opp | discard_opp
123 | Player | 1 | deals | 0.28 | 1 | 1
All possible designs that I come up with can't satisfy all my conditions.
In the relational model presented, to see the previous actions taken in the same game requires N inner joins where N is previous actions we want to filter for. Given that the table would hold billions of rows, it would require several self joins on a billion row table which seems very inefficient.
If we instead store it in a wide column table, and represent the entire sequence in one row, we have very easy aggregates (can filter what happened and did not by comparing column values, eg. sum(deal)/sum(deal_opp) where deal_opp = 1 to get frequency of deal action given the player had the opportunity to do it) but we don't know WHO took the given action which is a necessity. We cannot just append a player column next to an action to represent who took that action because an action like call or discard or could have many players in a row (in a poker game, one player raises, 1 or more players can call).
More possibilities:
Graph database (overkill given that we have at most 1 other connecting node? - basically a linked list)
Closure tables (more efficient querying of previous actions)
??
If i understand very well, is you're dealing with How to store a decision tree within your database. Right ?
I remember i programmed a chess game yeasr ago, which means every action is a consequetive set of previous actions of both users. So to keep record of all the actions, with all the details you need, i think you should check the following :
+ In relational database, the most efficient way to store a Tree is a Modified Preorder Tree Traversal. Not easy tbh, but you can give it a try.
This will help you : https://gist.github.com/tmilos/f2f999b5839e2d42d751

Caching aggregrate results of user supplied queries

I have an application which let users track visitors on their own website. For this I have a table visitors(id, email, sessions, first_seen, last_seen, ...)
Users of my application can save filters/groups of visitors matching certain conditions on attributes. For this I have a table called like
groups(id, visitor_id, name, filters[{type, field, amount}...])
Example of a filters[]
(type: "greater_than", field: "sessions", amount: 5)
Each group can have multiple filters and for each group, I'd like to display the amount of visitors/results that match the filters.
What is the best way to handle this in the database?
I am thinking of something like a materialized view, but I'd still want the data to be fresh and up to date and not sure if this is the best approach.
Any suggestions?

How to keep a list of 'used' data per user

I'm currently working on a project in MongoDB where I want to get a random sampling of new products from the DB. But my problem is not MongoDB specific, I think it's a general database question.
The scenario:
Let's say we have a collection (or table) of products. And we also have a collection (or table) of users. Every time a user logs in, they are presented with 10 products. These products are selected randomly from the collection/table. Easy enough, but the catch is that every time the user logs in, they must be presented with 10 products that they have NEVER SEEN BEFORE. The two obvious ways that I can think of solving this problem are:
Every user begins with their own private list of all products. Each time they get one of these products, the product is removed from their private list. The result is that the next time products are chosen from this previously trimmed list, it already contains only new items.
Every user has a private list of previously viewed products. When a user logs in, they select 10 random products from the master list, compare the id of each against their list of previously viewed products, and if the item appears on the previously viewed list, the application throws this one away selects a new one, and iterates until there are 10 new items, which it then adds to the previously viewed list for next time.
The problem with #1 is it seems like a tremendous waste. You would basically be duplicating the list data for n number of users. Also removing/adding new items to the system would be a nightmare since it would have to iterate through all users. #2 seems preferable, but it too has issues. You could end up making a lot of extra and unnecessary calls to the DB in order to guarantee 10 new products. As a user goes through more and more products, there are less new ones to choose from, so the chances of having to throw one away and get new one from the DB greatly increases.
Is there an alternative solution? My first and primary concern is performance. I will give up disk space in order to optimize performance.
Those 2 ways are a complete waste of both primary and secondary memory.
You want to show 2 never before seen products, but is this a real must?
If you have a lot of products 10 random ones have a high chance of being unique.
3 . You could list 10 random products, even though not as easy as in MySQL, still less complicated than 1 and 2.
If you don't care how random the sequence of id's is you could do this:
Create a single randomized table of just product id's and a sequential integer surrogate key column. Start each customer at a random point in the list on first login and cycle through the list ordered by that key. If you reach the end, start again from the top.
The customer record would contain a single value for the last product they saw (the surrogate from the randomized list, not the actual id). You'd then pull the next ten on login and do a single update to the customer. It wouldn't really be random, of course. But this kind of table-seed strategy is how a lot of simpler pseudo-random number generators work.
The only problem I see is if your product list grows more quickly than your users log in. Then they'd never see the portions of the list which appear before wherever they started. Even so, with a large list of products and very active users this should scale much better than storing everything they've seen. So if it doesn't matter that products appear in a set psuedo-random sequence, this might be a good fit for you.
Edit:
If you stored the first record they started with as well, you could still generate the list of all things seen. It would be everything between that value and last viewed.
How about doing this: crate a collection prodUser where you will have just the id of the product and the list of customersID, (who have seen these products) .
{
prodID : 1,
userID : []
}
when a customer logs in you find the 10 prodID which has not been assigned to that user
db.prodUser.find({
userID : {
$nin : [yourUser]
}
})
(For some reason $not is not working :-(. I do not have time to figure out why. If you will - plz let me know.). After showing the person his products - you can update his prodUser collection. To mitigate mongos inability to find random elements - you can insert elements randomly and just find first 10.
Everything should work really fast.