Most efficient way to find random item not present in join table and without repeating previous random item - sql

In my Rails 4 app I have an Item model and a Flag model. Item has_many Flags. Flags belong_to Item. Flag has the attributes item_id, user_id, and reason. I am using enum for pending status. I need the most efficient way to get an item that doesn't exist in the flags table because I have a VERY large table. I also need to make sure that when a user clicks to generate another random item, it will not repeat the current random item back to back. It would be OK to repeat any time afterwards, just not back to back.
This is what I have so far:
def pending_item
#pending_items = Item.joins("LEFT OUTER JOIN flags ON flags.item_id = items.id").
where("items.user_id != #{current_user.id} and flags.id is null")
#pending_item = #pending_badges.offset(rand #pending_badges.count).first
end
Is there a more efficient way than this?
And how can I make sure there are no back to back repeats of the same pending_item?

What you have is the fastest way to do it in the database I know of (which is why I gave it here). Many other Stack Overflow posts discuss ways to efficiently select random rows from tables; there are more efficient methods if you're selecting from an entire table, but they don't apply when you're selecting a random result from a query whose results can be different every time.
If performance is critical, it would be much faster to do it in memory.
The first time you need to pick a random pending item for a given user, select all of the user's pending items from the database and store them in the Rails cache. (This only works if there's a reasonable number of pending items per user.)
Each time you need to pick a random pending item for a given user, get the full list from the cache and pick a random member with .sample or whatever.
Here's the tricky part: to keep the cache consistent, every time you do anything that could change a user's full list of pending items (including something like adding a new flag type), you'll need to invalidate the cache entry.
This is a lot of effort, so you really have to want to do it.
Regarding avoiding repeats, the only way to do that reliably is to store the last pending item displayed and exclude it from your query
def pending_item
#pending_items = UserItem.
joins("LEFT OUTER JOIN flags ON flags.item_id = items.id").
where("items.user_id != ?", current_user.id).
where("flags.id is null").
where("items.id != ?", previous_shown_item_id)
#pending_item = #pending_badges.offset(rand #pending_badges.count).first
end
or, if you do the random selection in memory, exclude the last shown pending item when you do that.

Related

Reducing database load from consecutive queries

I have an application which calls the database multiple times to achieve one simple goal.
A little information about this application; In short, the application scrapes data from a webpage & stores specific information from this page into a database. The important information in this query is: Player name, Position. There can be multiple sitting at one specific position, kill points & Class
Player name has every potential to change or remain the same every day
Regarding the Position, there can be multiple sitting in one position
Kill points has the potential to increase or remain the same every day
Class, there is only 2 possibilities that a name can be, Ex: A can change to B or remain A (same in reverse), but cannot be C,D,E,F
The player name can change at any particular day, Position can also change dependent on the kill point increase from the last update which spins back around to the goal. This is to search the database day by day, from the current date to as far back as 2021-02-22 starting at the most recent entry for a player name and back track to the previous day to check if that player name is still the same or has changed.
What is being used as a main reference to the change is the kill points. As the days go on, this number will either be the exact same or increase, it can never decrease.
So now onto the implementation of this application.
The first query which runs finds the most recent entry for the player name
SELECT TOP(1) * FROM [changes] WHERE [CharacterName]=#charname AND [Territory]=#territory AND [Archived]=0 ORDER BY [Recorded] DESC
Then continue to check the previous days entries with the following query:
SELECT TOP(1) * FROM [changes] WHERE [Territory]=#territory AND [CharacterName]=#charname AND [Recorded]=#searchdate AND ([Class] LIKE '%{Class}%' OR [Class] LIKE '%{GetOpposite(Class)}%' AND [Archived]=0 )
If no results are found, will then proceed to find an alternative name with the following query:
SELECT TOP(5) * FROM [changes] WHERE [Kills] <= #kills AND [Recorded]='{Data.Recorded.AddDays(-1):yyyy-MM-dd}' AND [Territory]=#territory AND [Mode]=#mode AND ([Class] LIKE #original OR [Class] LIKE #opposite) AND [Archived]=0 ORDER BY [Kills] DESC
The aim of the query above is to get the top 5 entries that are the closest possible matches & Then cross references with the day ahead
SELECT COUNT(*) FROM [changes] WHERE [CharacterName]=#CharacterName AND [Territory]=#Territory AND [Recorded]=#SearchedDate AND [Archived]=0
So with checking the day ahead, if the character name is not found in the day ahead, then this is considered to be the old player name for this specific character, else after searching all 5 of the results and they are all found to be present in the day aheads searches, then this name is considered to be new to the table.
Now with the date this application started to run up to today's date which is over 400 individual queries on the database to achieve one goal.
It is also worth a noting that this table grows by 14,400 - 14,500 Rows each and every day.
The overall question to this specific? Is it possible to bring all these queries into less calls onto the database, reduce queries & improve performance?
What you can do to improve performance will be based on what parts of the application stack you can manipulate. Things to try:
Store Less Data - Database content retrieval speed is largely based on how well the database is ordered/normalized and just how much data needs to be searched for each query. Managing a cache of prior scraped pages and only storing data when there's been a change between the current scrape and the last one would guarantee less redundant requests to the db.
Separate specific classes of data - Separating data into dedicated tables would allow you to query a specific table for a specific character, etc... effectively removing one where clause.
Reduce time between queries - Less incoming concurrent requests means less resource contention and faster response times to prior requests.
Use another data structure - The only reason you're using top() is because you need data ordered in some specific way (most-recent, etc...). If you just used a code data structure that keeps the data ordered and still easily-query-able you could then perhaps offload some sql requests to this structure instead of the db.
The suggestions above are not exhaustive, but what you do to improve performance is largely a function of what in the application stack you have the ability to modify.

How to set the explicit order for child table rows for one-to-many SQL relation?

Imagine a database with two tables, lists (with id and name) and items (with id, list_id, which is a foreign key linking to lists.id, and name) and the application with ORM and the corresponding models.
A task: have a way in the application to create/edit/view the list and the items inside it (that should be pretty easy), but also saving the order of the items within one list and allowing to reorder the items within one list (so, a user creates the items list, then swaps two items, then when displaying the list, the items order should be preserved) or deleting items.
What is the best way to implement it, database-wise? Which db structure should I use for it?
I see these ways of solving it:
not using the external table for items, but storing everything in a list document (as a postgres jsonb column for example) - can work but I suppose that's not RDBMS way to do it and if the user would want to update the single item, the whole list object would need to be updated
having a position field in items table and adding a way to manage the position in the API - can work, but it's quite complicated (like, handling the cases where the position is the same for some items, handling swapping items, handling items deletions and having to decrease the position of all the items coming after the deleted one etc.)
Is there a simple way of implementing it? Like the one used in production by some big companies? I'm really curious about how such cases are handled in real life.
This is more theoretical question, so no code samples here (except for the db structure).
This is a good question, which as far as I know doesn't have any simple answers. I once came up with a solution for a high volume photo sharing site using an item table with columns list_id and position as you describe. The key to performance was to minimize renumbering as this database had millions of photos (and more than 2^32 likes).
The only operation was to move a single item to another point in the list (before or after another item in the list). This would work by first assigning positions with large steps, e.g. 1000, 2000, 3000. Whenever an item is moved between two others the average is used, e.g. move from pos=3000 to 1500. Eventually you can try to move an item between two items that have consecutive position numbers. Then you choose to renumber items either above or below depending on which way requires fewer updates (e.g. if there were a run of consecutive positions). This was done using RANK and #vars as I recall on MySQL 5.7.
This did work well resolving a problem where there was intermittent unavailability in production due to massive renumberings that were occurring before when consecutive positions were used.
I was able to dig up a couple of the queries (that was meant to go into a blog post ages ago). Turns out this was MySQL before RANK() was a thing which is why the #shuffle_rank variable was used. The + 0 (and the + 1) is because this is the actual SQL sent to the query but it was generated in code. This is to find the first gap below (greater than) position 120533287:
SELECT shuffle_rank, position
FROM (SELECT #shuffle_rank := #shuffle_rank + 1 AS shuffle_rank, position
FROM `gallery_items`
JOIN (SELECT #shuffle_rank := 0) initialize_rank_var
WHERE `gallery_items`.`gallery_id` = 14103882 AND (position >= 120533287)
ORDER BY position ASC) positionable_items
WHERE ABS(120533287 - position) >= shuffle_rank + 0 LIMIT 1
Here's the update query after the above query and supporting code decided that 3 rows need to be updated to make a gap. The + 1 here may be larger if renumbering with some gap if there's room.
UPDATE `gallery_items`
SET position = -222 + (#shuffle_rank := #shuffle_rank + 1)
WHERE `gallery_items`.`gallery_id` = 24669422
AND (position >= -222)
AND ((SELECT #shuffle_rank := 0) = 0)
ORDER BY position ASC
LIMIT 3
Note that this pair of actual queries aren't for the same operation seeing as they have different gallery_id values (aka list_id).

Storing a php integer array

I need to store an array of ints. Now my issue is, there's an operation that's done quite a few times so I'd like to limit it to one single query. In tha query, I would need to add an int to a certain int from the array.
It's for a timer of the time spent on a certain page. Currently it's just a general counter that counts for all the pages in the same field, so I only have to do
UPDATE user SET active = active+$totaltime WHERE id=:id
with the $totaltime being the time difference between last check and then. Now I'd like to store for certain pages seperately. The problem is I don't know exactly how many pages there will be. I thought about using serialize, but then I'd need to do 2 queries a lot of times which doesn't seem like a good solution.
Are there any other methods to do so?
What you need is a separate table for the levels which keeps track of active time associated with each user on each level.
Lets calls this table userlevels, and give it the following columns:
userid INT
levelid INT
active INT
The primary key should be a combination of the userid and leveid columns, since there can only be one entry for a particular combination of user and level.
Then when you want to update the amount of time a user has spent on a certain level, you would do something like:
INSERT INTO userlevels (userid,levelid,active)
VALUES (:userid,:levelid,$totaltime)
ON DUPLICATE KEY UPDATE active=active+$totaltime;
This creates a new entry in the table if the user has never been on that level before, or adds to the active time if there is already an entry.
This is mysql specific syntax, but the same thing can be achieved on other databases with different calls.

How to keep a list of 'used' data per user

I'm currently working on a project in MongoDB where I want to get a random sampling of new products from the DB. But my problem is not MongoDB specific, I think it's a general database question.
The scenario:
Let's say we have a collection (or table) of products. And we also have a collection (or table) of users. Every time a user logs in, they are presented with 10 products. These products are selected randomly from the collection/table. Easy enough, but the catch is that every time the user logs in, they must be presented with 10 products that they have NEVER SEEN BEFORE. The two obvious ways that I can think of solving this problem are:
Every user begins with their own private list of all products. Each time they get one of these products, the product is removed from their private list. The result is that the next time products are chosen from this previously trimmed list, it already contains only new items.
Every user has a private list of previously viewed products. When a user logs in, they select 10 random products from the master list, compare the id of each against their list of previously viewed products, and if the item appears on the previously viewed list, the application throws this one away selects a new one, and iterates until there are 10 new items, which it then adds to the previously viewed list for next time.
The problem with #1 is it seems like a tremendous waste. You would basically be duplicating the list data for n number of users. Also removing/adding new items to the system would be a nightmare since it would have to iterate through all users. #2 seems preferable, but it too has issues. You could end up making a lot of extra and unnecessary calls to the DB in order to guarantee 10 new products. As a user goes through more and more products, there are less new ones to choose from, so the chances of having to throw one away and get new one from the DB greatly increases.
Is there an alternative solution? My first and primary concern is performance. I will give up disk space in order to optimize performance.
Those 2 ways are a complete waste of both primary and secondary memory.
You want to show 2 never before seen products, but is this a real must?
If you have a lot of products 10 random ones have a high chance of being unique.
3 . You could list 10 random products, even though not as easy as in MySQL, still less complicated than 1 and 2.
If you don't care how random the sequence of id's is you could do this:
Create a single randomized table of just product id's and a sequential integer surrogate key column. Start each customer at a random point in the list on first login and cycle through the list ordered by that key. If you reach the end, start again from the top.
The customer record would contain a single value for the last product they saw (the surrogate from the randomized list, not the actual id). You'd then pull the next ten on login and do a single update to the customer. It wouldn't really be random, of course. But this kind of table-seed strategy is how a lot of simpler pseudo-random number generators work.
The only problem I see is if your product list grows more quickly than your users log in. Then they'd never see the portions of the list which appear before wherever they started. Even so, with a large list of products and very active users this should scale much better than storing everything they've seen. So if it doesn't matter that products appear in a set psuedo-random sequence, this might be a good fit for you.
Edit:
If you stored the first record they started with as well, you could still generate the list of all things seen. It would be everything between that value and last viewed.
How about doing this: crate a collection prodUser where you will have just the id of the product and the list of customersID, (who have seen these products) .
{
prodID : 1,
userID : []
}
when a customer logs in you find the 10 prodID which has not been assigned to that user
db.prodUser.find({
userID : {
$nin : [yourUser]
}
})
(For some reason $not is not working :-(. I do not have time to figure out why. If you will - plz let me know.). After showing the person his products - you can update his prodUser collection. To mitigate mongos inability to find random elements - you can insert elements randomly and just find first 10.
Everything should work really fast.

Iterate over a task model, counting users, ...then remove any user who exists more than once

I need to evaluate if a user_id exists more than once in an array. Ultimately, I need to determine if a user was able to complete a task (each task is saved as a record) in one attempt. I need to display a percentage of success, which is ultimately determined by the number of users who get it right on the first try. I added a boolean to the task model ':passed', but then I have to write more logic to set that boolean for the first record, and then unset it if any subsequent record is created. That smells. My approach now is to simply create an array of task.users, then determine if any user_id exists in that array more than once...and if it does remove all instances of that integer from the array (so that they are not counted). I'm tripping over my own thought process and not having success...
How can I iterate over all tasks and count each iterations user_id and .delete(user_id) of any user whos count is > 1?
Don't understand what you're asking, but this will give you an array of user_id's that have completed the task without duplicates:
task.users.map(&:user_id).uniq
Based on your comment you can try:
task.users.group_by{ |u| u.user_id }.collect{ |u, dups| u if dups.size == 1 }.compact