How to implement a Digg-like algorithm? - sql

How to implement a website with a recommendation system similar to stackoverflow/digg/reddit? I.e., users submit content and the website needs to calculate some sort of "hotness" according to how popular the item is. The flow is as follows:
Users submit content
Other users view and vote on the content (assume 90% of the users only views content and 10% actively votes up or down on content)
New content is continuously submitted
How do I implement an algorithm that calculates the "hotness" of a submitted item, preferably in real-time? Are there any best-practices or design patterns?
I would assume that the algorithm takes the following into consideration:
When an item was submitted
When each vote was cast
When the item was viewed
E.g. an item that gets a constant trickle of votes would stay somewhat "hot" constantly while an item that receives a burst of votes when it is first submitted will jump to the top of the "hotness"-list but then fall down as the votes stop coming in.
(I am using a MySQL+PHP but I am interested in general design patterns).

You could use something similar to the Reddit algorithm - the basic principle of which is you compute a value for a post based on the time it was posted and the score. What's neat about the Reddit algorithm is that you only need recompute the value when the score of a post changes. When you want to display your front page, you just get the top n posts from your database based on that score. As time goes on the scores will naturally increase, so you don't have to do any special processing to remove items from the front page.

On my own site, I assign each entry a unique integer from a monotonically increasing series (newer posts get higher numbers). Each up vote increases the number by one, and each down vote decreases it by one (you can tweak these values, of course). Then, simply sort by the number to display the 'hottest' entries.

I developed an social bookmarking site, Sites Favoritos, and used a complex algoritm:
First, the votes are finite, an user only have a limited number of votes, and the number of votes depends on the user points. To earn points each user must add links that get positive votes.
Then, users can vote -3,-2,-1,1,2 or 3 votes for each link. As the votes are limited, each user will vote only on those links that they like.
To prevent user to vote only on links for the same user, creating support groups, the points each vote adds to the link depends on a racio between total votes and votes to links of the owner of the voted link. If you always vote on the same users links, your votes will lose value.
Votes lose value with time.
New links from users who don't have points (new users) will have a starting 0 points. New links from older users will have points depending on their points. Ranging from +3 to -infinite. Links from users with negative points will have negative starting points, links from users with positive points will have positive starting points.
Users will get random points when their links are voted. Positive votes give positive points, negative votes for negative points.

Paul Graham wrote an essay on what he learned in developing Hacker News. The emphasis is more on the people/interactions he was trying to attract/create than on the algorithm per se, but still well worth a read. For example, he discusses the different outcomes when stories bubble up from the bottom (HN) versus exploding to the top (Digg) of the front page. (Although from what I've seen of HN, it looks like stories explode to the top there also).
He offers this quote:
The key to performance is elegance, not battalions of special cases.
which in light of the purported algorithm for generating the HN front page:
(p - 1) / (t + 2)^1.5
where
p = an article's points and
t = time from submission of article
might be a good starting point.

I implemented an SQL version of Reddit's ranking algorithm for a video aggregator like so:
SELECT id, title
FROM videos
ORDER BY
LOG10(ABS(cached_votes_total) + 1) * SIGN(cached_votes_total)
+ (UNIX_TIMESTAMP(created_at) / 300000) DESC
LIMIT 50
*cached_votes_total* is updated by a trigger whenever a new vote is cast. It runs fast enough on our current site, but I am planning on adding a ranking value column and updating it with the same trigger as the *cached_votes_total* column. After that optimization, it should be fast enough for most any size site.

Related

Infinite scroll algorithm for random items with different weight ( probability to show to the user )

I have a web / mobile application that should display an infinite scroll view (the continuation of the list of items is loaded periodically in a dynamic way) with items where each of the items have a weight, the bigger is the weight in comparison to the weights of other items the higher should be the chances/probability to load the item and display it in the list for the users, the items should be loaded randomly, just the chances for the items to be in the list should be different.
I am searching for an efficient algorithm / solution or at least hints that would help me achieve that.
Some points worth to mention:
the weight has those boundaries: 0 <= w < infinite.
the weight is not a static value, it can change over time based on some item properties.
every item with a weight higher than 0 should have a chance to be displayed to the user even if the weight is significantly lower than the weight of other items.
when the users scrolls and performs multiple requests to API, he/she should not see duplicate items or at least the chance should be low.
I use a SQL Database (PostgreSQL) for storing items so the solution should be efficient for this type of database. (It shouldn't be a purely SQL solution)
Hope I didn't miss anything important. Let me know if I did.
The following are some ideas to implement the solution:
The database table should have a column where each entry is a number generated as follows:
log(R) / W,
where—
W is the record's weight greater than 0 (itself its own column), and
R is a per-record uniform random number in (0, 1)
(see also Arratia, R., "On the amount of dependence in the prime factorization of a uniform random integer", 2002). Then take the records with the highest values of that column as the need arises.
However, note that SQL has no standard way to generate random numbers; DBMSs that implement SQL have their own ways to do so (such as RANDOM() for PostgreSQL), but how they work depends on the DBMS (for example, compare MySQL's RAND() with T-SQL's NEWID()).
Peter O had a good idea, but had some issues. I would expand it a bit in favor of being able to shuffle a little better as far as being user-specific, at a higher database space cost:
Use a single column, but store in multiple fields. Recommend you use the Postgres JSONB type (which stores it as json which can be indexed and queried). Use several fields where the log(R) / W. I would say roughly log(U) + log(P) where U is the number of users and P is the number of items with a minimum of probably 5 columns. Add an index over all the fields within the JSONB. Add more fields as the number of users/items get's high enough.
Have a background process that is regularly rotating the numbers in #1. This can cause duplication, but if you are only rotating a small subset of the items at a time (such as O(sqrt(P)) of them), the odds of the user noticing are low. Especially if you are actually querying for data backwards and forwards and stitch/dedup the data together before displaying the next row(s). Careful use of manual pagination adjustments helps a lot here if it's an issue.
Before displaying items, randomly pick one of the index fields and sort the data on that. This means you have a 1 in log(P) + log(U) chance of displaying the same data to the user. Ideally the user would pick a random subset of those index fields (to avoid seeing the same order twice) and use that as the order, but can't think of a way to make that work and be practical. Though a random shuffle of the index and sorting by that might be practical if the randomized weights are normalized, such that the sort order matters.

Best way for getting users friends top rating with Redis SORTED SET

I have SORTED SET user_id:rating for every level in the game(2000+ levels). There is 2 000 000 users in set.
I need to create 2 ratings - first - all users top 100, second - top 5 friends each player
First can be solved very easily with ZRANGE
But there is a problem with second, because in average - every user has 500 friends
There is 2 ways:
1) I can do 500 requests with ZSCORE\ZRANK and sort users on by backend (too many requests, bad performance)
2) I can create SORTED SET for each user and update it on background on every users update. (more data, more ram, more complex)
May be there are any others options I missed?
I believe your main concern here should be your data model. Does every user have a sorted set of his friends?
I would recommend something like this:
users:{id}:friends values as the ids of friends
users:scoreboard values as the users ids and score as the rating
of each
As an answer to your first concern, you can consider using pipelines, which will reduce the number of requests drastically, none the less you will still need to handle ordering the results.
The better answer for you problem would be, in case you have the two sorted sets as described earlier:
Get the intersection between the two, using the "zinterstore" command and storing the result in a sorted set created solely for this purpose. As a result, the new sorted set will contain all the user's friends ids with their rating as the score (need to be careful here since you will need to specify the score of the new sorted set, it can either be the SUM, MIN or MAX of the scores).
ref: http://redis.io/commands/zinterstore
At this point using a simple "zrevrangebyscore" and specifying a limit, will leverage the sorted result you are looking for.

Need to format table from BigQuery for specific words in the body of text

I'm working with Google BigQuery to scrape the reddit comments database. I'll start with the query I'm working on:
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
subreddit,
author AS comment_author,
ups AS upvotes,
LOWER(body)
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE
body CONTAINS 'acid'
OR body CONTAINS 'ecstasy'
OR body CONTAINS 'fire'
OR body CONTAINS 'heroin'
LIMIT 10;
I need to scrape the reddit database for a list of about 30 drug-related word (I limited it to 3 for brevity).
I'm having trouble with two things:
I want to be able to correctly query the DB, but a lot of the results that are returned do not meet the criteria a.k.a. do not contain any of the matching words.
I want to be able to create a column which displays the specific word which was matched....so if it matched the word 'drug', that word would appear in a 'word_matched' column, along with the body, author, date, etc.
I've tried regular expressions as well for matching the words, but that doesn't seem to be helping either:
WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))
Any and all help will be greatly appreciated. Thanks all!
Below adressing both points of the question
1. Have in output only matching words and not those which are part of another/different word. This is easy to accomplish using REGEXP_MATCH function
2. Have column wich consists of all matching words. (i think it makes more sense to have all matching words vs. just one as it is asked in question.
SELECT
[date],
subreddit,
comment_author,
upvotes,
GROUP_CONCAT(word) AS matches,
body
FROM (
SELECT
[date],
subreddit,
comment_author,
upvotes,
body,
word
FROM (
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS [date],
subreddit,
author AS comment_author,
ups AS upvotes,
LOWER(body) AS body
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE REGEXP_MATCH(body, r'\b(drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)\b')
) x
CROSS JOIN (
SELECT SPLIT(list,'|') AS word FROM
(SELECT 'drug|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers' AS list)
) y
HAVING body CONTAINS word
)
GROUP BY [date], subreddit, comment_author, upvotes, body
LIMIT 1000
Above solution provides list of matching words on best-effort basis, so please note:
If column matches consists one word - it is for sure exact matched word
But if this columns consist of few words - still one of those is exact match, but others can be not exact match.
I think for lengthy body - it still valuable to have those at least as a hint to what to look for. For example as in
drug,meth,heroin,alcohol,benzos it also inhibits the reuptake of serotonin and norepinephrine which gives a hell of a lot worse withdrawal symptoms than most other drugs(incl. heroin, meth, coke and etc.). from what i have heard the only things that rival tramadol it terms of withdrawal are benzos and alcohol.
liquor,beer,alcohol,booze 1. reinforce #3 - it is not cheap to live here. not by any stretch. expect to pay more than the rest of the country pays for everything. even franchises that operate nation-wide have special wa/perth pricing. 2. petrol has literally just dropped to $1 this past month, i wouldn't go as far as quoting that as our average price just yet. average is still between $1.20-1.30. 3. parking is free at beaches & parks, do not expect to get free parking anywhere in the city though. if you're using public parking in the city all day, expect to pay $50 unless you get in early. 4. forget bribing the cops, don't even call them "mate". last time i was pulled over (last week, random stop) i said "evening mate" as i was handing him my license and was responded with "don't call me mate, i'm not your friend, i don't know you". 5. unlike the rest of the world, regular stores do not sell alcohol here. liquor stores only, don't expect to buy beer from a gas station or grocery store. 6. rent is expensive, food is expensive, booze is expensive, being alive is expensive.
drug,meth,heroin,beer that's simply not true. first there's a difference between legalization and decriminalization. second, some european countries have places to go to safely use drugs. there is middle ground between allowing heroin to be sold all over town and having users go to prison. heroin, meth and some other drugs are not good things for society and their use should encouraged by making it as easy to buy as a 6 pack of beer. i'm not really sure why you can't see a middle ground because it's clearly not as black and white as you say. you can go after the dealers while leaving the users alone.
drug,fire,joint,smoke not a story about a rave, but still relevant i think: i was working a job called "fire watch," which is just what it sounds like, at a nine inch nails concert a few years ago. our comrades, the security workers, were far from seasoned professionals. they were mostly college temps with a yellow security tee shirt and a flashlight; they didn't even have radios. the job is basically to make sure people don't go into restricted areas. ...but this one boy scout took it upon himself to tame the metal masses. mid-concert, he pulled me close and shouted "they're smoking pot!" i shrugged, and shot him an "and?" look. i guess he thought i should care because technically a joint is a tiny dangerous drug fire, and i was on the fire crew. he then proceeded to disappear into the crowd, shoving people out of the way on his heroic journey toward the countless smoke puff origins. the next time i saw him he was bleeding out of his face and getting a flashlight in the eyes from an onsite emt. i guess it's pretty harsh to say that he deserved the beating, but it's hard to argue that he didn't go asking for it. i guess the moral of my story is that security people are just people, and some people's shittyness is inflamed when combined with authority. it sounds like your event just happened to be warded by a gaggle of douches, probably being captained by king fuckwad who really wanted to be a cop, but couldn't pass the exams.
Note: If you need list of only exact matches, it is still relatively easy to do with BigQuery User-Defined Functions
I suggest debugging this using REGEXP_EXTRACT. I tried running your query, and it kept finding things like "meth" in "something", which might be what you're seeing. You probably want to check for word boundaries around the match, since some of your words you are searching for can be contained in several normal, non-drug-related words.
Something like the following should help in debugging:
SELECT
DATE(SEC_TO_TIMESTAMP(created_utc)) AS date,
subreddit,
author AS comment_author,
ups AS upvotes,
REGEXP_EXTRACT(body, '(drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers)') AS match,
LOWER(body),
FROM
[fh-bigquery:reddit_comments.2015_01]
WHERE (REGEXP_MATCH(body,'drug|acid|ecstacy|fire|heroin|joint|marijuana|weed|bud|ganja|hash|blazing|blaze|meth|molly|pcp|shrooms|speed|uppers|valium|xanax|tripping|smoke|liquor|beer|alcohol|booze|acid|benzos|blow|cocaine|crack|crank|dank|dope|downers'))
LIMIT 10;

Ranking algorithm in a rails app

We have a model in our ralis app whose objects are assigned a score based on positive user actions. We'll call them products for simplicity sake. If a user likes a product or buys a product or views a product, the score is incremented at various weights (a like might be worth more than a view, two views in the span of 30 seconds might be worth more than three views spread over an hour, etc.)
We'd like to use these scores to help sort and rank products, say for a popular products list, but for various reasons -- using the straight ranking is going to unevenly favor older products, since they'll have more time to amass a higher score.
My question is, how to normalize the scores between new and old products. I thought about dividing the products score by a unit of time, say the number of days it's been in existence, but am worried that will cut down the older products too much. Any thoughts on the best way to fairly normalize the scores between the old and new products?
I'm also considering an example of a bayesian rating system I found in another question:
rating = ((avg_num_votes * avg_rating) + (product_num_votes * product_rating)) / (avg_num_votes + product_num_votes)
Where theavg numbers are calculated by looking at the scores across all products that have more than one vote (or in our case, a positive action). This might not be the best way, because we don't have a negative rating in our system and it doesn't take time into consideration at all.
Your question reminds me the concept of Exponential Discounting Cash Flow in finance.
The concept is the following : 100$ in two years worth less than 100$ in one year, which worth less than 100$ now, ...
I think that we can make a good comparison here : a product of yesterday worth more that a product of the day before but less than a product of today.
The formula is simple :
Vn = V0 * (1-t)^n
with V0 the initial value (the real number of positives votes), t a discount rate (you have to fix it, like 10%) and n the time passed (for example n days). Thus a product will lose 10% of his value each day (but 10% of the precedent day, not of the initial value).
You can also see Hyperbolic discounting that is closer of your try. The formula can be sometyhing like that I guess :
Vn = V0 * (1/(1+k*n))
An other approach, simpler, but crudest : linear discounting. You can simply give an initial value for the scores, like 1000 and each day, you decrement all scores by 1 (or an other constant).
Vn = V0 - k*n

What formula is used for building a list of related items in a tag-based system?

There are a lot of sites out there that use 'tags' to categorize items in their system. For example, YouTube uses keywords to categorize videos, Stack Overflow uses tags to categorize questions, etc.
What formulas do these sites use (especially SO) to build a list of items related to another item based on the tags it has? I'm building a system much like the one on SO and I'd like to find a way to generate a list of 20 items or so based on the tags of one item, but also make it spread enough so that each photo generates a vastly different list, and so that clicking an item in any given related list could eventually lead you to almost every item in the database.
The technical term for an organization based on user tags is a folksonomy. A google search for that term brings up a huge amount of material on how these systems are put together. A good place to start is the Wikipedia article.
I had to solve this exact problem for a contract a few years back, and the company was nice enough to let me blog about how I did it at http://bentilly.blogspot.com/2011/02/finding-related-items.html.
You'll note that if you get a decent volume of data then you'll really, really want to do this out of the database.
Similarity between items is often represented as dot products between the vectors representing the items. So if you have a tag based system, each tag will define one dimension. The vector then for an item becomes 1 in dimension i if tag i is set for this item (or higher numbers if you allow multiple tagging). If you calculate the dot product of the vectors of two items you will get the similarity for those items (N.b. the vectors have to be normalized so that the absolute value is 1).
Note that the dimensionality will get very large (several tens of thousands of tags are common). This sounds like a show stopper for this kind of thing. But you will also not that the vectors are really sparse and multiple dot product become one big matrix multiplication of a sparse matrix with it's own transposition. Using efficient algorithms for sparse matrix multiplication, this can be done relatively fast.
Also note, that most systems do not only rely on tags, but rather on "user behavior" (whatever that means). I.e. for Youtube user behavior would be "Watching a video", "Subscribing to a channel", "looking for similar videos as video X" or "tagging video x with tag y".
I ended up using the following code (with different names), which finds all other items with at least one tag in common, and orders the results by number of common tags, descending, and subsorts by other criteria specific to my problem:
SELECT PT.WidgetID, COUNT(*) AS CommonTags, PS.OtherOrderingCriteria1, PS.OtherOrderingCriteria2, PS.OtherOrderingCriteria3, PS.Date FROM WidgetTags PT INNER JOIN WidgetStatistics PS ON PT.WidgetID = PS.WidgetID
WHERE PT.TagID IN (SELECT PTInner.TagID FROM WidgetTags PTInner WHERE PTInner.WidgetID = #WidgetID)
AND PT.WidgetID != #WidgetID
GROUP BY PT.WidgetID, PS.OtherOrderingCriteria1, PS.OtherOrderingCriteria2, PS.OtherOrderingCriteria3, PS.Date
ORDER BY CommonTags DESC, PS.OtherOrderingCriteria1 DESC, PS.OtherOrderingCriteria2 DESC, PS.OtherOrderingCriteria3 DESC, PS.Date DESC, PT.WidgetID DESC