Limiting the result of intersection between two sets in Redis efficiently - redis

I have an exam software system and one of the features is to show students random questions from a huge set given that the question has never been show to the student before. I'm using redis to implement it, so I made two types sets in my Redis DB, the first one is the question bank and then each user has a his own set of previously viewed questions that gets updated after the user sees a question in an exam.
However, in order to make the requirement, I need to find 10 question from the questions bank for each exam that the user has never seen before. I thought of using:
SDIFFSTORE nextQuestionsToShow questionBank userQuestionsSet
SRANDMEMBER nextQuestionsToShow 10
and after processing the result, I delete the produced set nextQuestionsToShow.
However, I think this is inefficient (time and memory wise) since it's an anytime online exam system for users during the day, and the question bank has a huge amount of questions per category (some categories has over 100K questions), and this means that the difference is a huge set for each user that has to be stored to only select 10 random questions. So is there a more efficient way to select 10 random questions from the question bank that the user hasn't answered before? Thanks a lot in advance.

Instead of using SET to store userQuestionsSet and questionBank, you can use bitmap (Redis STRING) to store these two sets. Then you can use the BITOP to efficiently get the difference between two bitmap.
UPDATE
First of all, you need to give each question to a unique number. Then use a bitmap to store the userQuestionsSet and questionBank. Say, you have the following questions in bank: 1: question1, 2: question2, 3: question3, 4: question4, 5: question5. And user has already viewed question3:
// initialize question bank: 00111110
SETBIT question-bank 1 1
SETBIT question-bank 2 1
...
SETBIT question-bank 5 1
// user has viewed question3: 00001000
SETBIT user 3 1
Get the difference between question bank and user viewed questions:
// XOR to get the difference: 00111110 XOR 00001000
BITOP XOR result question-bank user
// 00110110
// questions not viewed: 1, 2, 4, 5
GET result
When you GET the binary string stored in result, you can scan the string and randomly get 10 questions for the user.
NOTE
You should be careful that SETBIT might be an expensive operation, and you'd better pre-allocate memory for these bitmaps. See the doc's WARNING part for detail.

Related

Query time for a specific entity is 10000 times higher [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 9 months ago.
Improve this question
We run into a problem: select for a filter by a certain id takes a very long time. For all id about 5ms, for this - 10 seconds.
This is explain. Left - normal, right - long. This is absolutely the same sql query, where the difference is only in one digit 'where id = ...'
this
It is striking that a filter is used on the right, but for some reason it is not on the left, as well as some huge number of 'rows removed'. Such a number can only be obtained by multiplying the number of rows in the joined tables. Once again I repeat that the sql query is absolutely the same except for the entity id, the number of retrieved data for entities is comparable.
One of the tables also uses btre index. The only thing that this id has is special - it comes after the numbering break, 22,23,24,30 for example. But I was not able to reproduce the problem on this principle.
Unfortunately, I cannot show the code, but I hope that this information will be enough to advise something.
upd:
I found the reason. Postgres for some reason expects that one of the tables will return only 1 structure, when as a real return in 10k+ and therefore chooses the wrong algorithm. For other entity ids, it "thinks" correctly and chooses higher algorithms. Can you find how posgres counts plan lines? What could be the problem?
If I understand correctly, your problem is data histogram. We cannot support you because you cannot provide example code. Briefly, one of your table has a data whose id columns has heterogenous data in it. For example; your table has 1 billion records and in that table each id has 500 records. Yet, some of the id' s (virtually, let say) 20 or 200 millions records. So, if you search for these highly non-selective rows the database optimizer will not help you.
Check your data histogram!

Best way for getting users friends top rating with Redis SORTED SET

I have SORTED SET user_id:rating for every level in the game(2000+ levels). There is 2 000 000 users in set.
I need to create 2 ratings - first - all users top 100, second - top 5 friends each player
First can be solved very easily with ZRANGE
But there is a problem with second, because in average - every user has 500 friends
There is 2 ways:
1) I can do 500 requests with ZSCORE\ZRANK and sort users on by backend (too many requests, bad performance)
2) I can create SORTED SET for each user and update it on background on every users update. (more data, more ram, more complex)
May be there are any others options I missed?
I believe your main concern here should be your data model. Does every user have a sorted set of his friends?
I would recommend something like this:
users:{id}:friends values as the ids of friends
users:scoreboard values as the users ids and score as the rating
of each
As an answer to your first concern, you can consider using pipelines, which will reduce the number of requests drastically, none the less you will still need to handle ordering the results.
The better answer for you problem would be, in case you have the two sorted sets as described earlier:
Get the intersection between the two, using the "zinterstore" command and storing the result in a sorted set created solely for this purpose. As a result, the new sorted set will contain all the user's friends ids with their rating as the score (need to be careful here since you will need to specify the score of the new sorted set, it can either be the SUM, MIN or MAX of the scores).
ref: http://redis.io/commands/zinterstore
At this point using a simple "zrevrangebyscore" and specifying a limit, will leverage the sorted result you are looking for.

Redis unique increment

I am trying to implement a scoring system on redis. I have no experience with it what-so-ever.
What my app should be doing is increasing a value ONLY if the user has not already voted, so I was thinking of something like this:
INCR voteme
but only if this is has not been increased already, so wanted to do the following:
SET voteme:voterip 1
so then i would count the elements. Problem is I think this is not doable in redis, and have to think of another approach.
Any ideas?
EXTRA question:
I want to make this data persistent by writing the resulting count (e.g: 24) to the corresponding user, in mongodb. Some pseudo code would be of great help
I would not store a counter but directly a set containing all the users who have already voted.
Let's suppose a vote is organized for user 1. Each time, a user X vote for user 1, you can execute:
SADD user:1:votes X
The number of votes for user 1 can be easily retrieved:
SCARD user:1:votes
Now if you need to keep this count in sync with another store, you can execute (still supposing user X votes for user 1):
MULTI
SADD users:1:votes X
SCARD user:1:votes
EXEC
The trick is the SADD command returns the number of items effectively added to the set. If the item already exists, it returns 0. So it is quite easy to run this multi/exec block, check the result of SADD, get the cardinality of the set (number of votes), and push the cardinality to another store only if the set has been altered by the transaction.
This way, you keep the counter up-to-date in your persistent store (in real time), while filtering useless voting events.

Developing Rainbow Tables

I am currently working on a parallel computing project where i am trying to crack passwords using rainbow tables.
The first step that i have thought of is to implement a very small version of it that cracks password of lengths 5 or 6 (only numeric passwords to begin with). To begin with, i have some questions with the configuration settings.
1 - What should be the size that i should start with. My first guess is, i will start with a table with 1000 Initial, Final pair. Is this is a good size to start with?
2- Number of chains - I really got no information online with what should be the size of a chain be
3 - Reduction function - If someone can give me any information about how should i go about building one.
Also, if anyone has any information or any example, it will be really helpful.
There is already a wealth of rainbow tables available online. Calculating rainbow tables simply moves the computation burden from when the attack is being run, to the pre-computation.
http://www.freerainbowtables.com/en/tables/
http://www.renderlab.net/projects/WPA-tables/
http://ophcrack.sourceforge.net/tables.php
http://www.codinghorror.com/blog/2007/09/rainbow-hash-cracking.html
It's a time-space tradeoff. The longer the chains are, the less of them you need, so the less space it'll take up, but the longer cracking each password will take.
So, the answer is always to build the biggest table you can in the space that you have available. This will determine your chain length and number of chains.
As for choosing the reduction function, it should be fast and behave pseudo-randomly. For your proposed plaintext set, you could just pick 20 bits from the hash and interpret them as a decimal number (choosing a different set of 20 bits at each step in the chain).

Storage algorithm question - verify sequential data with little memory

I found this on an "interview questions" site and have been pondering it for a couple of days. I will keep churning, but am interested what you guys think
"10 Gbytes of 32-bit numbers on a magnetic tape, all there from 0 to 10G in random order. You have 64 32 bit words of memory available: design an algorithm to check that each number from 0 to 10G occurs once and only once on the tape, with minimum passes of the tape by a read head connected to your algorithm."
32-bit numbers can take 4G = 2^32 different values. There are 2.5*2^32 numbers on tape total. So after 2^32 count one of numbers will repeat 100%. If there were <= 2^32 numbers on tape then it was possible that there are two different cases – when all numbers are different or when at least one repeats.
It's a trick question, as Michael Anderson and I have figured out. You can't store 10G 32b numbers on a 10G tape. The interviewer (a) is messing with you and (b) is trying to find out how much you think about a problem before you start solving it.
The utterly naive algorithm, which takes as many passes as there are numbers to check, would be to walk through and verify that the lowest number is there. Then do it again checking that the next lowest is there. And so on.
This requires one word of storage to keep track of where you are - you could cut down the number of passes by a factor of 64 by using all 64 words to keep track of where you're up to in several different locations in the search space - checking all of your current ones on each pass. Still O(n) passes, of course.
You could probably cut it down even more by using portions of the words - given that your search space for each segment is smaller, you won't need to keep track of the full 32-bit range.
Perform an in-place mergesort or quicksort, using tape for storage? Then iterate through the numbers in sequence, tracking to see that each number = previous+1.
Requires cleverly implemented sort, and is fairly slow, but achieves the goal I believe.
Edit: oh bugger, it's never specified you can write.
Here's a second approach: scan through trying to build up to 30-ish ranges of contiginous numbers. IE 1,2,3,4,5 would be one range, 8,9,10,11,12 would be another, etc. If ranges overlap with existing, then they are merged. I think you only need to make a limited number of passes to either get the complete range or prove there are gaps... much less than just scanning through in blocks of a couple thousand to see if all digits are present.
It'll take me a bit to prove or disprove the limits for this though.
Do 2 reduces on the numbers, a sum and a bitwise XOR.
The sum should be (10G + 1) * 10G / 2
The XOR should be ... something
It looks like there is a catch in the question that no one has talked about so far; the interviewer has only asked the interviewee to write a program that CHECKS
(i) if each number that makes up the 10G is present once and only once--- what should the interviewee do if the numbers in the given list are present multple times? should he assume that he should stop execting the programme and throw exception or should he assume that he should correct the mistake by removing the repeating number and replace it with another (this may actually be a costly excercise as this involves complete reshuffle of the number set)? correcting this is required to perform the second step in the question, i.e. to verify that the data is stored in the best possible way that it requires least possible passes.
(ii) When the interviewee was asked to only check if the 10G weight data set of numbers are stored in such a way that they require least paases to access any of those numbers;
what should the interviewee do? should he stop and throw exception the moment he finds an issue in the algorithm they were stored in, or correct the mistake and continue till all the elements are sorted in the order of least possible passes?
If the intension of the interviewer is to ask the interviewee to write an algorithm that finds the best combinaton of numbers that can be stored in 10GB, given 64 32 Bit registers; and also to write an algorithm to save these chosen set of numbers in the best possible way that require least number of passes to access each; he should have asked this directly, woudn't he?
I suppose the intension of the interviewer may be to only see how the interviewee is approaching the problem rather than to actually extract a working solution from the interviewee; wold any buy this notion?
Regards,
Samba