What is a data structure for quickly finding non-empty intersections of a list of sets?

What is a data structure for quickly finding non-empty intersections of a list of sets? - optimization

I have a set of N items, which are sets of integers, let's assume it's ordered and call it I[1..N]. Given a candidate set, I need to find the subset of I which have non-empty intersections with the candidate.
So, for example, if:
I = [{1,2}, {2,3}, {4,5}]
I'm looking to define valid_items(items, candidate), such that:
valid_items(I, {1}) == {1}
valid_items(I, {2}) == {1, 2}
valid_items(I, {3,4}) == {2, 3}
I'm trying to optimize for one given set I and a variable candidate sets. Currently I am doing this by caching items_containing[n] = {the sets which contain n}. In the above example, that would be:
items_containing = [{}, {1}, {1,2}, {2}, {3}, {3}]
That is, 0 is contained in no items, 1 is contained in item 1, 2 is contained in itmes 1 and 2, 2 is contained in item 2, 3 is contained in item 2, and 4 and 5 are contained in item 3.
That way, I can define valid_items(I, candidate) = union(items_containing[n] for n in candidate).
Is there any more efficient data structure (of a reasonable size) for caching the result of this union? The obvious example of space 2^N is not acceptable, but N or N*log(N) would be.

I think your current solution is optimal big-O wise, though there are micro-optimization techniques that could improve its actual performance. Such as using bitwise operations when merging the chosen set in item_containing set with the valid items set.
i.e. you store items_containing as this:
items_containing = [0x0000, 0x0001, 0x0011, 0x0010, 0x0100, 0x0100]
and your valid_items can use bit-wise OR to merge like this:
int valid_items(Set I, Set candidate) {
// if you need more than 32-items, use int[] for valid
// and int[][] for items_containing
int valid = 0x0000;
for (int item : candidate) {
// bit-wise OR
valid |= items_containing[item];
}
return valid;
}
but they don't really change the Big-O performance.

One representation that might help is storing the sets I as vectors V of size n whose entries V(i) are 0 when i is not in V and positive otherwise. Then to take the intersection of two vectors you multiply the terms, and to take the union you add the terms.

Related

How to get guaranteed unique list shuffles in Kotlin

I have a list of nine numbers (1-9), that I need to shuffle based on a seed, and guarantee that each permutation of that shuffle is unique. I'd like to do that like this:
list.shuffle(Random(seed))
There are 9! (362,880) possible permutations of this list, and I know that if I pass it the same Random seed twice, those two permutations will be identical, but I need a way to guarantee that for any given seed between 0 and 362,880, the list order will be unique from any other seed in that range.
Is this possible in Kotlin?

This isn't really a question about Kotlin, but algorithms in general.
There could be much better solution, but you can represent your seed as a number with variable base. First digit has base of 9, second has base of 8 and so on. When dealing with numbers of base 10, we need to repeatedly divide it by 10 and note the remainder to split it into digits. In our case we need to divide it by 9, 8, 7 and so on. This way we will convert the seed to a list of 9 digits like this: 0-8, 0-7, 0-6, ... . What is important: each seed has a unique list of such digits.
Now, if we create another list of numbers 1-9, then we can use the list of digits from the previous paragraph to pick numbers from it, removing them at the same time. Initially, we have 9 items in our list, so valid indexes are 0-8 and this is exactly the range of our first digit. Then we have only 8 remaining items, so they have indexes 0-7 and this is exactly what the second digit is. And so on.
This is not that easy to explain in words, code could be better:
fun shuffled1to9(seed: Int): List<Int> {
require(seed in 0 until 362880)
val remaining = (1..9).toMutableList()
val result = mutableListOf<Int>()
var curr = seed
(9 downTo 2).forEach {
val (next, pick) = curr divmod it
result += remaining.removeAt(pick)
curr = next
}
result += remaining.single()
return result
}
infix fun Int.divmod(divisor: Int): Pair<Int, Int> {
val quotient = this / divisor
return quotient to (this - quotient * divisor)
}
shuffled1to9(0) returns original order of 1..9. shuffled1to9(362879) returns the order inverted: 9..1. Any number in between should generate a unique ordering.
Of course, it can be very easily generalized to different lists of numbers and to different sizes.

Trade off between Linear and Binary Search

I have a list of elements to be searched in a dataset of variable lengths. I have tried binary search and I found it is not always efficient when the objective is to search a list of elements.
I did the following study and conclude that if the number of elements to be searched is less than 5% of the data, binary search is efficient, other wise the Linear search is better.
Below are the details
Number of elements : 100000
Number of elements to be searched: 5000
Number of Iterations (Binary Search) =
log2 (N) x SearchCount=log2 (100000) x 5000=83048
Further increase in the number of search elements lead to more iterations than the linear search.
Any thoughts on this?
I am calling the below function only if the number elements to be searched is less than 5%.
private int SearchIndex(ref List<long> entitylist, ref long[] DataList, int i, int len, ref int listcount)
{
int Start = i;
int End = len-1;
int mid;
while (Start <= End)
{
mid = (Start + End) / 2;
long target = DataList[mid];
if (target == entitylist[listcount])
{
i = mid;
listcount++;
return i;
}
else
{
if (target < entitylist[listcount])
{
Start = mid + 1;
}
if (target > entitylist[listcount])
{
End = mid - 1;
}
}
}
listcount++;
return -1; //if the element in the list is not in the dataset
}
In the code I retun the index rather than the value because, I need to work with Index in the calling function. If i=-1, the calling function resets the value to the previous i and calls the function again with a new element to search.

In your problem you are looking for M values in an N long array, N > M, but M can be quite large.
Usually this can be approached as M independent binary searches (or even with the slight optimization of using the previous result as a starting point): you are going to O(M*log(N)).
However, using the fact that also the M values are sorted, you can find all of them in one pass, with linear search. In this case you are going to have your problem O(N). In fact this is better than O(M*log(N)) for M large.
But you have a third option: since M values are sorted, binary split M too, and every time you find it, you can limit the subsequent searches in the ranges on the left and on the right of the found index.
The first look-up is on all the N values, the second two on (average) N/2, than 4 on N/4 data,.... I think that this scale as O(log(M)*log(N)). Not sure of it, comments welcome!
However here is a test code - I have slightly modified your code, but without altering its functionality.
In case you have M=100000 and N=1000000, the "M binary search approach" takes about 1.8M iterations, that's more that the 1M needed to scan linearly the N values. But with what I suggest it takes just 272K iterations.
Even in case the M values are very "collapsed" (eg, they are consecutive), and the linear search is in the best condition (100K iterations would be enough to get all of them, see the comments in the code), the algorithm performs very well.

Get the most occuring number amongst several integers without using arrays

DISCLAIMER: Rather theoretical question here, not looking for a correct answere, just asking for some inspiration!
Consider this:
A function is called repetitively and returns integers based on seeds (the same seed returns the same integer). Your task is to find out which integer is returned most often. Easy enough, right?
But: You are not allowed to use arrays or fields to store return values of said function!
Example:
int mostFrequentNumber = 0;
int occurencesOfMostFrequentNumber = 0;
int iterations = 10000000;
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
int occurencesOfResult = magic();
if(occurencesOfResult > occurencesOfMostFrequentNumber)
{
mostFrequentNumber = result;
occurencesOfMostFrequentNumber = occurencesOfResult;
}
}
If getNumberFromSeed() returns 2,1,5,18,5,6 and 5 then mostFrequentNumber should be 5 and occurencesOfMostFrequentNumber should be 3 because 5 is returned 3 times.
I know this could easily be solved using a two-dimentional list to store results and occurences. But imagine for a minute that you can not use any kind of arrays, lists, dictionaries etc. (Maybe because the system that is running the code has such a limited memory, that you cannot store enough integers at once or because your prehistoric programming language has no concept of collections).
How would you find mostFrequentNumber and occurencesOfMostFrequentNumber? What does magic() do?? (Of cause you do not have to stick to the example code. Any ideas are welcome!)
EDIT: I should add that the integers returned by getNumber() should be calculated using a seed, so the same seed returns the same integer (i.e. int result = getNumber(5); this would always assign the same value to result)

Make an hypothesis: Assume that the distribution of integers is, e.g., Normal.
Start simple. Have two variables
. N the number of elements read so far
. M1 the average of said elements.
Initialize both variables to 0.
Every time you read a new value x update N to be N + 1 and M1 to be M1 + (x - M1)/N.
At the end M1 will equal the average of all values. If the distribution was Normal this value will have a high frequency.
Now improve the above. Add a third variable:
M2 the average of all (x - M1)^2 for all values of xread so far.
Initialize M2 to 0. Now get a small memory of say 10 elements or so. For every new value x that you read update N and M1 as above and M2 as:
M2 := M2 + (x - M1)^2 * (N - 1) / N
At every step M2 is the variance of the distribution and sqrt(M2) its standard deviation.
As you proceed remember the frequencies of only the values read so far whose distances to M1 are less than sqrt(M2). This requires the use of some additional array, however, the array will be very short compared to the high number of iterations you will run. This modification will allow you to guess better the most frequent value instead of simply answering the mean (or average) as above.
UPDATE
Given that this is about insights for inspiration there is plenty of room for considering and adapting the approach I've proposed to any particular situation. Here are some thoughts
When I say assume that the distribution is Normal you should think of it as: Given that the problem has no solution, let's see if there is some qualitative information I can use to decide what kind of distribution would the data have. Given that the algorithm is intended to find the most frequent number, it should be fine to assume that the distribution is not uniform. Let's try with Normal, LogNormal, etc. to see what can be found out (more on this below.)
If the game completely disallows the use of any array, then fine, keep track of only, say 10 numbers. This would allow you to count the occurrences of the 10 best candidates, which will give more confidence to your answer. In doing this choose your candidates around the theoretical most likely value according to the distribution of your hypothesis.
You cannot use arrays but perhaps you can read the sequence of numbers two or three times, not just once. In that case you can read it once to check whether you hypothesis about its distribution is good nor bad. For instance, if you compute not just the variance but the skewness and the kurtosis you will have more elements to check your hypothesis. For instance, if the first reading indicates that there is some bias, you could use a LogNormal distribution instead, etc.
Finally, in addition to providing the approximate answer you would be able to use the information collected during the reading to estimate an interval of confidence around your answer.

Alright, I found a decent solution myself:
int mostFrequentNumber = 0;
int occurencesOfMostFrequentNumber = 0;
int iterations = 10000000;
int maxNumber = -2147483647;
int minNumber = 2147483647;
//Step 1: Find the largest and smallest number that _can_ occur
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
if(result > maxNumber)
{
maxNumber = result;
}
if(result < minNumber)
{
minNumber = result;
}
}
//Step 2: for each possible number between minNumber and maxNumber, count occurences
for(int thisNumber = minNumber; thisNumber <= maxNumber; thisNumber++)
{
int occurenceOfThisNumber = 0;
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
if(result == thisNumber)
{
occurenceOfThisNumber++;
}
}
if(occurenceOfThisNumber > occurencesOfMostFrequentNumber)
{
occurencesOfMostFrequentNumber = occurenceOfThisNumber;
mostFrequentNumber = thisNumber;
}
}
I must admit, this may take a long time, depending on the smallest and largest possible. But it will work without using arrays.

Find all pairs of consecutive numbers in BST

I need to write a code that will find all pairs of consecutive numbers in BST.
For example: let's take the BST T with key 9, T.left.key = 8, T.right.key = 19. There is only one pair - (8, 9).
The naive solution that I thought about is to do any traversal (pre, in, post) on the BST and for each node to find its successor and predecessor, and if one or two of them are consecutive to the node - we'll print them. But the problem is that it'll will the O(n^2), because we have n nodes and for each one of them we use function that takes O(h), that in the worst case h ~ n.
Second solution is to copy all the elements to an array, and to find the consecutive numbers in the array. Here we use O(n) additional space, but the runtime is better - O(n).
Can you help me to find an efficient algorithm to do it? I'm trying to think about algorithm that don't use additional space, and its runtime is better than O(n^2)
*The required output is the number of those pairs (No need to print the pairs).
*any 2 consecutive integers in the BST is a pair.
*The BST containts only integers.
Thank you!

Why don't you just do an inorder traversal and count pairs on the fly? You'll need a global variable to keep track of the last number, and you'll need to initialize it to something which is not one less than the first number (e.g. the root of the tree). I mean:
// Last item
int last;
// Recursive function for in-order traversal
int countPairs (whichever_type treeRoot)
{
int r = 0; // Return value
if (treeRoot.leftChild != null)
r = r + countPairs (treeRoot.leftChild);
if (treeRoot.value == last + 1)
r = r + 1;
last = treeRoot.value;
if (treeRoot.rightChild != null)
r = r + countPairs (treeRoot.rightChild);
return r; // Edit 2016-03-02: This line was missing
}
// Main function
main (whichever_type treeRoot)
{
int r;
if (treeRoot == null)
r = 0;
else
{
last = treeRoot.value; // to make sure this is not one less than the lowest element
r = countPairs (treeRoot);
}
// Done. Now the variable r contains the result
}

Picking random binary flag

I have defined the following:
typdef enum {
none = 0,
alpha = 1,
beta = 2,
delta = 4
gamma = 8
omega = 16,
} Greek;
Greek t = beta | delta | gammax
I would like to be able to pick one of the flags set in t randomly. The value of t can vary (it could be, anything from the enum).
One thought I had was something like this:
r = 0;
while ( !t & ( 2 << r ) { r = rand(0,4); }
Anyone got any more elegant ideas?
If it helps, I want to do this in ObjC...

Assuming I've correctly understood your intent, if your definition of "elegant" includes table lookups the following should do the trick pretty efficiently. I've written enough to show how it works, but didn't fill out the entire table. Also, for Objective-C I recommend arc4random over using rand.
First, construct an array whose indices are the possible t values and whose elements are arrays of t's underlying Greek values. I ignored none, but that's a trivial addition to make if you want it. I also found it easiest to specify the lengths of the subarrays. Alternatively, you could do this with NSArrays and have them self-report their lengths:
int myArray[8][4] = {
{0},
{1},
{2},
{1,2},
{4},
{4,1},
{4,2},
{4,2,1}
};
int length[] = {1,1,1,2,1,2,2,3};
Then, for any given t you can randomly select one of its elements using:
int r = myArray[t][arc4random_uniform(length[t])];
Once you get past the setup, the actual random selection is efficient, with no acceptance/rejection looping involved.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas