comparing/intersecting compare criteria - oop

If there's any open source code that does this already I'm interested in hearing about it. But I haven't seen it yet so I'm trying to roll my own.
Example:
variable x = compareCriteriaBetween 3 and 6
variable y = compareCriteriaLesserThanOrEqual 5
The difficult part for me is finding an elegant way to compare the compareCriteria and create an intersection. In the example the intersection between the two is 'between 3 and 5'.
How can I implement this in a 'tell don't ask' manner? Note that compareCriteria can be completely unrelated (eg startsWithLetter versus betweenNumber).

If you only have constants in your expressions you should be safe from undecidability (I think!). Problems arise as soon as you can express e.g. general statements about integers with +-*/ (see Peano arithmetic).
Even if you stay within the realm of decidability, there exists no algorithm that can take arbitrary statements P(x) and Q(x) and compute a statement R(x) equivalent to P(x) & Q(x) for all x, where x can range over any domain (integers, strings, matrices, real numbers, complex numbers, logical statements [whoops, back into undecidable territory!], ...). You need domain specific tricks to get there, and strictly delimited languages in which P, Q and R are formulated. There exist software products for certain domains -- one of them is called Mathematica...
Try to get back to basics: what problem are you trying to solve?

If you are just interested in simple criteria like less equal or between on integers/floats, you can rewrite between 3 and 6 as (greater equal 3 and less equal 6). If you combine this with a logical and with less equal 5 you can use Boolean algebra to obtain (greater equal 3 and (less equal 6 and less equal 5)) before simplifying the inner parenthesis to just less equal 5 and rewriting the result as between 3 and 5.

Related

How to sample rows from a table with a specific probability?

I'm using BigQuery at my new position, and I'm totally new to SQL/BigQuery.
I'm testing a machine learning model and monitoring an A/B test with a different ratio, e.g., 3 vs. 10. To compare the A/B results, e.g., # of page view, I want to make the ratios equal first so that I can compare easily. For example, say we have a table with 13 records (3 are from A and 10 are from B). In addition, each row contains an id field that is identical. What I want to do is to extract only 3 samples out of 10 for B to match the sample number to A.
I'm trying to use the FARM_FINGERPRINT function to map fields to integers. Then I'm taking ABS and then calculating MOD to convert the integer numbers to a specific range, e.g., [0, 10). Eventually, I would like to get 3 in 10 items using the following line:
MOD(ABS(FARM_FINGERPRINT(field)), 10) < 3
However, I found that even if I run A/B with exactly the same ML model with different A/B ratio, the result is different between A and B (The results should be same because A and B are running the same ML model with just the different ratio). This made me doubt that the above implementation may bring some biased data sampling. I also read this post and confirmed the FARM_FINGERPRINT might not bring a randomly distributed result.
*There's a critical reason why I cannot simply multiply 3/10 to B, which is confidential and cannot disclose here.
Is there a better way to accomplish the equally distributed sampling?
Thank you in advance. (I'm sorry if the question is vague, as I'm hiding the confidential parts.)

Searching for groups of objects given a reduction function

I have a few questions about a type of search.
First, is there a name and if so what is the name of the following type of search? I want to search for subsets of objects from some collection such that a reduction and filter function applied to the subset is true. For example, say I have the following objects, each of which contains an id and a value.
[A,10]
[B,10]
[C,10]
[D,9]
[E,11]
I want to search for "all the sets of objects whose summed values equal 30" and I would expect the output to be, {{A,B,C}, {A,D,E}, {B,D,E}, {C,D,E}}.
Second, is the only strategy to perform this search brute-force? Is there some type of general-purpose algorithm for this? Or are search optimizations dependent on the reduction function?
Third, if you came across this problem, what tools would you use to solve it in a general way? Assume the reduction and filter functions could be anything and are not necessarily the sum function. Does SQL provide a good API for this type of search? What about Prolog? Any interesting tips and tricks would be appreciated.
Thanks.
I cannot comment on the problem in general but brute forcing search can be easily done in prolog.
w(a,10).
w(b,10).
w(c,10).
w(d,9).
w(e,11).
solve(0, [], _).
solve(N, [X], [X|_]) :- w(X, N).
solve(N, [X|Xs], [X|Bs]) :-
w(X, W),
W < N,
N1 is N - W,
solve(N1, Xs, Bs).
solve(N, [X|Xs], [_|Bs]) :- % skip element if previous clause fails
solve(N, [X|Xs], Bs).
Which gives
| ?- solve(30, X, [a, b, c, d, e]).
X = [a,b,c] ? ;
X = [a,d,e] ? ;
X = [b,d,e] ? ;
X = [c,d,e] ? ;
(1 ms) no
Sql is TERRIBLE at this kind of problem. Until recently there was no way to get 'All Combinations' of row elements. Now you can do so with Recursive Common Table Expressions, but you are forced by its limitations to retain all partial results as well as final results which you would have to filter out for your final results. About the only benefit you get with SQL's recursive procedure is that you can stop evaluating possible combinations once a sub-path exceeds 30, your target total. That makes it slightly less ugly than an 'evaluate all 2^N combinations' brute force solution (unless every combination sums to less than the target total).
To solve this with SQL you would be running an algorithm that can be described as:
Seed your result set with all table entries less than your target total and their value as a running sum.
Iteratively join your prior result with all combinations of table that were not already used in the result set and whose value added to running sum is less than or equal to target total. Running sum becomes old running sum plus value, and append ID to ID LIST. Union this new result to the old results. Iterate until no more records qualify.
Make a final pass of the result set to filter out the partial sums that do not total to your target.
Oh, and unless you make special provisions, solutions {A,B,C}, {C,B,A}, and {A,C,B} all look like different solutions (order is significant).

Link numbers with an equation/algorithm

I am making an anagram solver in Visual Basic that gives you every possible combination when you enter a string. I need to work out how many combinations there are depending on the amount of characters in the string and how many different characters there are.
E.G.
Sample string:
abc
Total characters: 3, Different Characters: 3
Possible combinations: 6
abc, acb, bac, bca, cab, cba
I need an equation (using the number of characters and different characters) to link this to a string that contains a different amount of characters.
I've been using trial and error to try and figure is out, but I can't quite get my head around it. So far I have:
((letters - 1) ^ (different letters - 1)) + (letters - 1)
which works for a few different letter counts but now for all.
Help please???
I'll lead you to the answer, but I'll try to explain along the way. Let's say you had 10 different letters. You'd have 10 choices for the first, 9 for the second, 8 for the third, etc. Ultimately, there would be 10*9*8*7*6...*2*1 = 10! possibilities. However, sometimes you'll have multiple instances of the same letter. For example, using that for the string "aaabcd" would overcount possibilities, because it counts each of the a's as distinct letters, even though they're not. To correct for that, you would have to divide by the factorial of the number of repeated letters. A good way to calculate the total number of possibilities would be (total number of letters factorial)/ (product of the factorials of the number of repeated instances of each letter).
For example:
There are 6!/(3!) ways to arrange the letters in "aaabcd"
There are 6! ways to arrange the letters is "abcdef"
There are 6!/(3!*2!) ways to arrange the letters in "aaabbc"
There are 10!/(5!*3!*2!) ways to arrange the letters in "aaaaabbbcc"
I hope this helps.
For the possible counting number, it's exactly the same as computing Multinomial Coefficient
A simple explanation is that, for no repeating characters,
It's simply permutation = n!
(It is easy to understand if you draw a tree diagram, with first character has n choices, second character has n-1choices...etc.)
However as you may have repeating characters, you will double count many of them.
Let's see an simple example: for aaa, how many possible arrangements IF WE COUNT EVEN THE OUTCOME IS THE SAME?
Answer is 3!(aaa,aaa,aaa,aaa,aaa,aaa)
This gives us an idea that, when we have a character appearing for m times, we will count m! instead of 1
So the counting is just n!(all possible arrangements, including same outcome) / m! (a character appear for m times)
Same for more characters repeating: n!/a!b!c!.. (first character appear a times, another appear for b times...)
If you understand the concept behind, then you will find that, actually for those "non-repeating" characters, it's just dividing an 1!. For eg, character (multi)set = {a,a,a,b,b,c}, #a = 3, #b = 2, #c = 1, so the answer (without repeating count) is (3+2+1)!/3!2!1! and fraction of this format is named multinomial coefficient as stated above.
In programming point of view, you can just pre-compute all factorials (with a pretty small n though as n~30 is already too large for a variable to store) with simple for loop
declare frac = array(n);
frac[0] = 1;
FOR i=1; i<=n;i++
frac[i] = i*frac[i-1]
For a larger n, you may just calculate double/float division on the fly in the loop to avoid overflow..you may face precision problem though.
If you further need to output the different strings, you may use DFS to backtrack all the possible outcomes. Or if you could use another language like C++, you can use built-in function like next_permutation() after sort the character set.

Determine whether there is a subset of size n which has a standard deviation <= s

Given a bunch of numbers, I am trying to determine whether there is a "clump" anywhere where numbers are very densely packed.
To make things more precise, I thought I'd ask a more specific problem: given a set of numbers, I would like to determine whether there is a subset of size n which has a standard deviation <= s. If there are many such subsets, I'd like to find the subset with the lowest standard deviation.
So question #1 : does this formal problem definition effectively capture the intuitive concept of a "clump" of densely packed numbers?
EDIT: I don't actually care about determining which numbers belong to this "clump", I'm much more interested in determining where the clump is centred, which is why I think that specifying n in advance is okay. But feel free to correct me!
And question #2 : assuming it does, what is the best way to go about implementing something like this (in particular, I want a solution with lowest time complexity)? So far I think I have a solution that runs in n log n:
First, note that the lowest-standard-deviation-possessing subset of a given size must consist of consecutive numbers. So step 1 is sort the numbers (this is n log n)
Second, take the first n numbers and compute their standard deviation. If our array of numbers is 0-based, then the first n numbers are [0, n-1]. To get standard deviation, compute s1 and s2 as follows:
s1 = sum of numbers
s2 = sum of squares of numbers
Then, wikipedia says that the standard deviation is sqrt(n*s2 - s1^2)/n. Record this value as the highest standard deviation seen so far.
Find the standard deviation of [1, n], [2, n+1], [3, n+2] ... until you hit the the last n numbers. To do each computation takes only constant time if you keep track of s1 and s2 running totals: for example, to get std dev of [1, n], just subtract the 0th element from the s1 and s2 totals and add the nth element, then recalculate standard deviation. This means that the entire standard deviation calculating portion of the algorithm takes linear time.
So total time complexity n log n.
Is my assessment right? Is there a better way to do this? I really need this to run fast on fairly large sets, so the faster the better! Space is less of an issue (I think).
Having been working recently on a similar problem, both the definition of the clumps and the proposed implementation seem reasonable.
Another reasonable definition would be to find the minimum of all the ranges of n numbers. Thus, given that the list of numbers x is sorted, one would just find the minimum of x[n]-x[1], x[n+1]-x[2], etc. This would be slightly quicker than finding the standard deviation because it would avoid the multiplications and square roots. Indeed, you can avoid the square roots even when looking for the lowest standard deviation by finding the minimum variance (the square of the standard deviation), rather than the sd itself.
A caution would be that the location of the biggest clump might be quite sensitive to the choice of n. If there is an a priori reason to select a particular n, that won't be a problem. If not, however, it might require some experimentation to select the value of n that fairly reliably finds the clumps you are looking for, whether you are selecting by range or by standard deviation. Some ideas on this can be found in Chapter 6 of the online book ABC of EDA.

How can I test that my hash function is good in terms of max-load?

I have read through various papers on the 'Balls and Bins' problem and it seems that if a hash function is working right (ie. it is effectively a random distribution) then the following should/must be true if I hash n values into a hash table with n slots (or bins):
Probability that a bin is empty, for large n is 1/e.
Expected number of empty bins is n/e.
Probability that a bin has k balls is <= 1/ek! (corrected).
Probability that a bin has at least k collisions is <= ((e/k)**k)/e (corrected).
These look easy to check. But the max-load test (the maximum number of collisions with high probability) is usually stated vaguely.
Most texts state that the maximum number of collisions in any bin is O( ln(n) / ln(ln(n)) ).
Some say it is 3*ln(n) / ln(ln(n)). Other papers mix ln and log - usually without defining them, or state that log is log base e and then use ln elsewhere.
Is ln the log to base e or 2 and is this max-load formula right and how big should n be to run a test?
This lecture seems to cover it best, but I am no mathematician.
http://pages.cs.wisc.edu/~shuchi/courses/787-F07/scribe-notes/lecture07.pdf
BTW, with high probability seems to mean 1 - 1/n.
That is a fascinating paper/lecture-- makes me wish I had taken some formal algorithms class.
I'm going to take a stab at some answers here, based on what I've just read from that, and feel free to vote me down. I'd appreciate a correction, though, rather than just a downvote :) I'm also going to use n and N interchangeably here, which is a big no-no in some circles, but since I'm just copy-pasting your formulae, I hope you'll forgive me.
First, the base of the logs. These numbers are given as big-O notation, not as absolute formulae. That means that you're looking for something 'on the order of ln(n) / ln(ln(n))', not with an expectation of an absolute answer, but more that as n gets bigger, the relationship of n to the maximum number of collisions should follow that formula. The details of the actual curve you can graph will vary by implementation (and I don't know enough about the practical implementations to tell you what's a 'good' curve, except that it should follow that big-O relationship). Those two formulae that you posted are actually equivalent in big-O notation. The 3 in the second formula is just a constant, and is related to a particular implementation. A less efficient implementation would have a bigger constant.
With that in mind, I would run empirical tests, because I'm a biologist at heart and I was trained to avoid hard-and-fast proofs as indications of how the world actually works. Start with N as some number, say 100, and find the bin with the largest number of collisions in it. That's your max-load for that run. Now, your examples should be as close as possible to what you expect actual users to use, so maybe you want to randomly pull words from a dictionary or something similar as your input.
Run that test many times, at least 30 or 40. Since you're using random numbers, you'll need to satisfy yourself that the average max-load you're getting is close to the theoretical 'expectation' of your algorithm. Expectation is just the average, but you'll still need to find it, and the tighter your std dev/std err about that average, the more you can say that your empirical average matches the theoretical expectation. One run is not enough, because a second run will (most likely) give a different answer.
Then, increase N, to say, 1000, 10000, etc. Increase it logarithmically, because your formula is logarithmic. As your N increases, your max-load should increase on the order of ln(n) / ln(ln(n)). If it increases at a rate of 3*ln(n) / ln(ln(n)), that means that you're following the theory that they put forth in that lecture.
This kind of empirical test will also show you where your approach breaks down. It may be that your algorithm works well for N < 10 million (or some other number), but above that, it starts to collapse. Why could that be? Maybe you have some limitation to 32 bits in your code without realizing it (ie, using a 'float' instead of a 'double'), or some other implementation detail. These kinds of details let you know where your code will work well in practice, and then as your practical needs change, you can modify your algorithm. Maybe making the algorithm work for very large datasets makes it very inefficient for very small ones, or vice versa, so pinpointing that tradeoff will help you further characterize how you could adapt your algorithm to particular situations. Always a useful skill to have.
EDIT: a proof of why the base of the log function doesn't matter with big-O notation:
log N = log_10 (N) = log_b (N)/log_b (10)= (1/log_b(10)) * log_b(N)
1/log_b(10) is a constant, and in big-O notation, constants are ignored. Base changes are free, which is why you're encountering such variation in the papers.
Here is a rough start to the solution of this problem involving uniform distributions and maximum load.
Instead of bins and balls or urns or boxes or buckets or m and n, people (p) and doors (d) will be used as designations.
There is an exact expected value for each of the doors given a certain number of people. For example, with 5 people and 5 doors, the expected maximum door is exactly 1.2864 {(1429-625) / 625} above the mean (p/d) and the minimum door is exactly -0.9616 {(24-625) / 625} below the mean. The absolute value of the highest door's distance from the mean is a little larger than the smallest door's because all of the people could go through one door, but no less than zero can go through one of the doors. With large numbers of people (p/d > 3000), the difference between the absolute value of the highest door's distance from the mean and the lowest door's becomes negligible.
For an odd number of doors, the center door is essentially zero and is not scalable, but all of the other doors are scalable from certain values representing p=d. These rounded values for d=5 are:
-1.163 -0.495 0* 0.495 1.163
* slowly approaching zero from -0.12
From these values, you can compute the expected number of people for any count of people going through each of the 5 doors, including the maximum door. Except for the middle ordered door, the difference from the mean is scalable by sqrt(p/d).
So, for p=50,000 and d=5:
Expected number of people going through the maximum door, which could be any of the 5 doors, = 1.163 * sqrt(p/d) + p/d.
= 1.163 * sqrt(10,000) + 10,000 = 10,116.3
For p/d < 3,000, the result from this equation must be slightly increased.
With more people, the middle door slowly becomes closer and closer to zero from -0.11968 at p=100 and d=5. It can always be rounded up to zero and like the other 4 doors has quite a variance.
The values for 6 doors are:
-1.272 -0.643 -0.202 0.202 0.643 1.272
For 1000 doors, the approximate values are:
-3.25, -2.95, -2.79 … 2.79, 2.95, 3.25
For any d and p, there is an exact expected value for each of the ordered doors. Hopefully, a good approximation (with a relative error < 1%) exists. Some professor or mathematician somewhere must know.
For testing uniform distribution, you will need a number of averaged ordered sessions (750-1000 works well) rather than a greater number of people. No matter what, the variances between valid sessions are great. That's the nature of randomness. Collisions are unavoidable. *
The expected values for 5 and 6 doors were obtained by sheer brute force computation using 640 bit integers and averaging the convergence of the absolute values of corresponding opposite doors.
For d=5 and p=170:
-6.63901 -2.95905 -0.119342 2.81054 6.90686
(27.36099 31.04095 33.880658 36.81054 40.90686)
For d=6 and p=108:
-5.19024 -2.7711 -0.973979 0.734434 2.66716 5.53372
(12.80976 15.2289 17.026021 18.734434 20.66716 23.53372)
I hope that you may evenly distribute your data.
It's almost guaranteed that all of George Foreman's sons or some similar situation will fight against your hash function. And proper contingent planning is the work of all good programmers.
After some more research and trial-and-error I think I can provide something part way to to an answer.
To start off, ln and log seem to refer to log base-e if you look into the maths behind the theory. But as mmr indicated, for the O(...) estimates, it doesn't matter.
max-load can be defined for any probability you like. The typical formula used is
1-1/n**c
Most papers on the topic use
1-1/n
An example might be easiest.
Say you have a hash table of 1000 slots and you want to hash 1000 things. Say you also want to know the max-load with a probability of 1-1/1000 or 0.999.
The max-load is the maximum number of hash values that end up being the same - ie. collisions (assuming that your hash function is good).
Using the formula for the probability of getting exactly k identical hash values
Pr[ exactly k ] = ((e/k)**k)/e
then by accumulating the probability of exactly 0..k items until the total equals or exceeds 0.999 tells you that k is the max-load.
eg.
Pr[0] = 0.37
Pr[1] = 0.37
Pr[2] = 0.18
Pr[3] = 0.061
Pr[4] = 0.015
Pr[5] = 0.003 // here, the cumulative total is 0.999
Pr[6] = 0.0005
Pr[7] = 0.00007
So, in this case, the max-load is 5.
So if my hash function is working well on my set of data then I should expect the maxmium number of identical hash values (or collisions) to be 5.
If it isn't then this could be due to the following reasons:
Your data has small values (like short strings) that hash to the same value. Any hash of a single ASCII character will pick 1 of 128 hash values (there are ways around this. For example you could use multiple hash functions, but slows down hashing and I don't know much about this).
Your hash function doesn't work well with your data - try it with random data.
Your hash function doesn't work well.
The other tests I mentioned in my question also are helpful to see that your hash function is running as expected.
Incidentally, my hash function worked nicely - except on short (1..4 character) strings.
I also implemented a simple split-table version which places the hash value into the least used slot from a choice of 2 locations. This more than halves the number of collisions and means that adding and searching the hash table is a little slower.
I hope this helps.