Markovian chains with Redis - redis

For self-education purposes, I want to implement a Markov chain generator, using as much Redis, and as little application-level logic as possible.
Let's say I want to build a word generator, based on frequency table with history depth N (say, 2).
As a not very interesting example, for dictionary of two words bar and baz, the frequency table is as follows ("." is terminator, numbers are weights):
. . -> b x2
. b -> a x2
b a -> r x1
b a -> z x1
a r -> . x1
a z -> . x1
When I generate the word, I start with history of two terminators . .
There is only one possible outcome for the first two letters, b a.
Third letter may be either r or z, with equal probabilities, since their weights are equal.
Fourth letter is always a terminator.
(Things would be more interesting with longer words in dictionary.)
Anyway, how to do this with Redis elegantly?
Redis sets have SRANDMEMBER, but do not have weights.
Redis sorted sets have weights, but do not have random member retrieval.
Redis lists allow to represent weights as entry copies, but how to make set intersections with them?
Looks like application code is doomed to do some data processing...

You can accomplish a weighted random selection with a redis sorted set, by assigning each member a score between zero and one, according to the cumulative probability of the members of the set considered thus far, including the current member.
The ordering you use is irrelevant; you may choose any order which is convenient for you. The random selection is then accomplished by generating a random floating point number r uniformly distributed between zero and one, and calling
ZRANGEBYSCORE zset r 1 LIMIT 0 1,
which will return the first element with a score greater than or equal to r.
A little bit of reasoning should convince you that the probability of choosing a member is thus weighted correctly.
Unfortunately, the fact that the scores assigned to the elements needs to be proportional to the cumulative probability would seem to make it difficult to use the sorted set union or intersection operations in a way which would preserve the significance of the scores for random selection of elements. That part would seem to require some significant application logic.

Related

How is the Gini-Index minimized in CART Algorithm for Decision Trees?

For neural networks for example I minimize the cost function by using the backpropagation algorithm. Is there something equivalent for the Gini Index in decision trees?
CART Algorithm always states "choose partition of set A, that minimizes Gini-Index", but how to I actually get that partition mathematically?
Any input on this would be helpful :)
For a decision tree, there are different methods for splitting continuous variables like age, weight, income, etc.
A) Discretize the continuous variable to use it as a categorical variable in all aspects of the DT algorithm. This can be done:
only once during the start and then keeping this discretization
static
at every stage where a split is required, using percentiles or
interval ranges or clustering to bucketize the variable
B) Split at all possible distinct values of the variable and see where there is the highest decrease in the Gini Index. This can be computationally expensive. So, there are optimized variants where you sort the variables and instead of choosing all distinct values, choose the midpoints between two consecutive values as the splits. For example, if the variable 'weight' has 70, 80, 90 and 100 kgs in the data points, try 75, 85, 95 as splits and pick the best one (highest decrease in Gini or other impurities)
But then, what is the exact split algorithm that is implemented in scikit-learn in python, rpart in R, and the mlib package in pyspark , and what are the differences between them in the splitting of a continuous variable is something I am not sure as well and am still researching.
Here there is a good example of CART algorithm. Basically, we get the gini index like this:
For each attribute we have different values each of which will have a gini index, according to the class they belong to. For example, if we had two classes (positive and negative), each value of an attribute will have some records that belong to the positive class and some other values that belong to the negative class. So we can calculate the probabilities. Say if an attribute was called weather and it had two values (e.g. rainy and sunny), and we had these information:
rainy: 2 positive, 3 negative
sunny: 1 positive, 2 negative
we could say:
Then we can have the weighted sum of gini indexes for weather (assuming we had a total of 8 records):
We do this for all the other attributes (like we did for weather) and at the end we choose the attribute with the lowest gini index to be the one to split the tree from. We have to do all this at each split (unless we could classify the sub-tree without the need for splitting).

effective number

In Gelman book, the effective number is defined in terms of the following;
R hat
between- within MCMC sequence of variance, B and W
the number of MCMC samples, denoted by n
the number of chains, denoted by m
I do not know how the samplig() calculate the between MCMC sequence of variance for the case chains=1. So, I cannot calculate these terms ( B,W,m). I want to implement some algorithm according to the paper:https://arxiv.org/abs/1804.06788.
Roughly speaking, this paper construct some test statistics which is uniformly distributed under the null hypothesis that the MCMC sampling is correct. And if MCMC sampling is not correct, then the histogram of the test statistics become skew shape and this deviation from uniformity tells us the MCMC contains bias. I want to implement but it needs to calculate the above quantities.
In rstan, is there such function to extract the above quantities ? I think the process of calculation of R hat statistics, the above quantities B,W, m are retained in some place in the stanfit S4 object.
I am sorry, I found n_eff, but I do not know the choice of m of the case chains =1.
In the case that only one chain is estimated (which should not be happening anyway), then m = 2 because the post-warmup draws from the single chain are split into the first half and the second half. This splitting method is discussed in the documentation.

Non-empty buckets in LSH

I'm reading this survey about LSH, in particular citing the last paragraph of section 2.2.1:
To improve the recall, L hash tables are constructed, and the items
lying in the L (L ′ , L ′ < L) hash buckets h_1 (q), · · · , h_L (q)
are retrieved as near items of q for randomized R-near neighbor search
(or randomized c- approximate R-near neighbor search).
To guarantee the precision, each of the L hash codes, y_i , needs to
be a long code, which means that the total number of the buckets is
too large to index directly. Thus, only the nonempty buckets are
retained by resorting to convectional hashing of the hash codes h_l
(x).
I have 3 questions:
The bold sentence is not clear to me: what does it mean by "resorting to convenctional hashing of the hash codes h_l (x)"?
Always about the bold sentence, I'm not sure that I got the problem: I totally understand that h_l(x) can be a long code and so the number of possible buckets can be huge. For example, if h_l(x) is a binary code and length is h_l(x)'s length, then we have in total L*2^length possible buckets (since we use L hash tables)...is that correct?
Last question: once we find which bucket the query vector q belongs to, in order to find the nearest neighbor we have to use the original vector q and the original distance metric? For example, let suppose that the original vector q is in 128 dimensions q=[1,0,12,...,14.3]^T and it uses the euclidean distance in our application. Now suppose that our hashing function (supposing that L=1 for simplicity) used in LSH maps this vector to a binary space in 20 dimensions y=[0100...11]^T in order to decide which bucket assign q to. So y has the same index of the bucket B, and which already contains 100 vectors. Now, in order to find the nearest neighbor, we have to compare q with all the others 100 128-dimensions vectors using euclidean distance. Is this correct?
Approach they are using to improve recall constructs more hash tables and essentially stores multiple copies of the ID for each reference item, hence space cost is larger [4]. If there are a lot of empty buckets which increases the retrieval cost, the double-hash scheme or fast search algorithm in the Hamming space can be used to fast retrieve the hash buckets. I think in this case they are using double hash function to retrieve non-empty buckets.
No of buckets/memory cells [1][2][3] -> O(nL)
References:
[1] http://simsearch.yury.name/russir/03nncourse-hand.pdf
[2] http://joyceho.github.io/cs584_s16/slides/lsh-12.pdf
[3] https://users.soe.ucsc.edu/~niejiazhong/slides/kumar.pdf
[4] http://research.microsoft.com/en-us/um/people/jingdw/Pubs%5CLTHSurvey.pdf

Generating Matched Pairs for Statistical Analysis

In my study, a person is represented as a pair of real numbers (x, y). x is on [30, 80] and y is [60, 120]. There are two types of people, A and B. I have ~300 of each type. How can I generate the largest (or even a large) set of pairs of one person from A with one from B: ((xA, yA), (xB, yB)) such that each pair of points is close? Two points are close if abs(x1-x2) < dX and abs(y1 - y2) < dY. Similar constraints are acceptable. (That is, this constraint is roughly a Manhattan metric, but euclidean/etc is ok too.) Not all points need be used, but no point can be reused.
You're looking for the Hungarian Algorithm.
Suggested formulation: A are rows, B are columns, each cell contains a distance metric between Ai and Bi, e.g. abs(X(Ai)-X(Bi)) + abs(Y(Ai)-Y(Bi)). (You can normalize the X and Y values to [0,1] if you want distances to be proportional to the range of each variable)
Then use the Hungarian Algorithm to minimize matching weight.
You can filter out matches with distances over your threshold. If you're worried that this filtering might cause the approach to be sub-optimal, you could set distances over your threshold to a very high number.
There are many implementations of this algorithm. A short search found one in any conceivable language, including VBA for Excel and some online solvers (not sure about matching 300x300 matrix with them, though)
Hungarian algorithm did it, thanks Etov.
Source code available here: http://www.filedropper.com/stackoverflow1

How to make a start on the "crackless wall" problem

Here's the problem statement:
Consider the problem of building a wall out of 2x1 and 3x1 bricks (horizontal×vertical dimensions) such that, for extra strength, the gaps between horizontally-adjacent bricks never line up in consecutive layers, i.e. never form a "running crack".
There are eight ways of forming a crack-free 9x3 wall, written W(9,3) = 8.
Calculate W(32,10). < Generalize it to W(x,y) >
http://www.careercup.com/question?id=67814&form=comments
The above link gives a few solutions, but I'm unable to understand the logic behind them. I'm trying to code this in Perl and have done so far:
input : W(x,y)
find all possible i's and j's such that x == 3(i) + 2(j);
for each pair (i,j) ,
find n = (i+j)C(j) # C:combinations
Adding all these n's should give the count of all possible combinations. But I have no idea on how to find the real combinations for one row and how to proceed further.
Based on the claim that W(9,3)=8, I'm inferring that a "running crack" means any continuous vertical crack of height two or more. Before addressing the two-dimensional problem as posed, I want to discuss an analogous one-dimensional problem and its solution. I hope this will make it more clear how the two-dimensional problem is thought of as one-dimensional and eventually solved.
Suppose you want to count the number of lists of length, say, 40, whose symbols come from a reasonably small set of, say, the five symbols {a,b,c,d,e}. Certainly there are 5^40 such lists. If we add an additional constraint that no letter can appear twice in a row, the mathematical solution is still easy: There are 5*4^39 lists without repeated characters. If, however, we instead wish to outlaw consonant combinations such as bc, cb, bd, etc., then things are more difficult. Of course we would like to count the number of ways to choose the first character, the second, etc., and multiply, but the number of ways to choose the second character depends on the choice of the first, and so on. This new problem is difficult enough to illustrate the right technique. (though not difficult enough to make it completely resistant to mathematical methods!)
To solve the problem of lists of length 40 without consonant combinations (let's call this f(40)), we might imagine using recursion. Can you calculate f(40) in terms of f(39)? No, because some of the lists of length 39 end with consonants and some end with vowels, and we don't know how many of each type we have. So instead of computing, for each length n<=40, f(n), we compute, for each n and for each character k, f(n,k), the number of lists of length n ending with k. Although f(40) cannot be
calculated from f(39) alone, f(40,a) can be calculated in terms of f(30,a), f(39,b), etc.
The strategy described above can be used to solve your two-dimensional problem. Instead of characters, you have entire horizontal brick-rows of length 32 (or x). Instead of 40, you have 10 (or y). Instead of a no-consonant-combinations constraint, you have the no-adjacent-cracks constraint.
You specifically ask how to enumerate all the brick-rows of a given length, and you're right that this is necessary, at least for this approach. First, decide how a row will be represented. Clearly it suffices to specify the locations of the 3-bricks, and since each has a well-defined center, it seems natural to give a list of locations of the centers of the 3-bricks. For example, with a wall length of 15, the sequence (1,8,11) would describe a row like this: (ooo|oo|oo|ooo|ooo|oo). This list must satisfy some natural constraints:
The initial and final positions cannot be the centers of a 3-brick. Above, 0 and 14 are invalid entries.
Consecutive differences between numbers in the sequence must be odd, and at least three.
The position of the first entry must be odd.
The difference between the last entry and the length of the list must also be odd.
There are various ways to compute and store all such lists, but the conceptually easiest is a recursion on the length of the wall, ignoring condition 4 until you're done. Generate a table of all lists for walls of length 2, 3, and 4 manually, then for each n, deduce a table of all lists describing walls of length n from the previous values. Impose condition 4 when you're finished, because it doesn't play nice with recursion.
You'll also need a way, given any brick-row S, to quickly describe all brick-rows S' which can legally lie beneath it. For simplicity, let's assume the length of the wall is 32. A little thought should convince you that
S' must satisfy the same constraints as S, above.
1 is in S' if and only if 1 is not in S.
30 is in S' if and only if 30 is not in S.
For each entry q in S, S' must have a corresponding entry q+1 or q-1, and conversely every element of S' must be q-1 or q+1 for some element q in S.
For example, the list (1,8,11) can legally be placed on top of (7,10,30), (7,12,30), or (9,12,30), but not (9,10,30) since this doesn't satisfy the "at least three" condition. Based on this description, it's not hard to write a loop which calculates the possible successors of a given row.
Now we put everything together:
First, for fixed x, make a table of all legal rows of length x. Next, write a function W(y,S), which is to calculate (recursively) the number of walls of width x, height y, and top row S. For y=1, W(y,S)=1. Otherwise, W(y,S) is the sum over all S' which can be related to S as above, of the values W(y-1,S').
This solution is efficient enough to solve the problem W(32,10), but would fail for large x. For example, W(100,10) would almost certainly be infeasible to calculate as I've described. If x were large but y were small, we would break all sensible brick-laying conventions and consider the wall as being built up from left-to-right instead of bottom-to-top. This would require a description of a valid column of the wall. For example, a column description could be a list whose length is the height of the wall and whose entries are among five symbols, representing "first square of a 2x1 brick", "second square of a 2x1 brick", "first square of a 3x1 brick", etc. Of course there would be constraints on each column description and constraints describing the relationship between consecutive columns, but the same approach as above would work this way as well, and would be more appropriate for long, short walls.
I found this python code online here and it works fast and correctly. I do not understand how it all works though. I could get my C++ to the last step (count the total number of solutions) and could not get it to work correctly.
def brickwall(w,h):
# generate single brick layer of width w (by recursion)
def gen_layers(w):
if w in (0,1,2,3):
return {0:[], 1:[], 2:[[2]], 3:[[3]]}[w]
return [(layer + [2]) for layer in gen_layers(w-2)] + \
[(layer + [3]) for layer in gen_layers(w-3)]
# precompute info about whether pairs of layers are compatible
def gen_conflict_mat(layers, nlayers, w):
# precompute internal brick positions for easy comparison
def get_internal_positions(layer, w):
acc = 0; intpos = set()
for brick in layer:
acc += brick; intpos.add(acc)
intpos.remove(w)
return intpos
intpos = [get_internal_positions(layer, w) for layer in layers]
mat = []
for i in range(nlayers):
mat.append([j for j in range(nlayers) \
if intpos[i].isdisjoint(intpos[j])])
return mat
layers = gen_layers(w)
nlayers = len(layers)
mat = gen_conflict_mat(layers, nlayers, w)
# dynamic programming to recursively compute wall counts
nwalls = nlayers*[1]
for i in range(1,h):
nwalls = [sum(nwalls[k] for k in mat[j]) for j in range(nlayers)]
return sum(nwalls)
print(brickwall(9,3)) #8
print(brickwall(9,4)) #10
print(brickwall(18,5)) #7958
print(brickwall(32,10)) #806844323190414