Best way to get joint probability matrix from categorical data - numpy

My goal is to get joint probability (here we use count for example) matrix from data samples. Now I can get the expected result, but I'm wondering how to optimize it. Here is my implementation:
def Fill2DCountTable(arraysList):
'''
:param arraysList: List of arrays, length=2
each array is of shape (k, sampleSize),
k == 1 (or None. numpy will align it) if it's single variable
else k for a set of variables of size k
:return: xyJointCounts, xMarginalCounts, yMarginalCounts
'''
jointUniques, jointCounts = np.unique(np.vstack(arraysList), axis=1, return_counts=True)
_, xReverseIndexs = np.unique(jointUniques[[0]], axis=1, return_inverse=True) ###HIGHLIGHT###
_, yReverseIndexs = np.unique(jointUniques[[1]], axis=1, return_inverse=True)
xyJointCounts = np.zeros((xReverseIndexs.max() + 1, yReverseIndexs.max() + 1), dtype=np.int32)
xyJointCounts[tuple(np.vstack([xReverseIndexs, yReverseIndexs]))] = jointCounts
xMarginalCounts = np.sum(xyJointCounts, axis=1) ###HIGHLIGHT###
yMarginalCounts = np.sum(xyJointCounts, axis=0)
return xyJointCounts, xMarginalCounts, yMarginalCounts
def Fill3DCountTable(arraysList):
# :param arraysList: List of arrays, length=3
jointUniques, jointCounts = np.unique(np.vstack(arraysList), axis=1, return_counts=True)
_, xReverseIndexs = np.unique(jointUniques[[0]], axis=1, return_inverse=True)
_, yReverseIndexs = np.unique(jointUniques[[1]], axis=1, return_inverse=True)
_, SReverseIndexs = np.unique(jointUniques[2:], axis=1, return_inverse=True)
SxyJointCounts = np.zeros((SReverseIndexs.max() + 1, xReverseIndexs.max() + 1, yReverseIndexs.max() + 1), dtype=np.int32)
SxyJointCounts[tuple(np.vstack([SReverseIndexs, xReverseIndexs, yReverseIndexs]))] = jointCounts
SMarginalCounts = np.sum(SxyJointCounts, axis=(1, 2))
SxJointCounts = np.sum(SxyJointCounts, axis=2)
SyJointCounts = np.sum(SxyJointCounts, axis=1)
return SxyJointCounts, SMarginalCounts, SxJointCounts, SyJointCounts
My use scenario is to do conditional independence test over variables. SampleSize is usually quite big (~10k) and each variable's categorical cardinality is relatively small (~10). I still find the speed not satisfying.
How to best optimize this code, or even logic outside the code? I may have some thoughts:
The ###HIGHLIGHT### lines. On a single X I may calculate (X;Y1), (Y2;X), (X;Y3|S1)... for many times, so what if I save cache variable's (and conditional set's) {uniqueValue: reversedIndex} dictionary and its marginal count, and then directly get marginalCounts (no need to sum) and replace to get reverseIndexs (no need to unique).
How to further use matrix parallelization to do CITest in batch, i.e. calculate (X;Y|S1), (X;Y|S2), (X;Y|S3)... simultaneously?
Will torch be faster than numpy, on same CPU? Or on GPU?
It's an open question. Thank you for any possible ideas. Big thanks for your help :)
================== A test example is as follows ==================
xs = np.array( [2, 4, 2, 3, 3, 1, 3, 1, 2, 1] )
ys = np.array( [5, 5, 5, 4, 4, 4, 4, 4, 6, 5] )
Ss = np.array([ [1, 0, 0, 0, 1, 0, 0, 0, 1, 1],
[1, 1, 1, 0, 1, 0, 1, 0, 1, 0] ])
xyJointCounts, xMarginalCounts, yMarginalCounts = Fill2DCountTable([xs, ys])
SxyJointCounts, SMarginalCounts, SxJointCounts, SyJointCounts = Fill3DCountTable([xs, ys, Ss])
get 2D from (X;Y): xMarginalCounts=[3 3 3 1], yMarginalCounts=[5 4 1], and xyJointCounts (added axes name FYI):
xy| 4 5 6
--|-------
1 | 2 1 1
2 | 0 2 1
3 | 3 0 0
4 | 0 1 0
get 3D from (X;Y|{Z1,Z2}): SxyJointCounts is of shape 4x4x3, where the first 4 means the cardinality of {Z1,Z2} (00, 01, 10, 11 with respective SMarginalCounts=[3 3 1 3]). SxJointCounts is of shape 4x4 and SyJointCounts is of shape 4x3.

Related

How can I efficiently mask out certain pairs in (2, N) tensor?

I have a torch tensor edge_index of shape (2, N) that represents edges in a graph. For each (x, y) there is also a (y, x), where x and y are node IDs (ints). During the forward pass of my model I need to mask out certain edges. So, for example, I have:
n1 = [0, 3, 4] # list of node ids as x
n2 = [1, 2, 1] # list of node ids as y
edge_index = [[1, 2, 0, 1, 3, 4, 2, 3, 1, 4, 2, 4], # actual edges as (x, y) and (y, x)
[2, 1, 1, 0, 4, 3, 3, 2, 4, 1, 4, 2]]
# do something that efficiently removes (x, y) and (y, x) edges as formed by n1 and n2
Final edge_index should look like:
>>> edge_index
[[1, 2, 3, 4, 2, 4],
[2, 1, 4, 3, 4, 2]]
Preferably we need to efficiently make some kind of boolean mask that I can apply to edge index e.g. as edge_index[:, mask] or something like that.
Could also be done in numpy but I'd like to avoid converting back and forth.
Edit #1:
If that can't be done, then I can think of a way so that, instead of n1 and n2, I have access to the indices of the positions I need to exclude in one tensor e.g. _except=[2, 3, 6, 7, 8, 9] (by making a dict/index once in the beginning).
Is there a way to get the desired result by "telling" edge_index to drop the indices in except? edge_index[:, _except] gives me the ones I want to get rid of. I need its complement operation.
Edit #2:
I managed to do it like this:
mask = torch.ones(edge_index.shape[1], dtype=torch.bool)
for i in range(len(n1)):
mask = mask & ~(torch.tensor([n1[i], n2[i]], dtype=torch.long) == edge_index.T).all(dim=1) & ~(torch.tensor([n2[i], n1[i]], dtype=torch.long) == edge_index.T).all(dim=1)
edge_index[:, mask]
but it is too slow and I can't use it. How can I speed it up?
Edit #3: I managed to solve this Edit#1 efficiently with:
mask = torch.ones(edge_index.shape[1], dtype=torch.bool)
mask[_except] = False
edge_index[:, mask]
Still interested in solving the original problem if someone comes up with something...
If you're ok with the way you suggested at Edit#1,
you get the complement result by:
edge_index[:, [i for i in range(edge_index.shape[1]) if not (i in _except)]]
hope this is fast enough for your requirement.
Edit 1:
from functools import reduce
ids = torch.stack([torch.tensor(n1), torch.tensor(n2)], dim=1)
ids = torch.cat([ids, ids[:, [1,0]]], dim=0)
res = edge_index.unsqueeze(0).repeat(6, 1, 1) == ids.unsqueeze(2).repeat(1, 1, 12)
mask = ~reduce(lambda x, y: x | (reduce(lambda p, q: p & q, y)), res, reduce(lambda p, q: p & q, res[0]))
edge_index[:, mask]
Edit 2:
ids = torch.stack([torch.tensor(n1), torch.tensor(n2)], dim=1)
ids = torch.cat([ids, ids[:, [1,0]]], dim=0)
res = edge_index.unsqueeze(0).repeat(6, 1, 1) == ids.unsqueeze(2).repeat(1, 1, 12)
mask = ~(res.sum(1) // 2).sum(0).bool()
edge_index[:, mask]

how to implement the variable array with one and zero in tensorflow

I'm totally new on tensorflow, and I just want to implement a kind of selection function by using matrices multiplication.
example below:
#input:
I = [[9.6, 4.1, 3.2]]
#selection:(single "1" value , and the other are "0s")
s = tf.transpose(tf.Variable([[a, b, c]]))
e.g. s could be [[0, 1, 0]] or [[0, 0, 1]] or [[1, 0, 0]]
#result:(multiplication)
o = tf.matul(I, s)
sorry for the poor expression,
I intend to find the 'solution' in distribution functions with different means and sigmas. (value range from 0 to 1).
so now, i have three variable i, j, index.
value1 = np.exp(-((index - m1[i]) ** 2.) / s1[i]** 2.)
value2 = np.exp(-((index - m2[j]) ** 2.) / s2[j]** 2.)
m1 = [1, 3, 5] s = [0.2, 0.4, 0.5]. #first graph
m2 = [3, 5, 7]. s = [0.5, 0.5, 1.0]. #second graph
I want to get the max or optimization of total value
e.g. value1 + value2 = 1+1 = 2 and one of the solutions: i = 2, j=1, index=5
or I could do this in the other module?

Tensorflow, i-th element min-max clamping

Given a tensor of rank 1 eg. p = [x y z w], how can I "min-max clamp" within the provided boundaries: max = [1 10 5 3] and min = [-1 -10 -5 -3] such that the i-th element in p is always within the boundaries defined by mini and maxi
Extra: Would it be possible to do this for ranks > 1?
I found the following solution adequate. See the documentation for tf.minimum and tf.maximum. Solution:
import tensorflow as tf
p = tf.Variable([-1, 1, 3, 7])
clamp_min = tf.Variable([1, 1, 1, 1])
clamp_max = tf.Variable([5, 5, 5, 5])
p = tf.minimum(p, clamp_max)
p = tf.maximum(p, clamp_min)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
print(sess.run(p))
Produces:
[1 1 3 5]

How to sort a multi-dimensional tensor using the returned indices of tf.nn.top_k?

I have two multi-dimensional tensors a and b. And I want to sort them by the values of a.
I found tf.nn.top_k is able to sort a tensor and return the indices which is used to sort the input. How can I use the returned indices from tf.nn.top_k(a, k=2) to sort b?
For example,
import tensorflow as tf
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
# How to sort b into
# sorted_b[0, 0, 0, :] = b[0, 0, indices[0, 0, 0], :]
# sorted_b[0, 0, 1, :] = b[0, 0, indices[0, 0, 1], :]
# sorted_b[0, 1, 0, :] = b[0, 1, indices[0, 1, 0], :]
# ...
Update
Combining tf.gather_nd with tf.meshgrid can be one solution. For example, the following code is tested on python 3.5 with tensorflow 1.0.0-rc0:
a = tf.reshape(tf.range(30), (2, 5, 3))
b = tf.reshape(tf.range(210), (2, 5, 3, 7))
k = 2
sorted_a, indices = tf.nn.top_k(a, k)
shape_a = tf.shape(a)
auxiliary_indices = tf.meshgrid(*[tf.range(d) for d in (tf.unstack(shape_a[:(a.get_shape().ndims - 1)]) + [k])], indexing='ij')
sorted_b = tf.gather_nd(b, tf.stack(auxiliary_indices[:-1] + [indices], axis=-1))
However, I wonder if there is a solution which is more readable and doesn't need to create auxiliary_indices above.
Your code have a problem.
b = tf.reshape(tf.range(60), (2, 5, 3, 7))
Because TensorFlow Cannot reshape a tensor with 60 elements to shape [2,5,3,7] (210 elements).
And you can't sort a rank 4 tensor (b) using indices of rank 3 tensors.

How to find an index of the first matching element in TensorFlow

I am looking for a TensorFlow way of implementing something similar to Python's list.index() function.
Given a matrix and a value to find, I want to know the first occurrence of the value in each row of the matrix.
For example,
m is a <batch_size, 100> matrix of integers
val = 23
result = [0] * batch_size
for i, row_elems in enumerate(m):
result[i] = row_elems.index(val)
I cannot assume that 'val' appears only once in each row, otherwise I would have implemented it using tf.argmax(m == val). In my case, it is important to get the index of the first occurrence of 'val' and not any.
It seems that tf.argmax works like np.argmax (according to the test), which will return the first index when there are multiple occurrences of the max value.
You can use tf.argmax(tf.cast(tf.equal(m, val), tf.int32), axis=1) to get what you want. However, currently the behavior of tf.argmax is undefined in case of multiple occurrences of the max value.
If you are worried about undefined behavior, you can apply tf.argmin on the return value of tf.where as #Igor Tsvetkov suggested.
For example,
# test with tensorflow r1.0
import tensorflow as tf
val = 3
m = tf.placeholder(tf.int32)
m_feed = [[0 , 0, val, 0, val],
[val, 0, val, val, 0],
[0 , val, 0, 0, 0]]
tmp_indices = tf.where(tf.equal(m, val))
result = tf.segment_min(tmp_indices[:, 1], tmp_indices[:, 0])
with tf.Session() as sess:
print(sess.run(result, feed_dict={m: m_feed})) # [2, 0, 1]
Note that tf.segment_min will raise InvalidArgumentError when there is some row containing no val. In your code row_elems.index(val) will raise exception too when row_elems don't contain val.
Looks a little ugly but works (assuming m and val are both tensors):
idx = list()
for t in tf.unpack(m, axis=0):
idx.append(tf.reduce_min(tf.where(tf.equal(t, val))))
idx = tf.pack(idx, axis=0)
EDIT:
As Yaroslav Bulatov mentioned, you could achieve the same result with tf.map_fn:
def index1d(t):
return tf.reduce_min(tf.where(tf.equal(t, val)))
idx = tf.map_fn(index1d, m, dtype=tf.int64)
Here is another solution to the problem, assuming there is a hit on every row.
import tensorflow as tf
val = 3
m = tf.constant([
[0 , 0, val, 0, val],
[val, 0, val, val, 0],
[0 , val, 0, 0, 0]])
# replace all entries in the matrix either with its column index, or out-of-index-number
match_indices = tf.where( # [[5, 5, 2, 5, 4],
tf.equal(val, m), # [0, 5, 2, 3, 5],
x=tf.range(tf.shape(m)[1]) * tf.ones_like(m), # [5, 1, 5, 5, 5]]
y=(tf.shape(m)[1])*tf.ones_like(m))
result = tf.reduce_min(match_indices, axis=1)
with tf.Session() as sess:
print(sess.run(result)) # [2, 0, 1]
Here is a solution which also considers the case the element is not included by the matrix (solution from github repository of DeepMind)
def get_first_occurrence_indices(sequence, eos_idx):
'''
args:
sequence: [batch, length]
eos_idx: scalar
'''
batch_size, maxlen = sequence.get_shape().as_list()
eos_idx = tf.convert_to_tensor(eos_idx)
tensor = tf.concat(
[sequence, tf.tile(eos_idx[None, None], [batch_size, 1])], axis = -1)
index_all_occurrences = tf.where(tf.equal(tensor, eos_idx))
index_all_occurrences = tf.cast(index_all_occurrences, tf.int32)
index_first_occurrences = tf.segment_min(index_all_occurrences[:, 1],
index_all_occurrences[:, 0])
index_first_occurrences.set_shape([batch_size])
index_first_occurrences = tf.minimum(index_first_occurrences + 1, maxlen)
return index_first_occurrences
And:
import tensorflow as tf
mat = tf.Variable([[1,2,3,4,5], [2,3,4,5,6], [3,4,5,6,7], [0,0,0,0,0]], dtype = tf.int32)
idx = 3
first_occurrences = get_first_occurrence_indices(mat, idx)
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())
sess.run(first_occurrence) # [3, 2, 1, 5]