Generate list of random number with condition - numpy [duplicate] - pandas

This question already has answers here:
Is there an efficient way to generate N random integers in a range that have a given sum or average?
(6 answers)
Closed 2 years ago.
I would like to generate a list of 15 integers with sum 12, minimum value is 0 and maximum is 6.
I tried following code
def generate(low,high,total,entity):
while sum(entity)!=total:
entity=np.random.randint(low, high, size=15)
return entity
But above function is not working properly. It is too much time consuming.
Please let me know the efficient way to generate such numbers?

The above will, strictly speaking work. But for 15 numbers between 0 and 6, the odds of generating 12 is not that high. In fact we can calculate the number of possibilities with:
F(s, 1) = 1 for 0≤s≤6
and
F(s, n) = Σ6i=0F(s-i, n-1).
We can calculate that with a value:
from functools import lru_cache
#lru_cache()
def f(s, n, mn, mx):
if n < 1:
return 0
if n == 1:
return int(mn <= s <= mx)
else:
if s < mn:
return 0
return sum(f(s-i, n-1, mn, mx) for i in range(mn, mx+1))
That means that there are 9'483'280 possibilities, out of 4'747'561'509'943 total possibilities to generate a sum of 12, or 0.00019975%. It will thus take approximately 500'624 iterations to come up with such solution.
We thus should better aim to find a straight-forward way to generate such sequence. We can do that by each time calculating the probability of generating a number: the probability of generating i as number as first number in a sequence of n numbers that sums up to s is F(s-i, n-1, 0, 6)/F(s, n, 0, 6). This will guarantee that we generate a uniform list over the list of possibilities, if we would each time draw a uniform number, then it will not match a uniform distribution over the entire list of values that match the given condition:
We can do that recursively with:
from numpy import choice
def sumseq(n, s, mn, mx):
if n > 1:
den = f(s, n, mn, mx)
val, = choice(
range(mn, mx+1),
1,
p=[f(s-i, n-1, mn, mx)/den for i in range(mn, mx+1)]
)
yield val
yield from sumseq(n-1, s-val, mn, mx)
elif n > 0:
yield s
With the above function, we can generate numpy arrays:
>>> np.array(list(sumseq(15, 12, 0, 6)))
array([0, 0, 0, 0, 0, 4, 0, 3, 0, 1, 0, 0, 1, 2, 1])
>>> np.array(list(sumseq(15, 12, 0, 6)))
array([0, 0, 1, 0, 0, 1, 4, 1, 0, 0, 2, 1, 0, 0, 2])
>>> np.array(list(sumseq(15, 12, 0, 6)))
array([0, 1, 0, 0, 2, 0, 3, 1, 3, 0, 1, 0, 0, 0, 1])
>>> np.array(list(sumseq(15, 12, 0, 6)))
array([5, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1])
>>> np.array(list(sumseq(15, 12, 0, 6)))
array([0, 0, 0, 0, 4, 2, 3, 0, 0, 0, 0, 0, 3, 0, 0])

You could try it implementing it a little bit differently.
import random
def generate(low,high,goal_sum,size=15):
output = []
for i in range(size):
new_int = random.randint(low,high)
if sum(output) + new_int <= goal_sum:
output.append(new_int)
else:
output.append(0)
random.shuffle(output)
return output
Also, if you use np.random.randint, your high will actually be high-1

Well, there is a simple and natural solution - use distribution which by definition provides you array of values with the fixed sum. Simplest one is Multinomial Distribution. The only code to add is to check and reject (and repeat sampling) if some sampled value is above maximum.
Along the lines
import numpy as np
def sample_sum_interval(n, p, maxv):
while True:
q = np.random.multinomial(n, p)
v = np.where(q > maxv)
if len(v[0]) == 0: # if len(v) > 0, some values are outside the range, reject
return q
return None
np.random.seed(32345)
k = 15
n = 12
maxv = 6
p = np.full((k), np.float64(1.0)/np.float64(k), dtype=np.float64) # probabilities
q = sample_sum_interval(n, p, maxv)
print(q)
print(np.sum(q))
q = sample_sum_interval(n, p, maxv)
print(q)
print(np.sum(q))
q = sample_sum_interval(n, p, maxv)
print(q)
print(np.sum(q))
UPDATE
I quickly looked at #WillemVanOnsem proposed method, and I believe it is different from multinomial used by myself.
If we look at multinomial PMF, and assume equal probabilities for all k numbers,
p1 = ... = pk = 1/k, then we could write PMF as
PMF(x1,...xk)=n!/(x1!...xk!) p1x1...pkxk =
n!/(x1!...xk!) k-x1...k-xk = n!/(x1!...xk!) k-Sumixi = n!/(x1!...xk!) k-n
Obviously, probabilities of particular x1...xk combinations would be different from each other due to factorials in denominator (modulo permutations, of course), which is different from #WillemVanOnsem approach where all of them would have equal probabilities to appear, I believe.
Moral of the story - those methods produce different distributions.

Related

How can I efficiently mask out certain pairs in (2, N) tensor?

I have a torch tensor edge_index of shape (2, N) that represents edges in a graph. For each (x, y) there is also a (y, x), where x and y are node IDs (ints). During the forward pass of my model I need to mask out certain edges. So, for example, I have:
n1 = [0, 3, 4] # list of node ids as x
n2 = [1, 2, 1] # list of node ids as y
edge_index = [[1, 2, 0, 1, 3, 4, 2, 3, 1, 4, 2, 4], # actual edges as (x, y) and (y, x)
[2, 1, 1, 0, 4, 3, 3, 2, 4, 1, 4, 2]]
# do something that efficiently removes (x, y) and (y, x) edges as formed by n1 and n2
Final edge_index should look like:
>>> edge_index
[[1, 2, 3, 4, 2, 4],
[2, 1, 4, 3, 4, 2]]
Preferably we need to efficiently make some kind of boolean mask that I can apply to edge index e.g. as edge_index[:, mask] or something like that.
Could also be done in numpy but I'd like to avoid converting back and forth.
Edit #1:
If that can't be done, then I can think of a way so that, instead of n1 and n2, I have access to the indices of the positions I need to exclude in one tensor e.g. _except=[2, 3, 6, 7, 8, 9] (by making a dict/index once in the beginning).
Is there a way to get the desired result by "telling" edge_index to drop the indices in except? edge_index[:, _except] gives me the ones I want to get rid of. I need its complement operation.
Edit #2:
I managed to do it like this:
mask = torch.ones(edge_index.shape[1], dtype=torch.bool)
for i in range(len(n1)):
mask = mask & ~(torch.tensor([n1[i], n2[i]], dtype=torch.long) == edge_index.T).all(dim=1) & ~(torch.tensor([n2[i], n1[i]], dtype=torch.long) == edge_index.T).all(dim=1)
edge_index[:, mask]
but it is too slow and I can't use it. How can I speed it up?
Edit #3: I managed to solve this Edit#1 efficiently with:
mask = torch.ones(edge_index.shape[1], dtype=torch.bool)
mask[_except] = False
edge_index[:, mask]
Still interested in solving the original problem if someone comes up with something...
If you're ok with the way you suggested at Edit#1,
you get the complement result by:
edge_index[:, [i for i in range(edge_index.shape[1]) if not (i in _except)]]
hope this is fast enough for your requirement.
Edit 1:
from functools import reduce
ids = torch.stack([torch.tensor(n1), torch.tensor(n2)], dim=1)
ids = torch.cat([ids, ids[:, [1,0]]], dim=0)
res = edge_index.unsqueeze(0).repeat(6, 1, 1) == ids.unsqueeze(2).repeat(1, 1, 12)
mask = ~reduce(lambda x, y: x | (reduce(lambda p, q: p & q, y)), res, reduce(lambda p, q: p & q, res[0]))
edge_index[:, mask]
Edit 2:
ids = torch.stack([torch.tensor(n1), torch.tensor(n2)], dim=1)
ids = torch.cat([ids, ids[:, [1,0]]], dim=0)
res = edge_index.unsqueeze(0).repeat(6, 1, 1) == ids.unsqueeze(2).repeat(1, 1, 12)
mask = ~(res.sum(1) // 2).sum(0).bool()
edge_index[:, mask]

Count how many times numbers repeat in list number by numbers

Consider the first number, say m. See how many times this number is repeated consecutively. If it is repeated k times in a row, it gives rise to two entries in the output list: first
the number k, then the number m. (This is similar to how we say “four 2s” when we see
[2,2,2,2].) Then we move on to the next number after this run of m. Repeat the process
until every number in the list is considered
Example:The process is perhaps best understood by looking at a few examples:
• readAloud([]) should return []
• readAloud([1,1,1]) should return [3,1]
• readAloud([-1,2,7]) should return [1,-1,1,2,1,7]
• readAloud([3,3,8,-10,-10,-10]) should return [2,3,1,8,3,-10]
• readAloud([3,3,1,1,3,1,1]) should return [2,3,2,1,1,3,2,1]
I have the following code:
def readAloud(lst: List[int]) -> List[int]:
answer:List[int]=[]
l=len(lst)
d=1
for i in range(l-1):
if(lst[i]==lst[i]):
d = d + 1
answer.append(d)
answer.append(lst[i])
if (lst[i-1] != lst[i]):
d=1
answer.append(d)
answer.append(lst[i])
return answer
Grouping adjacent elements is exactly what itertools.groupby is for.
from itertools import chain, groupby
def read_aloud(numbers):
r = ((sum(1 for _ in v), k) for k, v in groupby(numbers))
return list(chain.from_iterable(r))
Examples:
>>> read_aloud([])
[]
>>> read_aloud([1, 1, 1])
[3, 1]
>>> read_aloud([3, 3, 8, -10, -10, -10])
[2, 3, 1, 8, 3, -10]
>>> read_aloud([3, 3, 1, 1, 3, 1, 1])
[2, 3, 2, 1, 1, 3, 2, 1]
Here a solution (but this is not the only one :) )
def readAloud(lst):
answer = []
count = 1
prev_elt = lst[0]
for m in lst[1:] + [None]: # we add Node for the last values
if prev_elt == m:
count += 1
else:
answer.extend([count, prev_elt])
prev_elt = m
count = 1
return answer
print(readAloud([3,3,1,1,3,1,1]))

Best way to get joint probability matrix from categorical data

My goal is to get joint probability (here we use count for example) matrix from data samples. Now I can get the expected result, but I'm wondering how to optimize it. Here is my implementation:
def Fill2DCountTable(arraysList):
'''
:param arraysList: List of arrays, length=2
each array is of shape (k, sampleSize),
k == 1 (or None. numpy will align it) if it's single variable
else k for a set of variables of size k
:return: xyJointCounts, xMarginalCounts, yMarginalCounts
'''
jointUniques, jointCounts = np.unique(np.vstack(arraysList), axis=1, return_counts=True)
_, xReverseIndexs = np.unique(jointUniques[[0]], axis=1, return_inverse=True) ###HIGHLIGHT###
_, yReverseIndexs = np.unique(jointUniques[[1]], axis=1, return_inverse=True)
xyJointCounts = np.zeros((xReverseIndexs.max() + 1, yReverseIndexs.max() + 1), dtype=np.int32)
xyJointCounts[tuple(np.vstack([xReverseIndexs, yReverseIndexs]))] = jointCounts
xMarginalCounts = np.sum(xyJointCounts, axis=1) ###HIGHLIGHT###
yMarginalCounts = np.sum(xyJointCounts, axis=0)
return xyJointCounts, xMarginalCounts, yMarginalCounts
def Fill3DCountTable(arraysList):
# :param arraysList: List of arrays, length=3
jointUniques, jointCounts = np.unique(np.vstack(arraysList), axis=1, return_counts=True)
_, xReverseIndexs = np.unique(jointUniques[[0]], axis=1, return_inverse=True)
_, yReverseIndexs = np.unique(jointUniques[[1]], axis=1, return_inverse=True)
_, SReverseIndexs = np.unique(jointUniques[2:], axis=1, return_inverse=True)
SxyJointCounts = np.zeros((SReverseIndexs.max() + 1, xReverseIndexs.max() + 1, yReverseIndexs.max() + 1), dtype=np.int32)
SxyJointCounts[tuple(np.vstack([SReverseIndexs, xReverseIndexs, yReverseIndexs]))] = jointCounts
SMarginalCounts = np.sum(SxyJointCounts, axis=(1, 2))
SxJointCounts = np.sum(SxyJointCounts, axis=2)
SyJointCounts = np.sum(SxyJointCounts, axis=1)
return SxyJointCounts, SMarginalCounts, SxJointCounts, SyJointCounts
My use scenario is to do conditional independence test over variables. SampleSize is usually quite big (~10k) and each variable's categorical cardinality is relatively small (~10). I still find the speed not satisfying.
How to best optimize this code, or even logic outside the code? I may have some thoughts:
The ###HIGHLIGHT### lines. On a single X I may calculate (X;Y1), (Y2;X), (X;Y3|S1)... for many times, so what if I save cache variable's (and conditional set's) {uniqueValue: reversedIndex} dictionary and its marginal count, and then directly get marginalCounts (no need to sum) and replace to get reverseIndexs (no need to unique).
How to further use matrix parallelization to do CITest in batch, i.e. calculate (X;Y|S1), (X;Y|S2), (X;Y|S3)... simultaneously?
Will torch be faster than numpy, on same CPU? Or on GPU?
It's an open question. Thank you for any possible ideas. Big thanks for your help :)
================== A test example is as follows ==================
xs = np.array( [2, 4, 2, 3, 3, 1, 3, 1, 2, 1] )
ys = np.array( [5, 5, 5, 4, 4, 4, 4, 4, 6, 5] )
Ss = np.array([ [1, 0, 0, 0, 1, 0, 0, 0, 1, 1],
[1, 1, 1, 0, 1, 0, 1, 0, 1, 0] ])
xyJointCounts, xMarginalCounts, yMarginalCounts = Fill2DCountTable([xs, ys])
SxyJointCounts, SMarginalCounts, SxJointCounts, SyJointCounts = Fill3DCountTable([xs, ys, Ss])
get 2D from (X;Y): xMarginalCounts=[3 3 3 1], yMarginalCounts=[5 4 1], and xyJointCounts (added axes name FYI):
xy| 4 5 6
--|-------
1 | 2 1 1
2 | 0 2 1
3 | 3 0 0
4 | 0 1 0
get 3D from (X;Y|{Z1,Z2}): SxyJointCounts is of shape 4x4x3, where the first 4 means the cardinality of {Z1,Z2} (00, 01, 10, 11 with respective SMarginalCounts=[3 3 1 3]). SxJointCounts is of shape 4x4 and SyJointCounts is of shape 4x3.

numpy: Cleanly retrieve coordinates (indices) for highest k values - along a specific axis - in ndarray

I would like to be able to:
select k highest values along (or across?) the first dimension
find indices for those k values
assign those values to a new ndarray of equal shape at their respective positions.
I'm wondering if there is a quicker way to achieve the result exemplified below. In particular, I would like to avoid making the batch indices "manually".
Here's my solution:
# Create unordered array (instrumental to the example)
arr = np.arange(24).reshape(2, 3, 4)
arr_1 = arr[0,::2].copy()
arr_2 = arr[1,1::].copy()
arr[0,::2] = arr_2[:,::-1]
arr[1,1:] = arr_1[:,::-1]
# reshape array to: (batch_size, H*W)
arr_batched = arr.reshape(arr.shape[0], -1)
# find indices for k greatest values along all but the 1st dimension.
gr_ind = np.argpartition(arr_batched, -k)[:, -k]
# flatten and unravel indices.
maxk_ind_flat = gr_ind.flatten()
maxk_ind_shape = np.unravel_index(maxk_ind_flat, arr.shape)
# maxk_ind_shape prints: (array([0, 0, 0, 0]), array([2, 2, 0, 0]), array([1, 0, 2, 3]))
# note: unraveling indices obtained by partitioning an array of shape (2, n) will not keep into account the first dimension (here [0,0,0,0])
# Craft batch indices...
batch_indices = np.repeat(np.arange(arr.shape[0], k)
# ...and join
maxk_indices = tuple([batch_indices]+[ind for ind in maxk_ind_shape[1:]])
# The result is used to re-assign k-highest values for each batch element to a destination matrix:
arr2 = np.zeros_like(arr)
arr2[maxk_indices] = arr[maxk_indices]
# arr2 prints:
# array([[[ 0, 0, 0, 0],
# [ 0, 0, 0, 0],
# [23,22, 0, 0]],
#
# [[ 0, 0, 14, 15],
# [ 0, 0, 0, 0],
# [ 0, 0, 0, 0]]])
Any help would be appreciated.
One way would be to use np.[put/take]_along_axis:
gr_ind = np.argpartition(arr_batched,-k,axis=-1)[:,-k:]
arr_2 = np.zeros_like(arr)
np.put_along_axis(arr_2.reshape(arr_batched.shape),gr_ind,np.take_along_axis(arr_batched,gr_ind,-1),-1)

Distance between non-negative elements of two vectors

I have two vectors:
v1 = [1, 3, 2, 0, 0, 0, 6]
v2 = [2, 0, 1, 0, 4, 2, 1]
I need to compute a distance that is the absolute value of the positive elements on that respective position. For example, the above is:
D(v1, v2) = D(v2, v1) = Abs(1-2) + Abs(2-1) + Abs(6-1) = 7
How can I implement this in numpy?
Here is a solution I found with numpy:
v1 = np.array(v1)
v2 = np.array(v2)
sum(abs(v1[(v1>0)&(v2>0)] - v2[(v1>0)&(v2>0)]))
Hope this helps