Algorithm on how to Sample from a Multinomial Distribution - numpy

If we are given a multinomial distribution
p=[0.2,0.4,0.1,0.3]
and we have to sample from this distribution over a number of times and return the result, how do I write the algorithm for this?
Eg - if I have a fair die and I want to roll it 20 time and get the total number of times that it landed on which side,
[4, 1, 7, 5, 2, 1]
this should be the result(randomized) -> It landed 4 times on 1, once on 2, etc.
There is a function to do this in Numpy
in numpy we can use
numpy.random.multinomial()
>>> np.random.multinomial(20, [1/6.]*6, size=1)
array([[4, 1, 7, 5, 2, 1]]) # random
I want to understand how the algorithm is written for performing this action
I've tried this approach in python ->
import numpy as np
import random
probs = [0.2, 0.4, 0.1, 0.3]
def sample(count:int)->list:
output = [0,0,0,0]
for i in range(count):
num = random.random()
if(0 < num <= 0.15):
output[2]+=1
elif(0.15 < num <= 0.25):
output[0]+=1
elif(0.25 < num <= 0.35):
output[3]+=1
elif(0.35 < num <= 0.45):
output[1]+=1
return output
final_output = sample(10)
print(final_output)
np.random.multinomial(10, probs, size=1)
But I don't think this is the optimal way, maybe I'm lacking some concepts in Probability?
The actual Code written in Numpy in CPython:
Link to the Numpy file where the code for numpy.random.multinomial() is written starting from line 4176
Possible Duplicate:
How to sample from a multinomial distribution?
References:
https://numpy.org/doc/stable/reference/random/generated/numpy.random.multinomial.html
Random number generation from multinomial distribution in R using rmultinom() function

If you care about sampling from this distribution multiple times, then it is worth looking at the Aliasing method - https://en.wikipedia.org/wiki/Alias_method#:~:text=In%20computing%2C%20the%20alias%20method,arbitrary%20probability%20distribution%20pi.
You can sample in $O(1)$ time after an initial computation of $O(K\log K)$ where $K$ is the size of the support of the distribution.

Related

How to perform McNemar (or Chi-square) test on multi-dimensional data?

Question:
Assume for this part that the total count for every sample is 5000 (i.e., sum of column = 5000).
Imagine there was a row (gene G) in this dataset for which the count is expected to be 1 in 10% of samples and 0 in the remaining 90% of samples. We are doing an experiment where we would like to know if the expression of gene G changes in experimental vs control conditions, and we will measure n samples (single cells) from each condition.
Plot the statistical power to detect a 10% increase in the expression of G in experimental vs control at Bonferroni-corrected p < 0.05 as a function of n, assuming that we will be performing a similar test for significance on 1000 genes total. How many samples from each condition do we need to measure to achieve a power of 95%?
First, for gene G, I created a pandas dataframe for control and experimental conditions, where the ratio of 0:1 is 10% and 20%, respectively. I extrapolated the conditions for 1000 genes and then perform cross-tabulation analysis.
import pandas as pd
from statsmodels.stats.contingency_tables import mcnemar
from statsmodels.stats.gof import chisquare_effectsize
from statsmodels.stats.power import GofChisquarePower
n = 5000 # no of records
nog = 1000 # no of genes
gene_list = ["gene_" + str(i) for i in range(0,nog)]
def generate_gene_df(gene, n):
df = pd.DataFrame.from_dict(
{"Gene" : gene,
"Cells": (f'Cell{x}' for x in range(1, n+1)),
"Control": np.random.choice([1,0], p=[0.1, 0.9], size=n),
"Experimental": np.random.choice([1,0], p=[0.1+0.1, 0.9-0.1], size=n)},
orient='columns'
)
df = df.set_index(["Gene","Cells"])
return df
# List of simulated genes
gene_df_list = [generate_gene_df(gene, n) for gene in gene_list]
df = pd.concat(gene_df_list)
df = df.reset_index()
table = pd.crosstab([df["Gene"], df["Cells"]],
[df["Control"], df["Experimental"]]).to_numpy()
Table:
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0],
...,
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0]])
Now, I want to plot the statistical power at Bonferroni-corrected p < 0.05 as a function of n. I also want to find out how many samples from each condition do we need to measure to achieve a power of 95%.
My attempt:
McNemar's test
result = mcnemar(table, exact=True)
print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))
alpha=0.05
if result.pvalue > alpha:
print('Same proportions of errors (fail to reject H0)')
else:
print('Different proportions of errors (reject H0)')
Output:
statistic=0.000, p-value=1.000
Same proportions of errors (fail to reject H0)
I calculated the power analysis using:
nobs = 5000
alpha = 0.05
effect_size = chisquare_effectsize(0.5, 0.5*1.1, correction=None, cohen=True, axis=0)
analysis = GofChisquarePower()
power_chisquare = analysis.solve_power(effect_size=effect_size, nobs=nobs, alpha=alpha)
print('Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: %.3f' % power_chisquare)
Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: 0.050
Why does the power curve look atypical? Did I perform the analyses correctly? Is McNemar an appropriate statistical test method in this case?

Numpy Random Choice with Non-regular Array Size

I'm making an array of sums of random choices from a negative binomial distribution (nbd), with each sum being of non-regular length. Right now I implement it as follows:
import numpy
from numpy.random import default_rng
rng = default_rng()
nbd = rng.negative_binomial(1, 0.5, int(1e6))
gmc = [12, 35, 4, 67, 2]
n_pp = np.empty(len(gmc))
for i in range(len(gmc)):
n_pp[i] = np.sum(rng.choice(nbd, gmc[i]))
This works, but when I perform it over my actual data it's very slow (gmc is of dimension 1e6), and I would like to vary this for multiple values of n and p in the nbd (in this example they're set to 1 and 0.5, respectively).
I'd like to work out a pythonic way to do this which eliminates the loop, but I'm not sure it's possible. I want to keep default_rng for the better random generation than the older way of doing it (np.random.choice), if possible.
The distribution of the sum of m samples from the negative binomial distribution with parameters (n, p) is the negative binomial distribution with parameters (m*n, p). So instead of summing random selections from a large, precomputed sample of negative_binomial(1, 0.5), you can generate your result directly with negative_binomial(gmc, 0.5):
In [68]: gmc = [12, 35, 4, 67, 2]
In [69]: npp = rng.negative_binomial(gmc, 0.5)
In [70]: npp
Out[70]: array([ 9, 34, 1, 72, 7])
(The negative_binomial method will broadcast its inputs, so we can pass gmc as an argument to generate all the samples with one call.)
More generally, if you want to vary the n that is used to generate nbd, you would multiply that n by the corresponding element in gmc and pass the product to rng.negative_binomial.

Calculate weighted statistical moments in Python

I've been looking for a function or package that would allow me to calculate the skew and kurtosis of a distribution in a weighted way, as I have histogram data.
For instance I have the data
import numpy as np
np.array([[1, 2],
[2, 5],
[3, 6],
[4,12],
[5, 1])
where the first column [1,2,3,4,5] are the values and the second column [2,5,6,12,1] are the frequencies of the values.
I have found out how to do the first two moments (mean, standard deviation) in a weighted way using the weighted_avg_and_std function specified in this thread, but I was not quite sure how I could extend this to both the skew and kurtosis, or even the nth statistical moment.
I have found the definitions themselves here and could manually write functions to implement this from scratch, but before I go and do that I was wondering if there were any existing packages or functions that might be able to do this.
Thanks
EDIT:
I figured it out, the following code works (please note that this is for population moments)
skewnewss = np.average(((values-average)/np.sqrt(variance))**3, weights=weights)
and
kurtosis=np.average(((values-average)/np.sqrt(variance))**4-3, weights=weights)
I think you have already listed all the ingredients that you need, following the formulas in the link you provided:
import numpy as np
a = np.array([[1,2],[2,5],[3,6],[4,12],[5,1]])
values, weights = a.T
def n_weighted_moment(values, weights, n):
assert n>0 & (values.shape == weights.shape)
w_avg = np.average(values, weights = weights)
w_var = np.sum(weights * (values - w_avg)**2)/np.sum(weights)
if n==1:
return w_avg
elif n==2:
return w_var
else:
w_std = np.sqrt(w_var)
return np.sum(weights * ((values - w_avg)/w_std)**n)/np.sum(weights)
#Same as np.average(((values - w_avg)/w_std)**n, weights=weights)
Which results in:
for n in range(1,5):
print(f'Moment {n} value is {n_weighted_moment(values, weights, n)}')
Moment 1 value is 3.1923076923076925
Moment 2 value is 1.0784023668639053
Moment 3 value is -0.5962505715592139
Moment 4 value is 2.384432138280637
Notice that while you are calculating the excess kurtosis, the formula implemented for a generic n-moment doesn't account for that.
Taken from here
Here is the code
def weighted_mean(var, wts):
"""Calculates the weighted mean"""
return np.average(var, weights=wts)
def weighted_variance(var, wts):
"""Calculates the weighted variance"""
return np.average((var - weighted_mean(var, wts))**2, weights=wts)
def weighted_skew(var, wts):
"""Calculates the weighted skewness"""
return (np.average((var - weighted_mean(var, wts))**3, weights=wts) /
weighted_variance(var, wts)**(1.5))
def weighted_kurtosis(var, wts):
"""Calculates the weighted skewness"""
return (np.average((var - weighted_mean(var, wts))**4, weights=wts) /
weighted_variance(var, wts)**(2))

Count number of unique colours in image [duplicate]

This question already has answers here:
Most dominant color in RGB image - OpenCV / NumPy / Python
(3 answers)
Closed 3 years ago.
I am trying to count the number of unique colours in an image. I have some code that I think should work however when I run it on an image its saying a I have 252 different colours out of a possible 16,777,216‬. That seems wrong given the image is BGR so shouldn't their be much more different colours (thousands not hundreds?)?
def count_colours(src):
unique, counts = np.unique(src, return_counts=True)
print(counts.size)
return counts.size
src = cv2.imread('../../images/di8.jpg')
src = imutils.resize(src, height=300)
count_colours(src) # outputs 252 different colours!? only?
Is that value correct? And if not how can I fix my function count_colours()?
Source image:
Edit: is this correct?
def count_colours(src):
unique, counts = np.unique(src.reshape(-1, src.shape[-1]), axis=0, return_counts=True)
return counts.size
If you look at the uniques you are getting back, I'm pretty sure you'll find they are scalars.
You need to use the axis keyword:
>>> import numpy as np
>>> from scipy.misc import face
>>>
>>> img = face()
>>> np.unique(img.reshape(-1, img.shape[-1]), axis=0, return_counts=True)
(array([[ 0, 0, 5],
[ 0, 0, 7],
[ 0, 0, 9],
...,
[255, 248, 255],
[255, 249, 255],
[255, 252, 255]], dtype=uint8), array([1, 2, 2, ..., 1, 1, 1]))
The comment by # Edeki Okoh is correct. You need to find a way to take the color channels into account. There is probably a much cleaner solution but a hacky way to do this would be something like this. Each color channels has values from 0 to 255 so we add 1 in order to make sure that it gets multiplied. Blue will represent the last the digits, green the middle three ones and red the first three. Now every value is representing a unique color.
b,g,r = cv2.split(src)
shiftet_im = b + 1000 * (g + 1) + 1000 * 1000 * (r + 1)
The resulting image should have one channel with each value representing a unique color combination.
I think you only counted for a single channel e.g R-value out of full RGB channel. that's why you have only 252 discrete values.
In theory R G B each can have 256 discrete states.
256*256*256 =16777216
means in total you can have 16777216 possibilities of colors.
My suggestion is to convert RGB uchar CV_8UC3 into a single 32bit data structure like CV_32FC1
Let
Given image as input
# my test small sie text image. which I can count the number of the state by hand
import cv2
import numpy as np
image=cv2.imread('/home/usr/naneDownloads/vuQ9y.png' )# change here
b,g,r = cv2.split(image)
out_in_32U_2D = np.int32(b) << 16 + np.int32(g) << 8 + np.int32(r) #bit wise shift 8 for each channel.
out_in_32U_1D= out_in_32U_2D.reshape(-1) #convert to 1D
np.unique(out_in_32U_1D)
array([-2147483648, -2080374784, -1073741824, -1006632960, 0,
14336, 22528, 30720, 58368, 91136,
123904, 237568, 368640, 499712, 966656,
1490944, 2015232, 3932160, 6029312, 8126464,
15990784, 24379392, 32768000, 65011712, 67108864,
98566144, 132120576, 264241152, 398458880, 532676608,
536870912, 805306368, 1073741824, 1140850688, 1342177280,
1610612736, 1879048192], dtype=int32)
len(np.unique(out_in_32U_1D))
37 # correct for my test wirting paper when compare when my manual counting
The code here should be able to provide you with what you needed

how to avoid split and sum of pieces in pytorch or numpy

I want to split a long vector into smaller unequal pieces, do a summation on each piece and gather the results into a new vector.
I need to do this in pytorch but I am also interested to see how this is done with numpy.
This can easily be accomplish by splitting the vector.
sizes = [3, 7, 5, 9]
X = torch.ones(sum(sizes))
Y = torch.tensor([s.sum() for s in torch.split(X, sizes)])
or with np.ones and np.split.
Is there a more efficient way to do this?
Edit:
Inspired by the first comment:
indices = np.cumsum([0]+sizes)[:-1]
Y = np.add.reduceat(X, indices.tolist())
solves it for numpy. I am still looking for a solution with pytorch.
index_add_ is your friend!
# inputs
sizes = torch.tensor([3, 7, 5, 9], dtype=torch.long)
x = torch.ones(sizes.sum())
# prepare an index vector for summation (what elements of x are summed to each element of y)
ind = torch.zeros(sizes.sum(), dtype=torch.long)
ind[torch.cumsum(sizes, dim=0)[:-1]] = 1
ind = torch.cumsum(ind, dim=0)
# prepare the output
y = torch.zeros(len(sizes))
# do the actual summation
y.index_add_(0, ind, x)