If we are given a multinomial distribution
p=[0.2,0.4,0.1,0.3]
and we have to sample from this distribution over a number of times and return the result, how do I write the algorithm for this?
Eg - if I have a fair die and I want to roll it 20 time and get the total number of times that it landed on which side,
[4, 1, 7, 5, 2, 1]
this should be the result(randomized) -> It landed 4 times on 1, once on 2, etc.
There is a function to do this in Numpy
in numpy we can use
numpy.random.multinomial()
>>> np.random.multinomial(20, [1/6.]*6, size=1)
array([[4, 1, 7, 5, 2, 1]]) # random
I want to understand how the algorithm is written for performing this action
I've tried this approach in python ->
import numpy as np
import random
probs = [0.2, 0.4, 0.1, 0.3]
def sample(count:int)->list:
output = [0,0,0,0]
for i in range(count):
num = random.random()
if(0 < num <= 0.15):
output[2]+=1
elif(0.15 < num <= 0.25):
output[0]+=1
elif(0.25 < num <= 0.35):
output[3]+=1
elif(0.35 < num <= 0.45):
output[1]+=1
return output
final_output = sample(10)
print(final_output)
np.random.multinomial(10, probs, size=1)
But I don't think this is the optimal way, maybe I'm lacking some concepts in Probability?
The actual Code written in Numpy in CPython:
Link to the Numpy file where the code for numpy.random.multinomial() is written starting from line 4176
Possible Duplicate:
How to sample from a multinomial distribution?
References:
https://numpy.org/doc/stable/reference/random/generated/numpy.random.multinomial.html
Random number generation from multinomial distribution in R using rmultinom() function
If you care about sampling from this distribution multiple times, then it is worth looking at the Aliasing method - https://en.wikipedia.org/wiki/Alias_method#:~:text=In%20computing%2C%20the%20alias%20method,arbitrary%20probability%20distribution%20pi.
You can sample in $O(1)$ time after an initial computation of $O(K\log K)$ where $K$ is the size of the support of the distribution.
Related
Question:
Assume for this part that the total count for every sample is 5000 (i.e., sum of column = 5000).
Imagine there was a row (gene G) in this dataset for which the count is expected to be 1 in 10% of samples and 0 in the remaining 90% of samples. We are doing an experiment where we would like to know if the expression of gene G changes in experimental vs control conditions, and we will measure n samples (single cells) from each condition.
Plot the statistical power to detect a 10% increase in the expression of G in experimental vs control at Bonferroni-corrected p < 0.05 as a function of n, assuming that we will be performing a similar test for significance on 1000 genes total. How many samples from each condition do we need to measure to achieve a power of 95%?
First, for gene G, I created a pandas dataframe for control and experimental conditions, where the ratio of 0:1 is 10% and 20%, respectively. I extrapolated the conditions for 1000 genes and then perform cross-tabulation analysis.
import pandas as pd
from statsmodels.stats.contingency_tables import mcnemar
from statsmodels.stats.gof import chisquare_effectsize
from statsmodels.stats.power import GofChisquarePower
n = 5000 # no of records
nog = 1000 # no of genes
gene_list = ["gene_" + str(i) for i in range(0,nog)]
def generate_gene_df(gene, n):
df = pd.DataFrame.from_dict(
{"Gene" : gene,
"Cells": (f'Cell{x}' for x in range(1, n+1)),
"Control": np.random.choice([1,0], p=[0.1, 0.9], size=n),
"Experimental": np.random.choice([1,0], p=[0.1+0.1, 0.9-0.1], size=n)},
orient='columns'
)
df = df.set_index(["Gene","Cells"])
return df
# List of simulated genes
gene_df_list = [generate_gene_df(gene, n) for gene in gene_list]
df = pd.concat(gene_df_list)
df = df.reset_index()
table = pd.crosstab([df["Gene"], df["Cells"]],
[df["Control"], df["Experimental"]]).to_numpy()
Table:
array([[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 1, 0],
...,
[0, 1, 0, 0],
[0, 1, 0, 0],
[0, 1, 0, 0]])
Now, I want to plot the statistical power at Bonferroni-corrected p < 0.05 as a function of n. I also want to find out how many samples from each condition do we need to measure to achieve a power of 95%.
My attempt:
McNemar's test
result = mcnemar(table, exact=True)
print('statistic=%.3f, p-value=%.3f' % (result.statistic, result.pvalue))
alpha=0.05
if result.pvalue > alpha:
print('Same proportions of errors (fail to reject H0)')
else:
print('Different proportions of errors (reject H0)')
Output:
statistic=0.000, p-value=1.000
Same proportions of errors (fail to reject H0)
I calculated the power analysis using:
nobs = 5000
alpha = 0.05
effect_size = chisquare_effectsize(0.5, 0.5*1.1, correction=None, cohen=True, axis=0)
analysis = GofChisquarePower()
power_chisquare = analysis.solve_power(effect_size=effect_size, nobs=nobs, alpha=alpha)
print('Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: %.3f' % power_chisquare)
Based on Chi Square test, the minimum number of samples required to see an effect of the desired size: 0.050
Why does the power curve look atypical? Did I perform the analyses correctly? Is McNemar an appropriate statistical test method in this case?
I'm making an array of sums of random choices from a negative binomial distribution (nbd), with each sum being of non-regular length. Right now I implement it as follows:
import numpy
from numpy.random import default_rng
rng = default_rng()
nbd = rng.negative_binomial(1, 0.5, int(1e6))
gmc = [12, 35, 4, 67, 2]
n_pp = np.empty(len(gmc))
for i in range(len(gmc)):
n_pp[i] = np.sum(rng.choice(nbd, gmc[i]))
This works, but when I perform it over my actual data it's very slow (gmc is of dimension 1e6), and I would like to vary this for multiple values of n and p in the nbd (in this example they're set to 1 and 0.5, respectively).
I'd like to work out a pythonic way to do this which eliminates the loop, but I'm not sure it's possible. I want to keep default_rng for the better random generation than the older way of doing it (np.random.choice), if possible.
The distribution of the sum of m samples from the negative binomial distribution with parameters (n, p) is the negative binomial distribution with parameters (m*n, p). So instead of summing random selections from a large, precomputed sample of negative_binomial(1, 0.5), you can generate your result directly with negative_binomial(gmc, 0.5):
In [68]: gmc = [12, 35, 4, 67, 2]
In [69]: npp = rng.negative_binomial(gmc, 0.5)
In [70]: npp
Out[70]: array([ 9, 34, 1, 72, 7])
(The negative_binomial method will broadcast its inputs, so we can pass gmc as an argument to generate all the samples with one call.)
More generally, if you want to vary the n that is used to generate nbd, you would multiply that n by the corresponding element in gmc and pass the product to rng.negative_binomial.
I've been looking for a function or package that would allow me to calculate the skew and kurtosis of a distribution in a weighted way, as I have histogram data.
For instance I have the data
import numpy as np
np.array([[1, 2],
[2, 5],
[3, 6],
[4,12],
[5, 1])
where the first column [1,2,3,4,5] are the values and the second column [2,5,6,12,1] are the frequencies of the values.
I have found out how to do the first two moments (mean, standard deviation) in a weighted way using the weighted_avg_and_std function specified in this thread, but I was not quite sure how I could extend this to both the skew and kurtosis, or even the nth statistical moment.
I have found the definitions themselves here and could manually write functions to implement this from scratch, but before I go and do that I was wondering if there were any existing packages or functions that might be able to do this.
Thanks
EDIT:
I figured it out, the following code works (please note that this is for population moments)
skewnewss = np.average(((values-average)/np.sqrt(variance))**3, weights=weights)
and
kurtosis=np.average(((values-average)/np.sqrt(variance))**4-3, weights=weights)
I think you have already listed all the ingredients that you need, following the formulas in the link you provided:
import numpy as np
a = np.array([[1,2],[2,5],[3,6],[4,12],[5,1]])
values, weights = a.T
def n_weighted_moment(values, weights, n):
assert n>0 & (values.shape == weights.shape)
w_avg = np.average(values, weights = weights)
w_var = np.sum(weights * (values - w_avg)**2)/np.sum(weights)
if n==1:
return w_avg
elif n==2:
return w_var
else:
w_std = np.sqrt(w_var)
return np.sum(weights * ((values - w_avg)/w_std)**n)/np.sum(weights)
#Same as np.average(((values - w_avg)/w_std)**n, weights=weights)
Which results in:
for n in range(1,5):
print(f'Moment {n} value is {n_weighted_moment(values, weights, n)}')
Moment 1 value is 3.1923076923076925
Moment 2 value is 1.0784023668639053
Moment 3 value is -0.5962505715592139
Moment 4 value is 2.384432138280637
Notice that while you are calculating the excess kurtosis, the formula implemented for a generic n-moment doesn't account for that.
Taken from here
Here is the code
def weighted_mean(var, wts):
"""Calculates the weighted mean"""
return np.average(var, weights=wts)
def weighted_variance(var, wts):
"""Calculates the weighted variance"""
return np.average((var - weighted_mean(var, wts))**2, weights=wts)
def weighted_skew(var, wts):
"""Calculates the weighted skewness"""
return (np.average((var - weighted_mean(var, wts))**3, weights=wts) /
weighted_variance(var, wts)**(1.5))
def weighted_kurtosis(var, wts):
"""Calculates the weighted skewness"""
return (np.average((var - weighted_mean(var, wts))**4, weights=wts) /
weighted_variance(var, wts)**(2))
This question already has answers here:
Most dominant color in RGB image - OpenCV / NumPy / Python
(3 answers)
Closed 3 years ago.
I am trying to count the number of unique colours in an image. I have some code that I think should work however when I run it on an image its saying a I have 252 different colours out of a possible 16,777,216. That seems wrong given the image is BGR so shouldn't their be much more different colours (thousands not hundreds?)?
def count_colours(src):
unique, counts = np.unique(src, return_counts=True)
print(counts.size)
return counts.size
src = cv2.imread('../../images/di8.jpg')
src = imutils.resize(src, height=300)
count_colours(src) # outputs 252 different colours!? only?
Is that value correct? And if not how can I fix my function count_colours()?
Source image:
Edit: is this correct?
def count_colours(src):
unique, counts = np.unique(src.reshape(-1, src.shape[-1]), axis=0, return_counts=True)
return counts.size
If you look at the uniques you are getting back, I'm pretty sure you'll find they are scalars.
You need to use the axis keyword:
>>> import numpy as np
>>> from scipy.misc import face
>>>
>>> img = face()
>>> np.unique(img.reshape(-1, img.shape[-1]), axis=0, return_counts=True)
(array([[ 0, 0, 5],
[ 0, 0, 7],
[ 0, 0, 9],
...,
[255, 248, 255],
[255, 249, 255],
[255, 252, 255]], dtype=uint8), array([1, 2, 2, ..., 1, 1, 1]))
The comment by # Edeki Okoh is correct. You need to find a way to take the color channels into account. There is probably a much cleaner solution but a hacky way to do this would be something like this. Each color channels has values from 0 to 255 so we add 1 in order to make sure that it gets multiplied. Blue will represent the last the digits, green the middle three ones and red the first three. Now every value is representing a unique color.
b,g,r = cv2.split(src)
shiftet_im = b + 1000 * (g + 1) + 1000 * 1000 * (r + 1)
The resulting image should have one channel with each value representing a unique color combination.
I think you only counted for a single channel e.g R-value out of full RGB channel. that's why you have only 252 discrete values.
In theory R G B each can have 256 discrete states.
256*256*256 =16777216
means in total you can have 16777216 possibilities of colors.
My suggestion is to convert RGB uchar CV_8UC3 into a single 32bit data structure like CV_32FC1
Let
Given image as input
# my test small sie text image. which I can count the number of the state by hand
import cv2
import numpy as np
image=cv2.imread('/home/usr/naneDownloads/vuQ9y.png' )# change here
b,g,r = cv2.split(image)
out_in_32U_2D = np.int32(b) << 16 + np.int32(g) << 8 + np.int32(r) #bit wise shift 8 for each channel.
out_in_32U_1D= out_in_32U_2D.reshape(-1) #convert to 1D
np.unique(out_in_32U_1D)
array([-2147483648, -2080374784, -1073741824, -1006632960, 0,
14336, 22528, 30720, 58368, 91136,
123904, 237568, 368640, 499712, 966656,
1490944, 2015232, 3932160, 6029312, 8126464,
15990784, 24379392, 32768000, 65011712, 67108864,
98566144, 132120576, 264241152, 398458880, 532676608,
536870912, 805306368, 1073741824, 1140850688, 1342177280,
1610612736, 1879048192], dtype=int32)
len(np.unique(out_in_32U_1D))
37 # correct for my test wirting paper when compare when my manual counting
The code here should be able to provide you with what you needed
I want to split a long vector into smaller unequal pieces, do a summation on each piece and gather the results into a new vector.
I need to do this in pytorch but I am also interested to see how this is done with numpy.
This can easily be accomplish by splitting the vector.
sizes = [3, 7, 5, 9]
X = torch.ones(sum(sizes))
Y = torch.tensor([s.sum() for s in torch.split(X, sizes)])
or with np.ones and np.split.
Is there a more efficient way to do this?
Edit:
Inspired by the first comment:
indices = np.cumsum([0]+sizes)[:-1]
Y = np.add.reduceat(X, indices.tolist())
solves it for numpy. I am still looking for a solution with pytorch.
index_add_ is your friend!
# inputs
sizes = torch.tensor([3, 7, 5, 9], dtype=torch.long)
x = torch.ones(sizes.sum())
# prepare an index vector for summation (what elements of x are summed to each element of y)
ind = torch.zeros(sizes.sum(), dtype=torch.long)
ind[torch.cumsum(sizes, dim=0)[:-1]] = 1
ind = torch.cumsum(ind, dim=0)
# prepare the output
y = torch.zeros(len(sizes))
# do the actual summation
y.index_add_(0, ind, x)