How to sample probability distributions in Pharo - smalltalk

I want to generate n random samples of different probability distributions (normal, lognormal, poisson, uniform). i.e. what in R are rnorm, rlnorm, rpois, runif.
How can I do that in Pharo? It would also be useful to know how to calculate densities (dnorm, dlnorm, dpois, dunif)

Ok, here I leave an answer for the record, using PolyMath as recomended.
To load PolyMath execute:
Metacello new
repository: 'github://PolyMathOrg/PolyMath/src';
baseline: 'PolyMath';
load
There seem to exist two different kind of objects: PMProbabilityDensity and PMNumberGenerator, both with their respective subclasses.
You can use the density objects both to obtain densities and to generate random samples, or you can use the number generators to generate random samples.
So, for normal distribution if using density objects you can sample with:
PMNormalDistribution new random "N(0,1)"
PMNormalDistribution new initialize: mu sigma: sigma; random
Or obtain the density at point x with:
PMNormalDistribution new initialize: 0 sigma: 1; value: x.
The exact same pattern applies to log normal and uniform:
PMUniformDistribution new initialize: from to: to; random
PMUniformDistribution new initialize: from to: to; value: x
PMLogNormalDistribution new initialize: mu sigma: sigma
PMLogNormalDistribution new initialize: mu sigma: sigma; value: x
Unfortunately I couldn't find a density object for Poisson.
Now, you can still generate samples of it: with number generator objects, you can generate Poisson and Normal samples (and log normals indirectly):
PMPoissonGenerator new lambda: lambda; next.
PMGaussianGenerator new next.
PMGaussianGenerator new standardDeviation: 2; mean: 5; next.
(PMGaussianGenerator new standardDeviation: 2; mean: 5; next) exp. "log normal"

Related

STAN - Defining priors for dependent random variables

Background: I have a simulation model which has unobserved parameters. I created a metamodel using artificial neural networks (ANN) because the runtime was very long for the simulation model. I am trying to estimate the unobserved parameters using Bayesian calibration, where priors are based on current knowledge, and the likelihood of observing data is being estimated from the metamodel.
Query: I have two random variables X and Y for which I am trying to get the posterior distribution using STAN. The prior distribution of X is uniform, U(0,2). The prior for Y is also uniform, but it will always exceed X i.e., Y ~ U(X,2). Since Y is linked to X, how can I define the prior distribution for Y in STAN such that the constraint Y>X holds? I am new to STAN, so I would appreciate any suggestions or guidance on how to proceed. Thank you so much!
Stan's ordered vectors are what you need. Create an ordered vector of length 2 (I'll call it beta) in the parameters block, like this:
parameters {
ordered<lower=0,upper=2>[2] beta;
}
Ordered vectors are constrained such that each element is greater than the previous element. So beta[1] will be your estimate of X and beta[2] will be your estimate of Y.
(To make sure I understand your model correctly: you have two parameters, X and Y, and your only prior knowledge about them is that they both lie in [0, 2] and Y > X. X and Y describe some aspect of the distribution of your data - for example, maybe X is the mean of some other random variable Z, for which you have observations. Do I have that right?)
I believe Stan's priors are uniform by default, but you can make sure of this by specifying a prior for beta in the model block:
model {
beta ~ uniform(0, 2);
...
}

Importance of seed and num_runs in the KMeans clustering

New to ML so trying to make sense of the following code. Specifically
In for run in np.arange(1, num_runs+1), what is the need for this loop? Why didn't the author use setMaxIter method of KMeans?
What is the importance of seeding in clustering?
Why did the author chose to set the seed explicitly rather than using the default one?
from pyspark.ml.clustering import KMeans
from pyspark.ml.evaluation import ClusteringEvaluator
def optimal_k(df_in,index_col,k_min, k_max,num_runs):
'''
Determine optimal number of clusters by using Silhoutte Score Analysis.
:param df_in: the input dataframe
:param index_col: the name of the index column
:param k_min: the train dataset
:param k_min: the minmum number of the clusters
:param k_max: the maxmum number of the clusters
:param num_runs: the number of runs for each fixed clusters
:return k: optimal number of the clusters
:return silh_lst: Silhouette score
:return r_table: the running results table
:author: Wenqiang Feng
:email: von198#gmail.com.com
'''
start = time.time()
silh_lst = []
k_lst = np.arange(k_min, k_max+1)
r_table = df_in.select(index_col).toPandas()
r_table = r_table.set_index(index_col)
centers = pd.DataFrame()
for k in k_lst:
silh_val = []
for run in np.arange(1, num_runs+1):
# Trains a k-means model.
kmeans = KMeans()\
.setK(k)\
.setSeed(int(np.random.randint(100, size=1)))
model = kmeans.fit(df_in)
# Make predictions
predictions = model.transform(df_in)
r_table['cluster_{k}_{run}'.format(k=k, run=run)]= predictions.select('prediction').toPandas()
# Evaluate clustering by computing Silhouette score
evaluator = ClusteringEvaluator()
silhouette = evaluator.evaluate(predictions)
silh_val.append(silhouette)
silh_array=np.asanyarray(silh_val)
silh_lst.append(silh_array.mean())
elapsed = time.time() - start
silhouette = pd.DataFrame(list(zip(k_lst,silh_lst)),columns = ['k', 'silhouette'])
print('+------------------------------------------------------------+')
print("| The finding optimal k phase took %8.0f s. |" %(elapsed))
print('+------------------------------------------------------------+')
return k_lst[np.argmax(silh_lst, axis=0)], silhouette , r_table
I'll try to answer your questions based on my reading of the material.
The reason for this loop is that the author sets a new seed for every loop using int(np.random.randint(100, size=1)). If the feature variables exhibit patterns that automatically group them into visible clusters, then the starting seed should not have an impact on the final cluster memberships. However, if the data is evenly distributed, then we might end up with different cluster members based on the initial random variable. I believe the author is changing these seeds for each run to test different initial distributions. Using setMaxIter would set maximum iterations for the same seed (initial distribution).
Similar to the above - the seed defines the initial distribution of k points around which you're going to cluster. Depending on your underlying data distribution, the clusters can converge in different final distributions.
The author has control over the seed, as discussed in points 1 and 2. You can see for what seed your code converges around clusters as desired, and for which you might not get convergence. Also, if you iterate for, say, 100 different seeds and your code still converges into the same final clusters, you can remove the default seed as it likely doesn't matter. Another use is from a more software engineering perspective, setting explicit seed is super important if you want to, for example, write tests for your code and don't want it to randomly fail.

Miscalculation of new function "pcsegdist" in Matlab R2018b

I try to test the new function "pcsegdist" in Matlab R2018b. However, the result is wrong for Segment point cloud into clusters based on Euclidean distance
Example: I test with 3D data points- 1797 points (please see attached test.txt file). It is noted that the smallest distance between 2 neighbor points is 0.3736
tic
clear;clc;filename = 'test.txt'; load('test.txt');P = test(:,1:3);%get data=coordinate(x,y,z) from set of data "column" at (all row & column 1,2,3)
ptCloud = pointCloud(P);
minDistance = 0.71;%this value should less than the smallest 3D distance between 2 clusters
[labels,numClusters] = pcsegdist(ptCloud,minDistance);%numClusters: the number of Cluster
%labels: is the kx1 matrix. This is index of each voxel in each cluster
toc
%% Generate the cell_cluster
cell_cluster={};x=P(:,1);y=P(:,2);z=P(:,3);
for i=1:numClusters
cluster_i=[x(labels==i),y(labels==i),z(labels==i)];%call x,y,z coord of all points which is belong the same cluster
cell_cluster{end+1} = cluster_i;%this is (1xk)cell. where k=number of cluster
end
figure;Plot_cell(cell_cluster);view(3);% plot result cluster(using function to plot)
But when I verify by using manually method (ground truth data), the result should as below figure:
Thus, I wonder about the result of new function "pcsegdist" in Matlab R2018b, Or I misunderstand or I wrong somewhere ?enter link description here

pymc python change point detection for small probabilities. ZeroProbability Error

I am trying to use pymc to find a change point in a time-series. The value I am looking at over time is probability to "convert" which is very small, 0.009 on average with a range of 0.001-0.016.
I give the two probabilities a uniform distribution as a prior between zero and the max observation.
alpha = df.cnvrs.max() # Set upper uniform
center_1_c = pm.Uniform("center_1_c", 0, alpha)
center_2_c = pm.Uniform("center_2_c", 0, alpha)
day_c = pm.DiscreteUniform("day_c", lower=1, upper=n_days)
#pm.deterministic
def lambda_(day_c=day_c, center_1_c=center_1_c, center_2_c=center_2_c):
out = np.zeros(n_days)
out[:day_c] = center_1_c
out[day_c:] = center_2_c
return out
observation = pm.Uniform("obs", lambda_, value=df.cnvrs.values, observed=True)
When I run this code I get:
ZeroProbability: Stochastic obs's value is outside its support,
or it forbids its parents' current values.
I'm pretty new to pymc so not sure if I'm missing something obvious. My guess is I might not have appropriate distributions for modelling small probabilities.
It's impossible to tell where you've introduced this bug—and programming is off-topic here, in any case—without more of your output. But there is a statistical issue here: You've somehow constructed a model that cannot produce either the observed variables or the current sample of latent ones.
To give a simple example, say you have a dataset with negative values, and you've assumed it to be gamma distributed; this will produce an error, because the data has zero probability under a gamma. Similarly, an error will be thrown if an impossible value is sampled during an MCMC chain.

Stan. Using target += syntax

I'm starting to learn Stan.
Could anyone explain when and how to use syntax such as... ?
target +=
instead of just:
y ~ normal(mu, sigma)
For example in Stan manual you can find the following example.
model {
real ps[K]; // temp for log component densities
sigma ~ cauchy(0, 2.5);
mu ~ normal(0, 10);
for (n in 1:N) {
for (k in 1:K) {
ps[k] = log(theta[k])
+ normal_lpdf(y[n] | mu[k], sigma[k]);
}
target += log_sum_exp(ps);
}
}
I think the target line increases the target value, that I think it's the logarithm of the posterior density.
But the posterior density for what parameter?
When is it updated and initialized?
After Stan finishes (and converges), how do you access its value and how I use it?
Other examples:
data {
int<lower=0> J; // number of schools
real y[J]; // estimated treatment effects
real<lower=0> sigma[J]; // s.e. of effect estimates
}
parameters {
real mu;
real<lower=0> tau;
vector[J] eta;
}
transformed parameters {
vector[J] theta;
theta = mu + tau * eta;
}
model {
target += normal_lpdf(eta | 0, 1);
target += normal_lpdf(y | theta, sigma);
}
the example above uses target twice instead of just once.
another example.
data {
int<lower=0> N;
vector[N] y;
}
parameters {
real mu;
real<lower=0> sigma_sq;
vector<lower=-0.5, upper=0.5>[N] y_err;
}
transformed parameters {
real<lower=0> sigma;
vector[N] z;
sigma = sqrt(sigma_sq);
z = y + y_err;
}
model {
target += -2 * log(sigma);
z ~ normal(mu, sigma);
}
This last example even mixes both methods.
To do it even more difficult I've read that
y ~ normal(0,1);
has the same effect than
increment_log_prob(normal_log(y,0,1));
Could anyone explain why, please?
Could anyone provide a simple example written in two different ways, with "target +=" and in the regular simpler "y ~" way, please?
Regards
The syntax
target += u;
adds u to the target log density.
The target density is the density from which the sampler samples and it needs to be equal to the joint density of all the parameters given the data up to a constant (which is usually achieved via Bayes's rule by coding as the joint density of parameters and modeled data up to a constant). You access it as lp__ in the posterior, but be careful, as it also contains the Jacobians arising from the constraints and drops constants in sampling statements---you do not want to use it for model comparison.
From a sampling perspective, writing
target += normal_lpdf(y | mu, sigma);
has the same effect as
y ~ normal(mu, sigma);
The _lpdf signals it's the log probability density function for the normal, which is implicit in the sampling notation. The sampling notation is just shorthand for the target += syntax, and in addition, drops constant terms in the log density.
It's explained in the statements section of the language reference (the second part of the manual) and used in multiple examples through the programmer's guide (the first part of the manual).
I am just starting to learn Stan and Bayesian statistics, and mainly rely on John Kruschke's book "Doing Bayesian Data Analysis". Here, in chapter 14.3.3, he explains:
Thus, the essence of computation in Stan is dealing with the logarithm
of the posterior probability density and its gradient; there is no
direct random sampling of parameters from distributions.
As a result (still rephrasing Kruschke), a
model [...] like y ∼ normal(mu,sigma) [actuallly] means to multiply the current posterior probability by the density of the normal distribution at the datum value y.
Following logarithm calculation rules, this multiplication is equal to add the log probability density of a given data y to the current log-probability. (log(a*b) = log(a) + log(b), hence the equality of multiplication and sum).
I concede that I don't grasp the full implications of that, but I think it points into the right direction into what, mathematically speaking, the targer += does.