find ranges to create Uniform histogram - sql

I need to find ranges in order to create a Uniform histogram
i.e: ages
to 4 ranges
data_set = [18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46]
is there a function that gives me the ranges so the histogram is uniform?
in this case
ranges = [(18,24), (27,29), (30,33), (42,46)]
This example is easy, I'd like to know if there is an algorithm that deals with complex data sets as well
thanks

You are looking for the quantiles that split up your data equally. This combined with cutshould work. So, suppose you want n groups.
set.seed(1)
x <- rnorm(1000) # Generate some toy data
n <- 10
uniform <- cut(x, c(-Inf, quantile(x, prob = (1:(n-1))/n), Inf)) # Determine the groups
plot(uniform)
Edit: now corrected to yield the correct cuts in the ends.
Edit2: I don't quite understand the downvote. But this also works in your example:
data_set = c(18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46)
n <- 4
groups <- cut(data_set, breaks = c(-Inf, quantile(data_set, prob = 1:(n-1)/n), Inf))
levels(groups)
With some minor renaming nessesary. For slightly better level names, you could also put in min(x) and max(x) instead of -Inf and Inf.

Related

Determining optimal bins to bin the data

I have X,Y data which i would like to bin according to X values.
However, I would like to determine the optimal number of X bins that satisfy a condition based on the resulting bin intervals and average Y of each bin. For example if i have
X=[2,3,4,5,6,7,8,9,10]
Y=[120,140,143,124,150,140,180,190,200]
I would like to determine the best number of X bins that will satisfy this condition: Average of Y bin/(8* width of X bin) should be above 20, but as close as possible to 20. The bins should also be integers e.g., [1,2,..].
I am currently using:
bin_means, bin_edges, binnumber = binned_statistic(X, Y, statistic='mean', bins=bins)
with bins being pre-defined. However, i would like an algorithim that can determine the optimal bins for me before using this.
One can easily determine it for a small data but for hundreds of points it becomes time consuming.
Thank you
If you NEED to iterate to find optimal nbins with your minimization function, take a look at numpy.digtize
https://numpy.org/doc/stable/reference/generated/numpy.digitize.html
And try:
start = min(X)
stop = max(X)
cut_dict = {
n: np.digitize(X, bins=np.linspace(start, stop, num=n+1))
for n in range(min_nbins, max_nbins)}
#input min/max_nbins
avg = {}
Y = pd.Series(Y).rename('Y')
avg = {nbins: Y.groupby(cut).mean().mean() for nbins, cut in cut_dict.items()}
avg = pd.Series(avg.values(), index=avg.keys()).rename('mean_ybins').to_frame()
Then you can find which is closest to 20 or if 20 is the right number...

Fast way to set diagonals of an (M x N x N) matrix? Einsum / n-dimensional fill_diagonal?

I'm trying to write fast, optimized code based on matrices, and have recently discovered einsum as a tool for achieving significant speed-up.
Is it possible to use this to set the diagonals of a multidimensional array efficiently, or can it only return data?
In my problem, I'm trying to set the diagonals for an array of square matrices (shape: M x N x N) by summing the columns in each square (N x N) matrix.
My current (slow, loop-based) solution is:
# Build dummy array
dimx = 2 # Dimension x (likely to be < 100)
dimy = 3 # Dimension y (likely to be between 2 and 10)
M = np.random.randint(low=1, high=9, size=[dimx, dimy, dimy])
# Blank the diagonals so we can see the intended effect
np.fill_diagonal(M[0], 0)
np.fill_diagonal(M[1], 0)
# Compute diagonals based on summing columns
diags = np.einsum('ijk->ik', M)
# Set the diagonal for each matrix
# THIS IS LOW. CAN IT BE IMPROVED?
for i in range(len(M)):
np.fill_diagonal(M[i], diags[i])
# Print result
M
Can this be improved at all please? It seems np.fill_diagonal doesn't accepted non-square matrices (hence forcing my loop based solution). Perhaps einsum can help here too?
One approach would be to reshape to 2D, set the columns at steps of ncols+1 with the diagonal values. Reshaping creates a view and as such allows us to directly access those diagonal positions. Thus, the implementation would be -
s0,s1,s2 = M.shape
M.reshape(s0,-1)[:,::s2+1] = diags
If you do np.source(np.fill_diagonal) you'll see that in the 2d case it uses a 'strided' approach
if a.ndim == 2:
step = a.shape[1] + 1
end = a.shape[1] * a.shape[1]
a.flat[:end:step] = val
#Divakar's solution applies this to your 3d case by 'flattening' on 2 dimensions.
You could sum the columns with M.sum(axis=1). Though I vaguely recall some timings that found that einsum was actually a bit faster. sum is a little more conventional.
Someone has has asked for an ability to expand dimensions in einsum, but I don't think that will happen.

Looping and if statements in SPSS

I'm new to SPSS and I'm a bit stuck on a problem. I have about 200 variables and I want to loop through pairs of them looking for variables with correlation coefficients above 0.7. I know that I can use CORRELATIONS to get a matrix of coefficients but it would be huge and difficult to look through. Basically, in pseudocode, what I want to do is:
for (i = W1_1 to W1_200) {
for (j = i to W1_200) {
if CORRELATIONS(i,j)>0.7 {
print i, j, CORRELATIONS(i,j)
}
}
}
I can't for the life of me work out how to do any of this in SPSS. Help!
SPSS has a helper function on the CORRELATIONS command to export the correlation matrix. From there you can manipulate the data to give the correlation pairs that meet your criteria. So first, lets make some fake data to illustrate.
*Making fake data.
set seed 5.
input program.
loop i = 1 to 100.
end case.
end loop.
end file.
end input program.
dataset name test.
compute #base = RV.NORMAL(0,1).
vector X(20).
loop #i = 1 to 20.
compute X(#i) = #base*(#i/20) + RV.NORMAL(0,1).
end loop.
exe.
Now, we can run the CORRELATIONS command and export the table to a new dataset (which I named here Corrs).
DATASET DECLARE Corrs.
CORRELATIONS
/VARIABLES=X1 to X20
/MATRIX=OUT('Corrs').
Unfortunately SPSS returns the full matrix (plus other info on the sample size). We can only select the rows we are interested in (ones with "CORR" in the ROWTYPE_ column) and then use a DO REPEAT to set the upper or lower half of the matrix to system missing values.
DATASET ACTIVATE Corrs.
SELECT IF ROWTYPE_ = "CORR".
*Now only making lower half of matrix.
COMPUTE #iter = 0.
DO REPEAT X = X1 TO X20.
COMPUTE #iter = #iter + 1.
IF #iter > ($casenum-1) X = $SYSMIS.
END REPEAT.
I set them to system missing values because the next part I will reshape the data using VARSTOCASES. This by default drops missing values, so we won't end up having redundant correlation pairs.
VARSTOCASES
/MAKE Corr FROM X1 TO X20
/INDEX X2 (Corr)
/DROP ROWTYPE_.
RENAME VARIABLES (VARNAME_ = X1).
Now you have your correlation pairs list and can just select out the correlations that meet your criteria.
SELECT IF ABS(Corr) >= .5.
Making of the correlation pairs can be made into a MACRO function pretty easily to return the pair list. Below is that function, recreating the exact steps used here.
DEFINE !CorrPairs (!POSITIONAL !CMDEND)
DATASET DECLARE Corrs.
CORRELATIONS
/VARIABLES=!1
/MATRIX=OUT('Corrs').
DATASET ACTIVATE Corrs.
SELECT IF ROWTYPE_ = "CORR".
COMPUTE #iter = 0.
DO REPEAT X = !1.
COMPUTE #iter = #iter + 1.
IF #iter > ($casenum-1) X = $SYSMIS.
END REPEAT.
VARSTOCASES
/MAKE Corr FROM !1
/INDEX X2 (Corr)
/DROP ROWTYPE_.
RENAME VARIABLES (VARNAME_ = X1).
!ENDDEFINE.
The macro just takes a list of variables (in the active dataset) to grab the correlations, and returns a second dataset named Corrs with the correlation pairs and the variable names defined in the X1 and X2 columns. Then after the above macro is defined the above steps can be recreated simply by below.
!CorrPairs X1 to X20.
SELECT IF ABS(Corr) >= .5.
EXECUTE.
My suggestion is to use OMS to extract your correlation values from the output into a datafile. Use a macro to only run the correlations you need:
DATASET DECLARE Correlations.
OMS /SELECT TABLES /IF COMMANDS=['Correlations'] SUBTYPES=['Correlations']
/DESTINATION FORMAT=SAV NUMBERED=TableNumber_ OUTFILE='Correlations' VIEWER=YES.
define runCorrs ()
!do !i1=1 !to 200
!do !i2=!i1 !to 200
!if (!i2<>!i1) !then
corr !concat("W_",!i1) with !concat("W_",!i2).
!ifend
!doend !doend
!enddefine.
runCorrs.
OMSEND.
datas act Correlations.
select if var2="Pearson Correlation".
VARSTOCASES /make crlVal from W_2 to W_200/index=withvar(crlVal)
/drop TableNumber_ Command_ Subtype_ Label_ Var2.
now you have a nice list of all the correlations to work with:
select if crlVal>0.7.
exe.

Number sort using Min, Max and Variables

I am new to programming and I am trying to create a program that will take 3 random numbers X Y and Z and will sort them into ascending order X being the lowest and Z the highest using Min, Max functions and a Variable (tmp)
I know that there is a particular strategy that I need to use that effects the (X,Y) pair first then (Y,Z) then (X,Y) again but I can't grasp the logic.
The closest I have got so far is...
y=min(y,z)
x=min(x,y)
tmp=max(y,z)
z=tmp
tmp=max(x,y)
y=tmp
x=min(x,y)
tmp=max(x,y)
y=tmp
I've tried so many different combinations but it seems that the problem is UNSOLVABLE can anybody else help?
You need to get sort the X,Y Pair first
tmp=min(x,y)
y=max(x,y)
x=tmp
Then sort the Y,Z pair
tmp = min(y,z)
z=max(y,z)
y=tmp
Then, resort the X,Y pair (in case the original Z was the lowest value...
tmp=min(x,y)
y=max(x,y)
x=tmp
If the commands you have mentioned are the only ones available on the website, and you can only use each one once try:
# Sort X,Y pair
tmp=max(x,y)
x=min(x,y)
y=tmp
# Sort Y,Z pair
tmp=max(y,z)
y=min(y,z)
z=tmp
# Sort X,Y pair again.
tmp=max(x,y)
x=min(x,y)
y=tmp
Hope that helps.
I'm not sure if I understood your question correctly, but you are over righting your variables. Or are you trying to solve some homework with the restriction to only use min() and max() functions?
What about using a list?
tmp = [x, y, z]
tmp.sort()
x, y, z = tmp

Is it possible to optimize this Matlab code for doing vector quantization with centroids from k-means?

I've created a codebook using k-means of size 4000x300 (4000 centroids, each with 300 features). Using the codebook, I then want to label an input vector (for purposes of binning later on). The input vector is of size Nx300, where N is the total number of input instances I receive.
To compute the labels, I calculate the closest centroid for each of the input vectors. To do so, I compare each input vector against all centroids and pick the centroid with the minimum distance. The label is then just the index of that centroid.
My current Matlab code looks like:
function labels = assign_labels(centroids, X)
labels = zeros(size(X, 1), 1);
% for each X, calculate the distance from each centroid
for i = 1:size(X, 1)
% distance of X_i from all j centroids is: sum((X_i - centroid_j)^2)
% note: we leave off the sqrt as an optimization
distances = sum(bsxfun(#minus, centroids, X(i, :)) .^ 2, 2);
[value, label] = min(distances);
labels(i) = label;
end
However, this code is still fairly slow (for my purposes), and I was hoping there might be a way to optimize the code further.
One obvious issue is that there is a for-loop, which is the bane of good performance on Matlab. I've been trying to come up with a way to get rid of it, but with no luck (I looked into using arrayfun in conjunction with bsxfun, but haven't gotten that to work). Alternatively, if someone know of any other way to speed this up, I would be greatly appreciate it.
Update
After doing some searching, I couldn't find a great solution using Matlab, so I decided to look at what is used in Python's scikits.learn package for 'euclidean_distance' (shortened):
XX = sum(X * X, axis=1)[:, newaxis]
YY = Y.copy()
YY **= 2
YY = sum(YY, axis=1)[newaxis, :]
distances = XX + YY
distances -= 2 * dot(X, Y.T)
distances = maximum(distances, 0)
which uses the binomial form of the euclidean distance ((x-y)^2 -> x^2 + y^2 - 2xy), which from what I've read usually runs faster. My completely untested Matlab translation is:
XX = sum(data .* data, 2);
YY = sum(center .^ 2, 2);
[val, ~] = max(XX + YY - 2*data*center');
Use the following function to calculate your distances. You should see an order of magnitude speed up
The two matrices A and B have the columns as the dimenions and the rows as each point.
A is your matrix of centroids. B is your matrix of datapoints.
function D=getSim(A,B)
Qa=repmat(dot(A,A,2),1,size(B,1));
Qb=repmat(dot(B,B,2),1,size(A,1));
D=Qa+Qb'-2*A*B';
You can vectorize it by converting to cells and using cellfun:
[nRows,nCols]=size(X);
XCell=num2cell(X,2);
dist=reshape(cell2mat(cellfun(#(x)(sum(bsxfun(#minus,centroids,x).^2,2)),XCell,'UniformOutput',false)),nRows,nRows);
[~,labels]=min(dist);
Explanation:
We assign each row of X to its own cell in the second line
This piece #(x)(sum(bsxfun(#minus,centroids,x).^2,2)) is an anonymous function which is the same as your distances=... line, and using cell2mat, we apply it to each row of X.
The labels are then the indices of the minimum row along each column.
For a true matrix implementation, you may consider trying something along the lines of:
P2 = kron(centroids, ones(size(X,1),1));
Q2 = kron(ones(size(centroids,1),1), X);
distances = reshape(sum((Q2-P2).^2,2), size(X,1), size(centroids,1));
Note
This assumes the data is organized as [x1 y1 ...; x2 y2 ...;...]
You can use a more efficient algorithm for nearest neighbor search than brute force.
The most popular approach are Kd-Tree. O(log(n)) average query time instead of the O(n) brute force complexity.
Regarding a Maltab implementation of Kd-Trees, you can have a look here