Dataframe grouping and distribution test/choose - dataframe

I have a Dataframe with 16 variables (x1, x2,...x16). The variables from 1 to 15 are parameters chosed to perform an experiment. The 16th variable is the measured paramater which describe the result of the experiment.
The experiment is repeated a certein number of times, let´s say n, with constant value of the parameters (x1 to x15), i.e. with costant boundary conditions. In this way I have one serie of experiments. If the boundary conditions change, then I have a new serie of experiments where the experiment is repeated m-times.
First, I would like to find out all the series of experiments in the dataframe. I think this could be done with the R-function "group_by".
Then, I would like to find out the probability distribution of the 16th variable, i.e. the results of the experiment for each serie (i.e. group found with "group_by"). For this I was thinking to use the comand "distChoose". Otherweise I was thinking to fit the data with "fitdist" for two/three distributions and get the AIC. I Wold like to create a table where the AIC is saved for every tested distribution for each group.
I tried something like this:
grouping = group_by(Dataframe, x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, x14, x15)
grouplist <- group_split(grouping)
AIC <- integer(763)
for (i in 1:763){
if (length(grouplist[[i]][["x16"]]) > 2){
normal = fitdist(grouplist[[i]][["x16"]], "norm")
AIC[i] = normal$aic
}
Is there a better way to do it, or maybe a command in R that already exist? I am new to R and I am trying to learn it.
Thank you all.

Related

Plotting an exponential function given one parameter

I'm fairly new to python so bare with me. I have plotted a histogram using some generated data. This data has many many points. I have defined it with the variable vals. I have then plotted a histogram with these values, though I have limited it so that only values between 104 and 155 are taken into account. This has been done as follows:
bin_heights, bin_edges = np.histogram(vals, range=[104, 155], bins=30)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(bin_centres, bin_heights, np.sqrt(bin_heights), fmt=',', capsize=2)
plt.xlabel("$m_{\gamma\gamma} (GeV)$")
plt.ylabel("Number of entries")
plt.show()
Giving the above plot:
My next step is to take into account values from vals which are less than 120. I have done this as follows:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
I need to plot a curve on the same plot as the histogram, which follows the form B(x) = Ae^(-x/λ)
I then estimated a value of λ using the maximum likelihood estimator formula:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
#print(background_data)
N_background=len(background_data)
print(N_background)
sigma_background_data=sum(background_data)
print(sigma_background_data)
lamb = (sigma_background_data)/(N_background) #maximum likelihood estimator for lambda
print('lambda estimate is', lamb)
where lamb = λ. I got a value of roughly lamb = 27.75, which I know is correct. I now need to get an estimate for A.
I have been advised to do this as follows:
Given a value of λ, find A by scaling the PDF to the data such that the area beneath
the scaled PDF has equal area to the data
I'm not quite sure what this means, or how I'd go about trying to do this. PDF means probability density function. I assume an integration will have to take place, so to get the area under the data (vals), I have done this:
data_area= integrate.cumtrapz(background_data, x=None, dx=1.0)
print(data_area)
plt.plot(background_data, data_area)
However, this gives me an error
ValueError: x and y must have same first dimension, but have shapes (981555,) and (981554,)
I'm not sure how to fix it. The end result should be something like:
See the cumtrapz docs:
Returns: ... If initial is None, the shape is such that the axis of integration has one less value than y. If initial is given, the shape is equal to that of y.
So you are either to pass an initial value like
data_area = integrate.cumtrapz(background_data, x=None, dx=1.0, initial = 0.0)
or discard the first value of the background_data:
plt.plot(background_data[1:], data_area)

find ranges to create Uniform histogram

I need to find ranges in order to create a Uniform histogram
i.e: ages
to 4 ranges
data_set = [18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46]
is there a function that gives me the ranges so the histogram is uniform?
in this case
ranges = [(18,24), (27,29), (30,33), (42,46)]
This example is easy, I'd like to know if there is an algorithm that deals with complex data sets as well
thanks
You are looking for the quantiles that split up your data equally. This combined with cutshould work. So, suppose you want n groups.
set.seed(1)
x <- rnorm(1000) # Generate some toy data
n <- 10
uniform <- cut(x, c(-Inf, quantile(x, prob = (1:(n-1))/n), Inf)) # Determine the groups
plot(uniform)
Edit: now corrected to yield the correct cuts in the ends.
Edit2: I don't quite understand the downvote. But this also works in your example:
data_set = c(18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46)
n <- 4
groups <- cut(data_set, breaks = c(-Inf, quantile(data_set, prob = 1:(n-1)/n), Inf))
levels(groups)
With some minor renaming nessesary. For slightly better level names, you could also put in min(x) and max(x) instead of -Inf and Inf.

Variable name to include value of another variable

Lets say I have pre-defined 3 variables, x1, x2, and x3, each of which is a different co-ordinate on the screen. I have a whole chunk of code to decide whether another variable, a will equal 1, 2, or 3. Now, I want to include the value of a in a variable name, allowing me to 'dynamically' change between x1, x2, and x3.
E.g. a is set to 2. Now I want to move the mouse to xa, so if a=2, xa is x2, which is a predefined variable.
Its probably clear I am very new to Lua, I have tried googling the issue, but I'm not really sure what I am looking for, terminology wise and such.
Anyhow, is anyone able to help me out?
If you can change the code where x1, x2 and x3 are defined, a cleaner approach is to use arrays (i.e. array-like tables). This is the general approach when you need a sequence of variables indexed by a number.
Therefore, instead of x1, x2 and x3 you could define:
local x = {}
x[1] = 10 -- instead of x1
x[2] = 20 -- instead of x2
x[3] = 30 -- instead of x3
Now instead of using xa you simply use x[a].
If xa are global variables, you can use the table _G like this:
x1 = 42
x2 = 43
x3 = 44
local a = 2
print(_G['x' .. a])
Output:
43

Number sort using Min, Max and Variables

I am new to programming and I am trying to create a program that will take 3 random numbers X Y and Z and will sort them into ascending order X being the lowest and Z the highest using Min, Max functions and a Variable (tmp)
I know that there is a particular strategy that I need to use that effects the (X,Y) pair first then (Y,Z) then (X,Y) again but I can't grasp the logic.
The closest I have got so far is...
y=min(y,z)
x=min(x,y)
tmp=max(y,z)
z=tmp
tmp=max(x,y)
y=tmp
x=min(x,y)
tmp=max(x,y)
y=tmp
I've tried so many different combinations but it seems that the problem is UNSOLVABLE can anybody else help?
You need to get sort the X,Y Pair first
tmp=min(x,y)
y=max(x,y)
x=tmp
Then sort the Y,Z pair
tmp = min(y,z)
z=max(y,z)
y=tmp
Then, resort the X,Y pair (in case the original Z was the lowest value...
tmp=min(x,y)
y=max(x,y)
x=tmp
If the commands you have mentioned are the only ones available on the website, and you can only use each one once try:
# Sort X,Y pair
tmp=max(x,y)
x=min(x,y)
y=tmp
# Sort Y,Z pair
tmp=max(y,z)
y=min(y,z)
z=tmp
# Sort X,Y pair again.
tmp=max(x,y)
x=min(x,y)
y=tmp
Hope that helps.
I'm not sure if I understood your question correctly, but you are over righting your variables. Or are you trying to solve some homework with the restriction to only use min() and max() functions?
What about using a list?
tmp = [x, y, z]
tmp.sort()
x, y, z = tmp

Is it possible to optimize this Matlab code for doing vector quantization with centroids from k-means?

I've created a codebook using k-means of size 4000x300 (4000 centroids, each with 300 features). Using the codebook, I then want to label an input vector (for purposes of binning later on). The input vector is of size Nx300, where N is the total number of input instances I receive.
To compute the labels, I calculate the closest centroid for each of the input vectors. To do so, I compare each input vector against all centroids and pick the centroid with the minimum distance. The label is then just the index of that centroid.
My current Matlab code looks like:
function labels = assign_labels(centroids, X)
labels = zeros(size(X, 1), 1);
% for each X, calculate the distance from each centroid
for i = 1:size(X, 1)
% distance of X_i from all j centroids is: sum((X_i - centroid_j)^2)
% note: we leave off the sqrt as an optimization
distances = sum(bsxfun(#minus, centroids, X(i, :)) .^ 2, 2);
[value, label] = min(distances);
labels(i) = label;
end
However, this code is still fairly slow (for my purposes), and I was hoping there might be a way to optimize the code further.
One obvious issue is that there is a for-loop, which is the bane of good performance on Matlab. I've been trying to come up with a way to get rid of it, but with no luck (I looked into using arrayfun in conjunction with bsxfun, but haven't gotten that to work). Alternatively, if someone know of any other way to speed this up, I would be greatly appreciate it.
Update
After doing some searching, I couldn't find a great solution using Matlab, so I decided to look at what is used in Python's scikits.learn package for 'euclidean_distance' (shortened):
XX = sum(X * X, axis=1)[:, newaxis]
YY = Y.copy()
YY **= 2
YY = sum(YY, axis=1)[newaxis, :]
distances = XX + YY
distances -= 2 * dot(X, Y.T)
distances = maximum(distances, 0)
which uses the binomial form of the euclidean distance ((x-y)^2 -> x^2 + y^2 - 2xy), which from what I've read usually runs faster. My completely untested Matlab translation is:
XX = sum(data .* data, 2);
YY = sum(center .^ 2, 2);
[val, ~] = max(XX + YY - 2*data*center');
Use the following function to calculate your distances. You should see an order of magnitude speed up
The two matrices A and B have the columns as the dimenions and the rows as each point.
A is your matrix of centroids. B is your matrix of datapoints.
function D=getSim(A,B)
Qa=repmat(dot(A,A,2),1,size(B,1));
Qb=repmat(dot(B,B,2),1,size(A,1));
D=Qa+Qb'-2*A*B';
You can vectorize it by converting to cells and using cellfun:
[nRows,nCols]=size(X);
XCell=num2cell(X,2);
dist=reshape(cell2mat(cellfun(#(x)(sum(bsxfun(#minus,centroids,x).^2,2)),XCell,'UniformOutput',false)),nRows,nRows);
[~,labels]=min(dist);
Explanation:
We assign each row of X to its own cell in the second line
This piece #(x)(sum(bsxfun(#minus,centroids,x).^2,2)) is an anonymous function which is the same as your distances=... line, and using cell2mat, we apply it to each row of X.
The labels are then the indices of the minimum row along each column.
For a true matrix implementation, you may consider trying something along the lines of:
P2 = kron(centroids, ones(size(X,1),1));
Q2 = kron(ones(size(centroids,1),1), X);
distances = reshape(sum((Q2-P2).^2,2), size(X,1), size(centroids,1));
Note
This assumes the data is organized as [x1 y1 ...; x2 y2 ...;...]
You can use a more efficient algorithm for nearest neighbor search than brute force.
The most popular approach are Kd-Tree. O(log(n)) average query time instead of the O(n) brute force complexity.
Regarding a Maltab implementation of Kd-Trees, you can have a look here