How can I adjust bars on X-axis accord to numerical order? - ggplot2

The ideal order of bars on X-axis is (S1,S2,S3,S4,S5,S6,S7,S8,S9,S10).But it just can not be adjust successfully :(
ggplot(data=T1, aes(x=Scenarios, y=Yields, fill=Scenarios)) +
geom_bar(stat="identity") +
scale_fill_manual(values=cbPalette)

Under the hood ggplot converts the strings to factors and the default levels will be the strings sorted alphabetically (or depending how you create T1, it could be at that point). If you want to customize the sort order, explicitly cast your x variable as a factor with the level order you prefer.
Either ahead of time:
T1 <- dplyr::mutate(T1, Scenarios = factor(Scenarios, Scenarios))
or inline with the ggplot call
ggplot(aes(x= factor(scenarios, scenarios), y= yields, fill=scenarios)) +
geom_bar(stat="identity")

Related

Can I use Lattice auto.key or key to make a legend with points for some data and lines for others?

I often make figures that have observed data represented as points and model-predicted data represented as lines, using distribute.type to assign graph types. Is there a way to make a legend that only shows points for the points data, and lines for the lines data? The auto.key default is points, and if I add lines with "list(lines=TRUE)" the legend shows both points and lines for every data label:
x <- seq(0, 8*pi, by=pi/6)
Y1pred <- sin(x)
Y1obs <- Y1pred + rnorm(length(x), mean=0, sd=0.2)
Y2pred <- cos(x)
Y2obs <- Y2pred + rnorm(length(x), mean=0, sd=0.4)
xyplot(Y1obs + Y2obs + Y1pred + Y2pred ~ x,
type=c('p','p','l','l'),
distribute.type=TRUE,
auto.key=list(lines=TRUE, columns=2)
)
There is a rather complicated example using 'key' on p. 158 of Deepayans' book on Lattice, I am wondering if there are simple options?
Yes, following S, the lines components of key supports different type-s (but not points). Using auto.key, you could do
xyplot(Y1obs + Y2obs + Y1pred + Y2pred ~ x,
type=c('p','p','l','l'),
distribute.type=TRUE,
auto.key = list(points = FALSE, lines = TRUE,
columns = 2,
type = c('p','p','l','l')))
Ideally, you would want to put the type only inside the lines component, and that's how you should do it if you use key. For auto.key, there can be only one line anyway, so this should be fine.

Plotting an exponential function given one parameter

I'm fairly new to python so bare with me. I have plotted a histogram using some generated data. This data has many many points. I have defined it with the variable vals. I have then plotted a histogram with these values, though I have limited it so that only values between 104 and 155 are taken into account. This has been done as follows:
bin_heights, bin_edges = np.histogram(vals, range=[104, 155], bins=30)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(bin_centres, bin_heights, np.sqrt(bin_heights), fmt=',', capsize=2)
plt.xlabel("$m_{\gamma\gamma} (GeV)$")
plt.ylabel("Number of entries")
plt.show()
Giving the above plot:
My next step is to take into account values from vals which are less than 120. I have done this as follows:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
I need to plot a curve on the same plot as the histogram, which follows the form B(x) = Ae^(-x/λ)
I then estimated a value of λ using the maximum likelihood estimator formula:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
#print(background_data)
N_background=len(background_data)
print(N_background)
sigma_background_data=sum(background_data)
print(sigma_background_data)
lamb = (sigma_background_data)/(N_background) #maximum likelihood estimator for lambda
print('lambda estimate is', lamb)
where lamb = λ. I got a value of roughly lamb = 27.75, which I know is correct. I now need to get an estimate for A.
I have been advised to do this as follows:
Given a value of λ, find A by scaling the PDF to the data such that the area beneath
the scaled PDF has equal area to the data
I'm not quite sure what this means, or how I'd go about trying to do this. PDF means probability density function. I assume an integration will have to take place, so to get the area under the data (vals), I have done this:
data_area= integrate.cumtrapz(background_data, x=None, dx=1.0)
print(data_area)
plt.plot(background_data, data_area)
However, this gives me an error
ValueError: x and y must have same first dimension, but have shapes (981555,) and (981554,)
I'm not sure how to fix it. The end result should be something like:
See the cumtrapz docs:
Returns: ... If initial is None, the shape is such that the axis of integration has one less value than y. If initial is given, the shape is equal to that of y.
So you are either to pass an initial value like
data_area = integrate.cumtrapz(background_data, x=None, dx=1.0, initial = 0.0)
or discard the first value of the background_data:
plt.plot(background_data[1:], data_area)

Find or calculate intersection points of a straight line with a diagonal scatter plot using VBA

I am trying to understand how I can go about finding or calculating the intersection points of a straight line and a diagonal scatter plot. Just to give a better idea, on an X,Y plot, if I have a straight horizontal line at y= # (any number), that crosses an array of scatters points (which form a diagonal line), how can I calculate points of intersection the two lines?
The problem that I am having is that the scattered array has multiple points around my horizontal line, what I would like to do is find the point that hits the horizontal line first, and the point that hits the horizontal line the last.
please refer to the image for a better understanding. The two points that are annotated are the ones that I am trying to extract with VBA. Is this possible? The image shows two sets of scattered arrays, I am only interested in figuring out the method for 1 of the arrays. If I can extract this for 1 scattered array, I can replicate the method for the next one.
http://imgur.com/9YTNeco
It's hard to give you any specifics without knowing the structure of your Data. But this is the approach I'd use.
I'll assume your data looks like this (for both of the plots)
A B
x1 y1
x2 y2
x3 y3
Loop through the axis like so:
'the y values need to be as high as the black axis you've got there
'I'll assume that's zero
i = 0
k = .Cells(1,1)
'we begin at the first x-value in your column
for i = 0 to Worksheets("Sheet name").UsedRange.Rows.Count
'now we are looking for the lowest value of x, k will be this value
if .Cells(i,1) < k Then
if .cells(i,2) = 0 Then '0 = y-value of the "black" axis
k = .Cells(i,1)
End If
End If
'every time we find a lower value than our existing k
'we will assign it to k
Next
The lowest value will be your "low limit"-point.
You can use that same kind of algorithm for the highest value of the same scatter plot (just change the "<" to ">" or the lowest and highest value for the one, just change the Column ID.
HTH

find ranges to create Uniform histogram

I need to find ranges in order to create a Uniform histogram
i.e: ages
to 4 ranges
data_set = [18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46]
is there a function that gives me the ranges so the histogram is uniform?
in this case
ranges = [(18,24), (27,29), (30,33), (42,46)]
This example is easy, I'd like to know if there is an algorithm that deals with complex data sets as well
thanks
You are looking for the quantiles that split up your data equally. This combined with cutshould work. So, suppose you want n groups.
set.seed(1)
x <- rnorm(1000) # Generate some toy data
n <- 10
uniform <- cut(x, c(-Inf, quantile(x, prob = (1:(n-1))/n), Inf)) # Determine the groups
plot(uniform)
Edit: now corrected to yield the correct cuts in the ends.
Edit2: I don't quite understand the downvote. But this also works in your example:
data_set = c(18,21,22,24,27,27,28,29,30,32,33,33,42,42,45,46)
n <- 4
groups <- cut(data_set, breaks = c(-Inf, quantile(data_set, prob = 1:(n-1)/n), Inf))
levels(groups)
With some minor renaming nessesary. For slightly better level names, you could also put in min(x) and max(x) instead of -Inf and Inf.

Is it possible to optimize this Matlab code for doing vector quantization with centroids from k-means?

I've created a codebook using k-means of size 4000x300 (4000 centroids, each with 300 features). Using the codebook, I then want to label an input vector (for purposes of binning later on). The input vector is of size Nx300, where N is the total number of input instances I receive.
To compute the labels, I calculate the closest centroid for each of the input vectors. To do so, I compare each input vector against all centroids and pick the centroid with the minimum distance. The label is then just the index of that centroid.
My current Matlab code looks like:
function labels = assign_labels(centroids, X)
labels = zeros(size(X, 1), 1);
% for each X, calculate the distance from each centroid
for i = 1:size(X, 1)
% distance of X_i from all j centroids is: sum((X_i - centroid_j)^2)
% note: we leave off the sqrt as an optimization
distances = sum(bsxfun(#minus, centroids, X(i, :)) .^ 2, 2);
[value, label] = min(distances);
labels(i) = label;
end
However, this code is still fairly slow (for my purposes), and I was hoping there might be a way to optimize the code further.
One obvious issue is that there is a for-loop, which is the bane of good performance on Matlab. I've been trying to come up with a way to get rid of it, but with no luck (I looked into using arrayfun in conjunction with bsxfun, but haven't gotten that to work). Alternatively, if someone know of any other way to speed this up, I would be greatly appreciate it.
Update
After doing some searching, I couldn't find a great solution using Matlab, so I decided to look at what is used in Python's scikits.learn package for 'euclidean_distance' (shortened):
XX = sum(X * X, axis=1)[:, newaxis]
YY = Y.copy()
YY **= 2
YY = sum(YY, axis=1)[newaxis, :]
distances = XX + YY
distances -= 2 * dot(X, Y.T)
distances = maximum(distances, 0)
which uses the binomial form of the euclidean distance ((x-y)^2 -> x^2 + y^2 - 2xy), which from what I've read usually runs faster. My completely untested Matlab translation is:
XX = sum(data .* data, 2);
YY = sum(center .^ 2, 2);
[val, ~] = max(XX + YY - 2*data*center');
Use the following function to calculate your distances. You should see an order of magnitude speed up
The two matrices A and B have the columns as the dimenions and the rows as each point.
A is your matrix of centroids. B is your matrix of datapoints.
function D=getSim(A,B)
Qa=repmat(dot(A,A,2),1,size(B,1));
Qb=repmat(dot(B,B,2),1,size(A,1));
D=Qa+Qb'-2*A*B';
You can vectorize it by converting to cells and using cellfun:
[nRows,nCols]=size(X);
XCell=num2cell(X,2);
dist=reshape(cell2mat(cellfun(#(x)(sum(bsxfun(#minus,centroids,x).^2,2)),XCell,'UniformOutput',false)),nRows,nRows);
[~,labels]=min(dist);
Explanation:
We assign each row of X to its own cell in the second line
This piece #(x)(sum(bsxfun(#minus,centroids,x).^2,2)) is an anonymous function which is the same as your distances=... line, and using cell2mat, we apply it to each row of X.
The labels are then the indices of the minimum row along each column.
For a true matrix implementation, you may consider trying something along the lines of:
P2 = kron(centroids, ones(size(X,1),1));
Q2 = kron(ones(size(centroids,1),1), X);
distances = reshape(sum((Q2-P2).^2,2), size(X,1), size(centroids,1));
Note
This assumes the data is organized as [x1 y1 ...; x2 y2 ...;...]
You can use a more efficient algorithm for nearest neighbor search than brute force.
The most popular approach are Kd-Tree. O(log(n)) average query time instead of the O(n) brute force complexity.
Regarding a Maltab implementation of Kd-Trees, you can have a look here