Plotting extremely small Y-axis values in ggplot - ggplot2

In my data, some values of which are extremely small, as small as e-19. When I plot the data, those small values are clustered around zero.
Suppose I have 21 datapoints; 20 of which are ranging from e-05 to e-02 and 1 of which has a value of 1.
my.df <- data.frame(group=state.name[1:21],col1 = c(runif(20,min=1e-05,max=1e-02),1))
p <- ggplot(my.df, aes(x=group,y=col1)) +
geom_point()
I want to have more precise y axis values so that those small values would spread more, not only stuck around zero. Preferably, I would like to have y values to increase in a geometric scale such as e-05, e-04, e-03, e-02, e-01, 1, something like this scale.
Thank you for your support in advance!
PS: I carefully checked if there is a replicate. I hope I haven't missed an already existing one.

Related

Plotting an exponential function given one parameter

I'm fairly new to python so bare with me. I have plotted a histogram using some generated data. This data has many many points. I have defined it with the variable vals. I have then plotted a histogram with these values, though I have limited it so that only values between 104 and 155 are taken into account. This has been done as follows:
bin_heights, bin_edges = np.histogram(vals, range=[104, 155], bins=30)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(bin_centres, bin_heights, np.sqrt(bin_heights), fmt=',', capsize=2)
plt.xlabel("$m_{\gamma\gamma} (GeV)$")
plt.ylabel("Number of entries")
plt.show()
Giving the above plot:
My next step is to take into account values from vals which are less than 120. I have done this as follows:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
I need to plot a curve on the same plot as the histogram, which follows the form B(x) = Ae^(-x/λ)
I then estimated a value of λ using the maximum likelihood estimator formula:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
#print(background_data)
N_background=len(background_data)
print(N_background)
sigma_background_data=sum(background_data)
print(sigma_background_data)
lamb = (sigma_background_data)/(N_background) #maximum likelihood estimator for lambda
print('lambda estimate is', lamb)
where lamb = λ. I got a value of roughly lamb = 27.75, which I know is correct. I now need to get an estimate for A.
I have been advised to do this as follows:
Given a value of λ, find A by scaling the PDF to the data such that the area beneath
the scaled PDF has equal area to the data
I'm not quite sure what this means, or how I'd go about trying to do this. PDF means probability density function. I assume an integration will have to take place, so to get the area under the data (vals), I have done this:
data_area= integrate.cumtrapz(background_data, x=None, dx=1.0)
print(data_area)
plt.plot(background_data, data_area)
However, this gives me an error
ValueError: x and y must have same first dimension, but have shapes (981555,) and (981554,)
I'm not sure how to fix it. The end result should be something like:
See the cumtrapz docs:
Returns: ... If initial is None, the shape is such that the axis of integration has one less value than y. If initial is given, the shape is equal to that of y.
So you are either to pass an initial value like
data_area = integrate.cumtrapz(background_data, x=None, dx=1.0, initial = 0.0)
or discard the first value of the background_data:
plt.plot(background_data[1:], data_area)

How to calculate slope of the line

I am trying to calculate the slope of the line for a 50 day EMA I created from the adjusted closing price on a few stocks I downloaded using the getSymbols function.
My EMA looks like this :
getSymbols("COLUM.CO")
COLUM.CO$EMA <- EMA(COLUM.CO[,6],n=50)
This gives me an extra column that contains the 50 day EMA on the adjusted closing price. Now I would like to include an additional column that contains the slope of this line. I'm sure it's a fairly easy answer, but I would really appreciate some help on this. Thank you in advance.
A good way to do this is with rolling least squares regression. rollSFM does a fast and efficient job for computing the slope of a series. It usually makes sense to look at the slope in relation to units of price activity in time (bars), so x can simply be equally spaced points.
The only tricky part is working out an effective value of n, the length of the window over which you fit the slope.
library(quantmod)
getSymbols("AAPL")
AAPL$EMA <- EMA(Ad(AAPL),n=50)
# Compute slope over 50 bar lookback:
AAPL <- merge(AAPL, rollSFM(Ra = AAPL[, "EMA"],
Rb = 1:nrow(AAPL), n = 50))
The column labeled beta contains the rolling window value of the slope (alpha contains the intercept, r.squared contains the R2 value).

Find or calculate intersection points of a straight line with a diagonal scatter plot using VBA

I am trying to understand how I can go about finding or calculating the intersection points of a straight line and a diagonal scatter plot. Just to give a better idea, on an X,Y plot, if I have a straight horizontal line at y= # (any number), that crosses an array of scatters points (which form a diagonal line), how can I calculate points of intersection the two lines?
The problem that I am having is that the scattered array has multiple points around my horizontal line, what I would like to do is find the point that hits the horizontal line first, and the point that hits the horizontal line the last.
please refer to the image for a better understanding. The two points that are annotated are the ones that I am trying to extract with VBA. Is this possible? The image shows two sets of scattered arrays, I am only interested in figuring out the method for 1 of the arrays. If I can extract this for 1 scattered array, I can replicate the method for the next one.
http://imgur.com/9YTNeco
It's hard to give you any specifics without knowing the structure of your Data. But this is the approach I'd use.
I'll assume your data looks like this (for both of the plots)
A B
x1 y1
x2 y2
x3 y3
Loop through the axis like so:
'the y values need to be as high as the black axis you've got there
'I'll assume that's zero
i = 0
k = .Cells(1,1)
'we begin at the first x-value in your column
for i = 0 to Worksheets("Sheet name").UsedRange.Rows.Count
'now we are looking for the lowest value of x, k will be this value
if .Cells(i,1) < k Then
if .cells(i,2) = 0 Then '0 = y-value of the "black" axis
k = .Cells(i,1)
End If
End If
'every time we find a lower value than our existing k
'we will assign it to k
Next
The lowest value will be your "low limit"-point.
You can use that same kind of algorithm for the highest value of the same scatter plot (just change the "<" to ">" or the lowest and highest value for the one, just change the Column ID.
HTH

Is it possible to optimize this Matlab code for doing vector quantization with centroids from k-means?

I've created a codebook using k-means of size 4000x300 (4000 centroids, each with 300 features). Using the codebook, I then want to label an input vector (for purposes of binning later on). The input vector is of size Nx300, where N is the total number of input instances I receive.
To compute the labels, I calculate the closest centroid for each of the input vectors. To do so, I compare each input vector against all centroids and pick the centroid with the minimum distance. The label is then just the index of that centroid.
My current Matlab code looks like:
function labels = assign_labels(centroids, X)
labels = zeros(size(X, 1), 1);
% for each X, calculate the distance from each centroid
for i = 1:size(X, 1)
% distance of X_i from all j centroids is: sum((X_i - centroid_j)^2)
% note: we leave off the sqrt as an optimization
distances = sum(bsxfun(#minus, centroids, X(i, :)) .^ 2, 2);
[value, label] = min(distances);
labels(i) = label;
end
However, this code is still fairly slow (for my purposes), and I was hoping there might be a way to optimize the code further.
One obvious issue is that there is a for-loop, which is the bane of good performance on Matlab. I've been trying to come up with a way to get rid of it, but with no luck (I looked into using arrayfun in conjunction with bsxfun, but haven't gotten that to work). Alternatively, if someone know of any other way to speed this up, I would be greatly appreciate it.
Update
After doing some searching, I couldn't find a great solution using Matlab, so I decided to look at what is used in Python's scikits.learn package for 'euclidean_distance' (shortened):
XX = sum(X * X, axis=1)[:, newaxis]
YY = Y.copy()
YY **= 2
YY = sum(YY, axis=1)[newaxis, :]
distances = XX + YY
distances -= 2 * dot(X, Y.T)
distances = maximum(distances, 0)
which uses the binomial form of the euclidean distance ((x-y)^2 -> x^2 + y^2 - 2xy), which from what I've read usually runs faster. My completely untested Matlab translation is:
XX = sum(data .* data, 2);
YY = sum(center .^ 2, 2);
[val, ~] = max(XX + YY - 2*data*center');
Use the following function to calculate your distances. You should see an order of magnitude speed up
The two matrices A and B have the columns as the dimenions and the rows as each point.
A is your matrix of centroids. B is your matrix of datapoints.
function D=getSim(A,B)
Qa=repmat(dot(A,A,2),1,size(B,1));
Qb=repmat(dot(B,B,2),1,size(A,1));
D=Qa+Qb'-2*A*B';
You can vectorize it by converting to cells and using cellfun:
[nRows,nCols]=size(X);
XCell=num2cell(X,2);
dist=reshape(cell2mat(cellfun(#(x)(sum(bsxfun(#minus,centroids,x).^2,2)),XCell,'UniformOutput',false)),nRows,nRows);
[~,labels]=min(dist);
Explanation:
We assign each row of X to its own cell in the second line
This piece #(x)(sum(bsxfun(#minus,centroids,x).^2,2)) is an anonymous function which is the same as your distances=... line, and using cell2mat, we apply it to each row of X.
The labels are then the indices of the minimum row along each column.
For a true matrix implementation, you may consider trying something along the lines of:
P2 = kron(centroids, ones(size(X,1),1));
Q2 = kron(ones(size(centroids,1),1), X);
distances = reshape(sum((Q2-P2).^2,2), size(X,1), size(centroids,1));
Note
This assumes the data is organized as [x1 y1 ...; x2 y2 ...;...]
You can use a more efficient algorithm for nearest neighbor search than brute force.
The most popular approach are Kd-Tree. O(log(n)) average query time instead of the O(n) brute force complexity.
Regarding a Maltab implementation of Kd-Trees, you can have a look here

programing ranging data

I have 2 inputs from 0-180 for x and y i need to add them together and stay in the range of 180 and 0 i am having some trouble since 90 is the mid point i cant seem to keep my data in that range im doing this in vb.net but i mainly need help with the logic
result = (x + y) / 2
Perhaps? At least that will stay in the 0-180 range. Are there any other constraints you're not telling us about, since right now this seems pretty obvious.
If you want to map the two values to the limited range in a linear fashion, just add them together and divide by two:
out = (in1 + in2) / 2
If you just want to limit the top end, add them together then use the minimimum of that and 180:
out = min (180, in1 + in2)
Are you wanting to find the average of the two or add them? If you're adding them, and you're dealing with angles which wrap around (which is what it sounds like) then, why not just add them and then modulo? Like this:
(in1 + in2) mod 180
Hopefully you're familiar with the modulo operator.