To determine the two optimal cutpoints based on a U-shaped restrict cubic spline curve - spline

I try to determine the two optimal cutpoints based on restrict cubic spline where a U-shape association was found between risk factor (X) and all-cause mortality. The optimal equal-HR method was used with the "CutpointsOEHR" package.
result <-coxph(Surv(survival.death,endpoint.death)~pspline(X,df=0,caic=TRUE)+X1+X2,data=indf)
termplot(result,se=TRUE,col.term=1,ylab='log relative hazard')
#the above two run well.
cuts <- findcutpoints(cox_pspline_fit = result, data = dataset,nquantile = 100, exclude = 0.05,eps = 0.01,shape='U')
#but it comes out the error when I run cuts code.
#Error in if (missing(data) | class(data) != "data.frame") { :
the condition has length > 1

Related

Plotting an exponential function given one parameter

I'm fairly new to python so bare with me. I have plotted a histogram using some generated data. This data has many many points. I have defined it with the variable vals. I have then plotted a histogram with these values, though I have limited it so that only values between 104 and 155 are taken into account. This has been done as follows:
bin_heights, bin_edges = np.histogram(vals, range=[104, 155], bins=30)
bin_centres = (bin_edges[:-1] + bin_edges[1:])/2.
plt.errorbar(bin_centres, bin_heights, np.sqrt(bin_heights), fmt=',', capsize=2)
plt.xlabel("$m_{\gamma\gamma} (GeV)$")
plt.ylabel("Number of entries")
plt.show()
Giving the above plot:
My next step is to take into account values from vals which are less than 120. I have done this as follows:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
I need to plot a curve on the same plot as the histogram, which follows the form B(x) = Ae^(-x/λ)
I then estimated a value of λ using the maximum likelihood estimator formula:
background_data=[j for j in vals if j <= 120] #to avoid taking the signal bump, upper limit of 120 MeV set
#print(background_data)
N_background=len(background_data)
print(N_background)
sigma_background_data=sum(background_data)
print(sigma_background_data)
lamb = (sigma_background_data)/(N_background) #maximum likelihood estimator for lambda
print('lambda estimate is', lamb)
where lamb = λ. I got a value of roughly lamb = 27.75, which I know is correct. I now need to get an estimate for A.
I have been advised to do this as follows:
Given a value of λ, find A by scaling the PDF to the data such that the area beneath
the scaled PDF has equal area to the data
I'm not quite sure what this means, or how I'd go about trying to do this. PDF means probability density function. I assume an integration will have to take place, so to get the area under the data (vals), I have done this:
data_area= integrate.cumtrapz(background_data, x=None, dx=1.0)
print(data_area)
plt.plot(background_data, data_area)
However, this gives me an error
ValueError: x and y must have same first dimension, but have shapes (981555,) and (981554,)
I'm not sure how to fix it. The end result should be something like:
See the cumtrapz docs:
Returns: ... If initial is None, the shape is such that the axis of integration has one less value than y. If initial is given, the shape is equal to that of y.
So you are either to pass an initial value like
data_area = integrate.cumtrapz(background_data, x=None, dx=1.0, initial = 0.0)
or discard the first value of the background_data:
plt.plot(background_data[1:], data_area)

How to perform raster calculation (e.g. aspect) on subset of raster based on point intersection in R

I'm working with some raster data in R using the raster package. I want to calculate and extract some geographic information (e.g., slope, aspect) from the raster, but only at specific points (I also have some data as a SpatialPointsDataFrame at which I want to calculate slope/aspect/etc.). I'm doing this for several high-resolution rasters, and it seems like a poor use of resources to calculate this for every raster cell when I only need maybe 5-10% of them.
I thought maybe the raster::stackApply function might work, but that seems to perform calculations on subsets of a rasterBrick rather than calculations on subsets of a single raster based on point locations (please correct me if I'm wrong). I also thought I could do a for loop, where I extract the surrounding cells nearest each point of interest, and iteratively calculate slope/aspect that way. That seems clunky, and I was hoping for a more elegant or built-in solution, but it should work.
These are my thoughts so far on the for loop, but I'm not sure how best to even do this.
# Attach packages
library(rgdal)
library(raster)
# Generate example raster data
r = raster()
set.seed(0)
values(r) = runif(ncell(r), min = 0, max = 1000)
# Generate example point data
df.sp = SpatialPoints(
coords = cbind(runif(25, min = -100, max = 100),
runif(25, min = -50, max = 50)),
proj4string = crs(r))
# Iterate on each row of SpatialPoints
for (i in 1:nrow(df.sp)) {
# Find cell index of current SpatialPoint
cell.idx = raster::extract(r, df.sp[i,], cellnumbers = TRUE)[1]
# Find indices of cells surrounding point of interest
neighbors.idx = raster::adjacent(r, cell.idx, directions = 16)
# Get DEM values for cell and surrounding cells
vals.local = r[c(cell.idx, neighbors.idx[,2])]
# Somehow convert this back to an appropriate georeferenced matrix
#r.local = ...
# Perform geometric calculations on local raster
#r.stack = terrain(r.local, opt = c('slope', 'aspect'))
# Remaining data extraction, etc. (I can take it from here...)
}
In summary, I need a method to calculate slope and aspect from a DEM raster only at specific points as given by a SpatialPoints object. If you know of a pre-built or more elegant solution, great! If not, some help finishing the for loop (how to best extract a neighborhood of surrounding cells and run calculations on that) would be most appreciated as well.
Interesting question. Here is a possible approach.
library(raster)
r <- raster()
set.seed(0)
values(r) <- runif(ncell(r), min = 0, max = 1000)
coords <- cbind(runif(25, min = -100, max = 100),
runif(25, min = -50, max = 50))
x <- rasterize(coords, r)
f <- focal(x, w=matrix(1, nc=3, nr=3), na.rm=TRUE)
rr <- mask(r, f)
slope <- terrain(rr, "slope")
extract(slope, coords)
# [1] 0.0019366236 0.0020670699 0.0006305257 0.0025334280 0.0023480935 0.0007527267 0.0002699272 0.0004699626
# [9] 0.0004869054 0.0025651333 0.0010415805 0.0008574920 0.0010664869 0.0017700297 0.0001666226 0.0008405391
#[17] 0.0017682167 0.0009854172 0.0015350466 0.0017714466 0.0012994945 0.0016563132 0.0003276584 0.0020499529
#[25] 0.0006582073
Probably not much efficiency gain, as it still processes all the NA values
So maybe like this, more along your line of thinking:
cells <- cellFromXY(r, coords)
ngbs <- raster::adjacent(r, cells, pairs=TRUE)
slope <- rep(NA, length(cells))
for (i in 1:length(cells)) {
ci <- ngbs[ngbs[,1] == cells[i], 2]
e <- extentFromCells(r, ci)
x <- crop(r, e)
slope[i] <- terrain(x, "slope")[5]
}
slope
#[1] 0.0019366236 0.0020670699 0.0006305257 0.0025334280 0.0023480935 0.0007527267 0.0002699272 0.0004699626
#[9] 0.0004869054 0.0025651333 0.0010415805 0.0008574920 0.0010664869 0.0017700297 0.0001666226 0.0008405391
#[17] 0.0017682167 0.0009854172 0.0015350466 0.0017714466 0.0012994945 0.0016563132 0.0003276584 0.0020499529
#[25] 0.0006582073
But I find that brute force
slope <- terrain(r, "slope")
extract(slope, coords)
is fastest, 9x faster than my first alternative and 4 times faster than the second alternative

"Updating" the RNG in Python

I have to iterate an operation over several sets of partially randomized copies of an initial array of 0 and 1.
I would like the copies to be different, of course, but also the sets.
For now I use this code (I omitted certain parts that should not interfere in the problem):
def randomizer(b) :
"""randomizes a fraction 'rate' (global variable) of b"""
c = np.copy(b)
num_elem = len(c)
idx = np.random.choice(range(num_elem), int(num_elem*rate), replace=False)
c[idx] = f(c[idx])
return c
def randomizePatterns(pattern, randomizer) :
"""Return nbTrials partially randomized copies of the given input pattern"""
outputs = np.tile(pattern,(nbTrials,1))
for line in xrange(nbTrials) :
outputs[line] = randomizer(outputs[line,:])
return outputs
def test(pattern,randomizer) :
randomPatterns = randomizePatterns(pattern, randomizer)
"""test the dynamics of a neural network with a fixed but initially random
connection scheme, and returns a boolean corresponding to if
the original pattern that was randomized is retrieved from at least
90% of the randomized patterns"""
return boolean
def metaTest(pattern):
successNumber = 0
for plop in xrange(10):
if test(pattern, randomizer) :
successNumber += 1
return successNumber
The other input to test is a random matrix of floats in [0,4] that I visualize and that has the expected behavior.
When I run metaTest, I always get either 0 or 10, never an intermediate value.
Given the nature of test, I expect a result about 5. I get each result for approximately half of the inputs.
To be more precise, printing the random patterns after each incrementation of plop, I get the same thing ten times, and I would like that to change.

optimise distance calculation in matlab

I am a newbie with Matlab and I have the following scenario( which is part of a larger problem).
matrix A with 4754x1024 and matrix B with 6800x1024 rows.
For every row in matrix A i need to calculate the euclidean distance in matrix B. I am using the following technique to calculate the distance but I find that this is very inefficient and very time consuming in Matlab.
for i=1:row_A
A_data=A_test(i,:);
for j=1:row_B
B_data=B_train(j,:);
X=[A_data;B_data];
%calculate distance
d=pdist(X,'euclidean');
dist(j,i)=d;
end
end
Any suggestions to optimise this because the final step involves performing this operation on 50 such sets of A and B.
Thanks and Regards,
Bhavya
I'm not sure what your code is actually doing.
Assuming your data has the following properties
assert(size(A,2) == size(B,2))
Try
d = zeros(size(A,1), size(B,1));
for i = 1:size(A,1)
d(i,:) = sqrt(sum(bsxfun(#minus, B, A(i,:)).^2, 2));
end
Or possibly better organised by columns (See "Store and Access Data in Columns" in http://www.mathworks.co.uk/company/newsletters/news_notes/june07/patterns.html):
At = A.'; Bt = B.';
d = zeros(size(At,2), size(Bt,2));
for i = 1:size(At,2)
d(i,:) = sqrt(sum(bsxfun(#minus, Bt, At(:,i)).^2, 1));
end

Is it possible to optimize this Matlab code for doing vector quantization with centroids from k-means?

I've created a codebook using k-means of size 4000x300 (4000 centroids, each with 300 features). Using the codebook, I then want to label an input vector (for purposes of binning later on). The input vector is of size Nx300, where N is the total number of input instances I receive.
To compute the labels, I calculate the closest centroid for each of the input vectors. To do so, I compare each input vector against all centroids and pick the centroid with the minimum distance. The label is then just the index of that centroid.
My current Matlab code looks like:
function labels = assign_labels(centroids, X)
labels = zeros(size(X, 1), 1);
% for each X, calculate the distance from each centroid
for i = 1:size(X, 1)
% distance of X_i from all j centroids is: sum((X_i - centroid_j)^2)
% note: we leave off the sqrt as an optimization
distances = sum(bsxfun(#minus, centroids, X(i, :)) .^ 2, 2);
[value, label] = min(distances);
labels(i) = label;
end
However, this code is still fairly slow (for my purposes), and I was hoping there might be a way to optimize the code further.
One obvious issue is that there is a for-loop, which is the bane of good performance on Matlab. I've been trying to come up with a way to get rid of it, but with no luck (I looked into using arrayfun in conjunction with bsxfun, but haven't gotten that to work). Alternatively, if someone know of any other way to speed this up, I would be greatly appreciate it.
Update
After doing some searching, I couldn't find a great solution using Matlab, so I decided to look at what is used in Python's scikits.learn package for 'euclidean_distance' (shortened):
XX = sum(X * X, axis=1)[:, newaxis]
YY = Y.copy()
YY **= 2
YY = sum(YY, axis=1)[newaxis, :]
distances = XX + YY
distances -= 2 * dot(X, Y.T)
distances = maximum(distances, 0)
which uses the binomial form of the euclidean distance ((x-y)^2 -> x^2 + y^2 - 2xy), which from what I've read usually runs faster. My completely untested Matlab translation is:
XX = sum(data .* data, 2);
YY = sum(center .^ 2, 2);
[val, ~] = max(XX + YY - 2*data*center');
Use the following function to calculate your distances. You should see an order of magnitude speed up
The two matrices A and B have the columns as the dimenions and the rows as each point.
A is your matrix of centroids. B is your matrix of datapoints.
function D=getSim(A,B)
Qa=repmat(dot(A,A,2),1,size(B,1));
Qb=repmat(dot(B,B,2),1,size(A,1));
D=Qa+Qb'-2*A*B';
You can vectorize it by converting to cells and using cellfun:
[nRows,nCols]=size(X);
XCell=num2cell(X,2);
dist=reshape(cell2mat(cellfun(#(x)(sum(bsxfun(#minus,centroids,x).^2,2)),XCell,'UniformOutput',false)),nRows,nRows);
[~,labels]=min(dist);
Explanation:
We assign each row of X to its own cell in the second line
This piece #(x)(sum(bsxfun(#minus,centroids,x).^2,2)) is an anonymous function which is the same as your distances=... line, and using cell2mat, we apply it to each row of X.
The labels are then the indices of the minimum row along each column.
For a true matrix implementation, you may consider trying something along the lines of:
P2 = kron(centroids, ones(size(X,1),1));
Q2 = kron(ones(size(centroids,1),1), X);
distances = reshape(sum((Q2-P2).^2,2), size(X,1), size(centroids,1));
Note
This assumes the data is organized as [x1 y1 ...; x2 y2 ...;...]
You can use a more efficient algorithm for nearest neighbor search than brute force.
The most popular approach are Kd-Tree. O(log(n)) average query time instead of the O(n) brute force complexity.
Regarding a Maltab implementation of Kd-Trees, you can have a look here