Jaccard cluster confidence interval - hierarchical-clustering

I am hoping someone will give me advice on how to get confidence intervals from a Jaccard cluster using R. I have species data from the 1970's and from today at four sites. When I run the following code I get a great graph that shows that one of my present day sites is closer to the historical than another site. I'm sure people are going to ask about significance. I've seen similar confidence intervals on phylogenetic trees but I'm not sure how to get these kind of results. I assume I do this with a bootstrap test but I'm not sure how to get results from boot() or how to put them on my cluster graph. Any advice would be greatly appreciated.
My code to make the cluster:
historicalwo <-read.csv("/users/Victoria/Desktop/Stat Documents/historicalwo.csv",
row.names = 1)
jaccard2 <- vegdist (historical, method = "jaccard")
plot (hclust (jaccard2), hang = -1,main = "Sites clustered by Jaccard similarity",axes = FALSE,
ylab = "")
then I made a .csv of the jaccard results with 3 columns, site 1, site 2 and the jaccard index of the two sites.
jaccardboot <-read.csv("/users/Victoria/Desktop/Stat Documents/jaccardboot.csv", header = TRUE)
bs <- function(formula, data, indices) {
d <- data[indices,]
fit <- lm(formula, data=d)
return(coef(fit)) }
results <- boot(data=jaccardboot,statistic=bs,
R=100, formula=site1~jaccard+site2)
results
I get:
Error in boot(data = jaccardboot, statistic = bs, R = 100, formula = site1 ~ :
number of items to replace is not a multiple of replacement length
In addition: There were 50 or more warnings (use warnings() to see the first 50)

Through a stroke of luck I stumbled upon a reasonable answer to my questions. First I transposed my data and then I used pvclust using Ward method and binary as the distance. This simulates a jaccard index. The results did not cluster like my previous example but at least now I have statistical significance. If anyone knows why this cluster might differ from my jaccard cluster I am all ears.
swo <-read.csv("/users/Victoria/Desktop/Stat Documents/siteswo1.csv", header = TRUE, row.names = 1)
result <- pvclust(swo, method.dist="binary", method.hclust="ward", nboot=1000)
plot(result)
pvrect(result, alpha=0.95)

Related

How to remove frequent/infrequent features from Sklearn CountVectorizer?

Is it possible to remove a percentage of features that occur most frequently / infrequently, from the CountVectorizer?
So basically organize the features from a greatest to least occurrence distribution and just remove a percentage from the left or right side?
As far as I know, there is no straight forward way to do that.
Let me propose a way to achieve the result you want.
I will assume that you are only interested in unigrams (one-word features) to make the examples also simpler.
Regarding the top-x per cent of the features, a possible implementation can be based on the max_features parameter of the CountVectorizer (see user guide).
First, you would need to find out the total number of features by using the CountVectorizer with the default values so that it generates the full vocabulary of terms in the corpus.
vect = CountVectorizer()
bow = vect.fit_transform(corpus)
total_features = len(vect.vocabulary_)
Then you use the CountVectorizer with the max_features parameter, limiting the number of features to the top percentage you need, say 20%. When using the max_features the most frequent terms are selected automatically.
top_vect = CountVectorizer(max_features=int(total_features * 0.2))
top_bow = top_vect.fit_transform(corpus)
Now, regarding the bottom-x per cent of the features, even though I cannot think a good reason why you need that, here is an approach. The vocabulary parameter can be used to limit the model to count only the less frequent terms. For that, we use the output of the first run of the CountVectorizer to create a list of the less common terms.
# Create a list of (term, frequency) tuples sorted by their frequency
sum_words = bow.sum(axis=0)
words_freq = [(word, sum_words[0, idx]) for word, idx in vect.vocabulary_.items()]
words_freq = sorted(words_freq, key = lambda x: x[1])
# Keep only the terms in a list
vocabulary, _ = zip(*words_freq[:int(total_features * 0.2)])
vocabulary = list(vocabulary)
Finally, we use the vocabulary to limit the model to the less frequent terms.
bottom_vect = CountVectorizer(vocabulary=vocabulary)
bottom_bow = bottom_vect.fit_transform(corpus)

Confused by random.randn()

I am a bit confused by the numpy function random.randn() which returns random values from the standard normal distribution in an array in the size of your choosing.
My question is that I have no idea when this would ever be useful in applied practices.
For reference about me I am a complete programming noob but studied math (mostly stats related courses) as an undergraduate.
The Python function randn is incredibly useful for adding in a random noise element into a dataset that you create for initial testing of a machine learning model. Say for example that you want to create a million point dataset that is roughly linear for testing a regression algorithm. You create a million data points using
x_data = np.linspace(0.0,10.0,1000000)
You generate a million random noise values using randn
noise = np.random.randn(len(x_data))
To create your linear data set you follow the formula
y = mx + b + noise_levels with the following code (setting b = 5, m = 0.5 in this example)
y_data = (0.5 * x_data ) + 5 + noise
Finally the dataset is created with
my_data = pd.concat([pd.DataFrame(data=x_data,columns=['X Data']),pd.DataFrame(data=y_data,columns=['Y'])],axis=1)
This could be used in 3D programming to generate non-overlapping random values. This would be useful for optimization of graphical effects.
Another possible use for statistical applications would be applying a formula in order to test against spacial factors affecting a given constant. Such as if you were measuring a span of time with some formula doing something but then needing to know what the effectiveness would be given various spans of time. This would return a statistic measuring for example that your formula is more effective in the shorter intervals or longer intervals, etc.
np.random.randn(d0, d1, ..., dn) Return a sample (or samples) from the “standard normal” distribution(mu=0, stdev=1).
For random samples from , use:
sigma * np.random.randn(...) + mu
This is because if Z is a standard normal deviate, then will have a normal distribution with expected value and standard deviation .
https://docs.scipy.org/doc/numpy-1.14.0/reference/generated/numpy.random.randn.html
https://en.wikipedia.org/wiki/Normal_distribution

Can't extract clusters from fcluster after using scipy's hierarchichal clustering

After doing hierarchichal clustering on my dataset and plotting it with dendrogram function it seems that it was correct clustered, but when I call function fcluster to extract the cluster ids I just get one cluster id, ever.
Why is this happening?
My code:
for key, values in use_case_idx.items():
vectors = []
labels = []
for value in values:
labels.append(value[0])
vectors.append(value[1])
try:
distance_matrix = pdist(vectors, metric='cosine')
Z = linkage(distance_matrix, 'ward')
plt.title("Ward")
dendrogram(Z, labels=labels)
except:
continue
plt.show()
clusters = fcluster(Z, 10, criterion='distance')
print(clusters)
And thus, the output:
More examples on: https://imgur.com/a/kEfub
What's wrong with this code?
Note: Each vector has 50 dimensions
The y-axis of the dendrogram shows the cophenetic distance between different nodes. Because you are using the distance criterion with a large value (much larger than the cophenetic distance), all elements are grouped into the same cluster.
Try using a smaller threshold (e.g. 0.025 for the first dendrogram you show). The dendrogram can act as a guide to choose "good" thresholds---although "good" is very subjective.
If you want to cluster your data into n distinct clusters you can do this using the criterion 'maxclust' so for example fcluster(data,n,criterion = 'maxclust')

Constructing a bubble trellis plot with lattice in R

First off, this is a homework question. The problem is ex. 2.6 from pg.26 of An Introduction to Applied Multivariate Analysis. It's laid out as:
Construct a bubble plot of the earthquake data using latitude and longitude as the scatterplot and depth as the circles, with greater depths giving smaller circles. In addition, divide the magnitudes into three equal ranges and label the points in your bubble plot with a different symbol depending on the magnitude group into which the point falls.
I have figured out that symbols, which is in base graphics does not work well with lattice. Also, I haven't figured out if lattice has the functionality to change symbol size (i.e. bubble size). I bought the lattice book in a fit of desperation last night, and as I see in some of the examples, it is possible to symbol color and shape for each "cut" or panel. I am then working under the assumption that symbol size could then also be manipulated, but I haven't been able to figure out how.
My code looks like:
plot(xyplot(lat ~ long | cut(mag, 3), data=quakes,
layout=c(3,1), xlab="Longitude", ylab="Latitude",
panel = function(x,y){
grid.circle(x,y,r=sqrt(quakes$depth),draw=TRUE)
}
))
Where I attempt to use the grid package to draw the circles, but when this executes, I just get a blank plot. Could anyone please point me in the right direction? I would be very grateful!
Here is the some code for creating the plot that you need without using the lattice package. I obviously had to generate my own fake data so you can disregard all of that stuff and go straight to the plotting commands if you want.
####################################################################
#Pseudo Data
n = 20
latitude = sample(1:100,n)
longitude = sample(1:100,n)
depth = runif(n,0,.5)
magnitude = sample(1:100,n)
groups = rep(NA,n)
for(i in 1:n){
if(magnitude[i] <= 33){
groups[i] = 1
}else if (magnitude[i] > 33 & magnitude[i] <=66){
groups[i] = 2
}else{
groups[i] = 3
}
}
####################################################################
#The actual code for generating the plot
plot(latitude[groups==1],longitude[groups==1],col="blue",pch=19,ylim=c(0,100),xlim=c(0,100),
xlab="Latitude",ylab="Longitude")
points(latitude[groups==2],longitude[groups==2],col="red",pch=15)
points(latitude[groups==3],longitude[groups==3],col="green",pch=17)
points(latitude[groups==1],longitude[groups==1],col="blue",cex=1/depth[groups==1])
points(latitude[groups==2],longitude[groups==2],col="red",cex=1/depth[groups==2])
points(latitude[groups==3],longitude[groups==3],col="green",cex=1/depth[groups==3])
You just need to add default.units = "native" to grid.circle()
plot(xyplot(lat ~ long | cut(mag, 3), data=quakes,
layout=c(3,1), xlab="Longitude", ylab="Latitude",
panel = function(x,y){
grid.circle(x,y,r=sqrt(quakes$depth),draw=TRUE, default.units = "native")
}
))
Obviously you need to tinker with some of the settings to get what you want.
I have written a package called tactile that adds a function for producing bubbleplots using lattice.
tactile::bubbleplot(depth ~ lat*long | cut(mag, 3), data=quakes,
layout=c(3,1), xlab="Longitude", ylab="Latitude")

np.fft.fft off by a factor of 1000 (fitting an powerspectrum)

I'm trying to make a powerspectrum from an experimental dataset which I am reading in, and then to fit it to an theoretical curve. Now everything is working fine and I'm not getting errors, except for the fact that my curve keeps differing by a factor of 1000 from the data and I have absolutely no idea what the problem could be. I've asked a few people, but to no avail. (I hope that you guys will be able to help)
Anyways, I'm pretty sure that its not the units, as they were tripple checked by me and 2 others. Basically, I need to fit a powerspectrum to an equation by using the least squares method.
I can't post the whole code, as its rather long and a bit messy, but this is the fourier part, I added comments to all arrays and vars which have not been declared in the code)
#Calculate stuff
Nm = 10**-6 #micro to meter
KbT = 4.10E-21 #Joule
T = 297. #K
l = zvalue*Nm #meter
meany = np.mean(cleandatay*Nm) #meter (cleandata is the array that I read in from a cvs at the start.)
SDy = sum((cleandatay*Nm - meany)**2)/len(cleandatay) #meter^2
FmArray[0][i] = ((KbT*l)/SDy) #N
#print FmArray[0][i]
print float((i*100/len(filelist)))#how many % done?
#fourier
dt = cleant[1]-cleant[0] #timestep
N = len(cleandatay) #Same for cleant, its the corresponding time to cleandatay
Here is where the fourier part starts, I take the fft and turn it into a powerspectrum. Then I calculate the corresponding freq steps with the array freqs
fouriery = np.fft.fft((cleandatay*(10**-6)))
fourierpower = (np.abs(fouriery))**2
fourierpower = fourierpower[1:N/2] #remove 0th datapoint and /2 (remove negative freqs)
fourierpower = fourierpower*dt #*dt to account for steps
freqs = (1.+np.arange((N/2)-1.))/50.
#Least squares method
eta = 8.9E-4 #pa*s
Rbead = 0.5E-6#meter
constant = 2*KbT/(3*eta*pi*Rbead)
omega = 2*pi*freqs #rad/s
Wcarray = 2.*pi*np.arange(0,30, 0.02003) #0.02 = 30/len(freqs)
ChiSq = np.zeros(len(Wcarray))
for k in range(0, len(Wcarray)):
Py = (constant / (Wcarray[k]**2 + omega**2))
ChiSq[k] = sum((fourierpower - Py)**2)
pylab.loglog(omega, Py)
print k*100/len(Wcarray)
index = np.where(ChiSq == min(ChiSq))
cutoffw = Wcarray[index]
Pygoed = (constant / (Wcarray[index]**2 + omega**2))
print cutoffw
print constant
print min(ChiSq)
pylab.loglog(omega,ChiSq)
So I have no idea what could be going wrong, I think its the fft, as nothing else can really go wrong.
Below is the pic I get when I plot all the fit lines against the spectrum, as you can see it is off by about 1000 (actually exactly 1000, as this leaves a least square residue of 10^-22, but I can't just randomly multiply without knowing why)
Just to elaborate on the picture. The green dots are the fft spectrum, the lines are the fits, the red dot is where it thinks the cutoff frequency is, and the blue line is the chi-squared fit, looking for the lowest value.
Take a look at the documentation for the FFT that you are using. Many FFTs introduce a scaling factor that is usually N * result (number of samples). Multiplying by 1/N will scale the results back in line. (You said that the result is 1000 too high....could it be that you are using a 1024 size FFT?)
Your library FFT routine might include a scale factor of 1/sqrt(n).
Check the documentation for the fft you used, as the proportion of the scale factor allocated between the fft and the ifft is arbitrary.