torch7: Unexpected 'counts' in k-Means Clustering - k-means

I am trying to apply k-means clustering on a set of images (images are loaded as float torch.Tensors) using the following segment of code:
print('[Clustering all samples...]')
local points = torch.Tensor(trsize, 3, 221, 221)
for i = 1,trsize do
points[i] = trainData.data[i]:clone() -- dont want to modify the original tensors
end
points:resize(trsize, 3*221*221) -- to convert it to a 2-D tensor
local centroids, counts = unsup.kmeans(points, total_classes, 40, total_classes, nil, true)
print(counts)
When I observe the values in the counts tensor, I observe that it contains unexpected values, in the form of some entries being more than trsize, whereas the documentation says that counts stores the counts per centroid. I expected that it means counts[i] equals the number of samples out of trsize belonging to cluster with centroid centroids[i]. Am I wrong in assuming so?
If that indeed is the case, shouldn't sample-to-centroid be a hard-assignment (i.e. shouldn't counts[i] sum to trsize, which clearly is not the case with my clustering)? Am I missing something here?
Thanks in advance.

In the current version of the code, counts are accumulated after each iteration
for i = 1,niter do
-- k-means computations...
-- total counts
totalcounts:add(counts)
end
So in the end counts:sum() is a multiple of niter.
As a workaround you can use the callback to obtain the final counts (non-accumulated):
local maxiter = 40
local centroids, counts = unsup.kmeans(
points,
total_classes,
maxiter,
total_classes,
function(i, _, totalcounts) if i < maxiter then totalcounts:zero() end end,
true
)
As an alternative you can use vlfeat.torch and explicitly quantize your input points after kmeans to obtain these counts:
local assignments = kmeans:quantize(points)
local counts = torch.zeros(total_classes):int()
for i=1,total_classes do
counts[i] = assignments:eq(i):sum()
end

Related

Why do my arrays display missing values when identifying a bandwidth? (geopandas)

I'm trying to identify a suitable bandwidth to use for a geographically weighted regression but every time I search for the bandwidth it displays that there are missing (NaN) values within the arrays of the dataset. Although, each row features all values.
g_y = df_ct2008xy['2008 HP'].values.reshape((-1,1))
g_X = df_ct2008xy[['2008 AF', '2008 MI', '2008 MP', '2008 EB']].values
u = df_ct2008xy['X']
v = df_ct2008xy['Y']
g_coords = list(zip(u,v))
g_X = (g_X - g_X.mean(axis=0)) / g_X.std(axis=0)
g_y = g_y.reshape((-1,1))
g_y = (g_y - g_y.mean(axis=0)) / g_y.std(axis=0)
bw = mgwr.sel_bw.Sel_BW(g_coords,
g_y, # Independent variable
g_X, # Dependent variable
fixed=True, # True for fixed bandwidth and false for adaptive bandwidth
spherical=True) # Spherical coordinates (long-lat) or projected coordinates
I searched using numpy to identify if these were individual values using
np.isnan(g_y).any()
and
np.isnan(g_X)
but apparently every value is 'missing' and returning 'True'

Understanding Pandas Series Data Structure

I am trying to get my head around the Pandas module and started learning about the Series data structure.
I have created the following Series in Spyder :-
songs = pd.Series(data = [145,142,38,13], name = "Count")
I can obtain information about the Series index using the code:-
songs.index
The output of the above code is as follows:-
My question is where it states Start = 0 and Stop = 4, what are these referring to?
I have interpreted start = 0 as the first element in the Series is in row 0.
But i am not sure what Stop value refers to as there are no elements in row 4 of the Series?
Can some one explain?
Thank you.
This concept as already explained adequately in the comments (indexing is at minus one the count of items) is prevalent in many places.
For instance, take the list data structure-
z = songs.to_list()
[145, 142, 38, 13]
len(z)
4 # length is four
# however indexing stops at i-1 position 'i' being the length/count of items in the list.
z[4] # this will raise an IndexError
# you will have to start at index 0 going till only index 3 (i.e. 4 items)
z[0], z[1], z[2], z[-1] # notice how -1 can be used to directly access the last element

Why even though I sliced my original DataFrame and assigned it to another variable, my original DataFrame still changed values?

I am trying to calculate a portfolio's daily total price, by multiplying weights of each asset with the daily price of the assets.
Currently I have a DataFrame tw which is all zeros except for the dates that I want to re-balance, which holds my assets weights. What I would like to do is for each month, populate the zeros with the weights I am trying to re-balance with, till the next re-balancing date, and so on and so forth.
My code:
df_of_weights = tw.loc[dates_to_rebalance[13]:]
temp_date = dates_to_rebalance[13]
counter = 0
for date in df_of_weights.index:
if date.year == temp_date.year and date.month == temp_date.month:
if date.day == temp_date.day:
pass
else:
df_of_weights.loc[date] = df_of_weights.loc[temp_date].values
counter += 1
temp_date = dates_to_rebalance[13+counter]
I understand that if you slice your DataFrame and assign it to a variable (df_of_weights), changing the values of said variable would not affect the original DataFrame. However, the values in tw changed. Have been searching for an answer online for a while now and am really confused.
You should use copy in order to fix the problem such that:
df_of_weights = tw.loc[dates_to_rebalance[13]:].copy()
The problem is slicing provides view instead of copy. The issue is still open.
https://github.com/pandas-dev/pandas/issues/15631

Moving Average of time series using a sliding window over an array

I need to write a function below that can compute the moving average of time series using a sliding window over an array. This function should take an array of date strings (say arr_date), an array of numbers (say arr_record), and a sliding window (default value 50). It should:
Return a list of dictionaries for all windows.
Each dictionary should include the date, average value, min, max, standard deviation at each window.
Able to handle missing data in time series by replacing missing data with the most recent available data.
(b) Download SPY daily data (Dec. 31, 2017 to Dec. 31, 2018) from Yahoo! as your test data in a .csv file. Read reading .csv file example and write a test programming for calling your function.
Does anyone have any thoughts? Extremely new to python and struggling.
So something following this logic should probably be a good starting point. Hope this is a helpful start, and welcome to the cs community.
def sliding_window( dates, numbers, sliding_window_value):
# list of dictionaries
return_dicts =[{}]
# if window size is greater than length of dates, there's only one window
if sliding_window_value >= len(dates):
return_dicts += [create_window(dates, numbers)]
return return_dicts
# gather all our windows into one list
for i in range (0, len(dates) - sliding_window_value ):
# get our window subsets
dates_subset = dates[i:(sliding_window_value+1)]
numbers_subset = numbers[i:(sliding_window_value+1)]
# get our window stats dictionary
window_stats = create_window(dates_subset,numbers_subset)
# add these stats to our return list
return_dicts += [window_stats]
return return_dicts
def create_window(dates_subset, numbers_subset):
window_min = 1000000 # some high minimum to start
window_max = -1000000 # some low maximuim to start
window_total = 0
for i in range ( 0, len(dates_subset)):
# calculate total
window_total += numbers_subset[i]
# calculate max
if numbers_subset[i] > window_max:
window_max = numbers_subset[i]
# calculate min
if numbers_subset[i] < window_min:
window_min = numbers_subset[i]
# other calculations....
return_dict = {
"min" : window_min,
"max" : window_max,
"average" : window_total / len(dates_subset),
# other calculations....
}
return return_dict
Good luck bud, the work is worth it.

Numpy maximum(arrays)--how to determine the array each max value came from

I have numpy arrays representing July temperature for each year since 1950.
I can use the numpy.maximum(temp1950,temp1951,temp1952,..temp2014)
to determine the maximum July temperature at each cell.
I need the maximum for each cell..the numpy.maximum() works for only 2 arrays
How do I determine the year that each max value came from?
Also the numpy.maximum(array1,array2) works comparing only two arrays.
Thanks to Praveen, the following works fine:
array1 = numpy.array( ([1,2],[3,4]) )
array2 = numpy.array( ([3,4],[1,2]) )
array3 = numpy.array( ([9,1],[1,9]) )
all_arrays = numpy.dstack((array1,array2,array3))
#maxvalues = numpy.maximum(all_arrays)#will not work
all_arrays.max(axis=2) #this returns the max from each cell location
max_indexes = numpy.argmax(all_arrays,axis=2)#this returns correct indexes
The answer is argmax, except that you need to do this along the required axis. If you have 65 years' worth of temperatures, it doesn't make sense to keep them in separate arrays.
Instead, put them all into a single 2D dimensional array using something like np.vstack and then take the argmax over rows.
alltemps = np.vstack((temp1950, temp1951, ..., temp2014))
maxindexes = np.argmax(alltemps, axis=0)
If your temperature arrays are already 2D for some reason, then you can use np.dstack to stack in depth instead. Then you'll have to take argmax over axis=2.
For the specific example in your question, you're looking for something like:
t = np.dstack((array1, array2)) # Note the double parantheses. You need to pass
# a tuple to the function
maxindexes = np.argmax(t, axis=2)
PS: If you are getting the data out of a file, I suggest putting them in a single array to start with. It gets hard to handle 65 variable names.
You need to use Numpy's argmax
It would give you the index of the largest element in the array, which you can map to the year.