Calling scheduler.multiprocessing.get in a separate process in dask - python-multiprocessing

I am training a neural network with a large text corpora. Each text generate quite a big matrix because I'm using a convolutional model. As my data won't feet in my still large memory, I try to stream it, and use keras.models fit_generator.
To feed keras, I have a pipeline composed of different preprocessing steps, that I arrange with a dask bag with lots of partitions. The dask bag reads a file on disk.
Even is dask is not handling iteration in a smart way (it just compute() and iter on result, which in my case blow up memory), I was to use something like this:
def compute_partition_iter(collection, **kwargs):
"""A utility to compute a collection items after items
"""
get = kwargs.pop("get", None) or _globals['get']
if get is None:
get = collection.__dask_scheduler__
postcompute_func, postcompute_args = collection.__dask_postcompute__()
dsk = collection.__dask_graph__()
for key in collection.__dask_keys__():
yield from f([partition], *args)
This compute partitions one by one and return items, computing next partition as we cross partition border.
This approach has a problem : it's only when we hit last item from partition that we provoque the computation of next elements, leading to a lag until next element. Within this lag, keras is stalled and we loose precious time !
So I imagine running the above compute_partition_iter in a separate process thanks to multiprocessing.Pool, feeding partitions in a Queue with say 2 slots, so that in the generator, I won't always have one more partition ready.
But it seems that this is not supported by dask.bag. I didn't dive deeply enough in the code, but it seems like there are some async methods used, or I don't know what.
Here is a reproductible code for the problem.
First a code that work, using a simple range.
import multiprocessing
import time
def put_q(n, q):
for i in range(n):
print(i, "<-")
q.put(i)
q.put(None)
q = multiprocessing.Queue(2)
with multiprocessing.Pool(1, put_q, (4, q)) as pool:
i = True
while i is not None:
print("zzz")
time.sleep(.5)
i = q.get()
if i is None:
break
print("-> ", i)
This outputs
0 <-
1 <-
2 <-
zzz
3 <-
-> 0
zzz
-> 1
zzz
-> 2
zzz
-> 3
zzz
you can see that, as expected, elements where computed in anticipation and it's all ok.
Now let's replace the range by a dask.bag:
import multiprocessing
import time
import dask.bag
def put_q(n, q):
for i in dask.bag.from_sequence(range(n), npartitions=2):
print(i, "<-")
q.put(i)
q.put(None)
q = multiprocessing.Queue(5)
with multiprocessing.Pool(1, put_q, (4, q)) as pool:
i = True
while i is not None:
print("zzz")
time.sleep(.5)
i = q.get()
if i is None:
break
print("-> ", i)
In a jupyter notebook, it indefinitely raises :
Process ForkPoolWorker-71:
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/process.py", line 249, in _bootstrap
self.run()
File "/usr/lib/python3.5/multiprocessing/process.py", line 93, in run
self._target(*self._args, **self._kwargs)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 103, in worker
initializer(*initargs)
File "<ipython-input-3-e1e9ef9354a0>", line 8, in put_q
for i in dask.bag.from_sequence(range(n), npartitions=2):
File "/usr/local/lib/python3.5/dist-packages/dask/bag/core.py", line 1190, in __iter__
return iter(self.compute())
File "/usr/local/lib/python3.5/dist-packages/dask/base.py", line 154, in compute
(result,) = compute(self, traverse=False, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/dask/base.py", line 407, in compute
results = get(dsk, keys, **kwargs)
File "/usr/local/lib/python3.5/dist-packages/dask/multiprocessing.py", line 152, in get
initializer=initialize_worker_process)
File "/usr/lib/python3.5/multiprocessing/context.py", line 118, in Pool
context=self.get_context())
File "/usr/lib/python3.5/multiprocessing/pool.py", line 168, in __init__
self._repopulate_pool()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 233, in _repopulate_pool
w.start()
File "/usr/lib/python3.5/multiprocessing/process.py", line 103, in start
'daemonic processes are not allowed to have children'
AssertionError: daemonic processes are not allowed to have children
while the main process is stalled, waiting for elements in queue.
I also tried using a ipyparallel cluster but in this case the main process is simply stalled (no trace of the exception).
Does anyone knows the right way to do that ?
Is there a way I can run scheduler.get in parallel to my main code ?

Finally I should have got a closer look at the exception !
Stackoverflow gave me the solution : Python Process Pool non-daemonic?
In fact as the bag scheduler uses Pool, it can't be called inside a process spawned by pool. The solution in my case is to simply use threads. (Note that the bug and its solution depends on the scheduler you use).
So I have substituted multiprocessing.Pool for multiprocessing.pool.ThreadPool and it works like a charm, either in a normal notebook, or when using ipyparallel.
So it goes like this:
import queue
from multiprocessing.pool import ThreadPool
import time
import dask.bag
def put_q(n, q):
b = dask.bag.from_sequence(range(n), npartitions=3)
for i in b:
print(i, "<-")
q.put(i)
q.put(None)
q = queue.Queue(2)
with ThreadPool(1, put_q, (6, q)) as pool:
i = True
while i is not None:
print("zzz")
time.sleep(.5)
i = q.get()
if i is None:
break
print("-> ", i)
Which outputs:
zzz
0 <-
1 <-
2 <-
-> 0
zzz
3 <-
-> 1
zzz
4 <-
-> 5 <-
2
zzz
-> 3
zzz
-> 4
zzz
-> 5
zzz

Related

Python Script that used to work, is now getting automatically killed in Ubuntu

I was once able to run the below python script on my Ubuntu machine without the memory errors I was getting on windows.
import pandas as pd
import numpy as np
#create a pandas dataframe for each input file
dfs1 = pd.read_csv('s1.csv', encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
dfs2 = pd.read_csv('s2.csv', encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
dfr = pd.read_csv('r.csv' , encoding='utf-8', names=list(range(0,107)),dtype='string', na_filter=False)
#combine them into one dataframe
dfs12r = pd.concat([dfs1, dfs2, dfr],ignore_index=True)#withour ignore index the line numbers are not adjusted
# bow is "comming
wordlist=[]
for line in range(8052):
for row in range(106) :
#print(line,row,dfs12r[row][line])
if dfs12r[row][line] not in wordlist :
wordlist.append(dfs12r[row][line])
wordlist.sort()
#print(wordlist)
print(len(wordlist)) #12350
dfBOW = pd.DataFrame(np.zeros((len(dfs12r.index), len(wordlist))),dtype='int')
#create the dictionary
wordDict = dict.fromkeys(wordlist,'default')
counter=0
for word in wordlist :
wordDict[word] = counter
counter+=1
#print(wordDict)
#will start scanning every word from dfS12R and +1 the respective cell in dfBOW
for line in range(8052):
for row in range(107):
dfBOW[wordDict[dfs12r[row][line]]][line]+=1
Unfortunately, probably after some automatic Ubuntu updates I am now getting the simple message "KIlled", after trying to run the process without any further explanation.
Through simple print statements I know that the script is interrupted inside the for loop in the end.
I understand that I should be able to make the script more memory efficient, but I am also hoping for guidance on how to get Ubuntu able to run again the same script like they used to. (Through the TOP command I can see the all of my memory including the swap is being used while inside this loop)
Could paging have been disabled somehow after the updates? Any advice is welcome.
I still have 16GB of RAM, and use Ubuntu 20.04 (Specs are the same before and after the script stopped working). I use dual boot on the same SSD.
Below is the error I am getting from teh same script on windows :
Traceback (most recent call last):
File "D:\sharedfiles\Organised\WorkSpace\ptixiaki\github\ptixiaki\code\makingthedata\2.1 Approach (Same as 2 but turning all words to lowercase)\2.1_CSVtoDataframe\CSVtoBOW.py", line 60, in <module>
dfBOW[wordDict[dfs12r[row][line]]][line]+=1
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\series.py", line 1143, in __setitem__
self._maybe_update_cacher()
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\series.py", line 1279, in _maybe_update_cacher
ref._maybe_cache_changed(cacher[0], self, inplace=inplace)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\frame.py", line 3950, in _maybe_cache_changed
self._mgr.iset(loc, arraylike, inplace=inplace)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\internals\managers.py", line 1141, in iset
blk.delete(blk_locs)
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\pandas\core\internals\blocks.py", line 388, in delete
self.values = np.delete(self.values, loc, 0) # type: ignore[arg-type]
File "<__array_function__ internals>", line 5, in delete
File "D:\wandowsoftware\anaconda\envs\ptixiaki\lib\site-packages\numpy\lib\function_base.py", line 4555, in delete
new = arr[tuple(slobj)]
MemoryError: Unable to allocate 501. MiB for an array with shape (12234, 10736) and data type int32

Dask array from_npy_stack misses info file

Action
Trying to create a Dask array from a stack of .npy files not written by Dask.
Problem
Dask from_npy_stack() expects an info file, which is normally created by to_npy_stack() function when creating .npy stack with Dask.
Attempts
I found this PR (https://github.com/dask/dask/pull/686) with a description of how the info file is created
def to_npy_info(dirname, dtype, chunks, axis):
with open(os.path.join(dirname, 'info'), 'wb') as f:
pickle.dump({'chunks': chunks, 'dtype': x.dtype, 'axis': axis}, f)
Question
How do I go about loading .npy stacks that are created outside of Dask?
Example
from pathlib import Path
import numpy as np
import dask.array as da
data_dir = Path('/home/tom/data/')
for i in range(3):
data = np.zeros((2,2))
np.save(data_dir.joinpath('{}.npy'.format(i)), data)
data = da.from_npy_stack('/home/tom/data')
Resulting in the following error:
---------------------------------------------------------------------------
IOError Traceback (most recent call last)
<ipython-input-94-54315c368240> in <module>()
9 np.save(data_dir.joinpath('{}.npy'.format(i)), data)
10
---> 11 data = da.from_npy_stack('/home/tom/data/')
/home/tom/vue/env/local/lib/python2.7/site-packages/dask/array/core.pyc in from_npy_stack(dirname, mmap_mode)
3722 Read data in memory map mode
3723 """
-> 3724 with open(os.path.join(dirname, 'info'), 'rb') as f:
3725 info = pickle.load(f)
3726
IOError: [Errno 2] No such file or directory: '/home/tom/data/info'
The function from_npy_stack is short and simple. Agree that it probably ought to take the metadata as an optional argument for cases such as yours, but you could simply use the lines of code after loading the "info" file assuming you have the right values to. Some of these values, i.e., dtype and the shape of each array for making chunks, could presumably be obtained by looking at the first of the data files
name = 'from-npy-stack-%s' % dirname
keys = list(product([name], *[range(len(c)) for c in chunks]))
values = [(np.load, os.path.join(dirname, '%d.npy' % i), mmap_mode)
for i in range(len(chunks[axis]))]
dsk = dict(zip(keys, values))
out = Array(dsk, name, chunks, dtype)
Also, note that we are constructing the names of the files here, but you might want to get those by doing a listdir or glob.

Concurrently read an HDF5 file in Pandas

I have a data.h5 file organised in multiple chunks, the entire file being several hundred gigabytes. I need to work with a filtered subset of the file in memory, in the form of a Pandas DataFrame.
The goal of the following routine is to distribute the filtering work across several processes, then concatenate the filtered results into the final DataFrame.
Since reading from the file takes a significant amount of time, I'm trying to make each process read its own chunk in a concurrent manner as well.
import multiprocessing as mp, pandas as pd
store = pd.HDFStore('data.h5')
min_dset, max_dset = 0, len(store.keys()) - 1
dset_list = list(range(min_dset, max_dset))
frames = []
def read_and_return_subset(dset):
# each process is intended to read its own chunk in a concurrent manner
chunk = store.select('batch_{:03}'.format(dset))
# and then process the chunk, do the filtering, and return the result
output = chunk[chunk.some_condition == True]
return output
with mp.Pool(processes=32) as pool:
for frame in pool.map(read_and_return_subset, dset_list):
frames.append(frame)
df = pd.concat(frames)
However, the above code triggers this error:
HDF5ExtError Traceback (most recent call last)
<ipython-input-174-867671c5a58f> in <module>()
53
54 with mp.Pool(processes=32) as pool:
---> 55 for frame in pool.map(read_and_return_subset, dset_list):
56 frames.append(frame)
57
/usr/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
258 in a list that is returned.
259 '''
--> 260 return self._map_async(func, iterable, mapstar, chunksize).get()
261
262 def starmap(self, func, iterable, chunksize=None):
/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
606 return self._value
607 else:
--> 608 raise self._value
609
610 def _set(self, i, obj):
HDF5ExtError: HDF5 error back trace
File "H5Dio.c", line 173, in H5Dread
can't read data
File "H5Dio.c", line 554, in H5D__read
can't read data
File "H5Dchunk.c", line 1856, in H5D__chunk_read
error looking up chunk address
File "H5Dchunk.c", line 2441, in H5D__chunk_lookup
can't query chunk address
File "H5Dbtree.c", line 998, in H5D__btree_idx_get_addr
can't get chunk info
File "H5B.c", line 340, in H5B_find
unable to load B-tree node
File "H5AC.c", line 1262, in H5AC_protect
H5C_protect() failed.
File "H5C.c", line 3574, in H5C_protect
can't load entry
File "H5C.c", line 7954, in H5C_load_entry
unable to load entry
File "H5Bcache.c", line 143, in H5B__load
wrong B-tree signature
End of HDF5 error back trace
Problems reading the array data.
It seems that Pandas/pyTables have troubles when trying to access the same file in a concurrent manner, even if it's only for reading.
Is there a way to be able to make each process read its own chunk concurrently?
IIUC you can index those columns that are used for filtering data (chunk.some_condition == True - in your sample code) and then read up only that subset of data that satisfies needed conditions.
In order to be able to do that you need to:
save HDF5 file in table format - use parameter: format='table'
index columns, that will be used for filtering - use parameter: data_columns=['col_name1', 'col_name2', etc.]
After that you should be able to filter your data just by reading:
store = pd.HDFStore(filename)
df = store.select('key_name', where="col1 in [11,13] & col2 == 'AAA'")

Simple Hidden Markov Model in PyMC3 throws Theano error

I'm new to PyMC3, Theano, and numpy. Was just trying to duplicate the first 'hidden' Markov Model in the Stan manual--the one in which the states are actually observed. But, I keep running into errors having to do with Theano, numpy, and perhaps what is going on behind PyMC3 distributions, which seem a bit mysterious to me. My code for the model is below:
import pandas as pd
dat_hmm = pd.read_csv('hmmVals.csv')
emission=dat_hmm.emission.values
state=dat_hmm.state.values
from pymc3 import Model, Dirichlet, Categorical
import numpy as np
basic_model = Model()
with basic_model:
#Model constants:
#num unique hidden states, num unique emissions, num instances
K=3; V=9; T=10
alpha=np.ones(K); beta=np.ones(V)
# Priors for unknown model parameters
theta = np.empty(K, dtype=object) #theta=transmission
phi = np.empty(K, dtype=object) #phi=emission
#observed emission, state:
w=np.empty(T, dtype=object); z=np.empty(T, dtype=object);
for k in range(K):
theta[k]=Dirichlet('theta'+str(k), alpha)
phi[k]=Dirichlet('phi'+str(k), beta)
# Likelihood (sampling distribution) of observations
for t in range(T):
w[t]=Categorical('w'+str(t),theta[state[t]], shape=1, observed=emission[t])
for t in range(2, T):
z[t]=Categorical('z'+str(t),phi[state[t-1]], shape=1, observed=state[t])
The line "w[t]=Categorical('w'+str(t),theta[state[t]], shape=1, observed=emission[t])" generates the error, but not on t=0, which fills in w0, but on t=1 which generates an index out of bound error. There is no index out of bound in the code line itself because state[1], theta[state[t]], and emission[t] all exist. The error messages are:
Traceback (most recent call last):
File "/usr/local/lib/python3.4/dist-packages/pymc3/distributions/distribution.py", line 25, in __new__
return model.Var(name, dist, data)
File "/usr/local/lib/python3.4/dist-packages/pymc3/model.py", line 306, in Var
var = ObservedRV(name=name, data=data, distribution=dist, model=self)
File "/usr/local/lib/python3.4/dist-packages/pymc3/model.py", line 581, in __init__
self.logp_elemwiset = distribution.logp(data)
File "/usr/local/lib/python3.4/dist-packages/pymc3/distributions/discrete.py", line 400, in logp
a = tt.log(p[value])
File "/usr/local/lib/python3.4/dist-packages/theano/tensor/var.py", line 532, in __getitem__
lambda entry: isinstance(entry, Variable)))
File "/usr/local/lib/python3.4/dist-packages/theano/gof/op.py", line 668, in __call__
required = thunk()
File "/usr/local/lib/python3.4/dist-packages/theano/gof/op.py", line 883, in rval
fill_storage()
File "/usr/local/lib/python3.4/dist-packages/theano/gof/cc.py", line 1707, in __call__
reraise(exc_type, exc_value, exc_trace)
File "/usr/local/lib/python3.4/dist-packages/six.py", line 686, in reraise
raise value
IndexError: index out of bounds
I don't know about the wisdom of sticking numpy objects into PyMC3 distributions or using the result of that to try to parameterize another distribution, but I have seen somewhat similar code on the web, minus the last part. Is there perhaps no good way to code such a hidden Markov model in PyMC3 yet?
I have found a way to fix the above error. The following code works--no errors and I'm able to get correct parameter estimates with Metropolis at least.
I made two mistakes and didn't realize they were so simple because I expected something complicated to be happening in Theano. One is that my data was set up for Stan and so indexed to start at 1 rather than 0. Python indexes everything to 0. I changed the data file by subtracting 1 from every value. The other error was I used theta, the transmission matrix, to calculate individual emissions and vice versa for the phi matrix. Theta was too short for the emissions.
What I wish I understood now was why the NUTS sampler keeps telling me I have a non-positive definite scaling, even though I'm feeding it MAP estimates. Metropolis works, but is slow-- about 11 minutes for these 300 observations and 1000 samples. The other mystery is why PyMC3 thinks it only took a couple seconds to calculate the samples.
import pandas as pd
dat_hmm = pd.read_csv('hmmVals.csv')
emission=dat_hmm.emission.values
state=dat_hmm.state.values
from pymc3 import Model, Dirichlet, Categorical
import numpy as np
basic_model = Model()
with basic_model:
#Model constants:
K=3; V=9; T=300 #num unique hidden states, num unique emissions, num instances
alpha=np.ones(K); beta=np.ones(V)
# Priors for unknown model parameters
theta = np.empty(K, dtype=object) #theta=transmission
phi = np.empty(K, dtype=object) #phi=emission
w=np.empty(T, dtype=object); z=np.empty(T, dtype=object); #observed emission, state
for k in range(K):
theta[k]=Dirichlet('theta'+str(k), alpha)
phi[k]=Dirichlet('phi'+str(k), beta)
#Likelihood (sampling distribution) of observationss
for t in range(2, T):
z[t]=Categorical('z'+str(t),theta[state[t-1]], shape=1, observed=state[t])
for t in range(T):
w[t]=Categorical('w'+str(t),phi[state[t]], shape=1, observed=emission[t])
I have also tried to implement HMMs in pymc3 and I had run into similar problems. I just found a way to implement a two level HMM in a vectorized fashion (to be perfectly honest, my model is not hidden but the hidden part can be added easily - I vectorized the description of the state variable). I am not sure whether this is the most efficient way, but I tested this code against a simple for loop for defining the states. The code below runs in less than a minute with 1000 data points whereas the for loop took several hours.
Here is the code:
import numpy as np
import theano.tensor as tt
import pymc3 as pm
class HMMStates(pm.Discrete):
"""
Hidden Markov Model States
Parameters
----------
P1 : tensor
probability to remain in state 1
P2 : tensor
probability to move from state 2 to state 1
"""
def __init__(self, PA=None, P1=None, P2=None,
*args, **kwargs):
super(HMMStates, self).__init__(*args, **kwargs)
self.PA = PA
self.P1 = P1
self.P2 = P2
self.mean = 0.
def logp(self, x):
PA = self.PA
P1 = self.P1
P2 = self.P2
# now we need to create an array with probabilities
# so that for x=A: PA=P1, PB=(1-P1)
# and for x=B: PA=P2, PB=(1-P2)
length = x.shape[0]
P1T = tt.tile(P1,(length-1,1)).T
P2T = tt.tile(P2,(length-1,1)).T
P = tt.switch(x[:-1],P1T,P2T).T
x_i = x[1:]
ou_like = pm.Categorical.dist(P).logp(x_i)
return pm.Categorical.dist(PA).logp(x[0]) + tt.sum(ou_like)
This class creates the states of the HMM. To call it you can do the following:
theta = np.ones(2) # prior for probabilities
with pm.Model() as model:
# 2 state model
# P1 is probablility to stay in state 1
# P2 is probability to move from state 2 to state 1
P1 = pm.Dirichlet('P1', a=theta)
P2 = pm.Dirichlet('P2', a=theta)
PA = pm.Deterministic('PA',P2/(P2+1-P1))
states = HMMStates('states',PA,P1,P2, observed=data)
start = pm.find_MAP()
trace = pm.sample(5000, start=start)
just to show how the old code looks like:
with pm.Model() as model:
# 2 state model
# P1 is probablility to stay in state 1
# P2 is probability to move from state 2 to state 1
P1 = pm.Dirichlet('P1', a=np.ones(2))
P2 = pm.Dirichlet('P2', a=np.ones(2))
PA = pm.Deterministic('PA',P2/(P2+1-P1))
state = pm.Categorical('state0',PA, observed=data[0])
for i in range(1,N_chain):
state = pm.Categorical('state'+str(i), tt.switch(data[i-1],P1,P2), observed=data[i])
start = pm.find_MAP()
trace = pm.sample(5000, start=start)

Clustering of sparse matrix in python and scipy

I'm trying to cluster some data with python and scipy but the following code does not work for reason I do not understand:
from scipy.sparse import *
matrix = dok_matrix((en,en), int)
for pub in pubs:
authors = pub.split(";")
for auth1 in authors:
for auth2 in authors:
if auth1 == auth2: continue
id1 = e2id[auth1]
id2 = e2id[auth2]
matrix[id1, id2] += 1
from scipy.cluster.vq import vq, kmeans2, whiten
result = kmeans2(matrix, 30)
print result
It says:
Traceback (most recent call last):
File "cluster.py", line 40, in <module>
result = kmeans2(matrix, 30)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 683, in kmeans2
clusters = init(data, k)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 576, in _krandinit
return init_rankn(data)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 563, in init_rankn
mu = np.mean(data, 0)
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 2374, in mean
return mean(axis, dtype, out)
TypeError: mean() takes at most 2 arguments (4 given)
When I'm using kmenas instead of kmenas2 I have the following error:
Traceback (most recent call last):
File "cluster.py", line 40, in <module>
result = kmeans(matrix, 30)
File "/usr/lib/python2.7/dist-packages/scipy/cluster/vq.py", line 507, in kmeans
guess = take(obs, randint(0, No, k), 0)
File "/usr/lib/python2.7/dist-packages/numpy/core/fromnumeric.py", line 103, in take
return take(indices, axis, out, mode)
TypeError: take() takes at most 3 arguments (5 given)
I think I have the problems because I'm using sparse matrices but my matrices are too big to fit the memory otherwise. Is there a way to use standard clustering algorithms from scipy with sparse matrices? Or I have to re-implement them myself?
I created a new version of my code to work with vector space
el = len(experts)
pl = len(pubs)
print el, pl
from scipy.sparse import *
P = dok_matrix((pl, el), int)
p_id = 0
for pub in pubs:
authors = pub.split(";")
for auth1 in authors:
if len(auth1) < 2: continue
id1 = e2id[auth1]
P[p_id, id1] = 1
from scipy.cluster.vq import kmeans, kmeans2, whiten
result = kmeans2(P, 30)
print result
But I'm still getting the error:
TypeError: mean() takes at most 2 arguments (4 given)
What am I doing wrong?
K-means cannot be run on distance matrixes.
It needs a vector space to compute means in, that is why it is called k-means. If you want to use a distance matrix, you need to look into purely distance based algorithms such as DBSCAN and OPTICS (both on Wikipedia).
May I suggest, "Affinity Propagation" from scikit-learn? On the work I've been doing with it, I find that it has generally been able to find the 'naturally' occurring clusters within my data set. The inputs into the algorithm are an affinity matrix, or similarity matrix, of any arbitrary similarity measure.
I don't have a good handle on the kind of data you have on hand, so I can't speak to the exact suitability of this method to your data set, but it may be worth a try, perhaps?
Alternatively, if you're looking to cluster graphs, I'd take a look at NetworkX. That might be a useful tool for you. The reason I suggest this is because it looks like the data you're looking to work with networks of authors. Hence, with NetworkX, you can put in an adjacency matrix and find out which authors are clustered together.
For a further elaboration on this, you can see a question that I had asked earlier for inspiration here.