Add to items, with multiple occurrences [duplicate] - numpy

I have unsorted array of indexes:
i = np.array([1,5,2,6,4,3,6,7,4,3,2])
I also have an array of values of the same length:
v = np.array([2,5,2,3,4,1,2,1,6,4,2])
I have array with zeros of desired values:
d = np.zeros(10)
Now I want to add to elements in d values of v based on it's index in i.
If I do it in plain python I would do it like this:
for index,value in enumerate(v):
idx = i[index]
d[idx] += v[index]
It is ugly and inefficient. How can I change it?

np.add.at(d, i, v)
You'd think d[i] += v would work, but if you try to do multiple additions to the same cell that way, one of them overrides the others. The ufunc.at method avoids those problems.

We can use np.bincount which is supposedly pretty efficient for such accumulative weighted counting, so here's one with that -
counts = np.bincount(i,v)
d[:counts.size] = counts
Alternatively, using minlength input argument and for a generic case when d could be any array and we want to add into it -
d += np.bincount(i,v,minlength=d.size).astype(d.dtype, copy=False)
Runtime tests
This section compares np.add.at based approach listed in the other post with the np.bincount based one listed earlier in this post.
In [61]: def bincount_based(d,i,v):
...: counts = np.bincount(i,v)
...: d[:counts.size] = counts
...:
...: def add_at_based(d,i,v):
...: np.add.at(d, i, v)
...:
In [62]: # Inputs (random numbers)
...: N = 10000
...: i = np.random.randint(0,1000,(N))
...: v = np.random.randint(0,1000,(N))
...:
...: # Setup output arrays for two approaches
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [63]: bincount_based(d1,i,v) # Run approaches
...: add_at_based(d2,i,v)
...:
In [64]: np.allclose(d1,d2) # Verify outputs
Out[64]: True
In [67]: # Setup output arrays for two approaches again for timing
...: M = 12000
...: d1 = np.zeros(M)
...: d2 = np.zeros(M)
...:
In [68]: %timeit add_at_based(d2,i,v)
1000 loops, best of 3: 1.83 ms per loop
In [69]: %timeit bincount_based(d1,i,v)
10000 loops, best of 3: 52.7 µs per loop

Related

Finding top 3 dominant topics for LDA topic model

I am creating a datatable via this LDA modeling tutorial, (https://www.machinelearningplus.com/nlp/topic-modeling-visualization-how-to-present-results-lda-models/) and instead of just finding the single most dominant topic, I want to expand to find the top 3 most dominant topics, along with each of their percent contributions and topic keywords.
To do that, is it best to create 2 additional functions to create 3 separate dataframes, and append each of the results? Or is there a simpler way to modify the format_topics_sentence function to pull the top 3 topics from the enumerated bag of words corpus?
def format_topics_sentences(ldamodel=None, corpus=corpus, texts=data):
# Init output
sent_topics_df = pd.DataFrame()
# Get main topic in each document
for i, row_list in enumerate(ldamodel[corpus]):
row = row_list[0] if ldamodel.per_word_topics else row_list
# print(row)
row = sorted(row, key=lambda x: (x[1]), reverse=True)
# Get the Dominant topic, Perc Contribution and Keywords for each document
for j, (topic_num, prop_topic) in enumerate(row):
if j == 0: # => dominant topic
wp = ldamodel.show_topic(topic_num)
topic_keywords = ", ".join([word for word, prop in wp])
sent_topics_df = sent_topics_df.append(pd.Series([int(topic_num), round(prop_topic,4), topic_keywords]), ignore_index=True)
else:
break
sent_topics_df.columns = ['Dominant_Topic', 'Perc_Contribution', 'Topic_Keywords']
# Add original text to the end of the output
contents = pd.Series(texts)
sent_topics_df = pd.concat([sent_topics_df, contents], axis=1)
return(sent_topics_df)
df_topic_sents_keywords = format_topics_sentences(ldamodel=lda_model, corpus=corpus, texts=data_ready)
# Format
df_dominant_topic = df_topic_sents_keywords.reset_index()
df_dominant_topic.columns = ['Document_No', 'Dominant_Topic', 'Topic_Perc_Contrib', 'Keywords', 'Text']
df_dominant_topic.head(10)
table ouput
I had a similar requirement in a recent project, hopefully this helps you out, you will need to add topic keywords to below code:
topics_df1 = pd.DataFrame()
topics_df2 = pd.DataFrame()
topics_df3 = pd.DataFrame()
for i, row_list in enumerate(lda_model[corpus]):
row = row_list[0] if lda_model.per_word_topics else row_list
row = sorted(row, key=lambda x: (x[1]), reverse=True)
for j, (topic_num, prop_topic) in enumerate(row):
if len(row) >= 3:
if j ==0:
topics_df1 = topics_df1.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
elif j ==1:
topics_df2 = topics_df2.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
elif j ==2:
topics_df3 = topics_df3.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
else:
break
elif len(row) == 2:
if j ==0:
topics_df1 = topics_df1.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
elif j ==1:
topics_df2 = topics_df2.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
topics_df3 = topics_df3.append(pd.Series(['-', '-']), ignore_index=True)
elif len(row) == 1:
topics_df1 = topics_df1.append(pd.Series([int(topic_num), prop_topic]), ignore_index=True)
topics_df2 = topics_df2.append(pd.Series(['-', '-']), ignore_index=True)
topics_df3 = topics_df3.append(pd.Series(['-', '-']), ignore_index=True)
topics_df1.rename(columns={0:'1st Topic', 1:'1st Topic Contribution'}, inplace=True)
topics_df2.rename(columns={0:'2nd Topic', 1:'2nd Topic Contribution'}, inplace=True)
topics_df3.rename(columns={0:'3rd Topic', 1:'3rd Topic Contribution'}, inplace=True)
topics_comb = pd.concat([topics_df1, topics_df2, topics_df3], axis=1, sort=False)
#Join topics dataframe to original data
new_df = pd.concat([data_ready, topics_comb], axis=1, sort=False)

Speeding up Euclidean Distance in python [duplicate]

How do you optimize this code?
At the moment it is running to slow for the amount of data that goes through this loop. This code runs 1-nearest neighbor. It will predict the label of the training_element based off the p_data_set
# [x] , [[x1],[x2],[x3]], [l1, l2, l3]
def prediction(training_element, p_data_set, p_label_set):
temp = np.array([], dtype=float)
for p in p_data_set:
temp = np.append(temp, distance.euclidean(training_element, p))
minIndex = np.argmin(temp)
return p_label_set[minIndex]
Use a k-D tree for fast nearest-neighbour lookups, e.g. scipy.spatial.cKDTree:
from scipy.spatial import cKDTree
# I assume that p_data_set is (nsamples, ndims)
tree = cKDTree(p_data_set)
# training_elements is also assumed to be (nsamples, ndims)
dist, idx = tree.query(training_elements, k=1)
predicted_labels = p_label_set[idx]
You could use distance.cdist to directly get the distances temp and then use .argmin() to get min-index, like so -
minIndex = distance.cdist(training_element[None],p_data_set).argmin()
Here's an alternative approach using np.einsum -
subs = p_data_set - training_element
minIndex = np.einsum('ij,ij->i',subs,subs).argmin()
Runtime test
Well I was thinking cKDTree would easily beat cdist, but I guess training_element being a 1D array isn't too heavy for cdist and I am seeing it to beat out cKDTree instead by a good 10x+ margin!
Here's the timing results -
In [422]: # Setup arrays
...: p_data_set = np.random.randint(0,9,(40000,100))
...: training_element = np.random.randint(0,9,(100,))
...:
In [423]: def tree_based(p_data_set,training_element): ##ali_m's soln
...: tree = cKDTree(p_data_set)
...: dist, idx = tree.query(training_element, k=1)
...: return idx
...:
...: def einsum_based(p_data_set,training_element):
...: subs = p_data_set - training_element
...: return np.einsum('ij,ij->i',subs,subs).argmin()
...:
In [424]: %timeit tree_based(p_data_set,training_element)
1 loops, best of 3: 210 ms per loop
In [425]: %timeit einsum_based(p_data_set,training_element)
100 loops, best of 3: 17.3 ms per loop
In [426]: %timeit distance.cdist(training_element[None],p_data_set).argmin()
100 loops, best of 3: 14.8 ms per loop
Python can be quite fast programming language if used properly.
This is my suggestion (faster_prediction):
import numpy as np
import time
def euclidean(a,b):
return np.linalg.norm(a-b)
def prediction(training_element, p_data_set, p_label_set):
temp = np.array([], dtype=float)
for p in p_data_set:
temp = np.append(temp, euclidean(training_element, p))
minIndex = np.argmin(temp)
return p_label_set[minIndex]
def faster_prediction(training_element, p_data_set, p_label_set):
temp = np.tile(training_element, (p_data_set.shape[0],1))
temp = np.sqrt(np.sum( (temp - p_data_set)**2 , 1))
minIndex = np.argmin(temp)
return p_label_set[minIndex]
training_element = [1,2,3]
p_data_set = np.random.rand(100000, 3)*10
p_label_set = np.r_[0:p_data_set.shape[0]]
t1 = time.time()
result_1 = prediction(training_element, p_data_set, p_label_set)
t2 = time.time()
t3 = time.time()
result_2 = faster_prediction(training_element, p_data_set, p_label_set)
t4 = time.time()
print "Execution time 1:", t2-t1, "value: ", result_1
print "Execution time 2:", t4-t3, "value: ", result_2
print "Speed up: ", (t4-t3) / (t2-t1)
I get the following result on pretty old laptop:
Execution time 1: 21.6033108234 value: 9819
Execution time 2: 0.0176379680634 value: 9819
Speed up: 1224.81857013
which makes me think I must have done some stupid mistake :)
In case of very huge data, where memory might be an issue, I suggest using Cython or implementing function in C++ and wrapping it in python.

Convert numpy array with many dimensions into 2D array with nested numpy arrays

I would like to convert an array with many dimensions (more than 2) into a 2D array where other dimensions would be converted to nested stand-alone arrays.
So if I have an array like numpy.arange(3 * 4 * 5 * 5 * 5).reshape((3, 4, 5, 5, 5)), I would like to convert it to an array of shape (3, 4), where each element would be an array of shape (5, 5, 5). The dtype of the outer array would be object.
For example, for np.arange(8).reshape((1, 1, 2, 2, 2)), the output would be equivalent to:
a = np.ndarray(shape=(1,1), dtype=object)
a[0, 0] = np.arange(8).reshape((1, 1, 2, 2, 2))[0, 0, :, :, :]
How can I do this efficiently?
We can reshape and assign elements from the regular array into the output object dtype array in a single loop that seems to be a tad faster than with two loops, like so -
def reshape_approach(a):
m,n = a.shape[:2]
a.shape = (m*n,) + a.shape[2:]
out = np.empty((m*n),dtype=object)
for i in range(m*n):
out[i] = a[i]
out.shape = (m,n)
a.shape = (m,n) + a.shape[1:]
return out
Runtime test
Other approach(es) -
# #Scotty1-'s soln
def simply_assign(a):
m,n = a.shape[:2]
out = np.empty((m,n),dtype=object)
for i in range(m):
for j in range(n):
out[i,j] = a[i,j]
return out
Timings -
In [154]: m,n = 300,400
...: a = np.arange(m * n * 5 * 5 * 5).reshape((m,n, 5, 5, 5))
In [155]: %timeit simply_assign(a)
10 loops, best of 3: 39.4 ms per loop
In [156]: %timeit reshape_approach(a)
10 loops, best of 3: 32.9 ms per loop
With 7D data -
In [160]: m,n,p,q = 30,40,30,40
...: a = np.arange(m * n *p * q * 5 * 5 * 5).reshape((m,n,p,q, 5, 5, 5))
In [161]: %timeit simply_assign(a)
1000 loops, best of 3: 421 µs per loop
In [162]: %timeit reshape_approach(a)
1000 loops, best of 3: 316 µs per loop
Thanks for your hint Mitar. This is how it should look like using dtype=np.object arrays:
outer_array = np.empty((x.shape[0], x.shape[1]), dtype=np.object)
for i in range(x.shape[0]):
for j in range(x.shape[1]):
outer_array[i, j] = x[i, j]
Looping may not be the most efficient way to do it, but there is afaik no vectorized operation for this task.
(Using some more reshaping, this should be even faster than Divakar's solution: ;)) ---> No, Divakar is faster.... Nice solution Divakar!
def advanced_reshape_solution(x):
m, n = x.shape[:2]
sub_arr_size = np.prod(x.shape[2:])
out_array = np.empty((m * n), dtype=object)
x_flat_view = x.reshape(-1)
for i in range(m*n):
out_array[i] = x_flat_view[i * sub_arr_size:(i + 1) * sub_arr_size].reshape(x.shape[2:])
return out_array.reshape((m, n))

what is the fastest way to get the mode of a numpy array

I have to find the mode of a NumPy array that I read from an hdf5 file. The NumPy array is 1d and contains floating point values.
my_array=f1[ds_name].value
mod_value=scipy.stats.mode(my_array)
My array is 1d and contains around 1M values. It takes about 15 min for my script to return the mode value. Is there any way to make this faster?
Another question is why scipy.stats.median(my_array) does not work while mode works?
AttributeError: module 'scipy.stats' has no attribute 'median'
The implementation of scipy.stats.mode has a Python loop for handling the axis argument with multidimensional arrays. The following simple implementation, for one-dimensional arrays only, is faster:
def mode1(x):
values, counts = np.unique(x, return_counts=True)
m = counts.argmax()
return values[m], counts[m]
Here's an example. First, make an array of integers with length 1000000.
In [40]: x = np.random.randint(0, 1000, size=(2, 1000000)).sum(axis=0)
In [41]: x.shape
Out[41]: (1000000,)
Check that scipy.stats.mode and mode1 give the same result.
In [42]: from scipy.stats import mode
In [43]: mode(x)
Out[43]: ModeResult(mode=array([1009]), count=array([1066]))
In [44]: mode1(x)
Out[44]: (1009, 1066)
Now check the performance.
In [45]: %timeit mode(x)
2.91 s ± 18 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [46]: %timeit mode1(x)
39.6 ms ± 83.8 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.91 seconds for mode(x) and only 39.6 milliseconds for mode1(x).
Here's one approach based on sorting -
def mode1d(ar_sorted):
ar_sorted.sort()
idx = np.flatnonzero(ar_sorted[1:] != ar_sorted[:-1])
count = np.empty(idx.size+1,dtype=int)
count[1:-1] = idx[1:] - idx[:-1]
count[0] = idx[0] + 1
count[-1] = ar_sorted.size - idx[-1] - 1
argmax_idx = count.argmax()
if argmax_idx==len(idx):
modeval = ar_sorted[-1]
else:
modeval = ar_sorted[idx[argmax_idx]]
modecount = count[argmax_idx]
return modeval, modecount
Note that this mutates/changes the input array as it sorts it. So, if you want to keep the input array un-mutated or do mind the input array being sorted, pass a copy.
Sample run on 1M elements -
In [65]: x = np.random.randint(0, 1000, size=(1000000)).astype(float)
In [66]: from scipy.stats import mode
In [67]: mode(x)
Out[67]: ModeResult(mode=array([ 295.]), count=array([1098]))
In [68]: mode1d(x)
Out[68]: (295.0, 1098)
Runtime test
In [75]: x = np.random.randint(0, 1000, size=(1000000)).astype(float)
# Scipy's mode
In [76]: %timeit mode(x)
1 loop, best of 3: 1.64 s per loop
# #Warren Weckesser's soln
In [77]: %timeit mode1(x)
10 loops, best of 3: 52.7 ms per loop
# Proposed in this post
In [78]: %timeit mode1d(x)
100 loops, best of 3: 12.8 ms per loop
With a copy, the timings for mode1d would be comparable to mode1.
I added the two functions mode1 and mode1d from replies above to my script and tried to compare with the scipy.stats.mode.
dir_name="C:/Users/test_mode"
file_name="myfile2.h5"
ds_name="myds"
f_in=os.path.join(dir_name,file_name)
def mode1(x):
values, counts = np.unique(x, return_counts=True)
m = counts.argmax()
return values[m], counts[m]
def mode1d(ar_sorted):
ar_sorted.sort()
idx = np.flatnonzero(ar_sorted[1:] != ar_sorted[:-1])
count = np.empty(idx.size+1,dtype=int)
count[1:-1] = idx[1:] - idx[:-1]
count[0] = idx[0] + 1
count[-1] = ar_sorted.size - idx[-1] - 1
argmax_idx = count.argmax()
if argmax_idx==len(idx):
modeval = ar_sorted[-1]
else:
modeval = ar_sorted[idx[argmax_idx]]
modecount = count[argmax_idx]
return modeval, modecount
startTime=time.time()
with h5py.File(f_in, "a") as f1:
myds=f1[ds_name].value
time1=time.time()
file_read_time=time1-startTime
print(str(file_read_time)+"\t"+"s"+"\t"+str((file_read_time)/60)+"\t"+"min")
print("mode_scipy=")
mode_scipy=scipy.stats.mode(myds)
print(mode_scipy)
time2=time.time()
mode_scipy_time=time2-time1
print(str(mode_scipy_time)+"\t"+"s"+"\t"+str((mode_scipy_time)/60)+"\t"+"min")
print("mode1=")
mode1=mode1(myds)
print(mode1)
time3=time.time()
mode1_time=time3-time2
print(str(mode1_time)+"\t"+"s"+"\t"+str((mode1_time)/60)+"\t"+"min")
print("mode1d=")
mode1d=mode1d(myds)
print(mode1d)
time4=time.time()
mode1d_time=time4-time3
print(str(mode1d_time)+"\t"+"s"+"\t"+str((mode1d_time)/60)+"\t"+"min")
The result from running the script for a numpy array of around 1M is :
mode_scipy=
ModeResult(mode=array([ 1.11903353e-06], dtype=float32), count=array([304909]))
938.8368742465973 s
15.647281237443288 min
mode1=(1.1190335e-06, 304909)
0.06500649452209473 s
0.0010834415753682455 min
mode1d=(1.1190335e-06, 304909)
0.06200599670410156 s
0.0010334332784016928 min

Explaining the result of pipeline execution of multiple hincrby commands in redis

This rudimentary one has me stumped. I've been tinkering around with redis-py, trying to learn the ropes. One thing I'm trying is:
pipeline1 = my_server.pipeline()
for hash_obj in hash_objs:
num = pipeline1.hincrby(hash_obj,"num",amount=-1)
result1 = pipeline1.execute()
print result1
>>> [0L,0L]
There were two redis hashes in the list hash_objs. What I see printed on the screen is [0L,0L]. Can someone help me decipher what this output means? What's L? I was hoping to get the resulting int values of num for each hash_obj (e.g. [2,0]).
My objective is to decrement num in each hash_obj by 1, and wherever num ends up as 0, delete the hash_obj.
I'm trying to accomplish that in two separate pipelines; the code above is the attempting to decrement all num values in all hash_objs. After this, I would delete the relevant hash_objs if warranted. I'm still developing my understanding of how to effectively use pipelining in redis.
Nothing wrong with the code above - the L means long (integer) and the result printout is consistent assuming that the hashes were set to 1 before the run. If you set the hashes beforehand to 3 and 1 (steps 3 and 4 below), respectively, you'll get the expected result in step 9:
In [1]: import redis
In [2]: r = redis.StrictRedis()
In [3]: r.hset('h1', 'num', 3)
Out[3]: 1L
In [4]: r.hset('h2', 'num', 1)
Out[4]: 1L
In [5]: hashes = ['h1', 'h2']
In [6]: p = r.pipeline()
In [7]: for h in hashes:
...: p.hincrby(h, 'num', -1)
...:
In [8]: res = p.execute()
In [9]: res
Out[9]: [2L, 0L]
Note: the 1L in 3 and 4 means that the key was created.
Now you can iterate on the result and continue the processing. In your case, however, it would make more sense to use just one pipeline and instead of executing the hincrby call a Lua script decrements and deletes the key if the result is 0, such as the one below (which returns 1 if the key was deleted):
In [1]: import redis
In [2]: r = redis.StrictRedis()
In [3]: r.hset('h1', 'num', 3)
Out[3]: 0L
In [4]: r.hset('h2', 'num', 1)
Out[4]: 0L
In [5]: s = r.script_load('if redis.call("HINCRBY", KEYS[1], ARGV[1], ARGV[2]) <= 0 then redis.call("DEL", KEYS[1]) return 1 end return 0')
In [6]: p = r.pipeline()
In [7]: for h in ['h1', 'h2']:
...: p.evalsha(s, 1, h, 'num', -1)
...:
In [8]: p.execute()
Out[8]: [0L, 1L]