Combinations & Numpy - numpy

I need to rate each combination in order to get the best one.
I completed the first step but it is not optimized at all.
When the value of RQ or NBPIS or NBSER is big, my code is much too long.
Do you have an idea to get the same result much faster?
Thank you very much
import numpy as np
from itertools import combinations, combinations_with_replacement
#USER SETTINGS
RQ=['A','B','C','D','E','F','G','H']
NBPIS=3
NBSER=3
#CODE
Combi1=np.array(list(combinations_with_replacement(RQ,NBPIS)))
Combi2=combinations_with_replacement(Combi1,NBSER)
Combi3=np.array([])
Compt=0
First=0
for X in Combi2:
Long=0
Compt=Compt+1
Y=np.array(X)
for Z in RQ:
Long=Long+1
if Z not in Y:
break
elif Long==len(RQ):
if First==0:
Combi3=Y
Combi3 = np.expand_dims(Combi3, axis = 0)
First=1
else:
Combi3=np.append(Combi3, [Y], axis = 0)
#RESULTS
print(Combi3)
print(Combi3.shape)
print(Compt)

Assuming your code produces the desirable result, the first step to optimize your code is refactoring it. This might help others to jump in and help as well.
Let's start making a function of it.
#USER SETTINGS
RQ=['A','B','C','D','E','F','G','H']
NBPIS=3
NBSER=3
def your_code():
Combi1=np.array(list(combinations_with_replacement(RQ,NBPIS)))
Combi2=combinations_with_replacement(Combi1,NBSER)
Combi3=np.array([])
Compt=0
First=0
for X in Combi2:
Long=0
Compt=Compt+1
Y=np.array(X)
for Z in RQ:
Long=Long+1
if Z not in Y:
break
elif Long==len(RQ):
if First==0:
Combi3=Y
Combi3 = np.expand_dims(Combi3, axis = 0)
First=1
else:
Combi3=np.append(Combi3, [Y], axis = 0)
shape = Combi3.shape
size = Compt
return Combi3, shape, size
Refactoring
Notice that Compt is equal to len(Combi2), so turning Combi2 as a numpy array will help to simplify the code. This also allow the variable Y to be replaced by X only. Also, there is no need for Combi1 to be a numpy array since it is only consumed by combinations_with_replacement.
def your_code_refactored():
Combi1 = combinations_with_replacement(RQ,NBPIS)
Combi2 = np.array(list(combinations_with_replacement(Combi1,NBSER)))
Combi3=np.array([])
First=0
for X in Combi2:
Long=0
for Z in RQ:
Long=Long+1
if Z not in X:
break
elif Long==len(RQ):
if First==0:
Combi3=X
Combi3 = np.expand_dims(Combi3, axis = 0)
First=1
else:
Combi3=np.append(Combi3, [X], axis = 0)
shape = Combi3.shape
size = len(Combi2)
return Combi3, shape, size
Next thing is to refactor how Combi3 is created and populated. The varialbe First is used to expand Combi3 dimension in the first interaction only, so this logic can be simplified as,
def your_code_refactored():
Combi1 = combinations_with_replacement(RQ,NBPIS)
Combi2 = np.array(list(combinations_with_replacement(Combi1,NBSER)))
Combi3 = np.empty((0, NBPIS, NBSER))
for X in Combi2:
Long=0
for Z in RQ:
Long=Long+1
if Z not in X:
break
elif Long==len(RQ):
Combi3 = np.append(Combi3, [X], axis = 0)
shape = Combi3.shape
size = len(Combi2)
return Combi3, shape, size
It seems Combi2 is populated only with combinations that have at least one of each element from RQ. This is accomplished by testing if each element of RQ is in X, which is basically checking if RQ is a subset of X. So it is simplified further,
def your_code_refactored():
Combi1 = combinations_with_replacement(RQ,NBPIS)
Combi2 = np.array(list(combinations_with_replacement(Combi1,NBSER)))
Combi3 = np.empty((0, NBPIS, NBSER))
unique_RQ = set(RQ)
for X in Combi2:
if unique_RQ.issubset(X.flatten()):
Combi3 = np.append(Combi3, [X], axis = 0)
shape = Combi3.shape
size = len(Combi2)
return Combi3, shape, size
This looks much simpler. Time to make it faster, maybe :)
Optimizing
One way this code can be optimized is to replace np.append by list.append. In numpy's documentation we see that np.append reallocate a larger and larger array each time it is called. The code might be speed up with list.append, since it over-allocates memory. So the code could be rewritten with list comprehension.
def your_code_refactored_and_optimized():
Combi1 = combinations_with_replacement(RQ,NBPIS)
Combi2 = np.array(list(combinations_with_replacement(Combi1,NBSER)))
unique_RQ = set(RQ)
Combi3 = np.array([X for X in Combi2 if unique_RQ.issubset(X.flatten())])
shape = Combi3.shape
size = len(Combi2)
return Combi3, shape, size
Testing
Now we can see it run faster.
import timeit
n = 5
print(timeit.timeit('your_code()', globals=globals(), number=n))
print(timeit.timeit('your_code_refactored_and_optimized()', globals=globals(), number=n))
This isn't much a gain in speed but it's something :)

I have an idea to reduce execution time by removing unnecessary combinations, simplifying with the following example with :
RQ=['A','B','C']
NBPIS=3
NBSER=3
Currently with :
Combi1 = combinations_with_replacement(RQ,NBPIS)
print(list(Combi1))
[('A', 'A', 'A'), ('A', 'A', 'B'), ('A', 'A', 'C'), ('A', 'B', 'B'),
('A', 'B', 'C'), ('A', 'C', 'C'), ('B', 'B', 'B'), ('B', 'B', 'C'),
('B', 'C', 'C'), ('C', 'C', 'C')]
But with :
Combi1 = list(list(combinations(RQ,W)) for W in range(1,NBPIS+1))
print(Combi1)
[[('A',), ('B',), ('C',)], [('A', 'B'), ('A', 'C'), ('B', 'C')],
[('A', 'B', 'C')]]
Problem with :
Combi1 = list(list(combinations(RQ,W)) for W in range(1,NBPIS+1))
Error message :
Combi3 = np.array([X for X in Combi2 if
unique_RQ.issubset(X.flatten())])
TypeError: unhashable type: 'list'
But with :
(Combi1 = combinations(RQ,W) for W in range(1,NBPIS+1))
print(Combi3)
[]
Questions :
For Combi1,
Instead of :
[[('A',), ('B',), ('C',)], [('A', 'B'), ('A', 'C'), ('B', 'C')],
[('A', 'B', 'C')]]
how to get this ? :
[('A'), ('B'), ('C'), ('A', 'B'), ('A', 'C'), ('B', 'C'), ('A', 'B',
'C')]
For Combi3, is it possible to get an array with different sizes ?
Instead of :
[[['A' 'A' 'A'] ['A' 'A' 'A'] ['A' 'B' 'C']]...
Obtain this ? :
[[['A'] ['A'] ['A' 'B' 'C']]...

Related

Alternate row colour for dataframe

Trying to build on from this: Python: Color pandas dataframe based on MultiIndex
I've extended the code:
import pandas as pd
i = pd.MultiIndex.from_tuples([(0, 'zero'), (0, 'one'), (1, 'zero'), (1, 'one'), (1, 'two'), (1, 'three'), (1, 'four'), (2, 'zero'), (2, 'one'), (2, 'two'), (2, 'three'), (2, 'four')], names=['level_0', 'level_1'])
df = pd.DataFrame(range(0, len(i)), index=i, columns=['foo'])
colors = {0: (0.6, 0.8, 0.8, 1), 1: (1, 0.9, 0.4, 1), 2: (0.6, 0.8, 0.8, 1)}
#convert rgba to integers
c1 = {k: (int(r * 255),int(g * 255),int(b * 255), a) for k, (r,g,b,a) in colors.items()}
c2 = {k: (int(r * 255),int(g * 255),int(b * 255), 0.25) for k, (r,g,b,a) in colors.items()}
#get values of first level of MulitIndex
idx = df.index.get_level_values(0)
#counter per first level for pair and unpair coloring
zipped = zip(df.groupby(idx).cumcount(), enumerate(idx))
css = [{'selector': f'.row{i}', 'props': [('background-color', f'rgba{c1[j]}')]}
if v % 2 == 0
else {'selector': f'.row{i}', 'props': [('background-color', f'rgba{c2[j]}')]}
for v,(i, j) in zipped]
df1.style.set_table_styles(css)
And got this:
It seems tedious to do this manually. So how do I go about generalising it so that it covers all rows, and the pattern applies even if I apply it to other such 2-level multi-index dataframes?
Here is one way to do it with cycle from Python standard library's itertools module:
import pandas as pd
# Setup
i = pd.MultiIndex.from_tuples(
[
(0, "zero"),
(0, "one"),
(1, "zero"),
(1, "one"),
(1, "two"),
(1, "three"),
(1, "four"),
(2, "zero"),
(2, "one"),
(2, "two"),
(2, "three"),
(2, "four"),
(3, "one"),
],
names=["level_0", "level_1"],
)
df = pd.DataFrame(range(0, len(i)), index=i, columns=["foo"])
# Define two pairs of colors (dark and light green/yellow)
from itertools import cycle
colors = [(0.6, 0.8, 0.8), (1, 0.9, 0.4)] # green, yellow
color_cycle = cycle(
[
{
k: (int(c[0] * 255), int(c[1] * 255), int(c[2] * 255), a)
for k, a in enumerate([1, 0.25])
}
for c in colors
]
)
# Define color for each row
bg_colors = []
for i in df.index.get_level_values(0).unique():
color = next(color_cycle)
row_color = cycle(
[
{
"props": [("background-color", f"rgba{color[0]}")],
},
{
"props": [("background-color", f"rgba{color[1]}")],
},
]
)
for _ in range(df.loc[(i,), :].shape[0]):
bg_colors.append(next(row_color))
# Style dataframe
css = [{"selector": f".row{i}"} | color for i, color in enumerate(bg_colors)]
df.style.set_table_styles(css)
Output from last cell in Jupyter notebook:

numpy genfromtxt - how to detect bad int input values

Here is a trivial example of a bad int value to numpy.genfromtxt. For some reason, I can't detect this bad value, as it's showing up as a valid int of -1.
>>> bad = '''a,b
0,BAD
1,2
3,4'''.splitlines()
My input here has 2 columns of ints, named a and b. b has a bad value, where we have a string "BAD" instead of an integer. However, when I call genfromtxt, I cannot detect this bad value.
>>> out = np.genfromtxt(bad, delimiter=',', dtype=(numpy.dtype('int64'), numpy.dtype('int64')), names=True, usemask=True, usecols=tuple('ab'))
>>> out
masked_array(data=[(0, -1), (1, 2), (3, 4)],
mask=[(False, False), (False, False), (False, False)],
fill_value=(999999, 999999),
dtype=[('a', '<i8'), ('b', '<i8')])
>>> out['b'].data
array([-1, 2, 4])
I print out the column 'b' from my output, and I'm shocked to see that it has a -1 where the string "BAD" is supposed to be. The user has no idea that there was bad input. In fact, if you only look at the output, this is totally indistinguishable from the following input
>>> bad2 = '''a,b
0,-1
1,2
3,4'''.splitlines()
I feel like I must be using genfromtxt wrong. How is it possible that it can't detect bad input?
I found in np.lib._iotools a function
def _loose_call(self, value):
try:
return self.func(value)
except ValueError:
return self.default
When genfromtxt is processing a line it does
if loose:
rows = list(
zip(*[[conv._loose_call(_r) for _r in map(itemgetter(i), rows)]
for (i, conv) in enumerate(converters)]))
where loose is an input parameter. So in the case of int converter it tries
int(astring)
and if that produces a ValueError it returns the default value (e.g. -1) instead of raising an error. Similarly for float and np.nan.
The usemask parameter is applied in:
if usemask:
append_to_masks(tuple([v.strip() in m
for (v, m) in zip(values,
missing_values)]))
Define 2 converters to give more information on what's processed:
def myint(astr):
try:
v = int(astr)
except ValueError:
print('err',astr)
v = '-999'
return v
def myfloat(astr):
try:
v = float(astr)
except ValueError:
print('err',astr)
v = '-inf'
return v
A sample text:
txt='''1,2
3,nan
,foo
bar,
'''.splitlines()
And using the converters:
In [242]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat})
err b''
err b'bar'
err b'foo'
err b''
Out[242]:
array([( 1, 2.), ( 3, nan), (-999, -inf), (-999, -inf)],
dtype=[('f0', '<i8'), ('f1', '<f8')])
And to see what usemask does:
In [243]: np.genfromtxt(txt, delimiter=',', converters={0:myint, 1:myfloat}, usemask=True)
err b''
err b'bar'
err b'foo'
err b''
Out[243]:
masked_array(data=[(1, 2.0), (3, nan), (--, -inf), (-999, --)],
mask=[(False, False), (False, False), ( True, False),
(False, True)],
fill_value=(999999, 1.e+20),
dtype=[('f0', '<i8'), ('f1', '<f8')])
A missing value is a '' string, and int('') produces a ValueError just as int('bad') does. So for the converter, default or my custom ones, a missing value is the same as bad one. Your converter could make a distinction. But only 'missing' set the the mask.

Positions of the maximum in Pandas

I have a pandas dataframe and I want to retrieve the positions (row, column) of the maximum value in the dataframe. How can I do?
Sample:
df = pd.DataFrame({
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
First if necessary get only numbers by DataFrame.select_dtypes:
df = df.select_dtypes(np.number)
For return first max value in DataFrame use DataFrame.stack with Series.idxmax:
print (df.stack().idxmax())
(2, 'C')
If performance is important get indices by compare by max value with numpy.where and get first value by indexing:
r, c = np.where(df == df.values.max())
print ((df.index[r[0]], df.columns[c[0]]))
(2, 'C')
If need all max values:
s = df.stack()
print (s.index[s == s.max()].tolist())
[(2, 'C'), (3, 'E')
print (list(zip(df.index[r], df.columns[c])))
[(2, 'C'), (3, 'E')]

New dataframe creation within loop and append of the results to the existing dataframe

I am trying to create conditional subsets of rows and columns from a DataFrame and append them to the existing dataframes that match the structure of the subsets. New subsets of data would need to be stored in the smaller dataframes and names of these smaller dataframes would need to be dynamic. Below is an example.
#Sample Data
df = pd.DataFrame({'a': [1,2,3,4,5,6,7], 'b': [4,5,6,4,3,4,6,], 'c': [1,2,2,4,2,1,7], 'd': [4,4,2,2,3,5,6,], 'e': [1,3,3,4,2,1,7], 'f': [1,1,2,2,1,5,6,]})
#Function to apply to create the subsets of data - I would need to apply a #function like this to many combinations of columns
def f1 (df, input_col1, input_col2):
#Subset ros
t=df[df[input_col1]>=3]
#Subset of columns
t=t[[input_col1, input_col2]]
t = t.sort_values([input_col1], ascending=False)
return t
#I want to create 3 different dataframes t1, #t2, and t3, but I would like to create them in the loop - not via individual #function calls.
#These Individual calls - these are just examples of what I am trying to achieve via loop
#t1=f1(df, 'a', 'b')
#t2=f1(df, 'c', 'd')
#t3=f1(df, 'e', 'f')
#These are empty dataframes to which I would like to append the resulting #subsets of data
column_names=['col1','col2']
g1 = pd.DataFrame(np.empty(0, dtype=[('col1', 'f8'),('col2', 'f8')]))
g2 = pd.DataFrame(np.empty(0, dtype=[('col1', 'f8'),('col2', 'f8')]))
g3 = pd.DataFrame(np.empty(0, dtype=[('col1', 'f8'),('col2', 'f8')]))
list1=['a', 'c', 'e']
list2=['b', 'd', 'f']
t={}
g={}
#This is what I want in the end - I would like to call the function inside of #the loop, create new dataframes dynamically and then append them to the #existing dataframes, but I am getting errors. Is it possible to do?
for c in range(1,4,1):
for i,j in zip(list1,list2):
t['t'+str(c)]=f1(df, i, j)
g['g'+str(c)]=g['g'+str(c)].append(t['t'+str(c)], ignore_index=True)
I guess you want to create t1,t2,t3 dynamically.
You can use globals().
g1 = pd.DataFrame(np.empty(0, dtype=[('a', 'f8'), ('b', 'f8')]))
g2 = pd.DataFrame(np.empty(0, dtype=[('c', 'f8'), ('d', 'f8')]))
g3 = pd.DataFrame(np.empty(0, dtype=[('e', 'f8'), ('f', 'f8')]))
list1 = ['a', 'c', 'e']
list2 = ['b', 'd', 'f']
for c in range(1, 4, 1):
globals()['t' + str(c)] = f1(df, list1[c-1], list2[c-1])
globals()['g' + str(c)] = globals()['g' + str(c)].append(globals()['t' + str(c)])

Cumulative Mode in numpy

I'd like to efficiently calculate a cumulative mode along an axis in numpy.
e.g.
>>> arr = np.random.RandomState(3).randint(3, size = (2, 5))
>>> arr
array([[2, 0, 1, 0, 0],
[0, 1, 1, 2, 1]])
>>> assert np.array_equal(cummode(arr, axis = 1), [[2,2,2,0,0],[0,0,1,1,1])
Is there an efficient way to do this? I guess it can handle ties by returning the first number to achieve the given count.
Here's a pure Python function that works on a list, or any iterable:
from collections import defaultdict
def cummode(alist):
dd = defaultdict(int)
mode = [(None,0)]
for i in alist:
dd[i] += 1
if dd[i]>mode[-1][1]:
newmode = (i,dd[i])
else:
newmode = mode[-1]
mode.append(newmode)
mode.pop(0)
return mode, dd
mode,dd = cummode([0,1,3,6,1,2,3,3,2,1,10,0])
print(dd)
print(mode)
which for the test case, produces
defaultdict(<type 'int'>, {0: 2, 1: 3, 2: 2, 3: 3, 6: 1, 10: 1})
[(0, 1), (0, 1), (0, 1), (0, 1), (1, 2), (1, 2), (1, 2), (3, 3), (3, 3), (3, 3), (3, 3), (3, 3)]
A detaultdict is a clean fast way of accumulating values when you don't know all the keys before hand. For small lists and arrays it probably beats a numpy based version, even with weave, simply because it does not incure the overhead of creating arrays. But with large ones it probably will lag. Plus I haven't generalized it to handle multiple values (rows).
Ok, it's not fully general in that it doesn't work along any axis, but for now I've made a version that works using scipy-weave.
from scipy import weave
def cummode(x, axis = 1):
assert x.ndim == 2 and axis == 1, 'Only implemented for a special case!'
all_values, element_ids = np.unique(x, return_inverse=True)
n_unique = len(all_values)
element_ids = element_ids.reshape(x.shape)
result = np.zeros(x.shape, dtype = int)
counts = np.zeros(n_unique, dtype = int)
code = """
int n_samples = Nelement_ids[0];
int n_events = Nelement_ids[1];
for (int i=0; i<n_samples; i++){
int maxcount = 0;
int maxel = -1;
for (int k=0; k<n_unique; k++)
counts[k] = 0;
for (int j=0; j<n_events; j++){
int ix = i*n_events+j;
int k = element_ids[ix];
counts[k]+=1;
if (counts[k] > maxcount){
maxcount = counts[k];
maxel = k;
}
result[ix]=maxel;
}
}
"""
weave.inline(code, ['element_ids', 'result', 'n_unique', 'counts'], compiler = 'gcc')
mode_values = all_values[result]
return mode_values