Python, Numpy: all UNIQUE combinations of a numpy.array() vector - numpy

I want to get all unique combinations of a numpy.array vector (or a pandas.Series). I used itertools.combinations but it's very slow. For an array of size (1000,) it takes many hours. Here is my code using itertools (actually I use combination differences):
def a(array):
temp = pd.Series([])
for i in itertools.combinations(array, 2):
temp = temp.append(pd.Series(np.abs(i[0]-i[1])))
temp.index=range(len(temp))
return temp
As you see there is no repetition!!
The sklearn.utils.extmath.cartesian is really fast and good but it provides repetitions which I do not want! I need help rewriting above function without using itertools and much more speed for large vectors.

You could take the upper triangular part of a matrix formed on the Cartesian product with the binary operation (here subtraction, as in your example):
import numpy as np
n = 3
a = np.random.randn(n)
print(a)
print(a - a[:, np.newaxis])
print((a - a[:, np.newaxis])[np.triu_indices(n, 1)])
gives
[ 0.04248369 -0.80162228 -0.44504522]
[[ 0. -0.84410597 -0.48752891]
[ 0.84410597 0. 0.35657707]
[ 0.48752891 -0.35657707 0. ]]
[-0.84410597 -0.48752891 0.35657707]
with n=1000 (and output piped to /dev/null) this runs in 0.131s
on my relatively modest laptop.

For a random array of ints:
import numpy as np
import pandas as pd
import itertools as it
b = np.random.randint(0, 8, ((6,)))
# array([7, 0, 6, 7, 1, 5])
pd.Series(list(it.combinations(np.unique(b), 2)))
it returns:
0 (0, 1)
1 (0, 5)
2 (0, 6)
3 (0, 7)
4 (1, 5)
5 (1, 6)
6 (1, 7)
7 (5, 6)
8 (5, 7)
9 (6, 7)
dtype: object

Related

Pandas - find rows sharing two out the three common values, order-independent, and collect values pairs

Given a dataframe, I am looking for rows where two out of three values are in common, regardless of the columns, hence order, in which they appear. I would like to then collect those common pairs.
Please note
a couple of values can appear at most in two rows
a value can appear only once in a row
I would like to know what the most efficient/elegant way is in numpy or pandas to solve this problem.
For example, taking as input the dataframe
d = {'col1': [1, 2,5,1], 'col2': [1, 7,1,2],'col3': [3, 3,1,7]}
df = pd.DataFrame(data=d)
col1 col2 col3
0 1 2 3
1 2 7 3
2 5 1 2
3 9 2 7
I expect as result an array, list, something as
1 2
2 3
2 7
as the values (1,2) , (2,3) and (2,7) are present in two rows (first and third, first and second, and second and forth respectively).
I cannot find a concise solution.
At the moment I skecthed a numpy solution such as
def func(x):
rows, columns = x.shape[0], x.shape[1]
res = []
for i in range(0,rows):
for j in range(i+1, rows):
aux = np.intersect1d(x[i,:], x[j,:])
if aux.size>1:
res.append(aux)
return res
which outputs
func(df.values)
Out: [array([2, 3]), array([1, 2]), array([2, 7])]
It looks well cumbersome, how could get it done with one of those cool numpy/pandas one-liners?
I would suggest using python built in set operations to do most of the heavy lifting, just apply them with pandas:
import itertools
import pandas as pd
d = {'col1': [1, 2,5,9], 'col2': [2, 7,1,2],'col3': [3, 3,2,7]}
df = pd.DataFrame(data=d)
pairs = df.apply(set, axis=1).apply(lambda x: set(itertools.combinations(x, 2))).explode()
out = set(pairs[pairs.duplicated()])
Output:
{(2, 3), (1, 2), (2, 7)}
Optionally to get it in list[np.ndarray] format:
out = list(map(np.array, out))
Similar approach to that of #Chrysophylaxs but in pure python:
from itertools import combinations
from collections import Counter
c = Counter(s for x in df.to_numpy().tolist() for s in set(combinations(set(x), r=2)))
out = [k for k,v in c.items() if v>1]
# [(2, 3), (1, 2), (2, 7)]
df=df.assign(col4=df.index)
def function1(ss:pd.Series):
ss1=ss.value_counts().loc[lambda ss:ss>=2]
return ss1.index.tolist() if ss1.size>=2 else None
df.merge(df,how='cross',suffixes=('','_2')).query("col4!=col4_2").filter(regex=r'col[^4]', axis=1)\
.apply(function1,axis=1).dropna().drop_duplicates()
out
1 [2, 3]
2 [1, 2]
7 [2, 7]

Large Sampling with Replacement by index layer of a Pandas multiindexed Dataframe

Imagine a dataframe with the structure below:
>>> print(pair_df)
0 1
centre param h pair_ind
0 x1 1 (0, 1) 2.244282 2.343915
(1, 2) 2.343915 2.442202
(2, 3) 2.442202 2.538162
(3, 4) 2.538162 2.630836
(4, 5) 2.630836 2.719298
... ... ...
9 x3 7 (1, 8) 1.407902 1.417398
(2, 9) 1.407953 1.422860
8 (0, 8) 1.407896 1.417398
(1, 9) 1.407902 1.422860
9 (0, 9) 1.407896 1.422860
[1350 rows x 2 columns]
What is the most efficient way to largely (e.g., 1000 times) sample (with replacement) this dataframe by index layer centre (10 values here) and put them all together?
I have found two solutions:
1)
import numpy as np
bootstrap_rand = np.random.choice(list(range(0,10)), size=len(range(0,10*1000)), replace=True).tolist()
sampled_df = pd.concat([pair_df.loc[idx[i, :, :, :], :] for i in bootstrap_rand])
sampled_df = pair_df.unstack(['param', 'h', 'pair_ind']).\
sample(10*1000, replace=True).\
stack(['param', 'h', 'pair_ind'])
Any more efficient ideas?

Extract all odd numbers from numpy array

how can i extract all the odd number from a numpy array?
try:
import numpy as np
a = np.array([1,2,3,4,5,6,6,7,7,8,9])
a[a % 2 == 1]
Out[13]: array([1, 3, 5, 7, 7, 9])
b = np.where(a%2)
print(f'Array with Odd Numbers: {a[b]}')
Here a is the array containing all the numbers.
And the Output of Odd Numbers from that array is given as a[b]

How to conveniently use operations on numpy fortran contiguos arrays?

Some numpy functions like np.matmul(a, b) have convenient behavior for stacks of matrices.
The manual states:
If either argument is N-D, N > 2, it is treated as a stack of matrices residing in the last two indexes and broadcast accordingly.
Thus, for a.shape = (10 , 2, 4) and b.shape(10, 4, 2) the statementa # b is meaningful and will have shape (10, 2, 2)
However, I'm coming from the linear algebra world, where I'm used to a Fortran contiguous array layout.
The same a represented as a Fortran contiguous array would have shape (4, 2, 10) and similarly b.shape = (2, 4, 10).
To do a # b as before I would have to invoke
(a.T # b.T).T .
Even worse, assume you naively created the same Fortran-contiguous array a with the behavior of matmul in mind, such that it has shape (10, 4, 2).
Then a.strides = (8, 80, 320) with the smallest stride in the 'stack' index, which actually should have highest stride.
Is this really the way to go or am I missing something?
While numpy can handle all sorts of layouts, many details are designed with the "C" layout in mind. Good examples are how nested lists translate into arrays, and the way numpy operations batch excess dimensions as in the matmul case.
It is correct that results in numpy as a rule of thumb do not depend on array layout (FORTRAN,C,non-contiguous); speed, however, certainly does and heavily so:
rng = np.random.default_rng()
a = rng.random((100,111,200))
b = rng.random((111,77,200))
af = np.array(a,order="F")
bf = np.array(b,order="F")
np.allclose((b.T#a.T).T,(bf.T#af.T).T)
# True
timeit(lambda:(b.T#a.T).T,number=10)
# 5.972857117187232
timeit(lambda:(bf.T#af.T).T,number=10)
# 0.1994628761895001
In fact, sometimes it is totally worth it to non-lazily transpose, i.e. copy your data into the best layout:
timeit(lambda:(np.array(b.T,order="C")#np.array(a.T,order="C")).T,number=10)
# 0.3931349152699113
My advice: If you want speed and convenience it is probably best to go with the "C" layout, it doesn't take all that long to get used to and saves you a lot of potential headaches.
numpy's matrix multiplication works regardless of the internal layout of the array. For example, here are two C-ordered arrays:
>>> import numpy as np
>>> a = np.random.rand(10, 2, 4)
>>> b = np.random.rand(10, 4, 2)
>>> print('a', a.shape, a.strides)
>>> print('b', b.shape, b.strides)
a (10, 2, 4) (64, 32, 8)
b (10, 4, 2) (64, 16, 8)
Here are the equivalent arrays in Fortran order:
>>> af = np.asfortranarray(a)
>>> bf = np.asfortranarray(b)
>>> print('af', af.shape, af.strides)
>>> print('bf', bf.shape, bf.strides)
af (10, 2, 4) (8, 80, 160)
bf (10, 4, 2) (8, 80, 320)
Numpy treats equivalent arrays as equivalent, regardless of their internal layout:
>>> np.allclose(a, af) and np.allclose(b, bf)
True
The results of a matrix multiplication do not depend on the internal layout:
>>> np.allclose(a # b, af # bf)
True
and you can even mix layouts if you wish:
>>> np.allclose(a # bf, af # b)
True
In short, the most convenient way to use Fortran-ordered arrays in numpy is to not worry about internal array layout: the shape is all that matters.
If your array shapes differ from what is expected by the numpy matmul API, your best bet is to reshape the arrays, for example using a.transpose(2, 0, 1) # b.transpose(2, 0, 1) or similar, depending on what is appropriate for your use-case, but don't worry: for C or Fortran contiguous arrays, this operation only adjusts the metadata around the array view, it does not cause the underlying data buffer to be copied or re-ordered.

How to zero out all entries of a dask array less than the top k

I want to zero out all of the elements of a dask.array except for the top few elements. How do I do this?
Example
Say I have a small dask array like the following:
import numpy as np
import dask.array as da
x = np.array([0, 4, 2, 3, 1])
x = da.from_array(x, chunks=(2,))
How do I zero out all but the two largest elements? I want something like the following:
>>> result.compute()
array([0, 4, 0, 3, 0])
You can do this with a combination of the topk function and inplace setitem
top = x.topk(2)
x[x < top[-1]] = 0
>>> x.compute()
array([0, 4, 0, 3, 0])
Note that this won't stream particularly nicely through memory. If you're using the single machine scheduler then you might want to do this in two passes by explicitly computing top ahead of time:
top = x.topk(2)
top = top.compute() # pass through data once to get top elements
x[x < top[-1]] = 0 # then pass through again applying filter
>>> x.compute()
array([0, 4, 0, 3, 0])
This only matters if you're trying to stream through a large dataset on a single machine and should not affect you much if you're on a distributed system.