I have a DataFrame with multiindex like this:
0 1 2
a 0 0.928295 0.828225 -0.612509
1 1.103340 -0.540640 -0.344500
2 -1.760918 -1.426488 -0.647610
3 -0.782976 0.359211 1.601602
4 0.334406 -0.508752 -0.611212
b 2 0.717163 0.902514 1.027191
3 0.296955 1.543040 -1.429113
4 -0.651468 0.665114 0.949849
c 0 0.195620 -0.240177 0.745310
1 1.244997 -0.817949 0.130422
2 0.288510 1.123550 0.211385
3 -1.060227 1.739789 2.186224
4 -0.109178 -1.645732 0.022480
d 3 0.021789 0.747183 0.614485
4 -1.074870 0.407974 -0.961013
What I want : array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0])
Now I want to generate a zero vector which have the sample length of this DataFrame and only have ones on the first elements of level[1] index.
For example, here the df have a shape of (15, 3). Therefore I want to get a vector with length of 15 and should have 1 at(a, 0), (b, 2), (c, 0), (d, 3) and 0 at other points.
How could I generator an vector like that? (If possible don't loop over get each sub vector and then use np.concatenate()) Thanks a lot!
IIUC
duplicated
(~df.index.get_level_values(0).duplicated()).astype(int)
Out[726]: array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0])
Or using groupby and head
df.loc[df.groupby(level=0).head(1).index,'New']=1
df.New.fillna(0).values
Out[721]: array([1., 0., 0., 0., 0., 1., 0., 0., 1., 0., 0., 0., 0., 1., 0.])
Get the labels of your first multiindex, turn them into a series, then find where they are not equal to the adjacent ones
labels = pd.Series(df.index.labels[0])
v = labels.ne(labels.shift()).astype(int).values
>>> v
array([1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0])
pd.Index(df.labels[0])
Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 3, 3], dtype='int64')
res = pd.Index(df.labels[0]).duplicated(keep='first')
array([False, True, True, True, True, False, True, True, False,
True, True, True, True, False, True])
Mulitindex has an attribute labels to indicate postion.
Which has the same meaning of the requirement.
Related
A = [[2,2,4,2,2,2]
[2,6,2,2,2,2]
[2,2,2,2,8,2]]
I want matrix B to be equal to:
B = [[0,0,4,0,0,0]
[0,6,0,0,0,0]
[0,0,0,0,8,0]]
So I want to find the maximum value of each row and replace other values with 0. Is there any way to do this without using for loops?
Thanks in advance for your comments.
Instead of looking at the argmax, you could take the max values for each row directly, then mask the elements which are lower and replace them with zeros:
Inplace this would look like (here True stands for keepdims=True):
>>> A[A < A.max(1, True)] = 0
>>> A
array([[0, 0, 4, 0, 0, 0],
[0, 6, 0, 0, 0, 0],
[0, 0, 0, 0, 8, 0]])
An out of place alternative is to use np.where:
>>> np.where(A == A.max(1, True), A, 0)
array([[0, 0, 4, 0, 0, 0],
[0, 6, 0, 0, 0, 0],
[0, 0, 0, 0, 8, 0]])
I have a correlation matrix, and I want to get the count of number of items below the diagonal. Preferably in numpy.
[[1, 0, 0, 0, 0],
[.35, 1, 0, 0, 0],
[.42, .31, 1, 0, 0],
[.25, .38, .41, 1, 0],
[.21, .36, .46, .31, 1]]
I want it to return 10. Or, to return the mean of all numbers under the diagonal.
Setup
a = np.array([[1. , 0. , 0. , 0. , 0. ],
[0.35, 1. , 0. , 0. , 0. ],
[0.42, 0.31, 1. , 0. , 0. ],
[0.25, 0.38, 0.41, 1. , 0. ],
[0.21, 0.36, 0.46, 0.31, 1. ]])
numpy.tril_indices will give the indices of all elements under the diagonal (if you provide an offset of -1), and from there, it becomes as simple as indexing and calling mean and size
n, m = a.shape
m = np.tril_indices(n=n, k=-1, m=m)
a[m]
# array([0.35, 0.42, 0.31, 0.25, 0.38, 0.41, 0.21, 0.36, 0.46, 0.31])
a[m].mean()
# 0.346
a[m].size
# 10
A more primitive and bulky answer since numpy provides np.tril_indices as user3483203 mentioned, but what you want per row iteration i is the following (in terms of [row,col] indices):
(i=0)
[1,0] (i=1)
[2,0] [2,1] (i=2)
[3,0] [3,1] [3,2] (i=3)
...
This is essentially the zip of list [i,i,i,...] = [i]*i (i repetitions of i) with [0,1,...,i-1] = range(i). So iterating over the rows of the table, you can actually get the indices per iteration and perform the operator of your choice.
Example setup:
test = np.array(
[[1, 0, 0, 0, 0],
[.35, 1, 0, 0, 0],
[.42, .31, 1, 0, 0],
[.25, .38, .41, 1, 0],
[.21, .36, .46, .31, 1]])
Function definition:
def countdiag(myarray):
numvals = 0
totsum = 0
for i in range(myarray.shape[0]): # row iteration
colc = np.array(range(i)) # calculate column indices
rowc = np.array([i]*i) # calculate row indices
if any(rowc):
print(np.sum(myarray[rowc,colc]))
print(len(myarray[rowc,colc]))
numvals += len(myarray[rowc,colc])
totsum += np.sum(myarray[rowc,colc])
print(list(zip([i]*i, np.arange(i))))
mean = totsum / numvals
return mean, numvals
Test:
[165]: countdiag(test)
[]
0.35
1
[(1, 0)]
0.73
2
[(2, 0), (2, 1)]
1.04
3
[(3, 0), (3, 1), (3, 2)]
1.34
4
[(4, 0), (4, 1), (4, 2), (4, 3)]
0.346
Out[165]:
(0.346, 10)
(In any language) For a research project, I am stuck on how to convert a matrix P of probability values to a matrix A such that A_ij = 1 with probability P_ij and 0 otherwise? I have looked through various random number generator documentations, but have been unable to figure out how to do this.
If I understand correctly:
In [11]: p = np.random.uniform(size=(5,5))
In [12]: p
Out[12]:
array([[ 0.45481883, 0.21242567, 0.3124863 , 0.00485797, 0.31970718],
[ 0.91995847, 0.29907277, 0.59154085, 0.85847147, 0.13227595],
[ 0.91914631, 0.5495813 , 0.58648856, 0.08037582, 0.23005148],
[ 0.12464628, 0.70657028, 0.75975869, 0.77632964, 0.24587041],
[ 0.69259133, 0.183515 , 0.65500547, 0.19526148, 0.26975325]])
In [13]: a = (p.round(1)==0.7).astype(np.int8)
In [14]: a
Out[14]:
array([[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 0],
[0, 1, 0, 0, 0],
[1, 0, 1, 0, 0]], dtype=int8)
I am interested in comparing all values from 1 dataframe column with all values from a 2nd column and then generating a list or a subset df with values from a 3rd column that is adjacent to the 1st column hits. Hopefully this example will explain it better:
For a simplified example, say I generate the following pandas dataframe:
fake_df=pd.DataFrame({'m':[100,120,101,200,201,501,350,420,525,500],
'n':[10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0],
'mod':[101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001]})
print fake_df
What I am interested in doing is finding all values in column 'm' that are within 0.1 of any value in
column 'mod' and return the values in column 'n' that correspond to the column 'm' hits. So for the above code, the return would be:
10.2, 2.0, 1.1
(since 101,201 and 501 all have close hits in column 'mod').
I have found ways to compare across the same row, but not like above. Is there a way to do this in pandas without extensive loops?
Thanks!
I don't know such method in pandas, but when you extend your scope to include
numpy, two options come to mind.
Easy/Expensive Method
If you can live with N**2 memory overhead, you can do numpy broadcasting to
find out all "adjacent" elements in one step:
In [25]: fake_df=pd.DataFrame({'m':[100,120,101,200,201,501,350,420,525,500],
'n':[10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0],
'mod':[101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001]})
In [26]: mvals = fake_df['m'].values
In [27]: modvals = fake_df['mod'].values
In [28]: is_close = np.abs(mvals - modvals[:, np.newaxis]) <= 0.1; is_close.astype(int)
Out[28]:
array([[0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 0, 0, 0]])
As we care only about 'mod' values that have adjacent 'm's, aggregate over axis=0:
In [29]: is_close.any(axis=0).astype(int)
Out[29]: array([0, 0, 1, 0, 1, 1, 0, 0, 0, 0])
Or otherwise
In [30]: fake_df.ix[is_close.any(axis=0), 'n']
Out[30]:
2 10.2
4 2.0
5 1.1
Name: n, dtype: float64
Efficient/Complex Method
To find adjacent elements in less than O(N**2) without any hashing/rounding
tricks, you have to do some sorting:
In [103]: modvals_sorted = np.sort(modvals)
In [104]: next_indices = np.searchsorted(modvals_sorted, mvals)
You have indices of next elements, but they may point beyond the original
array, so you need a one extra NaN at the end to avoid IndexError. Same
logic applies to previous elements which are next_indices - 1: to avoid
indexing before the first element we must prepend one NaN, too. Note the + 1 that arises because one of NaN has been added to the beginning.
In [105]: modvals_sorted_plus = np.r_[np.nan, modvals_sorted, np.nan]
In [106]: nexts = modvals_sorted_plus[next_indices + 1]
In [107]: prevs = modvals_sorted_plus[(next_indices - 1) + 1]
Now it's trivial. Note that we already have prevs <= mvals <= nexts, so we
don't need to use np.abs. Also, all missing elements are NaN and comparing with them results in False that doesn't alter the result of any operation.
In [108]: adjacent = np.c_[prevs, mvals, nexts]; adjacent
Out[108]:
array([[ nan, 100. , 101.001],
[ 102.001, 120. , 121.001],
[ nan, 101. , 101.001],
[ 121.001, 200. , 201.001],
[ 121.001, 201. , 201.001],
[ 421.001, 501. , 501.001],
[ 202.001, 350. , 351.001],
[ 351.001, 420. , 421.001],
[ 502.001, 525. , 526.001],
[ 421.001, 500. , 501.001]])
In [109]: (np.diff(adjacent, axis=1) <= 0.1).any(axis=1)
Out[109]: array([False, False, True, False, True, True, False, False, False, False], dtype=bool)
In [110]: mask = (np.diff(adjacent, axis=1) <= 0.1).any(axis=1)
In [112]: fake_df.ix[mask, 'n']
Out[112]:
2 10.2
4 2.0
5 1.1
Name: n, dtype: float64
Try the following:
# I assume all arrays involved to be or to be converted to numpy arrays
import numpy as np
m = np.array([100,120,101,200,201,501,350,420,525,500])
n = np.array([10.0,11.0,10.2,1.0,2.0,1.1,3.0,1.0,2.0,1.0])
mod = np.array([101.001,121.001,102.001,201.001,202.001,502.001,351.001,421.001,526.001,501.001])
res = []
# for each entry in mod, look in m for "close" values
for i in range(len(mod)):
# for each hit, store entry from n in result list
res.extend(n[np.fabs(mod[i]-m)<=0.1])
# cast result to numpy array
res = np.array(res)
print res
The output is
[ 10.2 2. 1.1]
I'll be making of numpy (imported as np) which pandas uses under the hood. np.isclose returns a boolean indexer: for each value of the iterable, there's a True or False value corresponding to the value m being within atol of each value of df["mod"].
>>> for i, m in df["m"].iteritems():
... indices = np.isclose(m, df["mod"], atol=0.1)
... if any(indices):
... print df["n"][i]
Using the DataFrame you gave produces the output:
10.2
2.0
1.1
It's hard to explain what I'm trying to do with words so here's an example.
Let's say we have the following inputs:
In [76]: x
Out[76]:
0 a
1 a
2 c
3 a
4 b
In [77]: z
Out[77]: ['a', 'b', 'c', 'd', 'e']
I want to get:
In [78]: ii
Out[78]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0]])
ii is an array of boolean masks which can be applied to z to get back the original x.
My current solution is to write a function which converts z to a list and uses the index method to get the index of the element in z and then generate a row of zeroes except for the index where there is a one. This function gets applied to each row of x to get the desired result.
A first possibility:
>>> choices = np.diag([1]*5)
>>> choices[[z.index(i) for i in x]]
As noted elsewhere, you can change the list comprehension [z.index(i) for i in x] by np.searchsorted(z, x)
>>> choices[np.searchsorted(z, x)]
Note that as suggested in a comment by #seberg, you should use np.eye(len(x)) instead of np.diag([1]*len(x)). The np.eye function directly gives you a 2D array with 1 on the diagonal and 0 elsewhere.
This is numpy method for the case of z being sorted. You did not specifiy that... If pandas needs something differently, I don't know:
# Assuming z is sorted.
indices = np.searchsorted(z, x)
Now I really don't know why you want a boolean mask, these indices can be applied to z to give back x already and are more compact.
z[indices] == x # if z included all x.
Surprised no one mentioned theouter method of numpy.equal:
In [51]: np.equal.outer(s, z)
Out[51]:
array([[ True, False, False, False, False],
[ True, False, False, False, False],
[False, False, True, False, False],
[ True, False, False, False, False],
[False, True, False, False, False]], dtype=bool)
In [52]: np.equal.outer(s, z).astype(int)
Out[52]:
array([[1, 0, 0, 0, 0],
[1, 0, 0, 0, 0],
[0, 0, 1, 0, 0],
[1, 0, 0, 0, 0],
[0, 1, 0, 0, 0]])