How do I put values od dataframe column in 2d matrix? - pandas

I have the pandas dataframe with 3 columns value, row_index, column_index. I would like to create a matrix, where values of dataframe placed at relevant rows and columns and unknown elements are zeros.
I have made a for-cycle like this:
N_rows = df.row_index.max()
N_cols = df.column_index.max()
A = np.zeros((N_rows, N_cols))
for i in df.row_index:
for j in df.column_index:
np.put(A, i*N_cols+j, df['value'][(df.row_index==i) &
(df.column_index==j)])
but it works very slow.
How can I do it faster?

I think you need pivot with fillna and for missing values of columns and rows add reindex, last for numpy array add values:
df = pd.DataFrame({'value':[2,4,5],
'row_index':[2,3,4],
'col_index':[0,2,3]})
print (df)
col_index row_index value
0 0 2 2
1 2 3 4
2 3 4 5
rows = np.arange(df.row_index.max()+1)
cols = np.arange(df.col_index.max()+1)
print (df.pivot('row_index', 'col_index', 'value')
.fillna(0)
.reindex(index=rows, columns=cols, fill_value=0))
col_index 0 1 2 3
row_index
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 2.0 0.0 0.0 0.0
3 0.0 0.0 4.0 0.0
4 0.0 0.0 0.0 5.0
a = df.pivot('row_index', 'col_index', 'value')
.fillna(0)
.reindex(index=rows, columns=cols, fill_value=0)
.values
print (a)
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 2. 0. 0. 0.]
[ 0. 0. 4. 0.]
[ 0. 0. 0. 5.]]
Another solution with set_index and unstack:
print (df.set_index(['row_index', 'col_index'])['value']
.unstack(fill_value=0)
.reindex(index=rows, columns=cols, fill_value=0))
col_index 0 1 2 3
row_index
0 0 0 0 0
1 0 0 0 0
2 2 0 0 0
3 0 0 4 0
4 0 0 0 5
a = df.set_index(['row_index', 'col_index'])['value']
.unstack(fill_value=0)
.reindex(index=rows, columns=cols, fill_value=0)
.values
print (a)
[[0 0 0 0]
[0 0 0 0]
[2 0 0 0]
[0 0 4 0]
[0 0 0 5]]

Just modifying a minor part in #jezrael's solution. You can actually use Pandas as_matrix() functions to get the arrays:
df = pd.DataFrame({'value':[2,4,5],
'row_index':[2,3,4],
'col_index':[0,2,3]})
df.pivot('row_index', 'col_index', 'value').fillna(0).as_matrix()
# array([[ 2., 0., 0.],
# [ 0., 4., 0.],
# [ 0., 0., 5.]])

Related

numpy return indices using multiple conditions of UNKNOWN number

Consider two arrays (X and Y), X is a 2D data array (grayscale image), and Y is an array of conditions where array X needs to be filtered based on, as follows:
X = np.array([[0,0,0,0,4], [0,1,1,2,3], [1,1,2,2,0], [0,0,2,2,3], [0,0,0,0,0]])
Y = np.array([1,2,3])
X:
[[0 0 0 0 4]
[0 1 1 2 3]
[1 1 2 2 0]
[0 0 2 2 3]
[0 0 0 0 0]]
Y:
[1 2 3]
I need to select the elements/indices of array X based on the values in array Y, such that:
Z = np.argwhere((X == Y[0]) | (X == Y[1]) | (X == Y[2]))
Z:
[[1 1]
[1 2]
[1 3]
[1 4]
[2 0]
[2 1]
[2 2]
[2 3]
[3 2]
[3 3]
[3 4]]
This can be done using a loop over the items of array Y, is there a numpy function to achieve this?
It is also achievable using multiple conditions in a np.argwhere function, however, the number of conditions (length of array Y ) is unknown beforhand.
Thanks
The key is to prepare the correct mask. For that, use numpy.isin:
np.isin(X, Y)
You'll get a boolean mask as a result, of the same shape X has. Now you can get the indices using an appropriate method.

pandas assign result from list of columns

Suppose I have a dataframe as shown below:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'A':np.random.randn(5), 'B': np.zeros(5), 'C': np.zeros(5)})
df
>>>
A B C
0 0.496714 0.0 0.0
1 -0.138264 0.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 0.0
And I have the list of columns which I want to populate with the value of 1, when A is negative.
idx = df.A < 0
cols = ['B', 'C']
So in this case, I want the indices [1, 'B'] and [4, 'C'] set to 1.
What I tried:
However, doing df.loc[idx, cols] = 1 sets the entire row to be 1, and not just the individual column. I also tried doing df.loc[idx, cols] = pd.get_dummies(cols) which gave the result:
A B C
0 0.496714 0.0 0.0
1 -0.138264 0.0 1.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 NaN NaN
I'm assuming this is because the index of get_dummies and the dataframe don't line up.
Expected Output:
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
So what's the best (read fastest) way to do this. In my case, there are 1000's of rows and 5 columns.
Timing of results:
TLDR: editing values directly is faster.
%%timeit
df.values[idx, df.columns.get_indexer(cols)] = 1
123 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df.iloc[idx.array,df.columns.get_indexer(cols)]=1
266 µs ± 7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use numpy indexing for improve performance:
idx = df.A < 0
res = ['B', 'C']
arr = df.values
arr[idx, df.columns.get_indexer(res)] = 1
print (arr)
[[ 0.49671415 0. 0. ]
[-0.1382643 1. 0. ]
[ 0.64768854 0. 0. ]
[ 1.52302986 0. 0. ]
[-0.23415337 0. 1. ]]
df = pd.DataFrame(arr, columns=df.columns, index=df.index)
print (df)
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
Alternative:
idx = df.A < 0
res = ['B', 'C']
df.values[idx, df.columns.get_indexer(res)] = 1
print (df)
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
ind = df.index[idx]
for idx,col in zip(ind,res):
...: df.at[idx,col] = 1
In [7]: df
Out[7]:
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0

Get column value using index dict

I have this pandas df:
value
index1 index2 index3
1 1 1 10.0
2 -0.5
3 0.0
2 2 1 3.0
2 0.0
3 0.0
3 1 0.0
2 -5.0
3 6.0
I would like to get the 'value' of a specific combination of index, using a dict.
Usually, I use, for example:
df = df.iloc[df.index.isin([2],level='index1')]
df = df.iloc[df.index.isin([3],level='index2')]
df = df.iloc[df.index.isin([2],level='index3')]
value = df.values[0][0]
Now, I would like to get my value = -5 in a shorter way using this dictionary:
d = {'index1':2,'index2':3,'index3':2}
And also, if I use:
d = {'index1':2,'index2':3}
I would like to get the array:
[0.0, -5.0, 6.0]
Tips?
You can use SQL-like method DataFrame.query():
In [69]: df.query(' and '.join('{}=={}'.format(k,v) for k,v in d.items()))
Out[69]:
value
index1 index2 index3
2.0 3.0 2 -5.0
for another dict:
In [77]: d = {'index1':2,'index2':3}
In [78]: df.query(' and '.join('{}=={}'.format(k,v) for k,v in d.items()))
Out[78]:
value
index1 index2 index3
2.0 3.0 1 0.0
2 -5.0
3 6.0
A non-query way would be
In [64]: df.loc[np.logical_and.reduce([
df.index.get_level_values(k) == v for k, v in d.items()])]
Out[64]:
value
index1 index2 index3
2 3 2 -5.0

How do I aggregate sub-dataframes in pandas?

Suppose I have two-leveled multi-indexed dataframe
In [1]: index = pd.MultiIndex.from_tuples([(i,j) for i in range(3)
: for j in range(1+i)], names=list('ij') )
: df = pd.DataFrame(0.1*np.arange(2*len(index)).reshape(-1,2),
: columns=list('xy'), index=index )
: df
Out[1]:
x y
i j
0 0 0.0 0.1
1 0 0.2 0.3
1 0.4 0.5
2 0 0.6 0.7
1 0.8 0.9
2 1.0 1.1
And I want to run a custom function on every sub-dataframe:
In [2]: def my_aggr_func(subdf):
: return subdf['x'].mean() / subdf['y'].mean()
:
: level0 = df.index.levels[0].values
: pd.DataFrame({'mean_ratio': [my_aggr_func(df.loc[i]) for i in level0]},
: index=pd.Index(level0, name=index.names[0]) )
Out[2]:
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
Is there an elegant way to do it with df.groupby('i').agg(__something__) or something similar?
Need GroupBy.apply, which working with DataFrame:
df1 = df.groupby('i').apply(my_aggr_func).to_frame('mean_ratio')
print (df1)
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
You don't need the custom function. You can calculate the 'within group means' with agg then perform an eval to get the ratio you want.
df.groupby('i').agg('mean').eval('x / y')
i
0 0.000000
1 0.750000
2 0.888889
dtype: float64

df.loc[rows, [col]] vs df.loc[rows, col] in assignment

What do the following assignments behave differently?
df.loc[rows, [col]] = ...
df.loc[rows, col] = ...
For example:
r = pd.DataFrame({"response": [1,1,1],},index = [1,2,3] )
df = pd.DataFrame({"x": [999,99,9],}, index = [3,4,5] )
df = pd.merge(df, r, how="left", left_index=True, right_index=True)
df.loc[df["response"].isnull(), "response"] = 0
print df
x response
3 999 0.0
4 99 0.0
5 9 0.0
but
df.loc[df["response"].isnull(), ["response"]] = 0
print df
x response
3 999 1.0
4 99 0.0
5 9 0.0
why should I expect the first to behave differently to the second?
df.loc[df["response"].isnull(), ["response"]]
returns a DataFrame, so if you want to assign something to it it must be aligned by both index and columns
Demo:
In [79]: df.loc[df["response"].isnull(), ["response"]] = \
pd.DataFrame([11,12], columns=['response'], index=[4,5])
In [80]: df
Out[80]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
alternatively you can assign an array/matrix of the same shape:
In [83]: df.loc[df["response"].isnull(), ["response"]] = [11, 12]
In [84]: df
Out[84]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
I'd also consider using fillna() method:
In [88]: df.response = df.response.fillna(0)
In [89]: df
Out[89]:
x response
3 999 1.0
4 99 0.0
5 9 0.0