Suppose I have a dataframe as shown below:
import pandas as pd
import numpy as np
np.random.seed(42)
df = pd.DataFrame({'A':np.random.randn(5), 'B': np.zeros(5), 'C': np.zeros(5)})
df
>>>
A B C
0 0.496714 0.0 0.0
1 -0.138264 0.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 0.0
And I have the list of columns which I want to populate with the value of 1, when A is negative.
idx = df.A < 0
cols = ['B', 'C']
So in this case, I want the indices [1, 'B'] and [4, 'C'] set to 1.
What I tried:
However, doing df.loc[idx, cols] = 1 sets the entire row to be 1, and not just the individual column. I also tried doing df.loc[idx, cols] = pd.get_dummies(cols) which gave the result:
A B C
0 0.496714 0.0 0.0
1 -0.138264 0.0 1.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 NaN NaN
I'm assuming this is because the index of get_dummies and the dataframe don't line up.
Expected Output:
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
So what's the best (read fastest) way to do this. In my case, there are 1000's of rows and 5 columns.
Timing of results:
TLDR: editing values directly is faster.
%%timeit
df.values[idx, df.columns.get_indexer(cols)] = 1
123 µs ± 2.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
df.iloc[idx.array,df.columns.get_indexer(cols)]=1
266 µs ± 7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Use numpy indexing for improve performance:
idx = df.A < 0
res = ['B', 'C']
arr = df.values
arr[idx, df.columns.get_indexer(res)] = 1
print (arr)
[[ 0.49671415 0. 0. ]
[-0.1382643 1. 0. ]
[ 0.64768854 0. 0. ]
[ 1.52302986 0. 0. ]
[-0.23415337 0. 1. ]]
df = pd.DataFrame(arr, columns=df.columns, index=df.index)
print (df)
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
Alternative:
idx = df.A < 0
res = ['B', 'C']
df.values[idx, df.columns.get_indexer(res)] = 1
print (df)
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
ind = df.index[idx]
for idx,col in zip(ind,res):
...: df.at[idx,col] = 1
In [7]: df
Out[7]:
A B C
0 0.496714 0.0 0.0
1 -0.138264 1.0 0.0
2 0.647689 0.0 0.0
3 1.523030 0.0 0.0
4 -0.234153 0.0 1.0
Related
I would like to take a dataframe and concatenate consecutive rows for comparison.
e.g.
Take
xyt = pd.DataFrame(np.concatenate((np.random.randn(3,2), np.arange(3).reshape((3, 1))), axis=1), columns=['x','y','t'])
Which looks something like:
x y t
0 1.237007 -1.035837 0.0
1 -1.782458 1.042942 1.0
2 0.063130 0.355014 2.0
And make:
a b
x y t x y t
0 1.237007 -1.035837 0.0 -1.782458 1.042942 1.0
1 -1.782458 1.042942 1.0 0.063130 0.355014 2.0
The best I could come up with was:
pd.DataFrame(
[np.append(x,y) for (x, y) in zip(xyt.values, xyt[1:].values)],
columns=pd.MultiIndex.from_product([('a', 'b'), xyt.columns]))
Is there a better way?
Let's try concat on axis=1 with the shifted frame:
import pandas as pd
xyt = pd.DataFrame({'x': {0: 1.237007, 1: -1.782458, 2: 0.06313},
'y': {0: -1.035837, 1: 1.042942, 2: 0.355014},
't': {0: 0.0, 1: 1.0, 2: 2.0}})
merged = pd.concat((xyt, xyt.shift(-1)), axis=1, keys=('a', 'b')).iloc[:-1]
print(merged)
merged:
a b
x y t x y t
0 1.237007 -1.035837 0.0 -1.782458 1.042942 1.0
1 -1.782458 1.042942 1.0 0.063130 0.355014 2.0
You can use pd.concat:
# Generate random data
n = 10
x, y = np.random.randn(2, n)
t = np.arange(n)
xyt = pd.DataFrame({
'x': x, 'y': y, 't': t
})
# The call
pd.concat([xyt, xyt.shift(-1)], axis=1, keys=['a','b'])
# Result
a b
x y t x y t
0 1.180544 1.707380 0 -0.227370 0.734225 1.0
1 -0.227370 0.734225 1 0.271997 -1.039424 2.0
2 0.271997 -1.039424 2 -0.729960 -1.081224 3.0
3 -0.729960 -1.081224 3 0.185301 0.530126 4.0
4 0.185301 0.530126 4 -0.175333 -0.126157 5.0
5 -0.175333 -0.126157 5 -0.634870 0.068683 6.0
6 -0.634870 0.068683 6 0.350867 0.361564 7.0
7 0.350867 0.361564 7 0.090678 -0.269504 8.0
8 0.090678 -0.269504 8 0.177076 -0.976640 9.0
9 0.177076 -0.976640 9 NaN NaN NaN
I want to divide a dataframe by a number:
df = df/10
Is there a way to do this in a method chain?
# idea:
df = df.filter(['a','b']).query("a>100").assign(**divide by 10)
We can use DataFrame.div here:
df = df[['a','b']].query("a>100").div(10)
a b
0 40.0 0.7
1 50.0 0.8
5 70.0 0.3
Use DataFrame.pipe with lambda function for use some function for all data of DataFrame:
df = pd.DataFrame({
'a':[400,500,40,50,5,700],
'b':[7,8,9,4,2,3],
'c':[1,3,5,7,1,0],
'd':[5,3,6,9,2,4]
})
df = df.filter(['a','b']).query("a>100").pipe(lambda x: x / 10)
print (df)
a b
0 40.0 0.7
1 50.0 0.8
5 70.0 0.3
Here if use apply all columns are divided separately:
df = df.filter(['a','b']).query("a>100").apply(lambda x: x / 10)
You can see difference with print:
df1 = df.filter(['a','b']).query("a>100").pipe(lambda x: print (x))
a b
0 400 7
1 500 8
5 700 3
df2 = df.filter(['a','b']).query("a>100").apply(lambda x: print (x))
0 400
1 500
5 700
Name: a, dtype: int64
0 7
1 8
5 3
Name: b, dtype: int64
I have this pandas df:
value
index1 index2 index3
1 1 1 10.0
2 -0.5
3 0.0
2 2 1 3.0
2 0.0
3 0.0
3 1 0.0
2 -5.0
3 6.0
I would like to get the 'value' of a specific combination of index, using a dict.
Usually, I use, for example:
df = df.iloc[df.index.isin([2],level='index1')]
df = df.iloc[df.index.isin([3],level='index2')]
df = df.iloc[df.index.isin([2],level='index3')]
value = df.values[0][0]
Now, I would like to get my value = -5 in a shorter way using this dictionary:
d = {'index1':2,'index2':3,'index3':2}
And also, if I use:
d = {'index1':2,'index2':3}
I would like to get the array:
[0.0, -5.0, 6.0]
Tips?
You can use SQL-like method DataFrame.query():
In [69]: df.query(' and '.join('{}=={}'.format(k,v) for k,v in d.items()))
Out[69]:
value
index1 index2 index3
2.0 3.0 2 -5.0
for another dict:
In [77]: d = {'index1':2,'index2':3}
In [78]: df.query(' and '.join('{}=={}'.format(k,v) for k,v in d.items()))
Out[78]:
value
index1 index2 index3
2.0 3.0 1 0.0
2 -5.0
3 6.0
A non-query way would be
In [64]: df.loc[np.logical_and.reduce([
df.index.get_level_values(k) == v for k, v in d.items()])]
Out[64]:
value
index1 index2 index3
2 3 2 -5.0
Suppose I have two-leveled multi-indexed dataframe
In [1]: index = pd.MultiIndex.from_tuples([(i,j) for i in range(3)
: for j in range(1+i)], names=list('ij') )
: df = pd.DataFrame(0.1*np.arange(2*len(index)).reshape(-1,2),
: columns=list('xy'), index=index )
: df
Out[1]:
x y
i j
0 0 0.0 0.1
1 0 0.2 0.3
1 0.4 0.5
2 0 0.6 0.7
1 0.8 0.9
2 1.0 1.1
And I want to run a custom function on every sub-dataframe:
In [2]: def my_aggr_func(subdf):
: return subdf['x'].mean() / subdf['y'].mean()
:
: level0 = df.index.levels[0].values
: pd.DataFrame({'mean_ratio': [my_aggr_func(df.loc[i]) for i in level0]},
: index=pd.Index(level0, name=index.names[0]) )
Out[2]:
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
Is there an elegant way to do it with df.groupby('i').agg(__something__) or something similar?
Need GroupBy.apply, which working with DataFrame:
df1 = df.groupby('i').apply(my_aggr_func).to_frame('mean_ratio')
print (df1)
mean_ratio
i
0 0.000000
1 0.750000
2 0.888889
You don't need the custom function. You can calculate the 'within group means' with agg then perform an eval to get the ratio you want.
df.groupby('i').agg('mean').eval('x / y')
i
0 0.000000
1 0.750000
2 0.888889
dtype: float64
What do the following assignments behave differently?
df.loc[rows, [col]] = ...
df.loc[rows, col] = ...
For example:
r = pd.DataFrame({"response": [1,1,1],},index = [1,2,3] )
df = pd.DataFrame({"x": [999,99,9],}, index = [3,4,5] )
df = pd.merge(df, r, how="left", left_index=True, right_index=True)
df.loc[df["response"].isnull(), "response"] = 0
print df
x response
3 999 0.0
4 99 0.0
5 9 0.0
but
df.loc[df["response"].isnull(), ["response"]] = 0
print df
x response
3 999 1.0
4 99 0.0
5 9 0.0
why should I expect the first to behave differently to the second?
df.loc[df["response"].isnull(), ["response"]]
returns a DataFrame, so if you want to assign something to it it must be aligned by both index and columns
Demo:
In [79]: df.loc[df["response"].isnull(), ["response"]] = \
pd.DataFrame([11,12], columns=['response'], index=[4,5])
In [80]: df
Out[80]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
alternatively you can assign an array/matrix of the same shape:
In [83]: df.loc[df["response"].isnull(), ["response"]] = [11, 12]
In [84]: df
Out[84]:
x response
3 999 1.0
4 99 11.0
5 9 12.0
I'd also consider using fillna() method:
In [88]: df.response = df.response.fillna(0)
In [89]: df
Out[89]:
x response
3 999 1.0
4 99 0.0
5 9 0.0