How to return a column value based on integer index and another column value in pandas? - pandas

I have a df that I want to groupby Date and apply a function to it.
Date Symbol Shares
0 1990-01-01 A 0.0
1 1990-01-01 B 0.0
2 1990-01-01 C 0.0
3 1990-01-01 D 0.0
4 1990-01-02 A 50.0
5 1990-01-02 B 100.0
6 1990-01-02 C 66.0
7 1990-01-02 D 7.0
8 1990-01-03 A 11.0
9 1990-01-03 B 123.0
10 1990-01-03 C 11.0
11 1990-01-03 D 11.0
I should be able to access the Shares value for a Symbol from the previous Date in the function. How can I do that? Creating df[prev_shares] like df.groupby('Symbol')['Shares'].shift(1) before applying the function is not an option because the function calculates Shares row by row. It should look like:
def calcs(x):
x.loc[some_condition, 'Shares'] = ...
x.loc[other_condition, 'Shares'] = # return 'Shares' from previous 'Date' for this 'Symbol'
df = df.groupby('Date').apply(calcs)
Any help appreciated.
EDIT:
I post the function that I created.
Equity = 10000
def calcs(x):
global Equity
if x.index[0]==0: return x
x.loc[x['condition'] == True, 'Shares'] = np.floor((Equity * 0.02 / x['ATR']).astype(float))
x.loc[x['condition'] == False, 'Shares'] = # locate Symbol for the previous Date and return its Shares value
x['Closed_P/L'] = x['Shares'] * x['Close']
Equity += x['Closed_P/L'].sum()
return x
data = data.groupby('Date').apply(calcs)

IIUC something like this might help, although what you're trying to accomplish in the end is unclear. Below you will find the data grouped by 'Symbol' and 'Date'. Column 'diff' shows the incremental P/L.
please advise what you need additionally.
df = data.set_index(['Symbol', 'Date']).sort_index()[['Shares']]
df['diff'] = np.nan
idx = pd.IndexSlice
for ix in df.index.levels[0]:
df.loc[ idx[ix,:], 'diff'] = df.loc[idx[ix,:], 'Shares' ].diff()
df.fillna(0)
Shares diff
Symbol Date
A 1990-01-01 0.0 0.0
1990-01-02 50.0 50.0
1990-01-03 11.0 -39.0
B 1990-01-01 0.0 0.0
1990-01-02 100.0 100.0
1990-01-03 123.0 23.0
C 1990-01-01 0.0 0.0
1990-01-02 66.0 66.0
1990-01-03 11.0 -55.0
D 1990-01-01 0.0 0.0
1990-01-02 7.0 7.0
1990-01-03 11.0 4.0

Related

In pandas, replace table column with Series while joining indexes

I have a table with preexisting columns, and I want to entirely replace some of those columns with values from a series. The tricky part is that each series will have different indexes and I need to add these varying indexes to the table as necessary, like doing a join/merge operation.
For example, this code generates a table and 5 series where each series only has a subset of the indexes.
import random
cols=['a', 'b', 'c', 'd', 'e', 'f', 'g']
table = pd.DataFrame(columns=cols)
series = []
for i in range(5):
series.append(
pd.Series(
np.random.randint(0, 3, 2)*10,
index=pd.Index(random.sample(range(3), 2))
)
)
series
Output:
[1 10
2 0
dtype: int32,
2 0
0 20
dtype: int32,
2 20
1 0
dtype: int32,
2 0
0 10
dtype: int32,
1 20
2 10
dtype: int32]
But when I try to replace columns of the table with the series, a simple assignment doesn't work
for i in range(5):
col = cols[i]
table[col] = series[i]
table
Output:
a b c d e f g
1 10 NaN 0 NaN 20 NaN NaN
2 0 0 20 0 10 NaN NaN
because the assignment won't add any more indexes after the first series is assigned
Other things I've tried:
combine or combine_first gives the same result as above. (table[col] = table[col].combine(series[i], lambda a, b: b) and table[col] = series[i].combine_first(table[col]))
pd.concat doesn't work either because of duplicate labels (table[col] = pd.concat([table[col], series[i]]) gives ValueError: cannot reindex on an axis with duplicate labels) and I can't just drop the duplicates because other columns may already have values in those indexes
DataFrame.update won't work since it only takes indexes from the table (join='left'). I need to add indexes from the series to the table as necessary.
Of course, I can always do something like this:
table = table.join(series[i].rename('new'), how='outer')
table[col] = table.pop('new')
which gives the correct result:
a b c d e f g
0 NaN 20.0 NaN 10.0 NaN NaN NaN
1 10.0 NaN 0.0 NaN 20.0 NaN NaN
2 0.0 0.0 20.0 0.0 10.0 NaN NaN
But it's doing it in quite a roundabout way, and still isn't robust to column name collisions, so you'd have to add a handful more code to fiddle with column names and protect against that. This produces quite verbose and ugly code for what is a conceptually a very simple operation, that I believe there must be a better way of doing it.
pd.concat should work along the column axis:
out = pd.concat(series, axis=1)
print(out)
# Output
0 1 2 3 4
0 10.0 0.0 0.0 NaN 10.0
1 NaN 10.0 NaN 0.0 20.0
2 0.0 NaN 0.0 0.0 NaN
You could try constructing the dataframe using a dict comprehension like this:
series:
[0 10
1 0
dtype: int64,
0 0
1 0
dtype: int64,
2 20
0 0
dtype: int64,
0 20
2 0
dtype: int64,
0 0
1 0
dtype: int64]
code:
table = pd.DataFrame({
col: series[i]
for i, col in enumerate(cols)
if i < len(series)
})
table
output:
a b c d e
0 10.0 0.0 0.0 20.0 0.0
1 0.0 0.0 NaN NaN 0.0
2 NaN NaN 20.0 0.0 NaN
If you really need the nan columns at the end you could do:
table = pd.DataFrame({
col: series[i] if i < len(series) else np.nan
for i, col in enumerate(cols)
})
Output:
a b c d e f g
0 10.0 0.0 0.0 20.0 0.0 NaN NaN
1 0.0 0.0 NaN NaN 0.0 NaN NaN
2 NaN NaN 20.0 0.0 NaN NaN NaN

Joining two pandas dataframes with multi-indexed columns

I want to join two pandas dataframes, one of which has multi-indexed columns.
This is how I make the first dataframe.
data_large = pd.DataFrame({"name":["a", "b", "c"], "sell":[10, 60, 50], "buy":[20, 30, 40]})
data_mini = pd.DataFrame({"name":["b", "c", "d"], "sell":[60, 20, 10], "buy":[30, 50, 40]})
data_topix = pd.DataFrame({"name":["a", "b", "c"], "sell":[10, 80, 0], "buy":[70, 30, 40]})
df_out = pd.concat([dfi.set_index('name') for dfi in [data_large, data_mini, data_topix]],
keys=['Large', 'Mini', 'Topix'], axis=1)\
.rename_axis(mapper=['name'], axis=0).rename_axis(mapper=['product','buy_sell'], axis=1)
df_out
And this is the second dataframe.
group = pd.DataFrame({"name":["a", "b", "c", "d"], "group":[1, 1, 2, 2]})
group
How can I join the second to the first, on the column name, keeping the multi-indexed columns?
This did not work and it flattened the multi-index.
df_final = df_out.merge(group, on=['name'], how='left')
Any help would be appreciated!
If need MultiIndex after merge is necessary convert column group to MultiIndex DataFrame, here is converted column name to index for merge by index, else both columns has to be converted to MultiIndex:
group = group.set_index('name')
group.columns = pd.MultiIndex.from_product([group.columns, ['new']])
df_final = df_out.merge(group, on=['name'], how='left')
Or:
df_final = df_out.merge(group, left_index=True, right_index=True, how='left')
print (df_final)
product Large Mini Topix group
buy_sell sell buy sell buy sell buy new
name
a 10.0 20.0 NaN NaN 10.0 70.0 1
b 60.0 30.0 60.0 30.0 80.0 30.0 1
c 50.0 40.0 20.0 50.0 0.0 40.0 2
d NaN NaN 10.0 40.0 NaN NaN 2
Another possible way, but with warning is convert values to MultiIndex after merge:
df_final = df_out.merge(group, on=['name'], how='left')
UserWarning: merging between different levels can give an unintended result (2 levels on the left, 1 on the right)
warnings.warn(msg, UserWarning)
L = [x if isinstance(x, tuple) else (x, 'new') for x in df_final.columns.tolist()]
df_final.columns = pd.MultiIndex.from_tuples(L)
print (df_final)
name Large Mini Topix group
new sell buy sell buy sell buy new
0 a 10.0 20.0 NaN NaN 10.0 70.0 1
1 b 60.0 30.0 60.0 30.0 80.0 30.0 1
2 c 50.0 40.0 20.0 50.0 0.0 40.0 2
3 d NaN NaN 10.0 40.0 NaN NaN 2
EDIT: If need group in MultiIndex:
group = group.set_index(['name'])
group.columns = pd.MultiIndex.from_product([group.columns, ['new']])
df_final = (df_out.merge(group, on=['name'], how='left')
.set_index([('group','new')], append=True)
.rename_axis(['name','group']))
print (df_final)
product Large Mini Topix
buy_sell sell buy sell buy sell buy
name group
a 1 10.0 20.0 NaN NaN 10.0 70.0
b 1 60.0 30.0 60.0 30.0 80.0 30.0
c 2 50.0 40.0 20.0 50.0 0.0 40.0
d 2 NaN NaN 10.0 40.0 NaN NaN
Or:
df_final = df_out.merge(group, on=['name'], how='left').set_index(['name','group'])
df_final.columns = pd.MultiIndex.from_tuples(df_final.columns)
print (df_final)
Large Mini Topix
sell buy sell buy sell buy
name group
a 1 10.0 20.0 NaN NaN 10.0 70.0
b 1 60.0 30.0 60.0 30.0 80.0 30.0
c 2 50.0 40.0 20.0 50.0 0.0 40.0
d 2 NaN NaN 10.0 40.0 NaN NaN

Python: How replace non-zero values in a Pandas dataframe with values from a series

I have a dataframe 'A' with 3 columns and 4 rows (X1..X4). Some of the elements in 'A' are non-zero. I have another dataframe 'B' with 1 column and 4 rows (X1..X4). I would like to create a dataframe 'C' so that where 'A' has a nonzero value, it takes the value from the equivalent row in 'B'
I've tried a.where(a!=0,c)..obviously wrong as c is not a scalar
A = pd.DataFrame({'A':[1,6,0,0],'B':[0,0,1,0],'C':[1,0,3,0]},index=['X1','X2','X3','X4'])
B = pd.DataFrame({'A':{'X1':1.5,'X2':0.4,'X3':-1.1,'X4':5.2}})
These are the expected results:
C = pd.DataFrame({'A':[1.5,0.4,0,0],'B':[0,0,-1.1,0],'C':[1.5,0,-1.1,0]},index=['X1','X2','X3','X4'])
np.where():
If you want to assign back to A:
A[:]=np.where(A.ne(0),B,A)
For a new df:
final=pd.DataFrame(np.where(A.ne(0),B,A),columns=A.columns)
A B C
0 1.5 0.0 1.5
1 0.4 0.0 0.0
2 0.0 -1.1 -1.1
3 0.0 0.0 0.0
Usage of fillna
A=A.mask(A.ne(0)).T.fillna(B.A).T
A
Out[105]:
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0
Or
A=A.mask(A!=0,B.A,axis=0)
Out[111]:
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0
Use:
A.mask(A!=0,B['A'],axis=0,inplace=True)
print(A)
A B C
X1 1.5 0.0 1.5
X2 0.4 0.0 0.0
X3 0.0 -1.1 -1.1
X4 0.0 0.0 0.0

how do I sum each column based on condition of another column without iterating over the columns in pandas datframe

I have a data frame as below:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 1.0 85.0 66.0 29.0 0.0 0.0
1 8.0 183.0 64.0 0.0 0.0 0.0
2 1.0 89.0 66.0 23.0 94.0 1.0
3 0.0 137.0 40.0 35.0 168.0 1.0
4 5.0 116.0 74.0 0.0 0.0 1.0
I would like a pythonic way to sum each column in separate based on a condition of one of the columns. I could do it with iterating over the df columns, but I'm sure there is a better way I'm not familiar with.
In specific to the data I have, I'd like to sum each column values if at the last column 'Outcome' is equal to 1. In the end, I should get as below:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 0.0
Any ideas?
Here is a solution to get the expected output:
sum_df = df.loc[df.Outcome == 1.0].sum().to_frame().T
sum_df.Outcome = 0.0
Output:
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 0.0
Documentation:
loc: access a group of rows / columns by labels or boolean array
sum: sum by default over all columns and return a Series indexed by the columns.
to_frame: convert a Series to a DataFrame.
.T: accessor the transpose function, transpose the DataFrame.
use np.where
df1[np.where(df1['Outcome'] == 1,True,False)].sum().to_frame().T
Output
Preg Glucose BloodPressure SkinThickness Insulin Outcome
0 6.0 342.0 180.0 58.0 262.0 3.0
Will these work for you?
df1.loc[~(df1['Outcome'] == 0)].groupby('Outcome').agg('sum').reset_index()
or
df1.loc[df1.Outcome == 1.0].sum().to_frame().T

How to select and calculate with value from specific variable in dataframe with pandas

I am running below code and get this:
import pandas as pd
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x=pf[pf['fuv1'] == 0].count()*100/1892
x
id 0.528541
date 0.528541
count 0.528541
idade 0.528541
site 0.528541
baseline 0.528541
fuv1 0.528541
fuv2 0.475687
fuv3 0.528541
fuv4 0.475687
dtype: float64
What I want is just to get this result 0.528541 and forgot all the above results.
What to do?
Thanks.
If want count number of 0 values in column fuv1 use sum for count Trues which are processes like 1s:
print ((pf['fuv1'] == 0).sum())
10
x = (pf['fuv1'] == 0).sum()*100/1892
print (x)
0.528541226216
Explanation why different outputs - count exclude NaNs:
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x=pf[pf['fuv1'] == 0]
print (x)
id date count idade site baseline fuv1 fuv2 fuv3 fuv4
0 0 4/1/2016 10 13 A 1 0.0 1.0 0.0 1.0
2 2 4/3/2016 9 5 C 1 0.0 NaN 0.0 1.0
3 3 4/4/2016 108 96 D 1 0.0 1.0 0.0 NaN
11 11 4/12/2016 6 13 C 1 0.0 1.0 1.0 0.0
13 13 4/14/2016 12 4 C 1 0.0 1.0 1.0 0.0
40 40 5/11/2016 14 7 C 1 0.0 1.0 1.0 1.0
41 41 5/12/2016 0 26 C 1 0.0 1.0 1.0 1.0
42 42 5/13/2016 10 15 C 1 0.0 1.0 1.0 1.0
60 60 5/31/2016 13 3 D 1 0.0 1.0 1.0 1.0
74 74 6/14/2016 15 7 B 1 0.0 1.0 1.0 1.0
print (x.count())
id 10
date 10
count 10
idade 10
site 10
baseline 10
fuv1 10
fuv2 9
fuv3 10
fuv4 9
dtype: int64
In [282]: pf.loc[pf['fuv1'] == 0, 'id'].count()*100/1892
Out[282]: 0.5285412262156448
import pandas as pd
pf=pd.read_csv("https://www.dropbox.com/s/08kuxi50d0xqnfc/demo.csv?dl=1")
x = (pf['fuv1'] == 0).sum()*100/1892
y=pf["idade"].mean()
l = "Performance"
k = "LTFU"
def test(l1,k1):
return pd.DataFrame({'a':[l1, k1], 'b':[x, y]})
df1 = test(l,k)
df1.columns = [''] * len(df1.columns)
df1.index = [''] * len(df1.index)
print(round(df1, 2))
Performance 0.53
LTFU 14.13