as_of_date
industry
sector
deal
year
quarter
stage
amount
yield
0
2022-01-01
Mortgage
RMBS
XYZ
2022
NaN
A
111
0.1
1
2022-01-01
Mortgage
RMBS
XYZ
2022
1
A
222
0.2
2
2022-01-01
Mortgage
RMBS
XYZ
2022
2
A
333
0.3
3
2022-01-01
Mortgage
RMBS
XYZ
2022
3
A
444
0.4
4
2022-01-01
Mortgage
RMBS
XYZ
2022
4
A
555
0.5
5
2022-01-01
Mortgage
RMBS
XYZ
2022
Nan
B
123
0.6
6
2022-01-01
Mortgage
RMBS
XYZ
2022
1
B
234
0.7
7
2022-01-01
Mortgage
RMBS
XYZ
2022
2
B
345
0.8
8
2022-01-01
Mortgage
RMBS
XYZ
2022
3
B
456
0.9
9
2022-01-01
Mortgage
RMBS
XYZ
2022
4
B
567
1.0
For each group (as_of_date, industry, sector, deal, year, stage), I need to display all the amounts and yields in one line
I have tried this -
df.groupby(['as_of_date', 'industry', 'sector', 'deal', 'year', 'stage'])['amount', 'yield' ].apply(lambda df: df.reset_index(drop=True)).unstack().reset_index()
but this is not working correctly.
Basically, I need this as output rows -
2022-01-01 Mortgage RMBS XYZ 2022 A 111 222 333 444 555 0.1 0.2 0.3 0.4 0.5
2022-01-01 Mortgage RMBS XYZ 2022 B 123 234 345 456 567 0.6 0.7 0.8 0.9 1.0
What would be the correct way to achieve this with Pandas? Thank you
This can be calculated by creating a list for each column first, then combined this (using +), and turning this into a string, removing the [, ], ,:
df1 = df.groupby(['as_of_date', 'industry', 'sector', 'deal', 'year', 'stage']).apply(
lambda x: str(list(x['amount']) + list(x['yield']))[1:-1].replace(",", ""))
df1
#Out:
#as_of_date industry sector deal year stage
#2022-01-01 Mortgage RMBS XYZ 2022 A 111 222 333 444 555 0.1 0.2 0.3 0.4 0.5
# B 123 234 345 456 567 0.6 0.7 0.8 0.9 1.0
Maybe this?
df.groupby(['as_of_date', 'industry', 'sector', 'deal', 'year', 'stage']).agg(' '.join).reset_index()
does this answer your question?
df2 = df.pivot(index=['as_of_date','industry','sector','deal','year', 'stage'], columns=['quarter']).reset_index()
to flatten the columns names
df2.columns = df2.columns.to_series().str.join('_')
df2
as_of_date_ industry_ sector_ deal_ year_ stage_ amount_1 amount_2 amount_3 amount_4 amount_NaN amount_Nan yield_1 yield_2 yield_3 yield_4 yield_NaN yield_Nan
0 2022-01-01 Mortgage RMBS XYZ 2022 A 222.0 333.0 444.0 555.0 111.0 NaN 0.2 0.3 0.4 0.5 0.1 NaN
1 2022-01-01 Mortgage RMBS XYZ 2022 B 234.0 345.0 456.0 567.0 NaN 123.0 0.7 0.8 0.9 1.0 NaN 0.6
Related
I have stock prices in a dataframe called 'stock_data' as hown here:
stock_data = pd.DataFrame(np.random.rand(5,4)*100, index=pd.date_range(start='1/1/2022', periods=5),columns = "A B C D".split())
stock_data
``
A B C D
2022-01-01 50.499862 65.011650 91.563112 45.107004
2022-01-02 53.218393 86.534942 54.575897 28.154673
2022-01-03 96.827564 49.782633 19.894127 47.529094
2022-01-04 18.226396 27.908952 67.141263 66.101363
2022-01-05 1.061750 29.833253 94.161190 85.542529
``
I have currency exchange rates here in a series called 'currency_list'. Note the index is same as column names in stock_data for reference
currency_list=pd.Series(['USD','CAD','EUR','CHF'], index="A B C D".split())
currency_list
``
A USD
B CAD
C EUR
D CHF
dtype: object
``
I have currency exchange rates here in a dataframe called 'forex_data'
Forex_data=pd.DataFrame(np.random.rand(5,3), index=pd.date_range(start='1/1/2022', periods=5),columns = "USD CAD EUR".split())
Forex_data
`
``
USD CAD EUR
2022-01-01 0.194238 0.996759 0.900205
2022-01-02 0.366476 0.054540 0.474838
2022-01-03 0.709269 0.723097 0.655717
2022-01-04 0.557701 0.878100 0.824146
2022-01-05 0.865796 0.432785 0.222463
``
Now I want to convert the prices to my base currency (let's say CHF) by the following logic -
2022-05-01 price of stock A is 50.499*0.194 , and so forth.
I am just stuck and don't know what to do - could someone help?
Example
import numpy as np
df1 = pd.DataFrame(np.random.randint(5, 20,(5,4)), index=pd.date_range(start='1/1/2022', periods=5),columns = list("ABCD"))
s1 = pd.Series(['USD','CAD','EUR','CHF'], index=list("ABCD"))
df2 = pd.DataFrame(np.random.randint(10,40, (5, 3)) / 10, index=pd.date_range(start='1/1/2022', periods=5),columns = "USD CAD EUR".split())
df1
A B C D
2022-01-01 8 19 6 12
2022-01-02 15 8 18 6
2022-01-03 9 11 14 17
2022-01-04 17 13 17 17
2022-01-05 11 12 10 19
s1
A USD
B CAD
C EUR
D CHF
dtype: object
df2
USD CAD EUR
2022-01-01 2.7 1.0 1.4
2022-01-02 3.6 3.1 1.2
2022-01-03 2.7 2.1 1.0
2022-01-04 3.8 2.4 3.6
2022-01-05 2.0 1.6 3.6
Code
mapping columns of df1 and use mul
out = df1.set_axis(df1.columns.map(s1), axis=1).mul(df2).reindex(df2.columns, axis=1)
out
USD CAD EUR
2022-01-01 21.6 19.0 8.4
2022-01-02 54.0 24.8 21.6
2022-01-03 24.3 23.1 14.0
2022-01-04 64.6 31.2 61.2
2022-01-05 22.0 19.2 36.0
sorry for my naive, but i can't solve this. any reference or solution ?
df1 =
date a b c
0 2011-12-30 100 400 700
1 2021-01-30 200 500 800
2 2021-07-30 300 600 900
df2 =
date c b
0 2021-07-30 NaN NaN
1 2021-01-30 NaN NaN
2 2011-12-30 NaN NaN
desired output:
date c b
0 2021-07-30 900 600
1 2021-01-30 800 500
2 2011-12-30 700 400
Use DataFrame.fillna with convert date to indices in both DataFrames:
df = df2.set_index('date').fillna(df1.set_index('date')).reset_index()
print (df)
date c b
0 2021-07-30 900.0 600.0
1 2021-01-30 800.0 500.0
2 2011-12-30 700.0 400.0
You can reindex_like df2 after setting date a temporary index:
out = df1.set_index('date').reindex_like(df2.set_index('date')).reset_index()
output:
date c b
0 2021-07-30 900 600
1 2021-01-30 800 500
2 2011-12-30 700 400
Another possible solution, using pandas.DataFrame.update:
df2 = df2.set_index('date')
df2.update(df1.set_index('date'))
df2.reset_index()
Output:
date c b
0 2021-07-30 900.0 600.0
1 2021-01-30 800.0 500.0
2 2011-12-30 700.0 400.0
Following df:
appid tag totalvalue
0 1234 B 50.00
1 1234 BA 10.00
2 2345 B 100.00
3 2345 BA 25.00
4 2345 BCS 15.00
What we want is to group the df with appid and have some analysis based on tag column, is such that if each tag is divided by tag='B' with totalvalue. Just like follows:
appid tag total %tage(B)
0 1234 B 50.00 1
1 1234 BA 10.00 0.2
2 2345 B 100.00 1
3 2345 BA 25.00 0.4
4 2345 BCS 15.00 0.15
You can use groupby:
gmax = df['totalvalue'].where(df['tag'] == 'B').groupby(df['appid']).transform('max')
df['%tage(B)'] = 1 / (gmax / df['totalvalue'])
print(df)
# Output
appid tag totalvalue %tage(B)
0 1234 B 50.0 1.00
1 1234 BA 10.0 0.20
2 2345 B 100.0 1.00
3 2345 BA 25.0 0.25
4 2345 BCS 15.0 0.15
I have a df as follows:
appid month tag totalvalue
0 1234 02-'22 B 50.00
1 1234 02-'22 BA 10.00
2 1234 01-'22 B 100.00
3 2345 03-'22 BA 25.00
4 2345 03-'22 B 100.00
5 2345 04-'22 BB 100.00
Output what I want is follows:
appid month tag totalvalue %tage
0 1234 02-'22 B 50.00 1.0
1 1234 02-'22 BA 10.00 0.2
2 1234 01-'22 B 100.00 1.0
3 2345 03-'22 BA 25.00 0.25
4 2345 03-'22 B 100.00 1.0
5 2345 04-'22 BB 100.00 inf
I want to have group variables based on appid & month. Moreover want to check if there are tag=B is available in that group just divide other tag's totalvalue with it. If not shows the inf
I have tried with df.groupby(['appid', 'month'])['totalvalue'] but unable to replicate them with condition of tag=B as denominator over groupby object.
IIUC, you can use a groupby.transform('first') on the masked totalvalue, then use it a divider:
m = df['tag'].eq('B')
df['%tage'] = (df['totalvalue']
.div(df['totalvalue'].where(m)
.groupby([df['appid'], df['month']])
.transform('first').fillna(0))
)
output:
appid month tag totalvalue %tage
0 1234 02-'22 B 50.0 1.00
1 1234 02-'22 BA 10.0 0.20
2 1234 01-'22 B 100.0 1.00
3 2345 03-'22 BA 25.0 0.25
4 2345 03-'22 B 100.0 1.00
5 2345 04-'22 BB 100.0 inf
I have orders_df:
Symbol Order Shares
Date
2011-01-10 AAPL BUY 1500
2011-01-13 AAPL SELL 1500
2011-01-13 IBM BUY 4000
2011-01-26 GOOG BUY 1000
2011-02-02 XOM SELL 4000
2011-02-10 XOM BUY 4000
2011-03-03 GOOG SELL 1000
2011-03-03 IBM SELL 2200
2011-05-03 IBM BUY 1500
2011-06-03 IBM SELL 3300
2011-08-01 GOOG BUY 55
2011-08-01 GOOG SELL 55
I want have a variable that maps Date to the number of SELLS on that date. I also want a symmetric variable for BUY.
I tried doing it for all Orders by doing
num_orders_per_day = orders_df.groupby(['Date']).size()
and got:
Date
2011-01-10 1
2011-01-13 2
2011-01-26 1
2011-02-02 1
2011-02-10 1
2011-03-03 2
2011-05-03 1
2011-06-03 1
2011-08-01 2
but that is not the desired output.
What I want is sells_on_a_day:
2011-01-13 1
2011-02-02 1
2011-03-03 2
2011-06-03 1
2011-08-01 1
and then a similar buys_on_a_day variable.
First filter by boolean indexing and then get count:
num_sells_per_day = orders_df[orders_df['Order'] == 'SELL']
.groupby(level=0).size().reset_index(name='count')
print (num_sells_per_day)
Date count
0 2011-01-13 1
1 2011-02-02 1
2 2011-03-03 2
3 2011-06-03 1
4 2011-08-01 1
Alternative:
num_sells_per_day = orders_df.query("Order == 'SELL'")
.groupby(level=0)
.size()
.reset_index(name='count')
print (num_sells_per_day)
Date count
0 2011-01-13 1
1 2011-02-02 1
2 2011-03-03 2
3 2011-06-03 1
4 2011-08-01 1
Also is possible create 2 columns together, only get NaNs if some values missing:
df1 = orders_df.groupby(['Date','Order']).size().unstack()
print (df1)
Order BUY SELL
Date
2011-01-10 1.0 NaN
2011-01-13 1.0 1.0
2011-01-26 1.0 NaN
2011-02-02 NaN 1.0
2011-02-10 1.0 NaN
2011-03-03 NaN 2.0
2011-05-03 1.0 NaN
2011-06-03 NaN 1.0
2011-08-01 1.0 1.0