Following df:
appid tag totalvalue
0 1234 B 50.00
1 1234 BA 10.00
2 2345 B 100.00
3 2345 BA 25.00
4 2345 BCS 15.00
What we want is to group the df with appid and have some analysis based on tag column, is such that if each tag is divided by tag='B' with totalvalue. Just like follows:
appid tag total %tage(B)
0 1234 B 50.00 1
1 1234 BA 10.00 0.2
2 2345 B 100.00 1
3 2345 BA 25.00 0.4
4 2345 BCS 15.00 0.15
You can use groupby:
gmax = df['totalvalue'].where(df['tag'] == 'B').groupby(df['appid']).transform('max')
df['%tage(B)'] = 1 / (gmax / df['totalvalue'])
print(df)
# Output
appid tag totalvalue %tage(B)
0 1234 B 50.0 1.00
1 1234 BA 10.0 0.20
2 2345 B 100.0 1.00
3 2345 BA 25.0 0.25
4 2345 BCS 15.0 0.15
I have a df as follows:
appid month tag totalvalue
0 1234 02-'22 B 50.00
1 1234 02-'22 BA 10.00
2 1234 01-'22 B 100.00
3 2345 03-'22 BA 25.00
4 2345 03-'22 B 100.00
5 2345 04-'22 BB 100.00
Output what I want is follows:
appid month tag totalvalue %tage
0 1234 02-'22 B 50.00 1.0
1 1234 02-'22 BA 10.00 0.2
2 1234 01-'22 B 100.00 1.0
3 2345 03-'22 BA 25.00 0.25
4 2345 03-'22 B 100.00 1.0
5 2345 04-'22 BB 100.00 inf
I want to have group variables based on appid & month. Moreover want to check if there are tag=B is available in that group just divide other tag's totalvalue with it. If not shows the inf
I have tried with df.groupby(['appid', 'month'])['totalvalue'] but unable to replicate them with condition of tag=B as denominator over groupby object.
IIUC, you can use a groupby.transform('first') on the masked totalvalue, then use it a divider:
m = df['tag'].eq('B')
df['%tage'] = (df['totalvalue']
.div(df['totalvalue'].where(m)
.groupby([df['appid'], df['month']])
.transform('first').fillna(0))
)
output:
appid month tag totalvalue %tage
0 1234 02-'22 B 50.0 1.00
1 1234 02-'22 BA 10.0 0.20
2 1234 01-'22 B 100.0 1.00
3 2345 03-'22 BA 25.0 0.25
4 2345 03-'22 B 100.0 1.00
5 2345 04-'22 BB 100.0 inf
as_of_date
industry
sector
deal
year
quarter
stage
amount
yield
0
2022-01-01
Mortgage
RMBS
XYZ
2022
NaN
A
111
0.1
1
2022-01-01
Mortgage
RMBS
XYZ
2022
1
A
222
0.2
2
2022-01-01
Mortgage
RMBS
XYZ
2022
2
A
333
0.3
3
2022-01-01
Mortgage
RMBS
XYZ
2022
3
A
444
0.4
4
2022-01-01
Mortgage
RMBS
XYZ
2022
4
A
555
0.5
5
2022-01-01
Mortgage
RMBS
XYZ
2022
Nan
B
123
0.6
6
2022-01-01
Mortgage
RMBS
XYZ
2022
1
B
234
0.7
7
2022-01-01
Mortgage
RMBS
XYZ
2022
2
B
345
0.8
8
2022-01-01
Mortgage
RMBS
XYZ
2022
3
B
456
0.9
9
2022-01-01
Mortgage
RMBS
XYZ
2022
4
B
567
1.0
For each group (as_of_date, industry, sector, deal, year, stage), I need to display all the amounts and yields in one line
I have tried this -
df.groupby(['as_of_date', 'industry', 'sector', 'deal', 'year', 'stage'])['amount', 'yield' ].apply(lambda df: df.reset_index(drop=True)).unstack().reset_index()
but this is not working correctly.
Basically, I need this as output rows -
2022-01-01 Mortgage RMBS XYZ 2022 A 111 222 333 444 555 0.1 0.2 0.3 0.4 0.5
2022-01-01 Mortgage RMBS XYZ 2022 B 123 234 345 456 567 0.6 0.7 0.8 0.9 1.0
What would be the correct way to achieve this with Pandas? Thank you
This can be calculated by creating a list for each column first, then combined this (using +), and turning this into a string, removing the [, ], ,:
df1 = df.groupby(['as_of_date', 'industry', 'sector', 'deal', 'year', 'stage']).apply(
lambda x: str(list(x['amount']) + list(x['yield']))[1:-1].replace(",", ""))
df1
#Out:
#as_of_date industry sector deal year stage
#2022-01-01 Mortgage RMBS XYZ 2022 A 111 222 333 444 555 0.1 0.2 0.3 0.4 0.5
# B 123 234 345 456 567 0.6 0.7 0.8 0.9 1.0
Maybe this?
df.groupby(['as_of_date', 'industry', 'sector', 'deal', 'year', 'stage']).agg(' '.join).reset_index()
does this answer your question?
df2 = df.pivot(index=['as_of_date','industry','sector','deal','year', 'stage'], columns=['quarter']).reset_index()
to flatten the columns names
df2.columns = df2.columns.to_series().str.join('_')
df2
as_of_date_ industry_ sector_ deal_ year_ stage_ amount_1 amount_2 amount_3 amount_4 amount_NaN amount_Nan yield_1 yield_2 yield_3 yield_4 yield_NaN yield_Nan
0 2022-01-01 Mortgage RMBS XYZ 2022 A 222.0 333.0 444.0 555.0 111.0 NaN 0.2 0.3 0.4 0.5 0.1 NaN
1 2022-01-01 Mortgage RMBS XYZ 2022 B 234.0 345.0 456.0 567.0 NaN 123.0 0.7 0.8 0.9 1.0 NaN 0.6
I have a dataframe as following:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1
XYZ 3/3/2020 1
XYZ 3/4/2020 1
XYZ 3/5/2020 1
XYZ 3/5/2020 0
XYZ 3/6/2020 1
XYZ 3/8/2020 1
ABC 3/9/2020 0
ABC 3/10/2020 1
ABC 3/11/2020 0
ABC 3/12/2020 1
The relTweet displays whether the tweet is relevant (1) or not (0).
\nI need to find the days difference (GaplastRel) between each successive rows for each company, with a condition that the previous day's tweet should be relevant tweet (i.e. relTweet =1 ). e.g. For the first record relTweet should be 0. For the 2nd record, relTweet should be 1 as the last relevant tweet was made one day ago.
Below is the example of needed output:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1 0
XYZ 3/3/2020 1 1
XYZ 3/4/2020 1 1
XYZ 3/5/2020 1 1
XYZ 3/5/2020 0 1
XYZ 3/6/2020 1 1
XYZ 3/8/2020 1 2
ABC 3/9/2020 0 0
ABC 3/10/2020 1 0
ABC 3/11/2020 0 1
ABC 3/12/2020 1 2
Following is my code:
dataDf['Date'] = pd.to_datetime(dataDf['Date'], format='%m/%d/%Y')
dataDf['relTweet'] = (dataDf.groupby('Company', group_keys=False)
.apply(lambda g: g['Date'].diff().replace(0, np.nan).ffill()))
This code gives the days difference between successive rows for each company without conisidering the relTweet =1 condition. I am not sure how to apply the condition.
Following is the output of the above code:
Company Date relTweet GaplastRel
XYZ 3/2/2020 1 NaT
XYZ 3/3/2020 1 1 days
XYZ 3/4/2020 1 1 days
XYZ 3/5/2020 1 1 days
XYZ 3/5/2020 0 0 days
XYZ 3/6/2020 1 1 days
XYZ 3/8/2020 1 2 days
ABC 3/9/2020 0 NaT
ABC 3/10/2020 1 1 days
ABC 3/11/2020 0 1 days
ABC 3/12/2020 1 1 days
Change your mind sometime we need merge_asof rather than groupby
df1=df.loc[df['relTweet']==1,['Company','Date']]
df=pd.merge_asof(df,df1.assign(Date1=df1.Date),by='Company',on='Date', allow_exact_matches=False)
df['GaplastRel']=(df.Date-df.Date1).dt.days.fillna(0)
df
Out[31]:
Company Date relTweet Date1 GaplastRel
0 XYZ 2020-03-02 1 NaT 0.0
1 XYZ 2020-03-03 1 2020-03-02 1.0
2 XYZ 2020-03-04 1 2020-03-03 1.0
3 XYZ 2020-03-05 1 2020-03-04 1.0
4 XYZ 2020-03-05 0 2020-03-04 1.0
5 XYZ 2020-03-06 1 2020-03-05 1.0
6 XYZ 2020-03-08 1 2020-03-06 2.0
7 ABC 2020-03-09 0 NaT 0.0
8 ABC 2020-03-10 1 NaT 0.0
9 ABC 2020-03-11 0 2020-03-10 1.0
10 ABC 2020-03-12 1 2020-03-10 2.0
I have a table t with columns date, ticker, open, high, low, close.
declare #t table
(
[Datecol] date,
Ticker varchar(10),
[open] decimal (10,2),
[high] decimal (10,2),
[low] decimal (10,2),
[close] decimal(10,2)
)
insert into #t values
('20180215', 'ABC', '122.01', '125.76', '118.79' , '123.29')
,('20180216', 'ABC', '123.02', '130.62', '119.94' , '128.85')
,('20180217', 'ABC', '131.03', '139.80', '129.42' , '136.75')
,('20180218', 'ABC', '136.40', '137.95', '124.32' , '127.38')
,('20180219', 'ABC', '127.24', '138.52', '126.70' , '137.47')
,('20180220', 'ABC', '137.95', '142.01', '127.86' , '128.36')
,('20180215', 'JKL', '9.94', '10.30', '9.77' , '10.17')
,('20180216', 'JKL', '10.15', '10.24', '9.70' , '10.02')
,('20180217', 'JKL', '10.01', '10.18', '9.93' , '10.15')
,('20180218', 'JKL', '10.16', '10.20', '9.23' , '9.38')
,('20180219', 'JKL', '9.37', '9.79', '9.36' , '9.68')
,('20180220', 'JKL', '9.69', '10.01', '9.26' , '9.28')
I'm interested in calculating the daily Average True Range (ATR) for each ticker.
ATR = Max (Today's High, Yesterday's Close) - Min (Today's Low, Yesterday's Close)
Using LAG function, I can get yesterday's close:
SELECT
*,
((LAG([close], 1) OVER (PARTITION BY Ticker ORDER BY [Datecol])) - 0) * 1 AS 'yest_close'
FROM
#t t
Datecol Ticker open high low close yest_close
--------------------------------------------------------------
2018-02-15 ABC 122.01 125.76 118.79 123.29 NULL
2018-02-16 ABC 123.02 130.62 119.94 128.85 123.29
2018-02-17 ABC 131.03 139.80 129.42 136.75 128.85
2018-02-18 ABC 136.40 137.95 124.32 127.38 136.75
2018-02-19 ABC 127.24 138.52 126.70 137.47 127.38
2018-02-20 ABC 137.95 142.01 127.86 128.36 137.47
2018-02-15 JKL 9.94 10.30 9.77 10.17 NULL
2018-02-16 JKL 10.15 10.24 9.70 10.02 10.17
2018-02-17 JKL 10.01 10.18 9.93 10.15 10.02
2018-02-18 JKL 10.16 10.20 9.23 9.38 10.15
2018-02-19 JKL 9.37 9.79 9.36 9.68 9.38
2018-02-20 JKL 9.69 10.01 9.26 9.28 9.68
How do I get max (Today's High, Yesterday's close)?
You can use case (iif in SQL 2012) to find max or min of two values.
Here's a sample
select
*, ATR = iif([high] > yest_close, [high], yest_close) - iif([low] > yest_close, yest_close, [low])
from (
select
*, yest_close = lag([close]) over (partition by Ticker order by [Datecol])
from #t
) t
Output:
Datecol Ticker open high low close yest_close ATR
------------------------------------------------------------------------
2018-02-15 ABC 122.01 125.76 118.79 123.29 NULL NULL
2018-02-16 ABC 123.02 130.62 119.94 128.85 123.29 10.68
2018-02-17 ABC 131.03 139.80 129.42 136.75 128.85 10.95
2018-02-18 ABC 136.40 137.95 124.32 127.38 136.75 13.63
2018-02-19 ABC 127.24 138.52 126.70 137.47 127.38 11.82
2018-02-20 ABC 137.95 142.01 127.86 128.36 137.47 14.15
2018-02-15 JKL 9.94 10.30 9.77 10.17 NULL NULL
2018-02-16 JKL 10.15 10.24 9.70 10.02 10.17 0.54
2018-02-17 JKL 10.01 10.18 9.93 10.15 10.02 0.25
2018-02-18 JKL 10.16 10.20 9.23 9.38 10.15 0.97
2018-02-19 JKL 9.37 9.79 9.36 9.68 9.38 0.43
2018-02-20 JKL 9.69 10.01 9.26 9.28 9.68 0.75