Following problem is causing me a real bad headache.
I have a big dataset that looks like this.
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 ....
As you can see, there are a lot of NAs in my observations.
The other columns look similar and the dataset has +100.000 rows. So its huge.
What I want to do is, I want aggregate my data the following way.
For example C1:
I want to build the monthly average for each Name and for each year and each month in a timeframe from like 2000-01 until 2012-12.
The monthly average should be calculated using the dates from each month which are available.
When the calculations are done, my dataset should look like this.
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01 monthly average
A 2008-02 monthly average
A 2008-03 monthly average
A 2008-04 monthly average
A 2008-05 monthly average
A 2008-06 monthly average
A 2008-07 monthly average
A 2008-08 monthly average
A 2008-09 monthly average
A 2008-10 monthly average
A 2008-11 monthly average
A 2008-12 monthly average
A 2009-01 monthly average
B 2008-01 monthly average
B 2008-02 monthly average
B 2008-03 monthly average
B 2008-04 monthly average
B 2008-05 monthly average
B 2008-06 ....
So my output data should show for each name each month of the year.
And the values are either NA if the month had only NA-Values or they are the monthly average of this certain month.
For example:
Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
Here we would expect:
Name Date C1
A 2008-01 (100+120)/2 = 110
For example:
Name Date C1
A 2008-01-03 NA
A 2008-01-05 NA
A 2008-01-07 NA
Here we would expect:
Name Date C1
A 2008-01 NA
For example:
Name Date C1
A 2008-01-03 100
A 2008-01-05 50
A 2008-01-07 120
Here we would expect:
Name Date C1
A 2008-01 (100+50+120)/3 = 90
As I am relatively new to r and I dont know how to solve this, I am hoping to find someone who can tackle this and show me how something like this can be solved.
I would be really thankful for your support :)
library(dplyr)
#generating sample data
data <- data.frame(Name = c(rep("A",25), rep("B",50)),
Date = seq(as.Date("2018-01-01"), as.Date("2020-01-12"), by = 10),
C1 = rep(c(100,NA,NA,NA,NA,500,320,102,412,NA,200,NA,145,800,230),5))
#grouping by Name and Month and summarizing mean of values
data %>%
group_by(Name, month = cut(Date, "month")) %>%
summarise(C1 = mean(C1, na.rm = TRUE)) %>% mutate(C1 = ifelse(is.nan(C1),NA,C1))
You can dplyr::summarise_all to calculate average for all columns C1,C2..etc.
First group_by on Name and YearMon and deselect Date column and then use summarise_all
library(dplyr)
library(lubridate)
#Added C2 to demonstrate calculation for multiple columns in one go.
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%
group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
select(-Date) %>%
summarise_all("mean", na.rm=TRUE)
#OR - Use summarise_at and calculate mean for all columns starting with 'C'
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%
group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
summarise_at(vars(starts_with("C")), mean, na.rm=TRUE)
# A tibble: 12 x 4
# Groups: Name [?]
Name YearMon C1 C2
<chr> <chr> <dbl> <dbl>
1 A 2008-01 110 220
2 A 2008-02 NaN NaN
3 A 2008-03 50.0 100
4 A 2008-07 70.0 140
5 A 2009-01 40.0 80.0
6 A 2010-01 NaN NaN
7 A 2010-03 25.0 50.0
8 A 2011-07 10.0 20.0
9 B 2008-01 4.00 8.00
10 B 2008-02 11.0 22.0
11 B 2008-03 13.0 26.0
12 B 2008-07 NaN NaN
Data:
df <- read.table(text =
"Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 NA",
header = TRUE, stringsAsFactors = FALSE)
Related
I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")
I would like to find percentile of each column and add to df data frame and also label
if the value of the column is
top 20 percent (value>80th percentile) then 'strong'
below 20 percent (value>80th percentile) then 'weak'
else average
Below is my dataframe
df=pd.DataFrame({'month':['1','1','1','1','1','2','2','2','2','2','2','2'],'X1':
[30,42,25,32,12,10,4,6,5,10,24,21],'X2':[10,76,100,23,65,94,67,24,67,54,87,81],'X3':
[23,78,95,52,60,76,68,92,34,76,34,12]})
df
Below what I tried
df['X1_percentile'] = df.X1.rank(pct = True)
df['X1_segment'] = np.where(df['X1_percentile']>0.8, 'Strong',np.where(df['X1_percentile']
<0.20,'Weak', 'Average'))
But I would like to do this for each month and for each column. And if possible this could be automted by a function for any col numbers and also type colname+"_per" and colname+"_segment" for each column ?
Thanks
We can use groupby + rank with optional parameter pct=True to calculate the ranking expressed as percentile rank, then using np.select bin/categorize the percentile values into discrete lables.
p = df.groupby('month').rank(pct=True)
df[p.columns + '_per'] = p
df[p.columns + '_seg'] = np.select([p.gt(.8), p.lt(.2)], ['strong', 'weak'], 'average')
month X1 X2 X3 X1_per X2_per X3_per X1_seg X2_seg X3_seg
0 1 30 10 23 0.600000 0.200000 0.200000 average average average
1 1 42 76 78 1.000000 0.800000 0.800000 strong average average
2 1 25 100 95 0.400000 1.000000 1.000000 average strong strong
3 1 32 23 52 0.800000 0.400000 0.400000 average average average
4 1 12 65 60 0.200000 0.600000 0.600000 average average average
5 2 10 94 76 0.642857 1.000000 0.785714 average strong average
6 2 4 67 68 0.142857 0.500000 0.571429 weak average average
7 2 6 24 92 0.428571 0.142857 1.000000 average weak strong
8 2 5 67 34 0.285714 0.500000 0.357143 average average average
9 2 10 54 76 0.642857 0.285714 0.785714 average average average
10 2 24 87 34 1.000000 0.857143 0.357143 strong strong average
11 2 21 81 12 0.857143 0.714286 0.142857 strong average weak
I have a Dataframe that captures date when ticket was raised by a customer that is captured in column labelled date. If the ref_column for the current cell is same as the following cell then I need to find difference of aging based on date column current cell and the following cell for the same cust_id. if the ref_column is to the same then I need to find difference of date and ref_date of the same row.
Given below is how my data is:
cust_id,date,ref_column,ref_date
101,15/01/19,abc,31/01/19
101,17/01/19,abc,31/01/19
101,19/01/19,xyz,31/01/19
102,15/01/19,abc,31/01/19
102,21/01/19,klm,31/01/19
102,25/01/19,xyz,31/01/19
103,15/01/19,xyz,31/01/19
Expected output:
cust_id,date,ref_column,ref_date,aging(in days)
101,15/01/19,abc,31/01/19,2
101,17/01/19,abc,31/01/19,14
101,19/01/19,xyz,31/01/19,0
102,15/01/19,abc,31/01/19,16
102,21/01/19,klm,31/01/19,10
102,25/01/19,xyz,31/01/19,0
103,15/01/19,xyz,31/01/19,0
Aging(in days) is 0 for the last entry for a given cust_id
Here's my approach:
# convert dates to datetime type
# ignore if already are
df['date'] = pd.to_datetime(df['date'])
df['ref_date'] = pd.to_datetime(df['ref_date'])
# customer group
groups = df.groupby('cust_id')
# where ref_column is the same with the next:
same_ = df['ref_column'].eq(groups['ref_column'].shift(-1))
# update these ones
df['aging'] = np.where(same_,
-groups['date'].diff(-1).dt.days, # same ref as next row
df['ref_date'].sub(df['date']).dt.days) # diff ref than next row
# update last elements in groups:
last_idx = groups['date'].idxmax()
df.loc[last_idx, 'aging'] = 0
Output:
cust_id date ref_column ref_date aging
0 101 2019-01-15 abc 2019-01-31 2.0
1 101 2019-01-17 abc 2019-01-31 14.0
2 101 2019-01-19 xyz 2019-01-31 0.0
3 102 2019-01-15 abc 2019-01-31 16.0
4 102 2019-01-21 klm 2019-01-31 10.0
5 102 2019-01-25 xyz 2019-01-31 0.0
6 103 2019-01-15 xyz 2019-01-31 0.0
My dataset consists of a date column in 'datetime64[ns]' dtype; it also has a price and a no. of sales column.
I want to calculate the monthly VWAP (Volume Weighted Average Price ) of the stock.
( VWAP = sum(price*no.of sales)/sum(no. of sales) )
What I applied is:-
created a new dataframe column of month and year using pandas functions.
Now, I want monthly VWAP from this dataset which I modified, also, it should be distinct by year.
For eg. - March,2016 and March,2017 should have their seperate VWAP monthly values.
Start from defining a function to count vwap for the current
month (group of rows):
def vwap(grp):
return (grp.price * grp.salesNo).sum() / grp.salesNo.sum()
Then apply it to monthly groups:
df.groupby(df.dat.dt.to_period('M')).apply(vwap)
Using the following test DataFrame:
dat price salesNo
0 2018-05-14 120.5 10
1 2018-05-16 80.0 22
2 2018-05-20 30.2 12
3 2018-08-10 75.1 41
4 2018-08-20 92.3 18
5 2019-05-10 10.0 33
6 2019-05-20 20.0 41
(containing data from the same months in different years), I got:
dat
2018-05 75.622727
2018-08 80.347458
2019-05 15.540541
Freq: M, dtype: float64
As you can see, the result contains separate entries for May in both
years from the source data.
Lets say I have got the following datatable which has one column which gives back the first of each month from 2000 until 2005 and the second column gives back some values which are positive or negative.
What I want to do is that I want to build the difference between two observations from the same month but from different years.
So for example:
I want to calculate the difference between 2001-01-01 and 2000-01-01 and write the value in a new column in the same row where my 2001-01-01 date stands.
I want to do this for all my observations and for the ones who do not have a value in the previous year to compare to, just give back NA.
Thank you for your time and help :)
If there are no gaps in your data, you could use the lag function:
library(dplyr)
df <- data.frame(Date = as.Date(sapply(2000:2005, function(x) paste(x, 1:12, 1, sep = "-"))),
Value = runif(72,0,1))
df$Difference <- df$Value-lag(df$Value, 12)
> df[1:24,]
Date Value Difference
1 2000-01-01 0.83038968 NA
2 2000-02-01 0.85557483 NA
3 2000-03-01 0.41463862 NA
4 2000-04-01 0.16500688 NA
5 2000-05-01 0.89260904 NA
6 2000-06-01 0.21735933 NA
7 2000-07-01 0.96691686 NA
8 2000-08-01 0.99877057 NA
9 2000-09-01 0.96518311 NA
10 2000-10-01 0.68122410 NA
11 2000-11-01 0.85688662 NA
12 2000-12-01 0.97282720 NA
13 2001-01-01 0.83614146 0.005751778
14 2001-02-01 0.07967273 -0.775902097
15 2001-03-01 0.44373647 0.029097852
16 2001-04-01 0.35088593 0.185879052
17 2001-05-01 0.46240321 -0.430205836
18 2001-06-01 0.73177425 0.514414912
19 2001-07-01 0.52017554 -0.446741315
20 2001-08-01 0.52986486 -0.468905713
21 2001-09-01 0.14921003 -0.815973080
22 2001-10-01 0.25427134 -0.426952761
23 2001-11-01 0.36032777 -0.496558857
24 2001-12-01 0.20862578 -0.764201423
I think you should try the lubridate package, very usefull to work with dates.
https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html