Lets say I have got the following datatable which has one column which gives back the first of each month from 2000 until 2005 and the second column gives back some values which are positive or negative.
What I want to do is that I want to build the difference between two observations from the same month but from different years.
So for example:
I want to calculate the difference between 2001-01-01 and 2000-01-01 and write the value in a new column in the same row where my 2001-01-01 date stands.
I want to do this for all my observations and for the ones who do not have a value in the previous year to compare to, just give back NA.
Thank you for your time and help :)
If there are no gaps in your data, you could use the lag function:
library(dplyr)
df <- data.frame(Date = as.Date(sapply(2000:2005, function(x) paste(x, 1:12, 1, sep = "-"))),
Value = runif(72,0,1))
df$Difference <- df$Value-lag(df$Value, 12)
> df[1:24,]
Date Value Difference
1 2000-01-01 0.83038968 NA
2 2000-02-01 0.85557483 NA
3 2000-03-01 0.41463862 NA
4 2000-04-01 0.16500688 NA
5 2000-05-01 0.89260904 NA
6 2000-06-01 0.21735933 NA
7 2000-07-01 0.96691686 NA
8 2000-08-01 0.99877057 NA
9 2000-09-01 0.96518311 NA
10 2000-10-01 0.68122410 NA
11 2000-11-01 0.85688662 NA
12 2000-12-01 0.97282720 NA
13 2001-01-01 0.83614146 0.005751778
14 2001-02-01 0.07967273 -0.775902097
15 2001-03-01 0.44373647 0.029097852
16 2001-04-01 0.35088593 0.185879052
17 2001-05-01 0.46240321 -0.430205836
18 2001-06-01 0.73177425 0.514414912
19 2001-07-01 0.52017554 -0.446741315
20 2001-08-01 0.52986486 -0.468905713
21 2001-09-01 0.14921003 -0.815973080
22 2001-10-01 0.25427134 -0.426952761
23 2001-11-01 0.36032777 -0.496558857
24 2001-12-01 0.20862578 -0.764201423
I think you should try the lubridate package, very usefull to work with dates.
https://cran.r-project.org/web/packages/lubridate/vignettes/lubridate.html
Related
I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")
I have this data frame that looks like this:
PE CE time
0 362.30 304.70 09:42
1 365.30 303.60 09:43
2 367.20 302.30 09:44
3 360.30 309.80 09:45
4 356.70 310.25 09:46
5 355.30 311.70 09:47
6 354.40 312.98 09:48
7 350.80 316.70 09:49
8 349.10 318.95 09:50
9 350.05 317.45 09:51
10 352.05 315.95 09:52
11 350.25 316.65 09:53
12 348.63 318.35 09:54
13 349.05 315.95 09:55
14 345.65 320.15 09:56
15 346.85 319.95 09:57
16 348.55 317.20 09:58
17 349.55 316.26 09:59
18 348.25 317.10 10:00
19 347.30 318.50 10:01
In this data frame, I would like to calculate the slope of both the first and second columns separately to the time period starting from (say in this case is 09:42 which is not fixed and can vary) up to the time 12:00.
please help me to write it..
Computing the slope can be accomplished by use of the equation:
Slope = Rise/Run
Given you want to define compute the slope between two time entries, all you need to do is find:
the *Run = timedelta between start and end times
the Rise** = the difference between cell entries at the start and end.
The tricky part of these calculations is making sure you properly handle the time functions:
import pandas as pd
from datetime import datetime
Thus you can define a function:
def computeSelectedSlope(df:pd.DataFrame, start:str, end:str, timecol:str, datacol:str) -> float:
assert timecol in df.columns # prove timecol exists
assert datacol in df.columns # prove datacol exists
rise = (df[datacol][df[timecol] == datetime.strptime(end, '%H:%M:%S').time()].values[0] -
df[datacol][df[timecol] == datetime.strptime(start, '%H:%M:%S').time()].values[0])
run = (int(df.index[df['T'] == datetime.strptime(end, '%H:%M:%S').time()].values) -
int(df.index[df['T'] == datetime.strptime(start, '%H:%M:%S').time()].values))
return rise/run
Now given a dataframe df of the form:
A B T
0 2.632 231.229 00:00:00
1 2.732 239.026 00:01:00
2 2.748 251.310 00:02:00
3 3.018 285.330 00:03:00
4 3.090 308.925 00:04:00
5 3.366 312.702 00:05:00
6 3.369 326.912 00:06:00
7 3.562 330.703 00:07:00
8 3.590 379.575 00:08:00
9 3.867 422.262 00:09:00
10 4.030 428.148 00:10:00
11 4.210 442.521 00:11:00
12 4.266 443.631 00:12:00
13 4.335 444.991 00:13:00
14 4.380 453.531 00:14:00
15 4.402 462.531 00:15:00
16 4.499 464.170 00:16:00
17 4.553 471.770 00:17:00
18 4.572 495.285 00:18:00
19 4.665 513.009 00:19:00
You can find the slope for any time difference by:
computeSelectedSlope(df, '00:01:00', '00:15:00', 'T', 'B')
Which yields 15.964642857142858
I am trying to tally the number of events that happened in specific periods of time previous to each of my events (day/week/month) in a data frame.
I have a data frame with 50 individuals, each of who have events scattered throughout different periods of time (days/weeks/months) in the dataframe. Every row in the data frame is an event, and I'm trying to understand how the number of events in the previous day/week/month impacted the way the individual responded to the current event. Every event is marked with an individual ID (ID.2) and has a date and time associated with it (Datetime). I have already created columns for day (epd), week (epw), month (epm) and want to populate them, for each event, with the number of events for that specific individual in the previous day, week and month respectively.
My data looks like this:
> head(ACss)
Date Datetime ID.2 month day year epd epw epm
1 2019-05-25 2019-05-25 11:57 139 5 25 2019 NA NA NA
2 2019-06-09 2019-06-09 19:42 43 6 9 2019 NA NA NA
3 2019-07-05 2019-07-05 20:12 139 7 5 2019 NA NA NA
4 2019-07-27 2019-07-27 17:27 152 7 27 2019 NA NA NA
5 2019-08-04 2019-08-04 9:13 152 8 4 2019 NA NA NA
6 2019-08-04 2019-08-04 16:18 139 8 4 2019 NA NA NA
I have no idea how to go about doing this so haven't tried anything yet! Any and all suggestions are greatly appreciated!
I have an Excel column that consists of numbers and times that were supposed to all be entered in as only time values. Some are in number form (915) and some are in time form (9:15, which appear as decimals in R). It seems like I managed to get them all to the same format in Excel (year-month-day hh:mm:ss), although the date's are incorrect - which doesn't really matter, I just need the time. However, I can't seem to convert this new column (time - new) back to the correct time value in R (in character or time format).
I'm sure this answer already exists somewhere, I just can't find one that works...
# Returns incorrect time
x$new_time <- times(strftime(x$`time - new`,"%H:%M:%S"))
# Returns all NA
x$new_time2 <- as.POSIXct(as.character(x$`time - new`),
format = '%H:%M:%S', origin = '2011-07-15 13:00:00')
> head(x)
# A tibble: 6 x 8
Year Month Day `Zone - matched with coordinate tab` Time `time - new` new_time new_time2
<dbl> <dbl> <dbl> <chr> <dbl> <dttm> <times> <dttm>
1 2017 7 17 Crocodile 103 1899-12-31 01:03:00 20:03:00 NA
2 2017 7 17 Crocodile 113 1899-12-31 01:13:00 20:13:00 NA
3 2017 7 16 Crocodile 118 1899-12-31 01:18:00 20:18:00 NA
4 2017 7 17 Crocodile 123 1899-12-31 01:23:00 20:23:00 NA
5 2017 7 17 Crocodile 125 1899-12-31 01:25:00 20:25:00 NA
6 2017 7 16 West 135 1899-12-31 01:35:00 20:35:00 NA
Found this answer here:
Extract time from timestamp?
library(lubridate)
# Adding new column to verify times are correct
x$works <- format(ymd_hms(x$`time - new`), "%H:%M:%S")
Following problem is causing me a real bad headache.
I have a big dataset that looks like this.
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 ....
As you can see, there are a lot of NAs in my observations.
The other columns look similar and the dataset has +100.000 rows. So its huge.
What I want to do is, I want aggregate my data the following way.
For example C1:
I want to build the monthly average for each Name and for each year and each month in a timeframe from like 2000-01 until 2012-12.
The monthly average should be calculated using the dates from each month which are available.
When the calculations are done, my dataset should look like this.
Name Date C1 C2 C3 C4 C5 C6 C7
A 2008-01 monthly average
A 2008-02 monthly average
A 2008-03 monthly average
A 2008-04 monthly average
A 2008-05 monthly average
A 2008-06 monthly average
A 2008-07 monthly average
A 2008-08 monthly average
A 2008-09 monthly average
A 2008-10 monthly average
A 2008-11 monthly average
A 2008-12 monthly average
A 2009-01 monthly average
B 2008-01 monthly average
B 2008-02 monthly average
B 2008-03 monthly average
B 2008-04 monthly average
B 2008-05 monthly average
B 2008-06 ....
So my output data should show for each name each month of the year.
And the values are either NA if the month had only NA-Values or they are the monthly average of this certain month.
For example:
Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
Here we would expect:
Name Date C1
A 2008-01 (100+120)/2 = 110
For example:
Name Date C1
A 2008-01-03 NA
A 2008-01-05 NA
A 2008-01-07 NA
Here we would expect:
Name Date C1
A 2008-01 NA
For example:
Name Date C1
A 2008-01-03 100
A 2008-01-05 50
A 2008-01-07 120
Here we would expect:
Name Date C1
A 2008-01 (100+50+120)/3 = 90
As I am relatively new to r and I dont know how to solve this, I am hoping to find someone who can tackle this and show me how something like this can be solved.
I would be really thankful for your support :)
library(dplyr)
#generating sample data
data <- data.frame(Name = c(rep("A",25), rep("B",50)),
Date = seq(as.Date("2018-01-01"), as.Date("2020-01-12"), by = 10),
C1 = rep(c(100,NA,NA,NA,NA,500,320,102,412,NA,200,NA,145,800,230),5))
#grouping by Name and Month and summarizing mean of values
data %>%
group_by(Name, month = cut(Date, "month")) %>%
summarise(C1 = mean(C1, na.rm = TRUE)) %>% mutate(C1 = ifelse(is.nan(C1),NA,C1))
You can dplyr::summarise_all to calculate average for all columns C1,C2..etc.
First group_by on Name and YearMon and deselect Date column and then use summarise_all
library(dplyr)
library(lubridate)
#Added C2 to demonstrate calculation for multiple columns in one go.
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%
group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
select(-Date) %>%
summarise_all("mean", na.rm=TRUE)
#OR - Use summarise_at and calculate mean for all columns starting with 'C'
df %>% mutate(Date = ymd(Date), C2 = C1*2) %>%
group_by(Name, YearMon = format(Date, "%Y-%m")) %>%
summarise_at(vars(starts_with("C")), mean, na.rm=TRUE)
# A tibble: 12 x 4
# Groups: Name [?]
Name YearMon C1 C2
<chr> <chr> <dbl> <dbl>
1 A 2008-01 110 220
2 A 2008-02 NaN NaN
3 A 2008-03 50.0 100
4 A 2008-07 70.0 140
5 A 2009-01 40.0 80.0
6 A 2010-01 NaN NaN
7 A 2010-03 25.0 50.0
8 A 2011-07 10.0 20.0
9 B 2008-01 4.00 8.00
10 B 2008-02 11.0 22.0
11 B 2008-03 13.0 26.0
12 B 2008-07 NaN NaN
Data:
df <- read.table(text =
"Name Date C1
A 2008-01-03 100
A 2008-01-05 NA
A 2008-01-07 120
A 2008-02-03 NA
A 2008-03-10 50
A 2008-07-14 70
A 2008-07-15 NA
A 2009-01-03 40
A 2009-01-05 NA
A 2010-01-07 NA
A 2010-03-03 30
A 2010-03-10 20
A 2011-07-14 10
A 2011-07-15 NA
B 2008-01-03 NA
B 2008-01-05 5
B 2008-01-07 3
B 2008-02-03 11
B 2008-03-10 13
B 2008-07-14 NA",
header = TRUE, stringsAsFactors = FALSE)