Select all rows of dataframe that have a minimum value for a group - pandas

I have a dataframe of dates, times, and values, and I would like to create a new dataframe with the date and value of the earliest time for each date (think like an opening stock price)
For example,
date time value
1/12 9:07 10
1/12 9:03 13
1/13 10:35 8
1/13 11:02 15
1/13 11:54 6
I would want:
date value
1/12 13
1/13 8
Since those values correspond to the earliest time for each date.
So far I got:
timegroup = (result.groupby('date')['time'].min()).to_dict()
But can't figure out where to go from here.

Use DataFrame.sort_values + DataFrame.drop_duplicates.
df.sort_values(['date','time']).drop_duplicates(subset ='date')[['date','value']]
# date value
#1 1/12 13
#2 1/13 8
or
df.sort_values(['date','time']).groupby('date',as_index=False).first()[['date','value']]
# date value
# 0 1/12 13
# 1 1/13 8

Related

Date dependent calculation from 2 dataframes - average 6-month return

I am working with the following dataframe, I have data for multiple companies, each row associated with a specific datadate, so I have many rows related to many companies - with ipo date from 2009 to 2022.
index ID price daily_return datadate daily_market_return mean_daily_market_return ipodate
0 1 27.50 0.008 01-09-2010 0.0023 0.03345 01-12-2009
1 2 33.75 0.0745 05-02-2017 0.00458 0.0895 06-12-2012
2 3 29,20 0.00006 08-06-2020 0.0582 0.0045 01-05-2013
3 4 20.54 0.00486 09-06-2018 0.0009 0.0006 27-11-2013
4 1 21.50 0.009 02-09-2021 0.0846 0.04345 04-05-2009
5 4 22.75 0.00539 06-12-2019 0.0003 0.0006 21-09-2012
...
26074 rows
I also have a dataframe containing the Market yield on US Treasury securities at 10-year constant maturity - measured daily. Each row represents the return associated with a specific day, each day from 2009 to 2022.
date dgs10
1 2009-01-02 2.46
2 2009-01-05 2.49
3 2009-01-06 2.51
4 2009-01-07 2.52
5 2009-01-08 2.47
6 2009-01-09 2.43
7 2009-01-12 2.34
8 2009-01-13 2.33
...
date dgs10
3570 2022-09-08 3.29
3571 2022-09-09 3.33
3572 2022-09-12 3.37
3573 2022-09-13 3.42
3574 2022-09-14 3.41
My goal is to calculate, for each ipodate (from dataframe 1), the average of the previous 6-month return of the the Market yield on US Treasury securities at 10-year constant maturity (from dataframe 2). The result should either be in a new dataframe or in an additionnal column in dataframe 1. Both dataframes are not the same length. I tried using rolling(), but it doesn't seem to be working. Anyone knows how to fix this?
# Make sure that all date columns are of type Timestamp. They are a lot easier
# to work with
df1["ipodate"] = pd.to_datetime(df1["ipodate"], dayfirst=True)
df2["date"] = pd.to_datetime(df2["date"])
# Calculate the mean market yield of the previous 6 months. Six month is not a
# fixed length of time so I replaced it with 180 days.
tmp = df2.rolling("180D", on="date").mean()
# The values of the first 180 days are invalid, because we have insufficient
# data to calculate the rolling mean. You may consider extending df2 further
# back to 2008. (You may come up with other rules for this period.)
is_invalid = (tmp["date"] - tmp["date"].min()) / pd.Timedelta(1, "D") < 180
tmp.loc[is_invalid, "dgs10"] = np.nan
# Result
df1.merge(tmp, left_on="ipodate", right_on="date", how="left")

Change date of POSIXct variable based on other columns in R

Is there a way to change the date of a dttm column based on the values from other columns? The time in the "Date_Time" column is correct, but the dates need to be changed to match those in the column "Date" (or from all three columns "Year", "Month", and "Day").
This is likely close to what I need to do, but it gives me this error:
library(lubridate)
df$new <- with(df, ymd_hm(sprintf('%04d%02d%02d', Year, Month, day, Time))) #'Time' is new character column of just time component from 'Date_Time'
# Not sure what this means..
invalid format '%04d'; use format %s for character objects
> head(df,5)
# A tibble: 5 x 5
Date Year Month Day Date_Time
<chr> <fct> <dbl> <dbl> <dttm>
1 2020-11-14 2020 11 14 1899-12-31 10:46:00
2 2020-11-14 2020 11 14 1899-12-31 10:57:00
3 2020-11-14 2020 11 14 1899-12-31 09:16:00
4 2012-8-11 2012 8 11 1899-12-31 14:59:00
5 2012-8-11 2012 8 11 1899-12-31 13:59:00
First update the Date column to be a date. Then use lubridate to assign that date to the Date_Time column:
df$Date <- as.Date(df$Date)
lubridate::date(df$Date_Time) <- df$Date
And if necessary, update the timezone to whatever if needs to be:
attr(df$Date_Time, "tzone") <- "Europe/Paris" # Update timezone

Python: Convert string to datetime, calculate time difference, and select rows with time difference more than 3 days

I have a dataframe that contains two string date columns. First I would like to convert the two column into datetime and calculate the time difference. Then I would like to select rows with a time difference of more than 3 days.
simple df
ID Start End
234 2020-11-16 20:25 2020-11-18 00:10
62 2020-11-02 02:50 2020-11-15 21:56
771 2020-11-17 03:03 2020-11-18 00:10
desired df
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
Current input
df['End'] = pd.to_datetime(z['End'])
df['Start'] = pd.to_datetime(z['Start'])
df['Time difference'] = df['End'] - df['Start']
How can I select rows that has a time difference of more than 3 days?
Thanks in advance! I appreciate any help on this!!
Your just missing one line, convert to days then query
df[df['Time difference'].dt.days > 3]
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
df=df.set_index('ID').apply(lambda x: pd.to_datetime(x))#Set ID as index to allow coercing of dates to datetime
df=df.assign(Timedifference =df['End'].sub(df['Start'])).reset_index()#Calculate time difference and reset index
df[df['Timedifference'].dt.days.gt(3)]#Mask a bollean selection to filter youre desired

Divide my data frame into n intervals based on datetime

I have a dataframe with oldest date 1995-01-09 and latest date 2019-11-20, the duration between dates is 9082 days.
What I am trying to do is divide the dataframe into 100 time bins, the number of rows can be different in each bin.
movieId time
21 1995-01-09
47 1995-01-09
11 200-01-29
45 1996-01-29
18 2019-11-20
How about:
df['time_group'] = pd.cut(df['time'], bins=100)

I want to do some aggregations with the help of Group By function in pandas

My dataset consists of a date column in 'datetime64[ns]' dtype; it also has a price and a no. of sales column.
I want to calculate the monthly VWAP (Volume Weighted Average Price ) of the stock.
( VWAP = sum(price*no.of sales)/sum(no. of sales) )
What I applied is:-
created a new dataframe column of month and year using pandas functions.
Now, I want monthly VWAP from this dataset which I modified, also, it should be distinct by year.
For eg. - March,2016 and March,2017 should have their seperate VWAP monthly values.
Start from defining a function to count vwap for the current
month (group of rows):
def vwap(grp):
return (grp.price * grp.salesNo).sum() / grp.salesNo.sum()
Then apply it to monthly groups:
df.groupby(df.dat.dt.to_period('M')).apply(vwap)
Using the following test DataFrame:
dat price salesNo
0 2018-05-14 120.5 10
1 2018-05-16 80.0 22
2 2018-05-20 30.2 12
3 2018-08-10 75.1 41
4 2018-08-20 92.3 18
5 2019-05-10 10.0 33
6 2019-05-20 20.0 41
(containing data from the same months in different years), I got:
dat
2018-05 75.622727
2018-08 80.347458
2019-05 15.540541
Freq: M, dtype: float64
As you can see, the result contains separate entries for May in both
years from the source data.