Filtering and comparing dates with Pandas - pandas

I would like to know how to filter different dates at all the different time levels, i.e. find dates by year, month, day, hour, minute and/or day. For example, how do I find all dates that happened in 2014 or 2014 in the month of January or only 2nd January 2014 or ...down to the second?
So I have my date and time dataframe generated from pd.to_datetime
df
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
2 2016-02-04 18:03:10
So if I filter by the year 2014 then I would have as output:
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
Or as a different example I want to know the dates that happened in 2014 and at the 2nd of each month. This would also result in:
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
But if I asked for a date that happened on the 2nd of January 2014
timeStamp
0 2014-01-02 21:03:04
How can I achieve this at all the different levels?
Also how do you compare dates at these different levels to create an array of boolean indices?

You can filter your dataframe via boolean indexing like so:
df.loc[df['timeStamp'].dt.year == 2014]
df.loc[df['timeStamp'].dt.month == 5]
df.loc[df['timeStamp'].dt.second == 4]
df.loc[df['timeStamp'] == '2014-01-02']
df.loc[pd.to_datetime(df['timeStamp'].dt.date) == '2014-01-02']
... and so on and so forth.

If you set timestamp as index and dtype as datetime to get a DateTimeIndex, then you can use the following Partial String Indexing syntax:
df['2014'] # gets all 2014
df['2014-01'] # gets all Jan 2014
df['01-02-2014'] # gets all Jan 2, 2014

I would just create a string series, then use str.contains() with wildcards. That will give you whatever granularity you're looking for.
s = df['timeStamp'].map(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))
print(df[s.str.contains('2014-..-.. ..:..:..')])
print(df[s.str.contains('2014-..-02 ..:..:..')])
print(df[s.str.contains('....-02-.. ..:..:..')])
print(df[s.str.contains('....-..-.. 18:03:10')])
Output:
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
timeStamp
1 2014-02-02 21:03:05
2 2016-02-04 18:03:10
timeStamp
2 2016-02-04 18:03:10
I think this also solves your question about boolean indices:
print(s.str.contains('....-..-.. 18:03:10'))
Output:
0 False
1 False
2 True
Name: timeStamp, dtype: bool

Related

Pandas Calculate RMSE in Date Range Chunks by Year

I have data in a df and need to calculate the RMSE of a column consisting of rows of months and years data compared to the current month and year rows in a chunk period. I cannot figure out how to set up the sequencing by each year. For example, I need to calculate the RMSE by year from exactly month == 5 to month == 2 and print all the RMSE values in the "Variation" column by start year. My data looks like this:
month mean_mon_flow ... std_anomaly Variation
date ...
1992-04-01 00:00:00 4 12.265100 ... -1.074586 NaN
1992-05-01 00:00:00 5 12.533220 ... -1.017388 0.057198
1992-06-01 00:00:00 6 12.491247 ... -1.117406 -0.100018
1992-07-01 00:00:00 7 12.113165 ... -1.401221 -0.283815
1992-08-01 00:00:00 8 11.846904 ... -1.359026 0.042195
1992-09-01 00:00:00 9 11.526178 ... -0.299250 1.059776
1992-10-01 00:00:00 10 11.555834 ... -0.628162 -0.328911
1992-11-01 00:00:00 11 11.746104 ... -1.116374 -0.488213
1992-12-01 00:00:00 12 11.891824 ... -0.143343 0.973031
1993-01-01 00:00:00 1 11.997252 ... -0.486450 -0.343107
1993-02-01 00:00:00 2 12.028855 ... -0.862971 -0.376521
1993-03-01 00:00:00 3 12.063974 ... -0.596869 0.266102
1993-04-01 00:00:00 4 12.265100 ... -0.923695 -0.326826
1993-05-01 00:00:00 5 12.533220 ... 0.322987 1.246682
1993-06-01 00:00:00 6 12.491247 ... -0.478567 -0.801554
1993-07-01 00:00:00 7 12.113165 ... -0.274119 0.204448
1993-08-01 00:00:00 8 11.846904 ... -0.707968 -0.433849
1993-09-01 00:00:00 9 11.526178 ... 0.167246 0.875214
1993-10-01 00:00:00 10 11.555834 ... -0.089410 -0.256656
1993-11-01 00:00:00 11 11.746104 ... -1.046461 -0.957050
1993-12-01 00:00:00 12 11.891824 ... -1.293175 -0.246714
1994-01-01 00:00:00 1 11.997252 ... -1.505133 -0.211959
1994-02-01 00:00:00 2 12.028855 ... -0.610121 0.895012
1994-03-01 00:00:00 3 12.063974 ... -0.974184 -0.364063
1994-04-01 00:00:00 4 12.265100 ... -1.077609 -0.103424
The observed data from the current year looks like this:
month mean_mon_flow ... std_anomaly Variation
date ...
2021-05-01 00:00:00 5 12.533220 ... -0.935899 0.206586
2021-06-01 00:00:00 6 12.491247 ... -0.647261 0.288638
2021-07-01 00:00:00 7 12.113165 ... -0.711730 -0.064469
2021-08-01 00:00:00 8 11.846904 ... -0.482306 0.229424
2021-09-01 00:00:00 9 11.526178 ... -0.116989 0.365317
2021-10-01 00:00:00 10 11.555834 ... 0.319614 0.436603
2021-11-01 00:00:00 11 11.746104 ... 0.880379 0.560765
2021-12-01 00:00:00 12 11.891824 ... 0.630541 -0.249838
2022-01-01 00:00:00 1 11.997252 ... -0.151507 -0.782048
2022-02-01 00:00:00 2 12.028855 ... -0.237398 -0.085891
The result should be something like this below. I've tried using a groupby statement to calculate RMSE but not sure how to give groupby a range of dates.
year RMSE Variation
1992 number
1993 number
1994 number
.. ..
2020 number
thank you,
Some pre-processing of your dataframe for previous years. First, get the year label by taking the year component of your date with 4-month subtracted. Second, drop March and April.
from datetime import date
from dateutil.relativedelta import relativedelta
df_prev['year'] = pd.Series(df_prev['date'].dt.to_pydatetime() - relativedelta(months=4)).dt.year
df_prev = df_prev[~df_prev['month'].isin([3,4])]
Then convert df_prev into a matrix with years as column and month as index, convert the table for this year into a series with month as index.
df_prev_vari = df_prev.set_index(['month', 'year'])[['Variation']].unstack().droplevel(0, axis=1)
df_this_vari = df_this.set_index('month')['Variation']
Having month as the common index for both data enables us to subtract one another by matching the index, followed by squared, mean, and square-root operations.
(df_prev_vari.sub(df_this_vari, axis=0)**2).mean()**.5

Python: Convert string to datetime, calculate time difference, and select rows with time difference more than 3 days

I have a dataframe that contains two string date columns. First I would like to convert the two column into datetime and calculate the time difference. Then I would like to select rows with a time difference of more than 3 days.
simple df
ID Start End
234 2020-11-16 20:25 2020-11-18 00:10
62 2020-11-02 02:50 2020-11-15 21:56
771 2020-11-17 03:03 2020-11-18 00:10
desired df
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
Current input
df['End'] = pd.to_datetime(z['End'])
df['Start'] = pd.to_datetime(z['Start'])
df['Time difference'] = df['End'] - df['Start']
How can I select rows that has a time difference of more than 3 days?
Thanks in advance! I appreciate any help on this!!
Your just missing one line, convert to days then query
df[df['Time difference'].dt.days > 3]
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
df=df.set_index('ID').apply(lambda x: pd.to_datetime(x))#Set ID as index to allow coercing of dates to datetime
df=df.assign(Timedifference =df['End'].sub(df['Start'])).reset_index()#Calculate time difference and reset index
df[df['Timedifference'].dt.days.gt(3)]#Mask a bollean selection to filter youre desired

Groupby by Year with pd.Timestamp / dateTime64 in the format YYYY-MM-DD while keeping full timestamp

I have a dataframe with a column "time" and "value" in the format YYYY-MM-DD and np.int64
time | value
2009-11-03 | 13
2009-11-14 | 25
2009-12-05 | 25
2016-03-02 | 80
2016-05-17 | 56
I need to groupby by year, getting the maximum value by year. If days within the same year both have the highest value I need tp keep both. But I need to keep the full timestamp as well.
Desired output:
time | value
2009-11-14 | 25
2009-12-05 | 25
2016-03-02 | 80
My code so far:
df["year"] = df["time"].dt.year
df = df.groupby(["year"], sort=False)['value'].max()
But this removes the timestamp and I only have the year + value as a column. How can I get the desired result?
Let us try transform first then do filter
m=df.value.eq(df.groupby(df.time.dt.year).value.transform('max'))
df=df[m]
Out[111]:
time value
1 2009-11-14 25
2 2009-12-05 25
3 2016-03-02 80
Calculate the maximum values per year, and then join the result with the original data frame:
df["year"] = pd.to_datetime(df["time"]).dt.year
max_val = df.groupby(["year"], sort=False)['value'].max()
pd.merge(max_val, df, on=["value", "year"])
result:
value year time
0 25 2009 2009-11-14
1 25 2009 2009-12-05
2 80 2016 2016-03-02

Date serial number and date need to convert in date format

when I am reading google spreadsheet in dataframe getting data in below format
42836
42837
42838
42844
42845
42846
42849
42850
42851
2/1/2018
2/2/2018
But i need to convert all in date format
IIUC setting up the origin date and using np.where, base on my experience
the origin in Excel is December 30, 1899.
s1=pd.to_datetime(pd.to_numeric(df.date,errors='coerce'),errors='coerce',origin='1899-12-30',unit='D')
s2=pd.to_datetime(df.date,errors='coerce')
df['new']=np.where(df.date.str.contains('/'),s2,s1)
df
Out[282]:
date new
0 42837 2017-04-12
1 42838 2017-04-13
2 42844 2017-04-19
3 42845 2017-04-20
4 42846 2017-04-21
5 42849 2017-04-24
6 42850 2017-04-25
7 42851 2017-04-26
8 2/1/2018 2018-02-01
9 2/2/2018 2018-02-02
Use datetime with timedelta.
base year is 1.1.1900 then add the days as timedelta.
the for loop just shows the first three of your dates.
if you need a different format use strftime("%Y-%m-%d %H:%M:%S", gmtime())
import datetime as dt
date = dt.datetime(1900,1,1)
dates = [42836, 42837, 42838]
for aDay in dates:<br>
print(date+dt.timedelta(days=aDay))

Filtering Pandas column with specific conditions?

I have a pandas dataframe that looks like
Start Time
0 2017-06-23 15:09:32
1 2017-05-25 18:19:03
2 2017-01-04 08:27:49
3 2017-03-06 13:49:38
4 2017-01-17 14:53:07
5 2017-06-26 09:01:20
6 2017-05-26 09:41:44
7 2017-01-21 14:28:38
8 2017-04-20 16:08:51
I want to filter out the ones with month == 06. So it would be the row 1 and 5.
I know how to filter it out for column that has only few categories, but in this case, if it's a date, I need to parse the date and check the month. But I am not sure how to do it with pandas. Please help.
Using
#df['Start Time']=pd.to_datetime(df['Start Time'])
df1=df[df['Start Time'].dt.month==6].copy()