Filtering Pandas column with specific conditions? - pandas

I have a pandas dataframe that looks like
Start Time
0 2017-06-23 15:09:32
1 2017-05-25 18:19:03
2 2017-01-04 08:27:49
3 2017-03-06 13:49:38
4 2017-01-17 14:53:07
5 2017-06-26 09:01:20
6 2017-05-26 09:41:44
7 2017-01-21 14:28:38
8 2017-04-20 16:08:51
I want to filter out the ones with month == 06. So it would be the row 1 and 5.
I know how to filter it out for column that has only few categories, but in this case, if it's a date, I need to parse the date and check the month. But I am not sure how to do it with pandas. Please help.

Using
#df['Start Time']=pd.to_datetime(df['Start Time'])
df1=df[df['Start Time'].dt.month==6].copy()

Related

Python: Convert string to datetime, calculate time difference, and select rows with time difference more than 3 days

I have a dataframe that contains two string date columns. First I would like to convert the two column into datetime and calculate the time difference. Then I would like to select rows with a time difference of more than 3 days.
simple df
ID Start End
234 2020-11-16 20:25 2020-11-18 00:10
62 2020-11-02 02:50 2020-11-15 21:56
771 2020-11-17 03:03 2020-11-18 00:10
desired df
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
Current input
df['End'] = pd.to_datetime(z['End'])
df['Start'] = pd.to_datetime(z['Start'])
df['Time difference'] = df['End'] - df['Start']
How can I select rows that has a time difference of more than 3 days?
Thanks in advance! I appreciate any help on this!!
Your just missing one line, convert to days then query
df[df['Time difference'].dt.days > 3]
ID Start End Time difference
62 2020-11-02 02:50:00 2020-11-15 21:56:00 13 days 19:06:00
df=df.set_index('ID').apply(lambda x: pd.to_datetime(x))#Set ID as index to allow coercing of dates to datetime
df=df.assign(Timedifference =df['End'].sub(df['Start'])).reset_index()#Calculate time difference and reset index
df[df['Timedifference'].dt.days.gt(3)]#Mask a bollean selection to filter youre desired

difference in two date column in Pandas

I am trying to get difference between two date columns below script and data used in script, but I am getting same results for all three rows
df = pd.read_csv(r'Book1.csv',encoding='cp1252')
df
Out[36]:
Start End DifferenceinDays DifferenceinHrs
0 10/26/2013 12:43 12/15/2014 0:04 409 9816
1 2/3/2014 12:43 3/25/2015 0:04 412 9888
2 5/14/2014 12:43 7/3/2015 0:04 409 9816
I am expecting results as in column DifferenceinDays which is calculated in excel but in python getting same values for all three rows, Please refer to below code used, can anyone let me know how is to calculate difference between 2 date column, I am trying to get number of hours between two date columns.
df["Start"] = pd.to_datetime(df['Start'])
df["End"] = pd.to_datetime(df['End'])
df['hrs']=(df.End-df.Start)
df['hrs']
Out[38]:
0 414 days 11:21:00
1 414 days 11:21:00
2 414 days 11:21:00
Name: hrs, dtype: timedelta64[ns]
IIUC, np.timedelta64(1,'h')
Additionally, it looks like excel calculates the hours differently, unsure why.
import numpy as np
df['hrs'] = (df['End'] - df['Start']) / np.timedelta64(1,'h')
print(df)
Start End DifferenceinHrs hrs
0 2013-10-26 12:43:00 2014-12-15 00:04:00 9816 9947.35
1 2014-02-03 12:43:00 2015-03-25 00:04:00 9888 9947.35
2 2014-05-14 12:43:00 2015-07-03 00:04:00 9816 9947.35

Date serial number and date need to convert in date format

when I am reading google spreadsheet in dataframe getting data in below format
42836
42837
42838
42844
42845
42846
42849
42850
42851
2/1/2018
2/2/2018
But i need to convert all in date format
IIUC setting up the origin date and using np.where, base on my experience
the origin in Excel is December 30, 1899.
s1=pd.to_datetime(pd.to_numeric(df.date,errors='coerce'),errors='coerce',origin='1899-12-30',unit='D')
s2=pd.to_datetime(df.date,errors='coerce')
df['new']=np.where(df.date.str.contains('/'),s2,s1)
df
Out[282]:
date new
0 42837 2017-04-12
1 42838 2017-04-13
2 42844 2017-04-19
3 42845 2017-04-20
4 42846 2017-04-21
5 42849 2017-04-24
6 42850 2017-04-25
7 42851 2017-04-26
8 2/1/2018 2018-02-01
9 2/2/2018 2018-02-02
Use datetime with timedelta.
base year is 1.1.1900 then add the days as timedelta.
the for loop just shows the first three of your dates.
if you need a different format use strftime("%Y-%m-%d %H:%M:%S", gmtime())
import datetime as dt
date = dt.datetime(1900,1,1)
dates = [42836, 42837, 42838]
for aDay in dates:<br>
print(date+dt.timedelta(days=aDay))

groupby pandas dataframe, take difference between value of latest and earliest date

I have a Cumulative column and I want to groupby index and take the values corresponding to the latest date minus the values corresponding to the earliest date.
Very similar to this: group by pandas dataframe and select latest in each group
But take the difference between latest and earliest in each group.
I'm a python rookie, and here is my solution:
import pandas as pd
from io import StringIO
csv = StringIO("""index id product date
0 220 6647 2014-09-01
1 220 6647 2014-09-03
2 220 6647 2014-10-16
3 826 3380 2014-11-11
4 826 3380 2014-12-09
5 826 3380 2015-05-19
6 901 4555 2014-09-01
7 901 4555 2014-10-05
8 901 4555 2014-11-01""")
df = pd.read_table(csv, sep='\s+',index_col='index')
df['date']=pd.to_datetime(df['date'],errors='coerce')
df_sort=df.sort_values('date')
df_sort.drop(['product'], axis=1,inplace=True)
df_sort.groupby('id').tail(1).set_index('id')-df_sort.groupby('id').head(1).set_index('id')

Filtering and comparing dates with Pandas

I would like to know how to filter different dates at all the different time levels, i.e. find dates by year, month, day, hour, minute and/or day. For example, how do I find all dates that happened in 2014 or 2014 in the month of January or only 2nd January 2014 or ...down to the second?
So I have my date and time dataframe generated from pd.to_datetime
df
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
2 2016-02-04 18:03:10
So if I filter by the year 2014 then I would have as output:
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
Or as a different example I want to know the dates that happened in 2014 and at the 2nd of each month. This would also result in:
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
But if I asked for a date that happened on the 2nd of January 2014
timeStamp
0 2014-01-02 21:03:04
How can I achieve this at all the different levels?
Also how do you compare dates at these different levels to create an array of boolean indices?
You can filter your dataframe via boolean indexing like so:
df.loc[df['timeStamp'].dt.year == 2014]
df.loc[df['timeStamp'].dt.month == 5]
df.loc[df['timeStamp'].dt.second == 4]
df.loc[df['timeStamp'] == '2014-01-02']
df.loc[pd.to_datetime(df['timeStamp'].dt.date) == '2014-01-02']
... and so on and so forth.
If you set timestamp as index and dtype as datetime to get a DateTimeIndex, then you can use the following Partial String Indexing syntax:
df['2014'] # gets all 2014
df['2014-01'] # gets all Jan 2014
df['01-02-2014'] # gets all Jan 2, 2014
I would just create a string series, then use str.contains() with wildcards. That will give you whatever granularity you're looking for.
s = df['timeStamp'].map(lambda x: x.strftime('%Y-%m-%d %H:%M:%S'))
print(df[s.str.contains('2014-..-.. ..:..:..')])
print(df[s.str.contains('2014-..-02 ..:..:..')])
print(df[s.str.contains('....-02-.. ..:..:..')])
print(df[s.str.contains('....-..-.. 18:03:10')])
Output:
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
timeStamp
0 2014-01-02 21:03:04
1 2014-02-02 21:03:05
timeStamp
1 2014-02-02 21:03:05
2 2016-02-04 18:03:10
timeStamp
2 2016-02-04 18:03:10
I think this also solves your question about boolean indices:
print(s.str.contains('....-..-.. 18:03:10'))
Output:
0 False
1 False
2 True
Name: timeStamp, dtype: bool