Finding the longest sequence of dates in a dataframe - pandas

I'd like to know how to find the longest unbroken sequence of dates (formatted as 2016-11-27) in a publish_date column (dates are not the index, though I suppose they could be).
There are a number of stack overflow questions which are similar, but AFAICT all proposed answers return the size of the longest sequence, which is not what I'm after.
I want to know e.g. that the stretch from 2017-01-01 to 2017-06-01 had no missing dates and was the longest such streak.

Here is an example of how you can do this:
import pandas as pd
import datetime
# initialize data
data = {'a': [1,2,3,4,5,6,7],
'date': ['2017-01-01', '2017-01-03', '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-09', '2017-01-31']}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# create mask that indicates sequential pair of days (except the first date)
df['mask'] = 1
df.loc[df['date'] - datetime.timedelta(days=1) == df['date'].shift(),'mask'] = 0
# convert mask to numbers - each sequence have its own number
df['mask'] = df['mask'].cumsum()
# find largest sequence number and get this sequence
res = df.loc[df['mask'] == df['mask'].value_counts().idxmax(), 'date']
# extract min and max dates if you need
min_date = res.min()
max_date = res.max()
# print result
print('min_date: {}'.format(min_date))
print('max_date: {}'.format(max_date))
print('result:')
print(res)
The result will be:
min_date: 2017-01-05 00:00:00
max_date: 2017-01-07 00:00:00
result:
2 2017-01-05
3 2017-01-06
4 2017-01-07

Related

Inconsistent output for pandas groupby-resample with missing values in first time bin

I am finding an inconsistent output with pandas groupby-resample behavior.
Take this dataframe, in which category A has samples on the first and second day and category B has a sample only on the second day:
df1 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00']),
data={'category':['A','A','B']})
# Output:
# category
#2022-01-01 01:00:00 A
#2022-01-02 01:00:00 A
#2022-01-02 01:00:00 B
When I groupby-resample I get a Series with multiindex on category and time:
res1 = df1.groupby('category').resample('1D').size()
#Output:
#category
#A 2022-01-01 1
# 2022-01-02 1
#B 2022-01-02 1
#dtype: int64
But if I add one more data point so that B has a sample on day 1, the return value is a dataframe with single-index in category and columns corresponding to the time bins:
df2 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00','2022-1-1 1:00']),
data={'category':['A','A','B','B']})
res2 = df2.groupby('category').resample('1D').size()
# Output:
# 2022-01-01 2022-01-02
# category
# A 1 1
# B 1 1
Is this expected behavior? I reproduced this behavior in pandas 1.4.2 and was unable to find a bug report.
I submitted bug report 46826 to pandas.
The result should be a Series with a MultiIndex in both cases. There was a bug which caused df.groupby.resample.size to return a wide DF for cases in which all groups had the same index. This has been fixed on the master branch. Thank you for opening the issue.

Add hours to a timestamp that is formatted as a string

I have a dataframe (df) that has employee start and end times formatted at strings
emp_id|Start|End
001|07:00:00|04:00:00
002|07:30:00|04:30:00
I want to add two hours to the Start and 2 hours to the End on a set of employees, not all employees. I do this by taking a slice of the main dataframe into a separate dataframe (df2). I then update the values and need to merge the updated values back into the main dataframe (df1) where I will coerce back to a string, as there is a method later in the code expecting these values to be strings.
I tried doing this:
df1['Start'] = pd.to_datetime(df1.Start)
df1['End'] = pd.to_datetime(df1.End)
df2 = df1.sample(frac=0.1, replace=False, random_state=1) #takes a random 10% slice
df2['Start'] = df2['Start'] + timedelta(hours=2)
df2['End'] = df2['End'] + timedelta(hours=2)
df1.loc[df1.emp_id.isin(df2.emp_id), ['Start, 'End']] = df2[['Start', 'End']]
df1['Start'] = str(df1['Start'])
df1['End'] = str(df1['End']))
I'm getting a TypeError: addition/subtraction of integers and integer arrays with DateTimeArray is no longer supported. How do I do this in Python3?
You can use .applymap() on the Start and End columns of your selected subset. Hour addition can be done by string extraction and substitution.
Code
df1 = pd.DataFrame({
"emp_id": ['001', '002'],
"Start": ['07:00:00', '07:30:00'],
"End": ['04:00:00', '04:30:00'],
})
# a subset of employee id
set_id = set(['002'])
# locate the subset
mask = df1["emp_id"].isin(set_id)
# apply hour addition
df1.loc[mask, ["Start", "End"]] = df1.loc[mask, ["Start", "End"]].applymap(lambda el: f"{int(el[:2])+2:02}{el[2:]}")
Result
print(df1)
emp_id Start End
0 001 07:00:00 04:00:00
1 002 09:30:00 06:30:00 <- 2 hrs were added
Note: f-strings require python 3.6+. For earlier versions, replace the f-string with
"%02d%s" % (int(el[:2])+2, el[2:])
Note: mind corner cases (time later than 22:00) if they exist.

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

Pandas - Filtering out data by weekday

I have a Dataframe that has list of dates with sales count for each of the days as shown below:
date,count
11/1/2018,345
11/2/2018,100
11/5/2018,432
11/7/2018,500
11/11/2018,555
11/17/2018,754
I am trying to check of all the sales that were done how many were done on a weekday. To pull all week-days in November I am doing the below:
weekday = pd.DataFrame(pd.bdate_range('2018-11-01', '2018-11-30'))
Now I am trying to compare dates in df with value in weekday as below:
df_final = df[df['date'].isin(weekday)]
But the above returns no rows.
You should remove pd.DataFrame when create the weekday, since when we using Series and DataFrame with isin means we not only match the values but also the index and columns , since the original index and columns may different from the new created dataframe weekday, that is why return the False
df.date=pd.to_datetime(df.date)
weekday = pd.bdate_range('2018-11-01', '2018-11-30')
df_final = df[df['date'].isin(weekday)]
df_final
Out[39]:
date count
0 2018-11-01 345
1 2018-11-02 100
2 2018-11-05 432
3 2018-11-07 500
Simple example address the issue I mentioned above
df=pd.DataFrame({'A':[1,2,3,4,5]})
newdf=pd.DataFrame({'B':[2,3]})
df.isin(newdf)
Out[43]:
A
0 False
1 False
2 False
3 False
4 False
df.isin(newdf.B.tolist())
Out[44]:
A
0 False
1 True
2 True
3 False
4 False
Use a DatetimeIndex and let pandas do the work for you as follows:
# generate some sample sales data for the month of November
df = pd.DataFrame(
{'count': np.random.randint(0, 900, 30)},
index=pd.date_range('2018-11-01', '2018-11-30', name='date')
)
# resample by business day and call `.asfreq()` on the resulting groupby-like object to get your desired filtering
df.resample(rule='B').asfreq()
Other values for the resampling rule can be found here

pandas group by week

I have the following test dataframe:
date user answer
0 2018-08-19 19:08:19 pga yes
1 2018-08-19 19:09:27 pga no
2 2018-08-19 19:10:45 lry no
3 2018-09-07 19:12:31 lry yes
4 2018-09-19 19:13:07 pga yes
5 2018-10-22 19:13:20 lry no
I am using the following code to group by week:
test.groupby(pd.Grouper(freq='W'))
I'm getting an error that Grouper is only valid with DatetimeIndex, however I'm unfamiliar on how to structure this in order to group by week.
Probably you have date column as a string.
In order to use it in a Grouper with a frequency, start from converting this column to DateTime:
df['date'] = pd.to_datetime(df['date'])
Then, as date column is an "ordinary" data column (not the index), use key='date' parameter and a frequency.
To sum up, below you have a working example:
import pandas as pd
d = [['2018-08-19 19:08:19', 'pga', 'yes'],
['2018-08-19 19:09:27', 'pga', 'no'],
['2018-08-19 19:10:45', 'lry', 'no'],
['2018-09-07 19:12:31', 'lry', 'yes'],
['2018-09-19 19:13:07', 'pga', 'yes'],
['2018-10-22 19:13:20', 'lry', 'no']]
df = pd.DataFrame(data=d, columns=['date', 'user', 'answer'])
df['date'] = pd.to_datetime(df['date'])
gr = df.groupby(pd.Grouper(key='date',freq='W'))
for name, group in gr:
print(' ', name)
if len(group) > 0:
print(group)
Note that the group key (name) is the ending date of a week, so dates from group members are earlier or equal to the date printed above.
You can change it passing label='left' parameter to Grouper.