I have a Dataframe that has list of dates with sales count for each of the days as shown below:
date,count
11/1/2018,345
11/2/2018,100
11/5/2018,432
11/7/2018,500
11/11/2018,555
11/17/2018,754
I am trying to check of all the sales that were done how many were done on a weekday. To pull all week-days in November I am doing the below:
weekday = pd.DataFrame(pd.bdate_range('2018-11-01', '2018-11-30'))
Now I am trying to compare dates in df with value in weekday as below:
df_final = df[df['date'].isin(weekday)]
But the above returns no rows.
You should remove pd.DataFrame when create the weekday, since when we using Series and DataFrame with isin means we not only match the values but also the index and columns , since the original index and columns may different from the new created dataframe weekday, that is why return the False
df.date=pd.to_datetime(df.date)
weekday = pd.bdate_range('2018-11-01', '2018-11-30')
df_final = df[df['date'].isin(weekday)]
df_final
Out[39]:
date count
0 2018-11-01 345
1 2018-11-02 100
2 2018-11-05 432
3 2018-11-07 500
Simple example address the issue I mentioned above
df=pd.DataFrame({'A':[1,2,3,4,5]})
newdf=pd.DataFrame({'B':[2,3]})
df.isin(newdf)
Out[43]:
A
0 False
1 False
2 False
3 False
4 False
df.isin(newdf.B.tolist())
Out[44]:
A
0 False
1 True
2 True
3 False
4 False
Use a DatetimeIndex and let pandas do the work for you as follows:
# generate some sample sales data for the month of November
df = pd.DataFrame(
{'count': np.random.randint(0, 900, 30)},
index=pd.date_range('2018-11-01', '2018-11-30', name='date')
)
# resample by business day and call `.asfreq()` on the resulting groupby-like object to get your desired filtering
df.resample(rule='B').asfreq()
Other values for the resampling rule can be found here
Related
I have a dataframe that looks like the following:
arr = pd.DataFrame([[0,0],[0,1],[0,4],[1,4],[1,5],[1,6],[2,5],[2,8],[2,6])
My desired output is booleans that represent whether the value in column 2 is in the next consecutive group or not. The groups are represented by the values in column 1. So for example, 4 shows up in group 0 and the next consecutive group, group 1:
output = pd.DataFrame([[False],[False],[True],[False],[True],[True],[Nan],[Nan],[Nan]])
The outputs for group 2 would be Nan because group 3 doesn't exist.
So far I have tried this:
output = arr.groupby([0])[1].isin(arr.groupby([0])[1].shift(periods=-1))
This doesn't work because I can't apply the isin() on a groupby series.
You could create a helper column with lists of shifted group items, then check against that with a function that returns True, False of NaN:
import pandas as pd
import numpy as np
arr = pd.DataFrame([[0,0],[0,1],[0,4],[1,4],[1,5],[1,6],[2,5],[2,8],[2,6]])
arr = pd.merge(arr, arr.groupby([0]).agg(list).shift(-1).reset_index(), on=[0], how='outer')
def check_columns(row):
try:
if row['1_x'] in row['1_y']:
return True
else:
return False
except:
return np.nan
arr.apply(check_columns, axis=1)
Result:
0 False
1 False
2 True
3 False
4 True
5 True
6 NaN
7 NaN
8 NaN
I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32
I'd like to know how to find the longest unbroken sequence of dates (formatted as 2016-11-27) in a publish_date column (dates are not the index, though I suppose they could be).
There are a number of stack overflow questions which are similar, but AFAICT all proposed answers return the size of the longest sequence, which is not what I'm after.
I want to know e.g. that the stretch from 2017-01-01 to 2017-06-01 had no missing dates and was the longest such streak.
Here is an example of how you can do this:
import pandas as pd
import datetime
# initialize data
data = {'a': [1,2,3,4,5,6,7],
'date': ['2017-01-01', '2017-01-03', '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-09', '2017-01-31']}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# create mask that indicates sequential pair of days (except the first date)
df['mask'] = 1
df.loc[df['date'] - datetime.timedelta(days=1) == df['date'].shift(),'mask'] = 0
# convert mask to numbers - each sequence have its own number
df['mask'] = df['mask'].cumsum()
# find largest sequence number and get this sequence
res = df.loc[df['mask'] == df['mask'].value_counts().idxmax(), 'date']
# extract min and max dates if you need
min_date = res.min()
max_date = res.max()
# print result
print('min_date: {}'.format(min_date))
print('max_date: {}'.format(max_date))
print('result:')
print(res)
The result will be:
min_date: 2017-01-05 00:00:00
max_date: 2017-01-07 00:00:00
result:
2 2017-01-05
3 2017-01-06
4 2017-01-07
I got a DataFrame with these columns :
year month day gender births
I'd like to create a new column type "Date" based on the column year, month and day as : "yyyy-mm-dd"
I'm just beginning in Python and I just can't figure out how to proceed...
Assuming you are using pandas to create your dataframe, you can try:
>>> import pandas as pd
>>> df = pd.DataFrame({'year':[2015,2016],'month':[2,3],'day':[4,5],'gender':['m','f'],'births':[0,2]})
>>> df['dates'] = pd.to_datetime(df.iloc[:,0:3])
>>> df
year month day gender births dates
0 2015 2 4 m 0 2015-02-04
1 2016 3 5 f 2 2016-03-05
Taken from the example here and the slicing (iloc use) "Selection" section of "10 minutes to pandas" here.
You can useĀ .assign
For example:
df2= df.assign(ColumnDate = df.Column1.astype(str) + '- ' + df.Column2.astype(str) + '-' df.Column3.astype(str) )
It is simple and it is much faster than lambda if you have tonnes of data.
I have the following test dataframe:
date user answer
0 2018-08-19 19:08:19 pga yes
1 2018-08-19 19:09:27 pga no
2 2018-08-19 19:10:45 lry no
3 2018-09-07 19:12:31 lry yes
4 2018-09-19 19:13:07 pga yes
5 2018-10-22 19:13:20 lry no
I am using the following code to group by week:
test.groupby(pd.Grouper(freq='W'))
I'm getting an error that Grouper is only valid with DatetimeIndex, however I'm unfamiliar on how to structure this in order to group by week.
Probably you have date column as a string.
In order to use it in a Grouper with a frequency, start from converting this column to DateTime:
df['date'] = pd.to_datetime(df['date'])
Then, as date column is an "ordinary" data column (not the index), use key='date' parameter and a frequency.
To sum up, below you have a working example:
import pandas as pd
d = [['2018-08-19 19:08:19', 'pga', 'yes'],
['2018-08-19 19:09:27', 'pga', 'no'],
['2018-08-19 19:10:45', 'lry', 'no'],
['2018-09-07 19:12:31', 'lry', 'yes'],
['2018-09-19 19:13:07', 'pga', 'yes'],
['2018-10-22 19:13:20', 'lry', 'no']]
df = pd.DataFrame(data=d, columns=['date', 'user', 'answer'])
df['date'] = pd.to_datetime(df['date'])
gr = df.groupby(pd.Grouper(key='date',freq='W'))
for name, group in gr:
print(' ', name)
if len(group) > 0:
print(group)
Note that the group key (name) is the ending date of a week, so dates from group members are earlier or equal to the date printed above.
You can change it passing label='left' parameter to Grouper.