Inconsistent output for pandas groupby-resample with missing values in first time bin - pandas

I am finding an inconsistent output with pandas groupby-resample behavior.
Take this dataframe, in which category A has samples on the first and second day and category B has a sample only on the second day:
df1 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00']),
data={'category':['A','A','B']})
# Output:
# category
#2022-01-01 01:00:00 A
#2022-01-02 01:00:00 A
#2022-01-02 01:00:00 B
When I groupby-resample I get a Series with multiindex on category and time:
res1 = df1.groupby('category').resample('1D').size()
#Output:
#category
#A 2022-01-01 1
# 2022-01-02 1
#B 2022-01-02 1
#dtype: int64
But if I add one more data point so that B has a sample on day 1, the return value is a dataframe with single-index in category and columns corresponding to the time bins:
df2 = pd.DataFrame(index=pd.DatetimeIndex(
['2022-1-1 1:00','2022-1-2 1:00','2022-1-2 1:00','2022-1-1 1:00']),
data={'category':['A','A','B','B']})
res2 = df2.groupby('category').resample('1D').size()
# Output:
# 2022-01-01 2022-01-02
# category
# A 1 1
# B 1 1
Is this expected behavior? I reproduced this behavior in pandas 1.4.2 and was unable to find a bug report.

I submitted bug report 46826 to pandas.

The result should be a Series with a MultiIndex in both cases. There was a bug which caused df.groupby.resample.size to return a wide DF for cases in which all groups had the same index. This has been fixed on the master branch. Thank you for opening the issue.

Related

Reshape Pandas dataframe (partial transpose)

I have a csv similar to the following, where the column heading specifies the time (hour number):
Day,Location,1,2,3
1/1/2021,A,0.26,0.25,0.49
1/1/2021,B,0.8,0.23,0.55
1/1/2021,C,0.32,0.11,0.58
1/2/2021,A,0.67,0.72,0.49
1/2/2021,B,0.25,0.09,0.56
1/2/2021,C,0.83,0.54,0.7
When I load it as a dataframe using
df = pd.read_csv(open('VirusLevels.csv', 'r'), index_col=[0,1], header=0)
Pandas creates a dataframe with indices Day and Location, and column names 1, 2, and 3.
I need it to be reshaped as shown below, where Day and Time are the indices, and the Location is the column heading:
I've tried a lot of things and followed a lot of rabbitholes, but haven't been successful. The most on-point example I could find suggested something like the following, but it doesn't work (says "KeyError: 'Day'").
df.melt(id_vars=['Day'], var_name= 'Time',
value_name = 'VirusLevels').sort_values(by='Location').reset_index(drop=True)
Thanks in advance for any help.
Try:
df = pd.read_csv('VirusLevels.csv', index_col=[0,1])
df.rename_axis(columns='Time').stack().unstack('Location')
# or
# df.rename_axis('Time',axis='columns').stack().unstack('Location')
Output:
Location A B C
Day Time
1/1/2021 1 0.345307 0.099403 0.474077
2 0.299947 0.853091 0.352472
3 0.400975 0.599249 0.743099
1/2/2021 1 0.660258 0.003976 0.295406
2 0.425434 0.953433 0.418783
3 0.421021 0.844761 0.369561

Add hours to a timestamp that is formatted as a string

I have a dataframe (df) that has employee start and end times formatted at strings
emp_id|Start|End
001|07:00:00|04:00:00
002|07:30:00|04:30:00
I want to add two hours to the Start and 2 hours to the End on a set of employees, not all employees. I do this by taking a slice of the main dataframe into a separate dataframe (df2). I then update the values and need to merge the updated values back into the main dataframe (df1) where I will coerce back to a string, as there is a method later in the code expecting these values to be strings.
I tried doing this:
df1['Start'] = pd.to_datetime(df1.Start)
df1['End'] = pd.to_datetime(df1.End)
df2 = df1.sample(frac=0.1, replace=False, random_state=1) #takes a random 10% slice
df2['Start'] = df2['Start'] + timedelta(hours=2)
df2['End'] = df2['End'] + timedelta(hours=2)
df1.loc[df1.emp_id.isin(df2.emp_id), ['Start, 'End']] = df2[['Start', 'End']]
df1['Start'] = str(df1['Start'])
df1['End'] = str(df1['End']))
I'm getting a TypeError: addition/subtraction of integers and integer arrays with DateTimeArray is no longer supported. How do I do this in Python3?
You can use .applymap() on the Start and End columns of your selected subset. Hour addition can be done by string extraction and substitution.
Code
df1 = pd.DataFrame({
"emp_id": ['001', '002'],
"Start": ['07:00:00', '07:30:00'],
"End": ['04:00:00', '04:30:00'],
})
# a subset of employee id
set_id = set(['002'])
# locate the subset
mask = df1["emp_id"].isin(set_id)
# apply hour addition
df1.loc[mask, ["Start", "End"]] = df1.loc[mask, ["Start", "End"]].applymap(lambda el: f"{int(el[:2])+2:02}{el[2:]}")
Result
print(df1)
emp_id Start End
0 001 07:00:00 04:00:00
1 002 09:30:00 06:30:00 <- 2 hrs were added
Note: f-strings require python 3.6+. For earlier versions, replace the f-string with
"%02d%s" % (int(el[:2])+2, el[2:])
Note: mind corner cases (time later than 22:00) if they exist.

Groupby two columns one of them is datetime

I have data frame that I want to groupby by two columns one of them is datetime type. How can I do this?
import pandas as pd
import datetime as dt
df = pd.DataFrame({
'a':np.random.randn(6),
'b':np.random.choice( [5,7,np.nan], 6),
'g':{1002,300,1002,300,1002,300}
'c':np.random.choice( ['panda','python','shark'], 6),
# some ways to create systematic groups for indexing or groupby
# this is similar to r's expand.grid(), see note 2 below
'd':np.repeat( range(3), 2 ),
'e':np.tile( range(2), 3 ),
# a date range and set of random dates
'f':pd.date_range('1/1/2011', periods=6, freq='D'),
'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
freq='D'), 6, replace=False)
})
You can use pd.Grouper to specify groupby instructions. It can be used with pd.DatetimeIndex index to group data with specified frequency using the freq parameter.
Assumming that you have this dataframe:
df = pd.DataFrame(dict(
a=dict(date=pd.Timestamp('2020-05-01'), category='a', value=1),
b=dict(date=pd.Timestamp('2020-06-01'), category='a', value=2),
c=dict(date=pd.Timestamp('2020-06-01'), category='b', value=6),
d=dict(date=pd.Timestamp('2020-07-01'), category='a', value=1),
e=dict(date=pd.Timestamp('2020-07-27'), category='a', value=3),
)).T
You can set index to date column and it would be converted to pd.DatetimeIndex. Then you can use pd.Grouper among with another columns. For the following example I use category column.
freq='M' parameter used to group index using month frequency. There are number of string data series aliases that can be used in pd.Grouper
df.set_index('date').groupby([pd.Grouper(freq='M'), 'category'])['value'].sum()
Result:
date category
2020-05-31 a 1
2020-06-30 a 2
b 6
2020-07-31 a 4
Name: value, dtype: int64
Another example with your mcve:
df.set_index('g').groupby([pd.Grouper(freq='M'), 'c']).d.sum()
Result:
g c
2011-01-31 panda 0
2011-04-30 shark 2
2011-06-30 panda 2
2011-07-31 panda 0
2011-09-30 panda 1
2011-12-31 python 1
Name: d, dtype: int32

Finding the longest sequence of dates in a dataframe

I'd like to know how to find the longest unbroken sequence of dates (formatted as 2016-11-27) in a publish_date column (dates are not the index, though I suppose they could be).
There are a number of stack overflow questions which are similar, but AFAICT all proposed answers return the size of the longest sequence, which is not what I'm after.
I want to know e.g. that the stretch from 2017-01-01 to 2017-06-01 had no missing dates and was the longest such streak.
Here is an example of how you can do this:
import pandas as pd
import datetime
# initialize data
data = {'a': [1,2,3,4,5,6,7],
'date': ['2017-01-01', '2017-01-03', '2017-01-05', '2017-01-06', '2017-01-07', '2017-01-09', '2017-01-31']}
df = pd.DataFrame(data)
df['date'] = pd.to_datetime(df['date'])
# create mask that indicates sequential pair of days (except the first date)
df['mask'] = 1
df.loc[df['date'] - datetime.timedelta(days=1) == df['date'].shift(),'mask'] = 0
# convert mask to numbers - each sequence have its own number
df['mask'] = df['mask'].cumsum()
# find largest sequence number and get this sequence
res = df.loc[df['mask'] == df['mask'].value_counts().idxmax(), 'date']
# extract min and max dates if you need
min_date = res.min()
max_date = res.max()
# print result
print('min_date: {}'.format(min_date))
print('max_date: {}'.format(max_date))
print('result:')
print(res)
The result will be:
min_date: 2017-01-05 00:00:00
max_date: 2017-01-07 00:00:00
result:
2 2017-01-05
3 2017-01-06
4 2017-01-07

Problems getting two columns into datetime.datetime format

I have code at the moment written to change two columns of my dataframe from strings into datetime.datetime objects similar to the following:
def converter(date):
date = dt.strptime(date, '%m/%d/%Y %H:%M:%S')
return date
df = pd.DataFrame({'A':['12/31/9999 0:00:00','1/1/2018 0:00:00'],
'B':['4/1/2015 0:00:00','11/1/2014 0:00:00']})
df['A'] = df['A'].apply(converter)
df['B'] = df['B'].apply(converter)
When I run this code and print the dataframe, it comes out like this
A B
0 9999-12-31 00:00:00 2015-04-01
1 2018-01-01 00:00:00 2014-11-01
When I checked the data types of each column, they read
A object
B datetime64[ns]
But when I check the format of the actual cells of the first row, they read
<class 'datetime.datetime'>
<class 'pandas._libs.tslib.Timestamp'>
After experimenting around, I think I've run into an out of bounds error because of the date '12/31/9999 0:00:00' in column 'A' and this is causing this column to be cast as a datetime.datetime object. My question is how I can also convert column 'B' of my dataframe to a datetime.datetime object so that I can run a query on the columns similar to
df.query('A > B')
without getting an error or the wrong output.
Thanks in advance
Since '9999' is just some dummy year, you can simplify your life by choosing a dummy year which is in bounds (or one that makes more sense given your actual data):
import pandas as pd
df.replace('9999', '2060', regex=True).apply(pd.to_datetime)
Output:
A B
0 2060-12-31 2015-04-01
1 2018-01-01 2014-11-01
A datetime64[ns]
B datetime64[ns]
dtype: object
As #coldspeed points out, it's perhaps better to remove those bad dates:
df.apply(pd.to_datetime, errors='coerce')
# A B
#0 NaT 2015-04-01
#1 2018-01-01 2014-11-01