How to group field by id and find the sum? - pandas

I have the following data
id starting_point ending_point Date
A 2525 6565 25/05/2017 13:25:00
B 5656 8989 25/01/2017 10:55:00
A 1234 5656 20/05/2017 03:20:00
A 4562 6245 01/02/2017 19:45:00
B 6496 9999 06/12/2016 21:55:00
B 1122 2211 20/03/2017 18:30:00
How to group the data by their id in the ascending order of date and find the sum of first stating point and last starting point. In this case,
Expected output is :
id starting_point ending_point Date Value
A 4562 6245 01/02/2017 19:45:00
A 1234 5656 20/05/2017 03:20:00
A 2525 6565 25/05/2017 13:25:00 4532 + 6565 = 11127
B 6496 9999 06/12/2016 21:55:00
B 1122 2211 20/03/2017 18:30:00 6496 + 2211 = 8707

IIUC:
In [146]: x.groupby('id').apply(lambda df: df['starting_point'].head(1).values[0]
+ df['ending_point'].tail(1).values[0])
Out[146]:
id
A 8770
B 7867
dtype: int64

Related

Merging two series with alternating dates into one grouped Pandas dataframe

Given are two series, like this:
#period1
DATE
2020-06-22 310.62
2020-06-26 300.05
2020-09-23 322.64
2020-10-30 326.54
#period2
DATE
2020-06-23 312.05
2020-09-02 357.70
2020-10-12 352.43
2021-01-25 384.39
These two series are correlated to each other, i.e. they each mark either the beginning or the end of a date period. The first series marks the end of a period1 period, the second series marks the end of period2 period. The end of a period2 period is at the same time also the start of a period1 period, and vice versa.
I've been looking for a way to aggregate these periods as date ranges, but apparently this is not easily possible with Pandas dataframes. Suggestions extremely welcome.
In the easiest case, the output layout should reflect the end dates of periods, which period type it was, and the amount of change between start and stop of the period.
Explicit output:
DATE CHG PERIOD
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.0 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
However, if there is any possibility of actually grouping by a date range consisting of start AND stop date, that would be much more favorable
Thank you!
p1 = pd.DataFrame(data={'Date': ['2020-06-22', '2020-06-26', '2020-09-23', '2020-10-30'], 'val':[310.62, 300.05, 322.64, 326.54]})
p2 = pd.DataFrame(data={'Date': ['2020-06-23', '2020-09-02', '2020-10-12', '2021-01-25'], 'val':[312.05, 357.7, 352.43, 384.39]})
p1['period'] = 1
p2['period'] = 2
df = p1.append(p2).sort_values('Date').reset_index(drop=True)
df['CHG'] = abs(df['val'].diff(periods=1))
df.drop('val', axis=1)
Output:
Date period CHG
0 2020-06-22 1 NaN
1 2020-06-23 2 1.43
2 2020-06-26 1 12.00
3 2020-09-02 2 57.65
4 2020-09-23 1 35.06
5 2020-10-12 2 29.79
6 2020-10-30 1 25.89
7 2021-01-25 2 57.85
EDIT: matching the format START - STOP - CHANGE - PERIOD
Starting from the above data frame:
df['Start'] = df.Date.shift(periods=1)
df.rename(columns={'Date': 'Stop'}, inplace=True)
df = df1[['Start', 'Stop', 'CHG', 'period']]
df
Output:
Start Stop CHG period
0 NaN 2020-06-22 NaN 1
1 2020-06-22 2020-06-23 1.43 2
2 2020-06-23 2020-06-26 12.00 1
3 2020-06-26 2020-09-02 57.65 2
4 2020-09-02 2020-09-23 35.06 1
5 2020-09-23 2020-10-12 29.79 2
6 2020-10-12 2020-10-30 25.89 1
7 2020-10-30 2021-01-25 57.85 2
# If needed:
df1.index = pd.to_datetime(df1.index)
df2.index = pd.to_datetime(df2.index)
df = pd.concat([df1, df2], axis=1)
df.columns = ['start','stop']
df['CNG'] = df.bfill(axis=1)['start'].diff().abs()
df['PERIOD'] = 1
df.loc[df.stop.notna(), 'PERIOD'] = 2
df = df[['CNG', 'PERIOD']]
print(df)
Output:
CNG PERIOD
Date
2020-06-22 NaN 1
2020-06-23 1.43 2
2020-06-26 12.00 1
2020-09-02 57.65 2
2020-09-23 35.06 1
2020-10-12 29.79 2
2020-10-30 25.89 1
2021-01-25 57.85 2
2021-01-29 14.32 1
2021-02-12 22.57 2
2021-03-04 15.94 1
2021-05-07 45.42 2
2021-05-12 16.71 1
2021-09-02 47.78 2
2021-10-04 24.55 1
2021-11-18 41.09 2
2021-12-01 19.23 1
2021-12-10 20.24 2
2021-12-20 15.76 1
2022-01-03 22.73 2
2022-01-27 46.47 1
2022-02-09 26.30 2
2022-02-23 35.59 1
2022-03-02 15.94 2
2022-03-08 21.64 1
2022-03-29 45.30 2
2022-04-29 49.55 1
2022-05-04 17.06 2
2022-05-12 36.72 1
2022-05-17 15.98 2
2022-05-19 18.86 1
2022-06-02 27.93 2
2022-06-17 51.53 1

Resample 10D but until end of months

I would like to resample a DataFrame with frequences of 10D but cutting the last decade always at the end of the month.
ES:
print(df)
 data
index
2010-01-01 145.08
2010-01-02 143.69
2010-01-03 101.06
2010-01-04 57.63
2010-01-05 65.46
...
2010-02-24 48.06
2010-02-25 87.41
2010-02-26 71.97
2010-02-27 73.1
2010-02-28 41.43
Apply something like df.resample('10DM').mean()
data
index
2010-01-10 97.33
2010-01-20 58.58
2010-01-31 41.43
2010-02-10 35.17
2010-02-20 32.44
2010-02-28 55.44
note that the 1st and 2nd decades are normal 10D resample, but the 3rd can be 8-9-10-11 days based on month and year.
Thanks in advance.
Sample data (easy to check):
# df = pd.DataFrame({"value": np.arange(1, len(dti)+1)}, index=dti)
>>> df
value
2010-01-01 1
2010-01-02 2
2010-01-03 3
2010-01-04 4
2010-01-05 5
...
2010-02-24 55
2010-02-25 56
2010-02-26 57
2010-02-27 58
2010-02-28 59
You need to create groups by (days, month, year):
grp = df.groupby([pd.cut(df.index.day, [0, 10, 20, 31]),
pd.Grouper(freq='M'),
pd.Grouper(freq='Y')])
Now you can compute the mean for each group:
out = grp['value'].apply(lambda x: (x.index.max(), x.mean())).apply(pd.Series) \
.reset_index(drop=True).rename(columns={0:'date', 1:'value'}) \
.set_index('date').sort_index()
Output result:
>>> out
value
date
2010-01-10 5.5
2010-01-20 15.5
2010-01-31 26.0
2010-02-10 36.5
2010-02-20 46.5
2010-02-28 55.5

pandas grouper int by frequency

I would like to group a Pandas dataframe by hour disregarding the date.
My data:
id opened_at count sum
2016-07-01 07:02:05 1 46.14
154 2016-07-01 07:34:02 1 479
2016-07-01 10:10:01 1 127.14
2016-07-02 12:01:04 1 8.14
2016-07-02 12:00:50 1 18.14
I am able to group by hour with date taken into account by using the following.
groupByLocationDay = df.groupby([df.id,
pd.Grouper(key='opened_at', freq='3h')])
I get the following
id opened_at count sum
2016-07-01 06:00:00 2 4296.14
154 2016-07-01 09:00:00 46 43716.79
2016-07-01 12:00:00 169 150827.14
2016-07-02 12:00:00 17 1508.14
2016-07-02 09:00:00 10 108.14
How can I group by hour only, so that it would look like the following.
id opened_at count sum
06:00:00 2 4296.14
154 09:00:00 56 43824.93
12:00:00 203 152335.28
The original data is on hourly basis, thus I need to get 3h frequency.
Thanks!
you can do it this way:
In [134]: df
Out[134]:
id opened_at count sum
0 154 2016-07-01 07:02:05 1 46.14
1 154 2016-07-01 07:34:02 1 479.00
2 154 2016-07-01 10:10:01 1 127.14
3 154 2016-07-02 12:01:04 1 8.14
4 154 2016-07-02 12:00:50 1 18.14
5 154 2016-07-02 08:34:02 1 479.00
In [135]: df.groupby(['id', df.opened_at.dt.hour // 3 * 3]).sum()
Out[135]:
count sum
id opened_at
154 6 3 1004.14
9 1 127.14
12 2 26.28

Would like to return a fake row if there is no match to my pair (for a year)

I would like to clean up some data returned from a query. This query :
select seriesId,
startDate,
reportingCountyId,
countyId,
countyName,
pocId,
pocValue
from someTable
where seriesId = 147
and pocid = 2
and countyId in (2033,2040)
order by startDate
usually returns 2 county matches for all years:
seriesId startDate reportingCountyId countyId countyName pocId pocValue
147 2004-01-01 00:00:00.000 6910 2040 CountyOne 2 828
147 2005-01-01 00:00:00.000 2998 2033 CountyTwo 2 4514
147 2005-01-01 00:00:00.000 3000 2040 CountyOne 2 2446
147 2006-01-01 00:00:00.000 3018 2033 CountyTwo 2 5675
147 2006-01-01 00:00:00.000 4754 2040 CountyOne 2 2265
147 2007-01-01 00:00:00.000 3894 2033 CountyTwo 2 6250
147 2007-01-01 00:00:00.000 3895 2040 CountyOne 2 2127
147 2008-01-01 00:00:00.000 4842 2033 CountyTwo 2 5696
147 2008-01-01 00:00:00.000 4846 2040 CountyOne 2 2013
147 2009-01-01 00:00:00.000 6786 2033 CountyTwo 2 2578
147 2009-01-01 00:00:00.000 6817 2040 CountyTwo 2 1933
147 2010-01-01 00:00:00.000 6871 2040 CountyOne 2 1799
147 2010-01-01 00:00:00.000 6872 2033 CountyTwo 2 4223
147 2011-01-01 00:00:00.000 8314 2033 CountyTwo 2 3596
147 2011-01-01 00:00:00.000 8315 2040 CountyOne 2 1559
But note please that the first entry has only CountyOne for 2004. I would like to return a fake row for CountyTwo for a graph I am doing. It would be sufficient to fill it like CountyOne only with pocValue = 0.
thanks!!!!!!!!
Try this (if you need blank row for that countryid)
; with CTE AS
(SELECT 2033 As CountryID UNION SELECT 2040),
CTE2 AS
(
seriesId, startDate, reportingCountyId,
countyId, countyName, pocId, pocValue
from someTable where
seriesId = 147 and pocid = 2 and countyId in (2033,2040)
order by startDate
)
SELECT x1.CountyId, x2.*, IsNull(pocValue,0) NewpocValue FROM CTE x
LEFT OUTER JOIN CTE2 x2 ON x1.CountyId = x2.reportingCountyId

How to Split Time and calculate time difference in sql server 2005?

i want to split the time and calculate time difference using sql server 2005
my default output is like this:
EnrollNo AttDateFirst AttDateLast
111 2011-12-09 08:46:00.000 2011-12-09 08:46:00.000
112 2011-12-09 08:40:00.000 2011-12-09 17:30:00.000
302 2011-12-09 09:00:00.000 2011-12-09 18:30:00.000
303 2011-12-09 10:00:00.000 2011-12-09 18:35:00.000
I want my new output to be like this:
Enroll No ..... FirtTime LastTime Time Diff
111 ..... 8:46:00 8:45:00 00:00:00
112 ..... 8:30:00 17:30:00 9:00:00
302 ..... 9:00:00 18:30:00 9:30:00
303 ..... 10:00:00 18:35:00 8:35:00
You can use this query:
select EnrollNo, convert(varchar, AttDateFirst, 8) as FirstTime,
convert(varchar, AttDateLast, 8) as LastTime,
convert(varchar, AttDateLast - AttDateFirst, 8) as [Time Diff]
from YourTable
to return the following results:
EnrollNo FirstTime LastTime Time Diff
----------- ------------------------------ ------------------------------ ------------------------------
111 08:46:00 08:46:00 00:00:00
112 08:30:00 17:30:00 09:00:00
302 09:00:00 18:30:00 09:30:00
303 10:00:00 18:35:00 08:35:00
you can use
select DATEDIFF(day,2007-11-30,2007-11-20) AS NumberOfDays,
DATEDIFF(hour,2007-11-30,2007-11-20) AS NumberOfHours,
DATEDIFF(minute,2007-11-30,2007-11-20) AS NumberOfMinutes from
test_table
to split u can use
substring(AttDateFirst,charindex(' ',AttDateFirst)+1 ,
len(AttDateFirst)) as [FirstTime]