How to extract the last value of a group in pandas dataframe - pandas

I have a huge data that needs to be grouped based on its 'IDs' and only the last value of each ID needs to be exported into a SINGLE csv/excel file.
incl = ['A', 'B', 'C']
for k, g in df[df['ID'].isin(incl)].groupby('ID'):
g.tail(1).to_csv(f'{k}.csv')
I have tried this but it makes different csv files for each ID instead of a one big file containing the last value of each group.
Sample data:
ID Date Open High Low
30 UNITY 2020-06-18 11.50 11.75 11.41
31 UNITY 2020-06-21 11.44 11.50 10.88
32 UNITY 2020-06-22 11.26 11.78 11.26
33 UNITY 2020-06-23 11.72 12.08 11.53
34 UNITY 2020-06-24 11.51 11.59 11.40
35 UNITY 2020-06-25 11.85 11.85 11.11
36 SSOM 2020-05-03 27.50 27.95 27.00
37 SSOM 2020-05-05 27.50 27.50 27.50
38 SSOM 2020-05-06 29.20 29.56 29.20
39 SSOM 2020-05-07 31.77 31.77 31.77

Related

Diff() function use with groupby for pandas

I am encountering an errors each time i attempt to compute the difference in readings for a meter in my dataset. The dataset structure is this.
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0
I am attempting to generate a new column called consumption that computes the difference in quantities consumed for each house(identified by houseid-meterid) after every month of the year.
The code i am using to implement this is:
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
After executing this code, the consumption column is filled with NaN values. How can I correctly implement this logic.
The end result looks like this:
id paymenttermid houseid houseid-meterid quantity month year cleaned_quantity consumption
Datetime
2019-02-01 255 water 215 215M201 23.0 2 2019 23.0 NaN
2019-02-01 286 water 193 193M181 24.0 2 2019 24.0 NaN
2019-02-01 322 water 172 172M162 22.0 2 2019 22.0 NaN
2019-02-01 323 water 176 176M166 61.0 2 2019 61.0 NaN
2019-02-01 332 water 158 158M148 15.0 2 2019 15.0 NaN
Many thank in advance.
I have attempted to use
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(-1)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff(0)
and
water_df["consumption"] = water_df.groupby(["year", "month", "houseid-meterid"])["cleaned_quantity"].diff()
all this commands result in the same behaviour as stated above.
Expected output should be:
Datetime houseid-meterid cleaned_quantity consumption
2019-02-01 215M201 23.0 20
2019-03-02 215M201 43.0 9
2019-04-01 215M201 52.0 12
2019-05-01 215M201 64.0 36
2019-06-01 215M201 100.0 20
what steps should i take?
Sort values by Datetime (if needed) then group by houseid-meterid before compute the diff for cleaned_quantity values then shift row to align with the right data:
df['consumption'] = (df.sort_values('Datetime')
.groupby('houseid-meterid')['cleaned_quantity']
.transform(lambda x: x.diff().shift(-1)))
print(df)
# Output
Datetime houseid-meterid cleaned_quantity consumption
0 2019-02-01 215M201 23.0 20.0
1 2019-03-02 215M201 43.0 9.0
2 2019-04-01 215M201 52.0 12.0
3 2019-05-01 215M201 64.0 36.0
4 2019-06-01 215M201 100.0 NaN

Moving Average Pandas Across Group

My data has the following structure:
np.random.seed(25)
tdf = pd.DataFrame({'person_id' :[1,1,1,1,
2,2,
3,3,3,3,3,
4,4,4,
5,5,5,5,5,5,5,
6,
7,7,
8,8,8,8,8,8,8,
9,9,
10,10
],
'Date': ['2021-01-02','2021-01-05','2021-01-07','2021-01-09',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11',
'2021-01-02','2021-01-05','2021-01-07',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05','2021-01-07','2021-01-09','2021-01-11','2021-01-13','2021-01-15',
'2021-01-02','2021-01-05',
'2021-01-02','2021-01-05'
],
'Quantity': np.floor(np.random.random(size=35)*100)
})
And I want to calculate moving average (2 periods) over Date. So, the final output looks like the following. For first MA, we are taking 2021-01-02 & 2021-01-05 across all observations & calculate the MA (50). Similarly for other dates. The output need not be in the structure I'm showing the report. I just need date & MA column in the final data.
Thanks!
IIUC, you can aggregate the similar dates first, getting the sum and count.
Then take the sum per rolling 2 dates (here it doesn't look like you want to take care of a defined period but rather raw successive values, so I am assuming here prior sorting).
Finally, perform the ratio of sum and count to get the mean:
g = tdf.groupby('Date')['Quantity']
out = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
output:
Date
2021-01-02 NaN
2021-01-05 50.210526
2021-01-07 45.071429
2021-01-09 41.000000
2021-01-11 44.571429
2021-01-13 48.800000
2021-01-15 50.500000
Name: Quantity, dtype: float64
joining the original data:
g = tdf.groupby('Date')['Quantity']
s = g.sum().rolling(2).sum()/g.count().rolling(2).sum()
tdf.merge(s.rename('Quantity_MA(2)'), left_on='Date', right_index=True)
output:
person_id Date Quantity Quantity_MA(2)
0 1 2021-01-02 87.0 NaN
4 2 2021-01-02 41.0 NaN
6 3 2021-01-02 68.0 NaN
11 4 2021-01-02 11.0 NaN
14 5 2021-01-02 16.0 NaN
21 6 2021-01-02 51.0 NaN
22 7 2021-01-02 38.0 NaN
24 8 2021-01-02 51.0 NaN
31 9 2021-01-02 90.0 NaN
33 10 2021-01-02 45.0 NaN
1 1 2021-01-05 58.0 50.210526
5 2 2021-01-05 11.0 50.210526
7 3 2021-01-05 43.0 50.210526
12 4 2021-01-05 44.0 50.210526
15 5 2021-01-05 52.0 50.210526
23 7 2021-01-05 99.0 50.210526
25 8 2021-01-05 55.0 50.210526
32 9 2021-01-05 66.0 50.210526
34 10 2021-01-05 28.0 50.210526
2 1 2021-01-07 27.0 45.071429
8 3 2021-01-07 55.0 45.071429
13 4 2021-01-07 58.0 45.071429
16 5 2021-01-07 32.0 45.071429
26 8 2021-01-07 3.0 45.071429
3 1 2021-01-09 18.0 41.000000
9 3 2021-01-09 36.0 41.000000
17 5 2021-01-09 69.0 41.000000
27 8 2021-01-09 71.0 41.000000
10 3 2021-01-11 40.0 44.571429
18 5 2021-01-11 36.0 44.571429
28 8 2021-01-11 42.0 44.571429
19 5 2021-01-13 83.0 48.800000
29 8 2021-01-13 43.0 48.800000
20 5 2021-01-15 48.0 50.500000
30 8 2021-01-15 28.0 50.500000

How to merge records with aggregate historical data?

I have a table with individual records and another which holds historical information about the individuals in the former.
I want to extract information about the individuals from the second table. Both tables have timestamp. It is very important that the historical information happened before the record in the first table.
Date_Time name
0 2021-09-06 10:46:00 Leg It Liam
1 2021-09-06 10:46:00 Hollyhill Island
2 2021-09-06 10:46:00 Shani El Bolsa
3 2021-09-06 10:46:00 Kilbride Fifi
4 2021-09-06 10:46:00 Go
2100 2021-10-06 11:05:00 Slaneyside Babs
2101 2021-10-06 11:05:00 Hillview Joe
2102 2021-10-06 11:05:00 Fairway Flyer
2103 2021-10-06 11:05:00 Whiteys Surprise
2104 2021-10-06 11:05:00 Astons Lucy
The name is the variable by which you connect the two tables:
Date_Time name cc
13 2021-09-15 12:16:00 Hollyhill Island 6.00
14 2021-09-06 10:46:00 Hollyhill Island 4.50
15 2021-05-30 18:28:00 Hollyhill Island 3.50
16 2021-05-25 10:46:00 Hollyhill Island 2.50
17 2021-05-18 12:46:00 Hollyhill Island 2.38
18 2021-04-05 12:31:00 Hollyhill Island 3.50
19 2021-04-28 12:16:00 Hollyhill Island 3.75
I want to add aggregated data from this table to the first. Such as adding the cc mean and count.
Date_Time name
1 2021-09-06 10:46:00 Hollyhill Island
This line I would add 5 for cc count and 3.126 for the cc mean. Remember the historical records need to be before the date time of the individual records.
I am a bit confused how to do this efficiently. I know I need to groupby the historical data.
Also the individual records are usually in groups of Date_Time, if that makes it any easier.
IIUC:
try:
out=df1.merge(df2,on='name',suffixes=('','_y'))
#merging both df's on name
out=out.mask(out['Date_Time']<=out['Date_Time_y']).dropna()
#filtering results
out=out.groupby(['Date_Time','name'])['cc'].agg(['count','mean']).reset_index()
#aggregrating values
output of out:
Date_Time name count mean
0 2021-09-06 10:46:00 Hollyhill Island 5 3.126

Future dates calculating incorrectly in FBProphet - make_future_dataframe method

I'm trying to do a weekly forecast in FBProphet for just 5 weeks ahead. The make_future_dataframe method doesn't seem to be working right....makes the correct one week intervals except for one week between jul 3 and Jul 5....every other interval is correct at 7 days or a week. Code and output below:
INPUT DATAFRAME
ds y
548 2010-01-01 3117
547 2010-01-08 2850
546 2010-01-15 2607
545 2010-01-22 2521
544 2010-01-29 2406
... ... ...
4 2020-06-05 2807
3 2020-06-12 2892
2 2020-06-19 3012
1 2020-06-26 3077
0 2020-07-03 3133
CODE
future = m.make_future_dataframe(periods=5, freq='W')
future.tail(9)
OUTPUT
ds
545 2020-06-12
546 2020-06-19
547 2020-06-26
548 2020-07-03
549 2020-07-05
550 2020-07-12
551 2020-07-19
552 2020-07-26
553 2020-08-02
All you need to do is create a dataframe with the dates you need for predict method. utilizing the make_future_dataframe method is not necessary.

seaborn multiple variables group bar plot

I have pandas dataframe, one index(datetime) and three variables(int)
date A B C
2017-09-05 25 261 31
2017-09-06 261 1519 151
2017-09-07 188 1545 144
2017-09-08 200 2110 232
2017-09-09 292 2391 325
I can create grouped bar plot with basic pandas plot.
df.plot(kind='bar', legend=False)
However, I want to display in Seaborn or other libraries to improve my skills. I found very close answer(Pandas: how to draw a bar plot with two categories and four series each?).
In its suggested answer, it has code that
ax=sns.barplot(x='', y='', hue='', data=data)
If I apply this code to mine, I do not know what my 'y` would be.
ax=sns.barplot(x='date', y=??, hue=??, data=data)
How can I plot multiple variables with Seaborn or other libraries?
I think need melt if want use barplot:
data = df.melt('date', var_name='a', value_name='b')
print (data)
date a b
0 2017-09-05 A 25
1 2017-09-06 A 261
2 2017-09-07 A 188
3 2017-09-08 A 200
4 2017-09-09 A 292
5 2017-09-05 B 261
6 2017-09-06 B 1519
7 2017-09-07 B 1545
8 2017-09-08 B 2110
9 2017-09-09 B 2391
10 2017-09-05 C 31
11 2017-09-06 C 151
12 2017-09-07 C 144
13 2017-09-08 C 232
14 2017-09-09 C 325
ax=sns.barplot(x='date', y='b', hue='a', data=data)
ax.set_xticklabels(ax.get_xticklabels(), rotation=90)
Pandas solution with DataFrame.plot.bar and set_index:
df.set_index('date').plot.bar()