I have a Pandas data frame df that looks like this:
date A B name
2022-06-25 04:00:00 700 532 aa
2022-06-25 05:00:00 1100 433 aa
2022-06-25 06:00:00 800 754 aa
2022-06-25 07:00:00 1200 288 aa
2022-06-25 08:00:00 700 643 bb
2022-06-25 09:00:00 1400 668 bb
2022-06-25 10:00:00 1600 286 bb
.....
2022-06-03 11:00:00 397 46 cc
2022-06-03 12:00:00 100 7 cc
2022-06-03 13:00:00 400 25 cc
2022-06-03 14:00:00 500 41 cc
2022-06-03 15:00:00 400 0 cc
2022-06-03 16:00:00 300 23 dd
2022-06-03 17:00:00 500 50 dd
2022-06-03 18:00:00 300 0 dd
2022-06-03 19:00:00 400 15 dd
I'm trying to Produce time series plots for each name. Line charts.
number of daily A vs date
number of daily B vs date
I was able to do so and get a plot for each name like this:
df.groupby('name').plot(x='date', figsize=(24,12))
But I couldn't specify a title for each plot like this plt.title(" ")
And also couldn't auto save each plot like this plt.savefig('name.png')
Because they all produce at once.
Is there any other way to produce plots so I can specify the title and save each plot automatically?
Thank you
One of the ways to do what you require is to put the code for your plot creation and plot save inside a for loop. That will allow you to add a title using the title. Also, you will be able to use savefig to individually save each of the plots. The update code and one of the output graphs is shown below. Note that I am adding name as the title and saving the figure as Myplot <name>.png
for name, group in df.groupby('name'):
group.plot(x='date', title=name)
plt.savefig('Myplot {}.png'.format(name), bbox_inches='tight')
First plot
Related
I have a table with individual records and another which holds historical information about the individuals in the former.
I want to extract information about the individuals from the second table. Both tables have timestamp. It is very important that the historical information happened before the record in the first table.
Date_Time name
0 2021-09-06 10:46:00 Leg It Liam
1 2021-09-06 10:46:00 Hollyhill Island
2 2021-09-06 10:46:00 Shani El Bolsa
3 2021-09-06 10:46:00 Kilbride Fifi
4 2021-09-06 10:46:00 Go
2100 2021-10-06 11:05:00 Slaneyside Babs
2101 2021-10-06 11:05:00 Hillview Joe
2102 2021-10-06 11:05:00 Fairway Flyer
2103 2021-10-06 11:05:00 Whiteys Surprise
2104 2021-10-06 11:05:00 Astons Lucy
The name is the variable by which you connect the two tables:
Date_Time name cc
13 2021-09-15 12:16:00 Hollyhill Island 6.00
14 2021-09-06 10:46:00 Hollyhill Island 4.50
15 2021-05-30 18:28:00 Hollyhill Island 3.50
16 2021-05-25 10:46:00 Hollyhill Island 2.50
17 2021-05-18 12:46:00 Hollyhill Island 2.38
18 2021-04-05 12:31:00 Hollyhill Island 3.50
19 2021-04-28 12:16:00 Hollyhill Island 3.75
I want to add aggregated data from this table to the first. Such as adding the cc mean and count.
Date_Time name
1 2021-09-06 10:46:00 Hollyhill Island
This line I would add 5 for cc count and 3.126 for the cc mean. Remember the historical records need to be before the date time of the individual records.
I am a bit confused how to do this efficiently. I know I need to groupby the historical data.
Also the individual records are usually in groups of Date_Time, if that makes it any easier.
IIUC:
try:
out=df1.merge(df2,on='name',suffixes=('','_y'))
#merging both df's on name
out=out.mask(out['Date_Time']<=out['Date_Time_y']).dropna()
#filtering results
out=out.groupby(['Date_Time','name'])['cc'].agg(['count','mean']).reset_index()
#aggregrating values
output of out:
Date_Time name count mean
0 2021-09-06 10:46:00 Hollyhill Island 5 3.126
Tried creating a new column to categorize different time frames into categories using np.select. However, python throws an error saying shape mismatch. I'm not sure how to get it corrected.
For your logic, it's simple to use hour attribute of datetime
import numpy as np
s = pd.Series(pd.date_range("1-Apr-2021", "now", freq="4H"), name="start_date")
(s.to_frame()
.join(pd.Series(np.select([s.dt.hour.between(1,6),
s.dt.hour.between(7,12)],
[1,2],0), name="cat"))
.head(8)
)
start_date
cat
0
2021-04-01 00:00:00
0
1
2021-04-01 04:00:00
1
2
2021-04-01 08:00:00
2
3
2021-04-01 12:00:00
2
4
2021-04-01 16:00:00
0
5
2021-04-01 20:00:00
0
6
2021-04-02 00:00:00
0
7
2021-04-02 04:00:00
1
I have a dataframe that looks like this
which contains every minute of a year.
I need to simplify it on hourly base and to get only hours of the year and then maximum of Reserved and Used columns for the respective hours.
I made this, which works, but not totally for my purposes
df = df.assign(date=df.date.dt.round('H'))
df1 = df.groupby('date').agg({'Reserved': ['max'], 'Used': ['max'] }).droplevel(1, axis=1).reset_index()
which just groups the minutes into hours.
date Reserved Used
0 2020-01-01 00:00:00 2176 0.0
1 2020-01-01 01:00:00 2176 0.0
2 2020-01-01 02:00:00 2176 0.0
3 2020-01-01 03:00:00 2176 0.0
4 2020-01-01 04:00:00 2176 0.0
... ... ... ...
8780 2020-12-31 20:00:00 3450 50.0
8781 2020-12-31 21:00:00 3450 0.0
8782 2020-12-31 22:00:00 3450 0.0
8783 2020-12-31 23:00:00 3450 0.0
8784 2021-01-01 00:00:00 3450 0.0
Now I need group it more to plot several curves, containing only 24 points (for every hour) based on several criteria
average used and reserved for the whole year (so to group together every 00 hour, every 01 hour, etc.)
average used and reserved for every month (so to group every 00 hour, 01 hour etc for each month individually)
average used and reserved for weekdays and for weekends
I know this is only the similar groupby as before, but I somehow miss the logic of doing it.
Could anybody help?
Thanks.
I'm trying to do a weekly forecast in FBProphet for just 5 weeks ahead. The make_future_dataframe method doesn't seem to be working right....makes the correct one week intervals except for one week between jul 3 and Jul 5....every other interval is correct at 7 days or a week. Code and output below:
INPUT DATAFRAME
ds y
548 2010-01-01 3117
547 2010-01-08 2850
546 2010-01-15 2607
545 2010-01-22 2521
544 2010-01-29 2406
... ... ...
4 2020-06-05 2807
3 2020-06-12 2892
2 2020-06-19 3012
1 2020-06-26 3077
0 2020-07-03 3133
CODE
future = m.make_future_dataframe(periods=5, freq='W')
future.tail(9)
OUTPUT
ds
545 2020-06-12
546 2020-06-19
547 2020-06-26
548 2020-07-03
549 2020-07-05
550 2020-07-12
551 2020-07-19
552 2020-07-26
553 2020-08-02
All you need to do is create a dataframe with the dates you need for predict method. utilizing the make_future_dataframe method is not necessary.
Iam newbie in python. I have huge a dataframe with millions of rows and id. my data looks like this:
Time ID X Y
8:00 A 23 100
9:00 B 24 110
10:00 B 25 120
11:00 C 26 130
12:00 C 27 140
13:00 A 28 150
14:00 A 29 160
15:00 D 30 170
16:00 C 31 180
17:00 B 32 190
18:00 A 33 200
19:00 C 34 210
20:00 A 35 220
21:00 B 36 230
22:00 C 37 240
23:00 B 38 250
I sorted the data on id and time.
Time ID X Y
8:00 A 23 100
13:00 A 28 150
14:00 A 29 160
18:00 A 33 200
20:00 A 35 220
9:00 B 24 110
10:00 B 25 120
17:00 B 32 190
21:00 B 36 230
23:00 B 38 250
11:00 C 26 130
12:00 C 27 140
16:00 C 31 180
19:00 C 34 210
22:00 C 37 240
15:00 D 30 170
and I want to pick only "The first and the last" of the id and eliminate the rest. The result looked like this:
Time ID X Y
8:00 A 23 100
20:00 A 35 220
9:00 B 24 110
23:00 B 38 250
11:00 C 26 130
22:00 C 37 240
15:00 D 30 170
I used this code:
df = pd.read_csv("contoh.csv")
g = df.groupby('ID')
(pd.concat([g.head(1), g.tail(1)])
.drop_duplicates()
.sort_values('ID')
.reset_index(drop=True))
It works but I cannot save to csv
g.to_csv('result.csv')
I got an error message: Cannot access callable attribute 'to_csv' of 'DataFrameGroupBy' objects, try using the 'apply' method
any advice to me? thank you
When you are using the concat function, you are not applying it on the groupby object g. You need to reassign the concat function's output to another object.
df = pd.read_csv("contoh.csv")
g = df.groupby('ID')
g_1 = pd.concat([g.head(1),g.tail(1)]).drop_duplicates().sort_values('ID').reset_index(drop=True)
g_1.to_csv('result.csv')