Iam newbie in python. I have huge a dataframe with millions of rows and id. my data looks like this:
Time ID X Y
8:00 A 23 100
9:00 B 24 110
10:00 B 25 120
11:00 C 26 130
12:00 C 27 140
13:00 A 28 150
14:00 A 29 160
15:00 D 30 170
16:00 C 31 180
17:00 B 32 190
18:00 A 33 200
19:00 C 34 210
20:00 A 35 220
21:00 B 36 230
22:00 C 37 240
23:00 B 38 250
I sorted the data on id and time.
Time ID X Y
8:00 A 23 100
13:00 A 28 150
14:00 A 29 160
18:00 A 33 200
20:00 A 35 220
9:00 B 24 110
10:00 B 25 120
17:00 B 32 190
21:00 B 36 230
23:00 B 38 250
11:00 C 26 130
12:00 C 27 140
16:00 C 31 180
19:00 C 34 210
22:00 C 37 240
15:00 D 30 170
and I want to pick only "The first and the last" of the id and eliminate the rest. The result looked like this:
Time ID X Y
8:00 A 23 100
20:00 A 35 220
9:00 B 24 110
23:00 B 38 250
11:00 C 26 130
22:00 C 37 240
15:00 D 30 170
I used this code:
df = pd.read_csv("data.csv")
g = df.groupby('ID')
g_1 = pd.concat([g.head(1),g.tail(1)]).drop_duplicates().sort_values('ID').reset_index(drop=True)
g_1.to_csv('result.csv')
but I want to assign or give annotation every row as "first" and "last" in the new column.
my expected result looks like this:
Time ID X Y Annotation
8:00 A 23 100 First
20:00 A 35 220 Last
9:00 B 24 110 First
23:00 B 38 250 Last
11:00 C 26 130 First
22:00 C 37 240 Last
15:00 D 30 170
anyone could help me with this? please give me advice thank you.
You can use groupby agg,first and last. Works perfectly for the column Annotation. As a bonus, this will work on the original dataframe so no need to sort
df3.groupby('ID').agg(['first', 'last']).stack().reset_index().rename(columns = {'level_1': 'Annotation'})
ID Annotation Time X Y
0 A first 8:00 23 100
1 A last 20:00 35 220
2 B first 9:00 24 110
3 B last 23:00 38 250
4 C first 11:00 26 130
5 C last 22:00 37 240
6 D first 15:00 30 170
7 D last 15:00 30 170
No need groupby using drop_duplicatesafter you sort
df=pd.concat([df.drop_duplicates(['ID']).assign(sign='first'),df.drop_duplicates(['ID'],keep='last').assign(sign='last')]).sort_values('ID')
df
Time ID X Y sign
0 8:00 A 23 100 first
4 20:00 A 35 220 last
5 9:00 B 24 110 first
9 23:00 B 38 250 last
10 11:00 C 26 130 first
14 22:00 C 37 240 last
15 15:00 D 30 170 first
15 15:00 D 30 170 last
Try:
df.groupby('ID').agg(['first','last'])\
.stack(1).reset_index()\
.rename(columns={'level_1':'Annotation'})
Output:
ID Annotation Time X Y
0 A first 8:00 23 100
1 A last 20:00 35 220
2 B first 9:00 24 110
3 B last 23:00 38 250
4 C first 11:00 26 130
5 C last 22:00 37 240
6 D first 15:00 30 170
7 D last 15:00 30 170
Related
I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit sale_date discount
A 50 2016-12-01 5
A 50 2017-01-03 4
B 200 2016-12-24 10
A 50 2017-01-18 3
B 200 2017-01-28 15
A 50 2017-01-18 6
B 200 2017-01-28 20
A 50 2017-04-18 6
B 200 2017-12-08 25
A 50 2017-11-18 6
B 200 2017-08-21 20
B 200 2017-12-28 30
A 50 2018-03-18 10
B 300 2018-06-08 45
B 300 2018-09-20 50
A 50 2018-11-18 8
B 300 2018-11-28 35
From the above I would like to prepare below dataframe and plot that into line plot.
Expected Output
bought_year total_profit
2016 250
2017 1250
2018 1000
X axis = bought_year
Y axis = profit
use groupby with dt.year and .agg to name your column.
df1 = df.groupby(df['sale_date'].dt.year).agg(total_profit=('profit','sum'))\
.reset_index().rename(columns={'sale_date': 'bought_year'})
print(df1)
bought_year total_profit
0 2016 250
1 2017 1250
2 2018 1000
df1.set_index('bought_year').plot(kind='bar')
I am trying to find the 7-day rolling average for the hour of day for a category. The data frame is indexed on the category id and there is a time stamp plus other columns:
id name ds time x y z
6 red 2020-02-14 00:00:00 10 20 30
6 red 2020-02-14 01:00:00 20 40 50
6 red 2020-02-14 02:00:00 20 20 60
...
6 red 2020-02-21 00:00:00 20 30 60
6 red 2020-02-21 01:00:00 20 40 60
6 red 2020-02-21 02:00:00 10 40 60
7 green 2020-02-14 00:00:00 10 20 30
7 green 2020-02-14 01:00:00 20 40 50
7 green 2020-02-14 02:00:00 20 20 60
...
7 green 2020-02-21 00:00:00 20 30 60
7 green 2020-02-21 01:00:00 20 40 60
7 green 2020-02-21 02:00:00 10 40 60
what I would like as an output (obviously with the rolling columns filled by the rolling mean where not NaN):
id name ds time x y z rolling_x rolling_y rolling_z
6 red 2020-02-14 00:00:00 10 20 30 NaN NaN NaN
6 red 2020-02-14 01:00:00 20 40 50 NaN NaN NaN
6 red 2020-02-14 02:00:00 20 20 60 NaN NaN NaN
...
6 red 2020-02-21 00:00:00 20 30 60
6 red 2020-02-21 01:00:00 20 40 60
6 red 2020-02-21 02:00:00 10 40 60
7 green 2020-02-14 00:00:00 10 20 30 NaN NaN NaN
7 green 2020-02-14 01:00:00 20 40 50 NaN NaN NaN
7 green 2020-02-14 02:00:00 20 20 60 NaN NaN NaN
...
7 green 2020-02-21 00:00:00 20 30 60
7 green 2020-02-21 01:00:00 20 40 60
7 green 2020-02-21 02:00:00 10 40 60
My approach:
df = df.assign(day=df['ds time'].dt.normalize(),
hour=df['ds time'].dt.hour)
ret_df = df.merge(df.drop('ds time', axis=1)
.set_index('day')
.groupby(['id','hour']).rolling('7D').mean()
.drop(['hour','id'], axis=1),
on=['id','hour','day'],
how='left',
suffixes=['','_roll']
).drop(['day','hour'], axis=1)
Sample data:
dates = pd.date_range('2020-02-21', '2020-02-25', freq='H')
np.random.seed(1)
df = pd.DataFrame({
'id': np.repeat([6,7], len(dates)),
'ds time': np.tile(dates,2),
'X': np.arange(len(dates)*2),
'Y': np.random.randint(0,10, len(dates)*2)
})
df.head()
Output ret_df.head():
id ds time X Y X_roll Y_roll
0 6 2020-02-21 00:00:00 0 5 0.0 5.0
1 6 2020-02-21 01:00:00 1 8 1.0 8.0
2 6 2020-02-21 02:00:00 2 9 2.0 9.0
3 6 2020-02-21 03:00:00 3 5 3.0 5.0
4 6 2020-02-21 04:00:00 4 0 4.0 0.0
Iam newbie in python. I have huge a dataframe with millions of rows and id. my data looks like this:
Time ID X Y
8:00 A 23 100
9:00 B 24 110
10:00 B 25 120
11:00 C 26 130
12:00 C 27 140
13:00 A 28 150
14:00 A 29 160
15:00 D 30 170
16:00 C 31 180
17:00 B 32 190
18:00 A 33 200
19:00 C 34 210
20:00 A 35 220
21:00 B 36 230
22:00 C 37 240
23:00 B 38 250
I sorted the data on id and time.
Time ID X Y
8:00 A 23 100
13:00 A 28 150
14:00 A 29 160
18:00 A 33 200
20:00 A 35 220
9:00 B 24 110
10:00 B 25 120
17:00 B 32 190
21:00 B 36 230
23:00 B 38 250
11:00 C 26 130
12:00 C 27 140
16:00 C 31 180
19:00 C 34 210
22:00 C 37 240
15:00 D 30 170
and I want to pick only "The first and the last" of the id and eliminate the rest. The result looked like this:
Time ID X Y
8:00 A 23 100
20:00 A 35 220
9:00 B 24 110
23:00 B 38 250
11:00 C 26 130
22:00 C 37 240
15:00 D 30 170
I used this code:
df = pd.read_csv("contoh.csv")
g = df.groupby('ID')
(pd.concat([g.head(1), g.tail(1)])
.drop_duplicates()
.sort_values('ID')
.reset_index(drop=True))
It works but I cannot save to csv
g.to_csv('result.csv')
I got an error message: Cannot access callable attribute 'to_csv' of 'DataFrameGroupBy' objects, try using the 'apply' method
any advice to me? thank you
When you are using the concat function, you are not applying it on the groupby object g. You need to reassign the concat function's output to another object.
df = pd.read_csv("contoh.csv")
g = df.groupby('ID')
g_1 = pd.concat([g.head(1),g.tail(1)]).drop_duplicates().sort_values('ID').reset_index(drop=True)
g_1.to_csv('result.csv')
In order to properly plot data, I need the missing values to be shown as 0. I do not want to have a 0 value for each missing day, as that bloats the storage. How do I insert 0 value for each type column for each gap's first and last day? I do not need 0 inserted before and after the whole sequence. Bonus: what if timeseries is monthly or weekly data (date set to the first of the month, or to every Monday)
For example, this timeseries contains one gap between 3rd and 10th of January for type A. I need to insert a 0 value on the 4th and the 9th of January.
df = DataFrame({"date":[datetime(2015,1,1) + timedelta(days=x) for x in range(0, 3)+range(8, 13)+range(2, 9)], "type": ['A']*8+['B']*7, "value": np.random.randint(10, 100, size=15)})
date type value
0 2015-01-01 A 97
1 2015-01-02 A 11
2 2015-01-03 A 89 <-- last date before the gap
3 2015-01-09 A 31 <-- first day after the gap
4 2015-01-10 A 64
5 2015-01-11 A 82
6 2015-01-12 A 75
7 2015-01-13 A 24
8 2015-01-03 B 72
9 2015-01-04 B 46
10 2015-01-05 B 26
11 2015-01-06 B 91
12 2015-01-07 B 36
13 2015-01-08 B 53
14 2015-01-09 B 85
Desired result (the row indexes would would be different)
date type value
0 2015-01-01 A 97
1 2015-01-02 A 11
2 2015-01-03 A 89
. 2015-01-03 A 0 <-- gap starts - new value
<-- do NOT insert any more values for 04--07
. 2015-01-08 A 0 <-- gap ends - new value
3 2015-01-09 A 31
4 2015-01-10 A 64
5 2015-01-11 A 82
6 2015-01-12 A 75
7 2015-01-13 A 24
8 2015-01-03 B 72
9 2015-01-04 B 46
10 2015-01-05 B 26
11 2015-01-06 B 91
12 2015-01-07 B 36
13 2015-01-08 B 53
14 2015-01-09 B 85
Maybe an inelegant solution, but it seems to be easiest to split the dataframe up, fill in the missing dates, and recombine, like so:
# with pandas imported as pd
dfA = df[df.type=='A']
new_axis = pd.date_range(df.date.min(), df.date.max())
dfA.set_index('date', inplace=True)
missing_dates = list(set(new_axis).difference(dfA.index))
dfA.loc[min(missing_dates)] = 'A', 0
dfA.loc[max(missing_dates)] = 'A', 0
df = pd.concat([df[df.type=='B'].set_index('date'), dfA])
I have a table, a timetable, with check-in and check-out times of the employees:
ID Date Check-in Check out
1 1-1-2011 11:00 18:00
2 1-1-2011 11:00 19:00
3 1-1-2011 16:00 18:30
4 1-1-2011 17:00 20:00
Now I want to know how many employees are working, every (half) hour.
The result I want to see:
Hour Count
11 2
12 2
13 2
14 2
15 2
16 3
17 3
18 2,5
19 1
Every 'Hour' you must read as 'till the next full hour', ex. 11 -> 11:00 - 12:00
Any ideas?
Build an additional table, called Hours, containing the following data:
h
00:00
00:30
01:00
...
23:30
then, run
Select h as 'hour' ,count(ID) as 'count' from timetable,hours where [Check_in]<=h and h<=[Check_out] group by h