Iam newbie in python. I have huge a dataframe with millions of rows and id. my data looks like this:
Time ID X Y
8:00 A 23 100
9:00 B 24 110
10:00 B 25 120
11:00 C 26 130
12:00 C 27 140
13:00 A 28 150
14:00 A 29 160
15:00 D 30 170
16:00 C 31 180
17:00 B 32 190
18:00 A 33 200
19:00 C 34 210
20:00 A 35 220
21:00 B 36 230
22:00 C 37 240
23:00 B 38 250
I sorted the data on id and time.
Time ID X Y
8:00 A 23 100
13:00 A 28 150
14:00 A 29 160
18:00 A 33 200
20:00 A 35 220
9:00 B 24 110
10:00 B 25 120
17:00 B 32 190
21:00 B 36 230
23:00 B 38 250
11:00 C 26 130
12:00 C 27 140
16:00 C 31 180
19:00 C 34 210
22:00 C 37 240
15:00 D 30 170
and I want to pick only "The first and the last" of the id and eliminate the rest. The result looked like this:
Time ID X Y
8:00 A 23 100
20:00 A 35 220
9:00 B 24 110
23:00 B 38 250
11:00 C 26 130
22:00 C 37 240
15:00 D 30 170
I used this code:
df = pd.read_csv("contoh.csv")
g = df.groupby('ID')
(pd.concat([g.head(1), g.tail(1)])
.drop_duplicates()
.sort_values('ID')
.reset_index(drop=True))
It works but I cannot save to csv
g.to_csv('result.csv')
I got an error message: Cannot access callable attribute 'to_csv' of 'DataFrameGroupBy' objects, try using the 'apply' method
any advice to me? thank you
When you are using the concat function, you are not applying it on the groupby object g. You need to reassign the concat function's output to another object.
df = pd.read_csv("contoh.csv")
g = df.groupby('ID')
g_1 = pd.concat([g.head(1),g.tail(1)]).drop_duplicates().sort_values('ID').reset_index(drop=True)
g_1.to_csv('result.csv')
Related
I have a Pandas data frame df that looks like this:
date A B name
2022-06-25 04:00:00 700 532 aa
2022-06-25 05:00:00 1100 433 aa
2022-06-25 06:00:00 800 754 aa
2022-06-25 07:00:00 1200 288 aa
2022-06-25 08:00:00 700 643 bb
2022-06-25 09:00:00 1400 668 bb
2022-06-25 10:00:00 1600 286 bb
.....
2022-06-03 11:00:00 397 46 cc
2022-06-03 12:00:00 100 7 cc
2022-06-03 13:00:00 400 25 cc
2022-06-03 14:00:00 500 41 cc
2022-06-03 15:00:00 400 0 cc
2022-06-03 16:00:00 300 23 dd
2022-06-03 17:00:00 500 50 dd
2022-06-03 18:00:00 300 0 dd
2022-06-03 19:00:00 400 15 dd
I'm trying to Produce time series plots for each name. Line charts.
number of daily A vs date
number of daily B vs date
I was able to do so and get a plot for each name like this:
df.groupby('name').plot(x='date', figsize=(24,12))
But I couldn't specify a title for each plot like this plt.title(" ")
And also couldn't auto save each plot like this plt.savefig('name.png')
Because they all produce at once.
Is there any other way to produce plots so I can specify the title and save each plot automatically?
Thank you
One of the ways to do what you require is to put the code for your plot creation and plot save inside a for loop. That will allow you to add a title using the title. Also, you will be able to use savefig to individually save each of the plots. The update code and one of the output graphs is shown below. Note that I am adding name as the title and saving the figure as Myplot <name>.png
for name, group in df.groupby('name'):
group.plot(x='date', title=name)
plt.savefig('Myplot {}.png'.format(name), bbox_inches='tight')
First plot
How to convert 'Time & Date' column to timestamp? As you can see there's a header cell for each date followed by AM & PM times. I would like to have a whole timestamp column.
Time & Date Country ... Consensus Forecast
15 4:00 PM DE ... NaN NaN
16 Tuesday April 02 2019 NaN ... Consensus Forecast
17 7:00 AM EA ... NaN NaN
18 7:00 AM ES ... -33.3K -38.6K
19 8:30 AM GB ... 49.8 49.1
20 9:00 AM CY ... NaN 8.90%
21 9:40 AM RO ... 2.50% 2.50%
22 10:00 AM IE ... NaN 5.50%
23 5:30 PM DE ... NaN NaN
24 Wednesday April 03 2019 NaN ... Consensus Forecast
25 7:15 AM ES ... 55 52.5
26 7:45 AM IT ... 50.8 50.1
27 7:50 AM FR ... 48.7 48.7
28 7:55 AM DE ... 54.9 54.9
29 8:00 AM EA ... 52.7 52.7
30 8:30 AM GB ... 50.9 50.5
31 9:00 AM EA ... 0.20% 0.40%
32 9:00 AM EA ... 2.30% 1.80%
33 11:25 AM PL ... 1.50% 1.50%
34 Thursday April 04 2019 NaN ... Consensus Forecast
35 4:30 AM NL ... NaN 2.60%
36 6:00 AM DE ... 0.30% 0.50%
37 7:30 AM DE ... NaN 54
38 11:30 AM EA ... NaN NaN
39 Friday April 05 2019 NaN ... Consensus Forecast
40 6:00 AM DE ... 0.50% 0.70%
41 6:45 AM FR ... €-4.7B €-4.7B
42 7:30 AM GB ... -2.40% -2.20%
43 7:30 AM GB ... 2.30% 1.50%
44 11:30 AM ES ... NaN 92.5
Your Time & Date column represents two different things, so it needs to be two different columns to start with. If there's a way to cut out that step I would love to see it, but I'm guessing you need to expand it into two columns before combining again before using pandas.to_datetime() with the format argument to get it into datetime.
First, get the dates into a different column, then forward-fill the missing values.
df['date'] = df['Time & Date'].apply(
lambda x: np.nan if (('AM' in str(x))|('PM' in str(x))) else x
).ffill()
Then you can rename Time & Date to time, slice out the date columns in that row from the dataframe, and concatenate date and time together.
df.rename(columns={'Time & Date': 'time'}, inplace=True)
df = df.loc[df.time.str.contains('AM|PM', regex=True)]
df['datetime'] = df.date + ' ' + df.time
From there all you have to do is find the right format for pd.to_datetime() and you're set!
I have a data frame as shown below. which is a sales data of two health care product starting from December 2016 to November 2018.
product profit sale_date discount
A 50 2016-12-01 5
A 50 2017-01-03 4
B 200 2016-12-24 10
A 50 2017-01-18 3
B 200 2017-01-28 15
A 50 2017-01-18 6
B 200 2017-01-28 20
A 50 2017-04-18 6
B 200 2017-12-08 25
A 50 2017-11-18 6
B 200 2017-08-21 20
B 200 2017-12-28 30
A 50 2018-03-18 10
B 300 2018-06-08 45
B 300 2018-09-20 50
A 50 2018-11-18 8
B 300 2018-11-28 35
From the above I would like to prepare below dataframe and plot that into line plot.
Expected Output
bought_year total_profit
2016 250
2017 1250
2018 1000
X axis = bought_year
Y axis = profit
use groupby with dt.year and .agg to name your column.
df1 = df.groupby(df['sale_date'].dt.year).agg(total_profit=('profit','sum'))\
.reset_index().rename(columns={'sale_date': 'bought_year'})
print(df1)
bought_year total_profit
0 2016 250
1 2017 1250
2 2018 1000
df1.set_index('bought_year').plot(kind='bar')
Iam newbie in python. I have huge a dataframe with millions of rows and id. my data looks like this:
Time ID X Y
8:00 A 23 100
9:00 B 24 110
10:00 B 25 120
11:00 C 26 130
12:00 C 27 140
13:00 A 28 150
14:00 A 29 160
15:00 D 30 170
16:00 C 31 180
17:00 B 32 190
18:00 A 33 200
19:00 C 34 210
20:00 A 35 220
21:00 B 36 230
22:00 C 37 240
23:00 B 38 250
I sorted the data on id and time.
Time ID X Y
8:00 A 23 100
13:00 A 28 150
14:00 A 29 160
18:00 A 33 200
20:00 A 35 220
9:00 B 24 110
10:00 B 25 120
17:00 B 32 190
21:00 B 36 230
23:00 B 38 250
11:00 C 26 130
12:00 C 27 140
16:00 C 31 180
19:00 C 34 210
22:00 C 37 240
15:00 D 30 170
and I want to pick only "The first and the last" of the id and eliminate the rest. The result looked like this:
Time ID X Y
8:00 A 23 100
20:00 A 35 220
9:00 B 24 110
23:00 B 38 250
11:00 C 26 130
22:00 C 37 240
15:00 D 30 170
I used this code:
df = pd.read_csv("data.csv")
g = df.groupby('ID')
g_1 = pd.concat([g.head(1),g.tail(1)]).drop_duplicates().sort_values('ID').reset_index(drop=True)
g_1.to_csv('result.csv')
but I want to assign or give annotation every row as "first" and "last" in the new column.
my expected result looks like this:
Time ID X Y Annotation
8:00 A 23 100 First
20:00 A 35 220 Last
9:00 B 24 110 First
23:00 B 38 250 Last
11:00 C 26 130 First
22:00 C 37 240 Last
15:00 D 30 170
anyone could help me with this? please give me advice thank you.
You can use groupby agg,first and last. Works perfectly for the column Annotation. As a bonus, this will work on the original dataframe so no need to sort
df3.groupby('ID').agg(['first', 'last']).stack().reset_index().rename(columns = {'level_1': 'Annotation'})
ID Annotation Time X Y
0 A first 8:00 23 100
1 A last 20:00 35 220
2 B first 9:00 24 110
3 B last 23:00 38 250
4 C first 11:00 26 130
5 C last 22:00 37 240
6 D first 15:00 30 170
7 D last 15:00 30 170
No need groupby using drop_duplicatesafter you sort
df=pd.concat([df.drop_duplicates(['ID']).assign(sign='first'),df.drop_duplicates(['ID'],keep='last').assign(sign='last')]).sort_values('ID')
df
Time ID X Y sign
0 8:00 A 23 100 first
4 20:00 A 35 220 last
5 9:00 B 24 110 first
9 23:00 B 38 250 last
10 11:00 C 26 130 first
14 22:00 C 37 240 last
15 15:00 D 30 170 first
15 15:00 D 30 170 last
Try:
df.groupby('ID').agg(['first','last'])\
.stack(1).reset_index()\
.rename(columns={'level_1':'Annotation'})
Output:
ID Annotation Time X Y
0 A first 8:00 23 100
1 A last 20:00 35 220
2 B first 9:00 24 110
3 B last 23:00 38 250
4 C first 11:00 26 130
5 C last 22:00 37 240
6 D first 15:00 30 170
7 D last 15:00 30 170
In order to properly plot data, I need the missing values to be shown as 0. I do not want to have a 0 value for each missing day, as that bloats the storage. How do I insert 0 value for each type column for each gap's first and last day? I do not need 0 inserted before and after the whole sequence. Bonus: what if timeseries is monthly or weekly data (date set to the first of the month, or to every Monday)
For example, this timeseries contains one gap between 3rd and 10th of January for type A. I need to insert a 0 value on the 4th and the 9th of January.
df = DataFrame({"date":[datetime(2015,1,1) + timedelta(days=x) for x in range(0, 3)+range(8, 13)+range(2, 9)], "type": ['A']*8+['B']*7, "value": np.random.randint(10, 100, size=15)})
date type value
0 2015-01-01 A 97
1 2015-01-02 A 11
2 2015-01-03 A 89 <-- last date before the gap
3 2015-01-09 A 31 <-- first day after the gap
4 2015-01-10 A 64
5 2015-01-11 A 82
6 2015-01-12 A 75
7 2015-01-13 A 24
8 2015-01-03 B 72
9 2015-01-04 B 46
10 2015-01-05 B 26
11 2015-01-06 B 91
12 2015-01-07 B 36
13 2015-01-08 B 53
14 2015-01-09 B 85
Desired result (the row indexes would would be different)
date type value
0 2015-01-01 A 97
1 2015-01-02 A 11
2 2015-01-03 A 89
. 2015-01-03 A 0 <-- gap starts - new value
<-- do NOT insert any more values for 04--07
. 2015-01-08 A 0 <-- gap ends - new value
3 2015-01-09 A 31
4 2015-01-10 A 64
5 2015-01-11 A 82
6 2015-01-12 A 75
7 2015-01-13 A 24
8 2015-01-03 B 72
9 2015-01-04 B 46
10 2015-01-05 B 26
11 2015-01-06 B 91
12 2015-01-07 B 36
13 2015-01-08 B 53
14 2015-01-09 B 85
Maybe an inelegant solution, but it seems to be easiest to split the dataframe up, fill in the missing dates, and recombine, like so:
# with pandas imported as pd
dfA = df[df.type=='A']
new_axis = pd.date_range(df.date.min(), df.date.max())
dfA.set_index('date', inplace=True)
missing_dates = list(set(new_axis).difference(dfA.index))
dfA.loc[min(missing_dates)] = 'A', 0
dfA.loc[max(missing_dates)] = 'A', 0
df = pd.concat([df[df.type=='B'].set_index('date'), dfA])