How to Explicitly specify the groups in Pandas groupby - pandas

I have a DataFrame containing customer purchase information. I would like to groupby the date column to determine the total sales each day. My problem is that there are some days with no purchases whatsoever. I would like the groupby object to include those missing days as groups with the total sales equal to 0. Is there a way to pass a list of the values of the date column to the groupby function? MWE below
>>> df = pd.DataFrame({
... 'purchase_id': [1, 2, 3, 4],
... 'date' : ['1900-01-01', '1900-01-01', '1900-01-03', '1900-01-04'],
... 'cost' : [1.00, 0.25, 0.50, 0.75]
... })
This group by has the following result.
>>> df.groupby('date').agg({'cost':'sum'})
date cost
'1900-01-01' 1.25
'1900-01-03' 0.50
'1900-01-04' 0.75
What command can I execute to obtain the following result instead? (Obviously I will have to pass the dates I am expecting to see as an argument, which is fine.)
date cost
'1900-01-01' 1.25
'1900-01-02' 0.00
'1900-01-03' 0.50
'1900-01-04' 0.75

You can use reindex:
df.groupby('date').agg({'cost':'sum'}).reindex(your_new_date_list)
However, I'd recommend you convert your data to datetime type, then you can use resample:
df['date'] = pd.to_datetime(df['date'])
df.resample('d', on='date')['cost'].sum().reset_index()
Output:
date cost
0 1900-01-01 1.25
1 1900-01-02 0.00
2 1900-01-03 0.50
3 1900-01-04 0.75

Related

Date column shifts and is no longer callable

I am using pandas groupby to group duplicate dates by their pm25 values to get one average. However when I use the groupby function, the structure of my dataframe changes, and I can no longer call the 'Date' Column.
Using groupby also changes the structure of my data: instead of being sorted by 1/1/19, 1/2/19, it is sorted by 1/1/19, 1/10/19, 1/11/19.
Here is my current code:
Before using df.groupby my df looks like:
df before groupy
I use groupby:
df.groupby('Date').mean('pm25')
print(df)
df after groupby
And after, I cannot call the 'Date' column anymore or sort the column
print(df['Date'])
Returns just
KeyError: 'Date'
Please help, or please let me know what else I can provide.
Using groupby also changes the structure of my data: instead of being sorted by 1/1/19, 1/2/19, it is sorted by 1/1/19, 1/10/19, 1/11/19.
This is because your Date column type is string not datetime. In string comparison, the third character 1 of 1/10/19 is smaller than the third character 2 of 1/2/19. If you want to keep the original sequence, you can to the following
df['Date'] = pd.to_datetime(df['Date']) # Convert Date column to datetime type
df['Date'] = df['Date'].dt.strftime('%m/%d/%y') # Convert datetime to other formats (but the dtype of column will be string)
And after, I cannot call the 'Date' column anymore or sort the column
This is because after groupby Date column, the returned dataframe will use Date column after groupby as index to represent each group.
pm25
Date
01/01/19 8.50
01/02/19 9.20
01/03/19 7.90
01/04/19 8.90
01/05/19 6.00
After doing df.groupby('Date').mean('pm25'), the returned dataframe above means the mean pm25 value of 01/01/19 group is 8.50, etc.
If you want to retrieve the Date column back from the index, you can reset_index() after groupby,
df.groupby('Date').mean('pm25').reset_index()
which gives
Date pm25
0 01/01/19 8.50
1 01/02/19 9.20
2 01/03/19 7.90
3 01/04/19 8.90
4 01/05/19 6.00
5 01/06/19 6.75
6 01/11/19 8.50
7 01/12/19 9.20
8 01/21/19 9.20
Or specify the as_index argument of pandas.DataFrame.groupby() to False
df.groupby('Date', as_index=False).mean('pm25')

pandas convert delta time column into one sole unit time

I am working with a pandas dataframe where i have a deltatime column with values like:
{'deltatime': 0 days 09:06:30 , 0 days 00:30:34, 2 days 23:07:14 }
How can I convert those times into a single unit, like minutes, or hours but a single one in order to better visualize those times in a graph.
Some idea?
here:
https://pandas.pydata.org/pandas-docs/stable/user_guide/timedeltas.html
it does not clarify how to simple change units
You can use the total_seconds() method of a timedelta to get its duration in seconds:
seconds = [t.total_seconds() for t in df['deltatime']]
And then convert the units if desired:
#to hours, i.e. 3600 seconds
hours = [t.total_seconds()/3600 for t in df['deltatime']]
You can do:
df['deltatime'] = pd.to_timedelta(df['deltatime']).dt.total_seconds()
Output:
deltatime
0 32790.0
1 1834.0
2 256034.0
You can also perform arithmetic operations on timedelta:
# convert to hours
pd.to_timedelta(df['deltatime']) / pd.to_timedelta('1H')
Output:
0 9.108333
1 0.509444
2 71.120556
Name: deltatime, dtype: float64

How to calculate the sum of rows with the maximum date country wise

I am trying to calculate the sum of the rows with the maximum date per country and if the country has more than one province then it should add the confirmed cases with the maximum date . For ex input
This is the input that I have and the output should be
Output
So the output for China is 90 which is the sum of Tianjin and Xinjiang for the maximum date which is 02-03-2020.
And since Argentina does not have any province it's output is 20 for the highest date which is again the same as above.
The strategy is to sort the values such that the last date is the first row of the Country/Region-Province/State pairs, then roll up the dataset twice, filtering the max date between roll ups.
First, sorting to put most recent dates at the top of each group:
(df
.sort_values(['Country/Region', 'Province/State', 'Date'], ascending=False))
Date Country/Region Province/State Confirmed
3 02-03-2020 China Xinjiang 70
2 01-03-2020 China Xinjiang 30
1 02-03-2020 China Tianjin 20
0 01-03-2020 China Tianjin 10
Then rolling up to Country/Region-Province/State and taking the most recent date:
(df
.sort_values(['Country/Region', 'Province/State', 'Date'], ascending=False)
.groupby(['Country/Region', 'Province/State'])
.first())
Date Confirmed
Country/Region Province/State
China Tianjin 02-03-2020 20
Xinjiang 02-03-2020 70
Finally, rolling up again to just Country/Region:
(df
.sort_values(['Country/Region', 'Province/State', 'Date'], ascending=False)
.groupby(['Country/Region', 'Province/State'])
.first()
.groupby('Country/Region').sum())
Confirmed
Country/Region
China 90
If you fill in your empty province values, you could use groupby to pull out the Date and then another groupby to get the sum of values.
input['Date'] = pd.to_datetime(input['Date'], format="%d-%m-%Y")
input = input.fillna("dummy")
input.loc[input.groupby(["Country/Region", "Province/State"]).Date.idxmax()].groupby(["Country/Region"])["Confirmed"].sum()
Basic Solution is to use the power of pandas and multiple methods that can jointly solve your problem
So we first convert the Date Column to a Datetime Column
Bascially , String to Datetime Type Conversion
Then we sort by the 'Date' Column
Then we .groupby() - Country and Date aggregating over 'Confirmed' Column (SUM) - Finally we drop_duplicates() keeping the last, which gives you the latest for any particular country
df['Date'] = pd.to_datetime(df['Date'])
df.sort_values(by = 'Date' , inplace = True)
df.groupby(['Country','Date']).agg({'Confirmed':'sum'}).reset_index().drop_duplicates(subset = 'Country' , keep = 'last')
df.groupby(['Date', 'Country/Region'],as_index = False).sum().groupby(['Country/Region']).agg('last').reset_index()
Country/Region Date Confirmed
0 Argentina 2020-03-02 20
1 China 2020-03-02 90
2 France 2020-03-02 70

How to round off numbers in SQL Server database

I have a column (named PercentageDifference) with numbers that have decimal places as shown below:
PercentageDifference
1.886792452830100
2.325581395348800
2.758620689655100
-3.689320388349500
-0.284900284900200
0.400000000000000
I want a query to round off the numbers to the nearest 10 and then leave 2 decimal places.
Here is the output am looking for:
PercentageDifference
1.89
2.33
2.76
-3.69
-0.28
0.40
I have tried to use the ROUND function but its not giving me the expected results:
select round([Percentage Difference], 2, 1) from Table
How can this be achieved?
You need only CAST:
SELECT CAST([Percentage Difference] AS decimal(19,2)) FROM Table;

python pandas groupby plot with sorted date as xtick

I have a pandas dataframe df with the following format
date value team
12/8/2015 1.2 'A'
12/8/2015 1.3 'A'
12/7/2015 1.2 'A'
12/6/2015 1.3 'B'
12/6/2015 1.1 'B'
12/7/2015 1.3 'B'
...............................
What I want is a figure with two curves representing two groups, with date as x-value, the average value of the corresponding date as the y value. What bother me is the date format seems not correct, as python complains
"Could not convert 12/8/2015... to numeric"
for label, group in df.groupby('team']):
group.plot(x=group['date'], y=group['date'].mean(),label=label)
You first need to convert your date to a timestamp.
df['date'] = pd.to_datetime(df.date)
Then you can group and unstack to get your desired data:
>>> df.groupby(['date', 'team']).sum().unstack('team')
value
team 'A' 'B'
date
2015-12-06 NaN 2.4
2015-12-07 1.2 1.3
2015-12-08 2.5 NaN
Add .plot() and you should get your desired result.