How to group by and sum several columns?

How to group by and sum several columns? - pandas

I have a big dataframe with several columns which contains strings, numbers, etc. I am trying to group by SCENARIO and then sum only the columns between 2020 and 2050. The only thing I have got so far is sum one column as displayed as follows, but I need to change this '2050' by the columns between 2020 and 2050, for instance.
df1 = df.groupby(["SCENARIO"])['2050'].sum().sum(axis=0)

You are creating a subset of the df with only that single column. I can't tell how your dataset looks like from the information provided, but try:
df.groupby(["SCENARIO"]).sum()
This should some up all the rows which are in the column.
Alternatively select the columns which you want to perform the summation on.
df.groupby(["SCENARIO"])[["column1","column2"]].sum()

Related

Merge two dataframe based on column which has splitted value

I have two data frames. One of the data frames appears to be as follows:
.
Products columns contain data like 1;3;5.
The other data frame looks like:
I am merging both of the frames:
Merge_Store_Transaction['products'] = Merge_Store_Transaction['products'].str.split(';')
Merge_Store_Transaction = Merge_Store_Transaction.explode('products')
Which give me result like: It duplicated all other values that I don't want. Is there a way where it divide the profit column with respective number of products and replicate the number or just fill other rows with zero.

I think that once you have this result, you can do something like the following:
Merge_Store_Transaction["profit"] = Merge_Store_Transaction.groupby(["group_id", "date"])["profit"].mean().reset_index(0, drop=True)
Same thing for the revenue_in_usd column.

Selecting Rows Based On Specific Condition In Python Pandas Dataframe

So I am new to using Python Pandas dataframes.
I have a dataframe with one column representing customer ids and the other holding flavors and satisfaction scores that looks something like this.
Although each customer should have 6 rows dedicated to them, Customer 1 only has 5. How do I create a new dataframe that will only print out customers who have 6 rows?
I tried doing: df['Customer No'].value_counts() == 6 but it is not working.

Here is one way to do it
if you post data as a code (preferably) or text, i would be able to share the result
# create a temporary column 'c' by grouping on Customer No
# and assigning count to it using transform
# finally, using loc to select rows that has a count eq 6
(df.loc[df.assign(
c=df.groupby(['Customer No'])['Customer No']
.transform('count'))['c'].eq(6]
)

Filteration on dataframe column value with combination of values

I have a dataframe which has 2 columns named TABLEID and STATID
There are different values in the both the columns.
when I filter the dataframe on values say '101PC' and 'ST101', it gives me 14K records and when I filter the dataframe on values say '102HT' and 'ST102', it gives me 14K records also. The issue is when I try to combine both the filters like below it gives me blank dataframe. I was expecting 28K records in my resultant dataframe. Any help is much appreciated
df[df[['TABLEID','STATID']].apply(tuple, axis = 1).isin([('101PC', 'ST101'), ('102HT','ST102')])]

Plotting Grouped Data, grouped by multiple columns in pandas

I have a grouped dataframe according to two columns.
Now i want to plot the data of Date vs Confirmed in seaborn.
Is there a good way to do it.
grouped_series = cases.groupby(['Country/Region','ObservationDate'])['Confirmed','Deaths','Recovered'].sum()
print(grouped_series)

You can change aggregatetion for grouping by datetimes only:
cases.groupby(['ObservationDate'])['Confirmed'].sum().plot()
Or if need summed values per ObservationDate and Country/Region:
cases.groupby(['Country/Region','ObservationDate'])['Confirmed'].sum().unstack(0).plot()

Pandas groupby year filtering the dataframe by n largest values

I have a dataframe at hourly level with several columns. I want to extract the entire rows (containing all columns) of the 10 top values of a specific column for every year in my dataframe.
so far I ran the following code:
df = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10)))
The problem here is that I only get the top 10 values for each year of that specific column and I lose the other columns. How can I do this operation and having the corresponding values of the other columns that correspond to the top 10 values per year of my 'totaldemand' column?

We usually do head after sort_values
df = df.sort_values('totaldemand',ascending = False).groupby([df.index.year])['totaldemand'].head(10)

nlargest can be applied to each group, passing the column to look for
largest values.
So run:
df.groupby([df.index.year]).apply(lambda grp: grp.nlargest(3, 'totaldemand'))
Of course, in the final version replace 3 with your actual value.

Get the index of your query and use it as a mask on your original df:
idx = df.groupby([df.index.year])['totaldemand'].apply(lambda grp: grp.nlargest(10))).index.to_list()
df.iloc[idx,]
(or something to that extend, I can't test now without any test data)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to group by and sum several columns? - pandas

Related

Merge two dataframe based on column which has splitted value

Selecting Rows Based On Specific Condition In Python Pandas Dataframe

Filteration on dataframe column value with combination of values

Plotting Grouped Data, grouped by multiple columns in pandas

Pandas groupby year filtering the dataframe by n largest values

Categories

Resources