How to group by one column if condition is true in another column summing values in third column with pandas - pandas

I can't think of how to do this:
As the headline explains I want to group a dataframe by the column acquired_month only if another column contains Closed Won(in the example I made a helper column that just marks True if that condition is fulfilled although I'm not sure that step is necessary). Then if those conditions are met I want to sum the values of a third column but can't think how to do it. Here is my code so far:
us_lead_scoring.loc[us_lead_scoring['Stage'].str.contains('Closed Won'), 'closed_won_binary'] = True acquired_date = us_lead_scoring.groupby('acquired_month')['closed_won_binary'].sum()
but this just sums the true false column not the sum column if the true false column is true after the acquired_month groupby. Any direction appreciated.
Thanks

If need aggregate column col replace non matched values to 0 values in Series.where and then aggregate sum:
us_lead_scoring = pd.DataFrame({'Stage':['Closed Won1','Closed Won2','Closed', 'Won'],
'col':[1,3,5,6],
'acquired_month':[1,1,1,2]})
out = (us_lead_scoring['col'].where(us_lead_scoring['Stage']
.str.contains('Closed Won'), 0)
.groupby(us_lead_scoring['acquired_month'])
.sum()
.reset_index(name='SUM'))
print (out)
acquired_month SUM
0 1 4
1 2 0

Related

How to calculate the difference between row values based on another column value without filtering the values in between

How to calculate the difference between row values based on another column value without filtering the values in between.I want to calculate the difference between seconds for turn_marker == 1. but when I use the following method, it filters all the zeros but I need the zeros, because I need the entire data set.
Here you can see my data set with a column called turn_marker that has the values zero and 1, and another column with seconds. Now I want to calculte the time bwetween those rows where turn_marker is equal 1.
dataframe = main_dataframe.query("turn_marker=='1;'")
main_dataframe["seconds_diff"] = dataframe["seconds"].diff()
main_dataframe
I would be grateful if you could help me.
You can do this:
main_dataframe['indx'] = main_dataframe.index
main_dataframe['diff'] = main_dataframe.sort_values(by=['turn_marker', 'indx'], ascending=[False, True])['seconds'].diff()
main_dataframe.loc[main_dataframe.turn_marker == '0;', 'diff'] = np.nan

Adding column value for a list of indexes

I have a list of indexes and trying to populate a column 'Type' for these rows only.
What I tried to do:
index_list={1,5,9,10,13}
df.loc[index_list,'Type']=="gain+loss"
Output:
1 False
5 False
9 False
10 False
13 False
But the output just gives the list with all False instead of populating these rows.
Thanks for any advice.
You need to put a single equal instead of double equal. In python, and in most progamming languages, == is the comparison operator. In your case you need the assignment operator =.
So the following code will do what you want :
index_list={1,5,9,10,13}
df.loc[index_list,'Type']="gain+loss"

how to sum rows in my dataframe Pandas with specific condition?

Could anyone help me ?
I want to sum the values with the format:
print (...+....+)
for example:
a b
France 2
Italie 15
Croatie 7
I want to make the sum of France and Croatie.
Thank you for your help !
One of possible solutions:
set column a as the index,
using loc select rows for the "wanted" values,
take column b,
sum the values found.
So the code can be:
result = df.set_index('a').loc[['France', 'Croatie']].b.sum()
Note double square brackets. The outer pair is the "container" of index values
passed to loc.
The inner part, and what is inside, is a list of values.
To subtract two sums (one for some set of countries and the second for another set),
you can run e.g.:
wrk = df.set_index('a').b
result = wrk.loc[['Italie', 'USA']].sum() - wrk.loc[['France', 'Croatie']].sum()

Pandas aggregate with condition. Join if empty, first is not empty

I have a pandas dataframe that looks like this:
id rq_id method user_id error reservation_id code
0 609d444a9d34a reservation 3261 False 82122
1 609d444a9d34a False 82122 346346
2279 60c6ff0c63e45 reservation 5231 False 92902
2280 60c6ff0c63e45 5231 False 92902 415643
There supposed to be 2 rows for each rq_id. I want to aggregate these two rows into one.
Problem I have is with the user_id column, because in some of the rows it exists only in one of the rq_id, but some of the rows it exists in both.
Current code:
clean_df = clean_df.groupby('rq_id').agg({
'rq_id':'first',
'method':"".join,
'user_id':'first', # <=== here what do I do?
'ucode':"".join,
'reservation_id':'first',
'dt':'first'
})
Expected:
id rq_id method user_id error reservation_id code
0 609d444a9d34a reservation 3261 False 82122 346346
1 60c6ff0c63e45 reservation 5231 False 92902 415643
How to achieve this?
You need repalce first empty strings to missing values, so first return first non NaN value.
clean_df = clean_df.replace('', np.nan).groupby('rq_id').first()

Dropping semi-dupliacted rows in pandas according to specific column value

I have a dataframe with duplicated rows except one column value, I want to drop the row with a value of "None" if the id is the same (not all the rows are duplicated)
a b
1 1 None
2 1 7
3 2 2
4 3 4
I need to drop the first row with the duplicated (1) and the value of b is None.
You can use duplicated and also search for None. That will return the row you want to drop, so use ~ to get the inverse dataframe (so everything but the row you want to drop) to return the expected result. EDIT: Passing keep=False will return all duplicates, so order doesn't matter.
df[~((df['b'].isnull()) & (df.duplicated('a', keep=False)))] #if None is Null value
OR
df[~((df['b'] == 'None') & (df.duplicated('a', keep=False)))] if 'None' is string