I can't think of how to do this:
As the headline explains I want to group a dataframe by the column acquired_month only if another column contains Closed Won(in the example I made a helper column that just marks True if that condition is fulfilled although I'm not sure that step is necessary). Then if those conditions are met I want to sum the values of a third column but can't think how to do it. Here is my code so far:
us_lead_scoring.loc[us_lead_scoring['Stage'].str.contains('Closed Won'), 'closed_won_binary'] = True acquired_date = us_lead_scoring.groupby('acquired_month')['closed_won_binary'].sum()
but this just sums the true false column not the sum column if the true false column is true after the acquired_month groupby. Any direction appreciated.
Thanks
If need aggregate column col replace non matched values to 0 values in Series.where and then aggregate sum:
us_lead_scoring = pd.DataFrame({'Stage':['Closed Won1','Closed Won2','Closed', 'Won'],
'col':[1,3,5,6],
'acquired_month':[1,1,1,2]})
out = (us_lead_scoring['col'].where(us_lead_scoring['Stage']
.str.contains('Closed Won'), 0)
.groupby(us_lead_scoring['acquired_month'])
.sum()
.reset_index(name='SUM'))
print (out)
acquired_month SUM
0 1 4
1 2 0
Related
How to calculate the difference between row values based on another column value without filtering the values in between.I want to calculate the difference between seconds for turn_marker == 1. but when I use the following method, it filters all the zeros but I need the zeros, because I need the entire data set.
Here you can see my data set with a column called turn_marker that has the values zero and 1, and another column with seconds. Now I want to calculte the time bwetween those rows where turn_marker is equal 1.
dataframe = main_dataframe.query("turn_marker=='1;'")
main_dataframe["seconds_diff"] = dataframe["seconds"].diff()
main_dataframe
I would be grateful if you could help me.
You can do this:
main_dataframe['indx'] = main_dataframe.index
main_dataframe['diff'] = main_dataframe.sort_values(by=['turn_marker', 'indx'], ascending=[False, True])['seconds'].diff()
main_dataframe.loc[main_dataframe.turn_marker == '0;', 'diff'] = np.nan
I have a list of indexes and trying to populate a column 'Type' for these rows only.
What I tried to do:
index_list={1,5,9,10,13}
df.loc[index_list,'Type']=="gain+loss"
Output:
1 False
5 False
9 False
10 False
13 False
But the output just gives the list with all False instead of populating these rows.
Thanks for any advice.
You need to put a single equal instead of double equal. In python, and in most progamming languages, == is the comparison operator. In your case you need the assignment operator =.
So the following code will do what you want :
index_list={1,5,9,10,13}
df.loc[index_list,'Type']="gain+loss"
Could anyone help me ?
I want to sum the values with the format:
print (...+....+)
for example:
a b
France 2
Italie 15
Croatie 7
I want to make the sum of France and Croatie.
Thank you for your help !
One of possible solutions:
set column a as the index,
using loc select rows for the "wanted" values,
take column b,
sum the values found.
So the code can be:
result = df.set_index('a').loc[['France', 'Croatie']].b.sum()
Note double square brackets. The outer pair is the "container" of index values
passed to loc.
The inner part, and what is inside, is a list of values.
To subtract two sums (one for some set of countries and the second for another set),
you can run e.g.:
wrk = df.set_index('a').b
result = wrk.loc[['Italie', 'USA']].sum() - wrk.loc[['France', 'Croatie']].sum()
I have a pandas dataframe that looks like this:
id rq_id method user_id error reservation_id code
0 609d444a9d34a reservation 3261 False 82122
1 609d444a9d34a False 82122 346346
2279 60c6ff0c63e45 reservation 5231 False 92902
2280 60c6ff0c63e45 5231 False 92902 415643
There supposed to be 2 rows for each rq_id. I want to aggregate these two rows into one.
Problem I have is with the user_id column, because in some of the rows it exists only in one of the rq_id, but some of the rows it exists in both.
Current code:
clean_df = clean_df.groupby('rq_id').agg({
'rq_id':'first',
'method':"".join,
'user_id':'first', # <=== here what do I do?
'ucode':"".join,
'reservation_id':'first',
'dt':'first'
})
Expected:
id rq_id method user_id error reservation_id code
0 609d444a9d34a reservation 3261 False 82122 346346
1 60c6ff0c63e45 reservation 5231 False 92902 415643
How to achieve this?
You need repalce first empty strings to missing values, so first return first non NaN value.
clean_df = clean_df.replace('', np.nan).groupby('rq_id').first()
I have a dataframe with duplicated rows except one column value, I want to drop the row with a value of "None" if the id is the same (not all the rows are duplicated)
a b
1 1 None
2 1 7
3 2 2
4 3 4
I need to drop the first row with the duplicated (1) and the value of b is None.
You can use duplicated and also search for None. That will return the row you want to drop, so use ~ to get the inverse dataframe (so everything but the row you want to drop) to return the expected result. EDIT: Passing keep=False will return all duplicates, so order doesn't matter.
df[~((df['b'].isnull()) & (df.duplicated('a', keep=False)))] #if None is Null value
OR
df[~((df['b'] == 'None') & (df.duplicated('a', keep=False)))] if 'None' is string