How to use aggregate with condition in pandas? - pandas

I have a dataframe.
Following code works
stat = working_data.groupby(by=['url', 'bucket_id'],
as_index=False).agg({'delta': 'max','id': 'count'})
Now i need to count ids with different statuses. I have "DOWNLOADED", "NOT_DOWNLOADED" and "DOWNLOADING" for the status.
I would like to have df with columns bucket_id, max, downloaded (how many have "DOWNLOADED" status) , not_downloaded (how many have "NOT_DOWNLOADED" status) , downloading (how many have "DOWNLOADING" status). How to make it?
Input I have:
.
Output i have:
As you can see count isn't devided by status. But i want to know that there are x downloaded, y not_downloaded, z downloading for each bucket_id bucket_id (so they should be in separate columns, but info for one bucket_id should be in one row)

One way to use assign to create columns then aggregate this new column.
working_data.assign(downloaded=df['status'] == 'DOWNLOADED',
not_downloaded=df['status'] == 'NOT_DOWNLOADED',
downloading=df['status'] == 'DOWNLOADING')\
.groupby(by=['url', 'bucket_id'],
as_index=False).agg({'delta': 'max',
'id': 'count',
'downloaded': 'sum',
'not_donwloaded':'sum',
'downloading':'sum'})

Related

Tidyr/Purr nest and map_dbl for stats (e.g., max, mean) returning incorrect values

I'm trying to mutate a variety of summary statistics based on various groupings in my nested data. I'd like to use this strategy instead of summarize() as I want to store the summary statistics in a tibble with the original data, including other identifying variables.
Group Name
Page Name
User
Date
num
min
Area A
Page 1
user265
22-04-13
14
10
Area A
Page 1
user265
22-04-14
5
3
Area B
Page 2
user275
22-04-01
12
6
There are 8 'groups' and hundreds of 'pages' nested across those groups. Before nesting, each row represented observations by Page Name, User, and Date.
When grouping/nesting by Group Name, the stats I generate for either 'num' or 'min' match what would be expected based on the nested values.
However, when I group by Page Name, the results make no sense based on the data in the nested table. For example, the minimum value for 'num' and 'min' is 1, yet I'll get a mean of 0.10 and a min of 0 for one page. Due to the long format of the data, there are no missing values. I'm not sure why the results aren't consistent with the actual data in the nested table when grouping by Page Name.
adopt_30_nest <- adopt_30 %>%
#Select variables of interest
select(page_name, group_name, user, date, num, min) %>%
#Nest/group by grouping factor, then nest
group_by(page_name) %>%
nest() %>%
#Create a new dbl column with the max num value for each page.
mutate(max_num = map_dbl(data, ~max(.x$num)))
Any ideas for how to fix this? Thanks!

Pandas seems to be merging same dataframe twice

I have two dataframes in pandas, one of which, 'datapanel', has country data for multiple years, and the other, 'data', has country data for only one year, but also includes a "Regional indicator" column for each country. I simply want to create a new column in the datapanel frame that gives the 'Regional indicator' for each country. For some reason, the rows of the dataframe are just about doubling after this merge, whereas they should remain the same. What am I doing wrong?
The key (country name) you are merging on is duplicated in 'datapanel' (see 'Afghanistan' mentioned at least 5 times) and perhaps also in 'data', which causes troubles.
Try using a different technique (v-lookup), something like this ("Country name" must be unique in 'data'):
for country in data["Country name"].values:
indicator = data.loc[data["Country name"] == country, "Regional indicator"].item()
datapanel.loc[datapanel["Country name"] == country, "Regional indicator"] = indicator

Pandas dataframe - Unique counts based on different conditions

I have a dataframe like this
screenshot
I want to find out :
unique viewers: 3
unique viewers who reviewed movies: 2
I am able to do that using the following code:
movie['Viewer_ID'].nunique()
movie.loc[movie['watched']==1,:]['Viewer_ID'].nunique()
However, i was wondering if there is a better way to combined both in one, something like
movie.agg({'Viewer_id':'nunique'
,'watched': 'sum'
})
is there a way i can write a conditional count within the agg function?
You can use .groupby():
view_count = movie.groupby('Viewer_id').watched.sum()
Now view_count is a Series with viewer id as index and sum of watched as values. You can filter with:
filtered = view_count.loc[view.count > 0]

how to save categorical columns when doing groupby.median()?

I have credit loan data, but the original df has many loan ids that can be under one customer. thus I need to group by client id in order to build client profile.
the original df:
contract_id', 'product_id','client_id','bal','age', 'gender', 'pledge_amount', 'branche_region
RZ13/25 000345 98023432 2300 32 M 4500 'west'
clients = df.groupby(by=['client_id']).median().reset_index()
This line completely removes important categories like gender, branch region! It groups by client_id and calculates median for NUMERIC columns. all other categorical columns are gone.
I wonder how to group by unique customers but also keep the categoricals..
It is removed, because pandas remove nuisance columns.
For avoid it is necessary aggregate each column, here for numeric are aggregated means and for non numeric is returned first value:
f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else x.iat[0]
#another idea with join non numeric values
#f = lambda x: x.median() if np.issubdtype(x.dtype, np.number) else ','.join(x)
clients = df.groupby(by=['client_id']).agg(f)
If values of another non numeri columns are same per groups is possible add them to list for by parameter:
clients = df.groupby(by=['client_id', 'gender', 'branche_region']).median()

How to group by two columns and get a sum down a third column

I have a dataframe where each row is a prescription event and contains a drug name, a postcode, and the quantity prescribed. I need to find the total quantity of every drug prescribed in every postcode.
I need to group by postcode, then by drug name, and find the sum of the cells in the "items" column for every group.
This is my function that I want to apply to every postcode group:
def count(group):
sums = []
for bnf_name in group['bnf_name']:
sum_ = group[group['bnf_name'] == bnf_name]['items'].sum()
sums.append(sum_)
group['sum'] = sums
merged.groupby('post_code').apply(count).head()
merged.head()
Calling merged.head() returns the original merged dataframe without a new column for sums like I would expect. I think there is something I don't get about the apply() function on a groupby object...
You do not need to define your own function
merged.groupby(['post_code','bnf_name'])['items'].sum()