why is groupby in pandas not displaying - pandas

I have a df like:
df = pd.DataFrame({'Animal': ['Falcon', 'Falcon',
'Parrot', 'Parrot','Elephant','Elephant','Elephant'],
'Max Speed': [380, 370, 24, 26,5,7,3]})
I would like to groupby Animal.
if I do in a notebook:
a = df.groupby(['Animal'])
display(a)
I get:
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f945bdd7b80>
I expected something like:
What I ultimate want to do is sort the df by number of animal appearances (Elephant 3, falcon 2 etc)

Because you are not using any aggregated functions after groupby -
a = df.groupby(['Animal'])
display(a)
Rectified -
a = df.groupby(['Animal']).count()
display(a)
Now, after using count() function or sort_values() or sum() etc. you will get the groupby results

You need check DataFrame.groupby:
Group DataFrame using a mapper or by a Series of columns.
So it is not for remove duplicates values by column, but for aggregation.
If need remove duplicated vales, set to empty string use:
df.loc[df['Animal'].duplicated(), 'Animal'] = ''
print (df)
Animal Max Speed
0 Falcon 380
1 370
2 Parrot 24
3 26
4 Elephant 5
5 7
6 3
If need groupby:
for i, g in df.groupby(['Animal']):
print (g)
Animal Max Speed
4 Elephant 5
5 Elephant 7
6 Elephant 3
Animal Max Speed
0 Falcon 380
1 Falcon 370
Animal Max Speed
2 Parrot 24
3 Parrot 26

The groupby object requires an action, like a max or a min. This will result in two things:
A regular pandas data frame
The grouping key appearing once
You clearly expect both of the Falcon entries to remain so you don't actually want to do a groupby. If you want to see the entries with repeated animal values hidden, you would do that by setting the Animal column as the index. I say that because your input data frame is already in the order you wanted to display.

Use mask:
>>> df.assign(Animal=df['Animal'].mask(df['Animal'].duplicated(), ''))
Animal Max Speed
0 Falcon 380
1 370
2 Parrot 24
3 26
4 Elephant 5
5 7
6 3
>>>
Or as index:
df.assign(Animal=df['Animal'].mask(df['Animal'].duplicated(), '')).set_index('Animal')
Max Speed
Animal
Falcon 380
370
Parrot 24
26
Elephant 5
7
3
>>>

Related

How to speed up the loop in a dataframe

I would like to speed up my loop because I have to do it on 900 000 data.
To simplify i show you a sample.
I would like to add an column name 'Count' which count the number of times where score was under target score for the same player. But for each row the target change.
Input :
index Nom player score target score
0 0 felix 3 10
1 1 felix 8 7
2 2 theo 4 5
3 3 patrick 12 6
4 4 sophie 7 6
5 5 sophie 3 6
6 6 felix 2 4
7 7 felix 2 2
8 8 felix 2 3
Result :
index Nom player score target score Count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
Below the code i current use but is it possible to speed up ? I saw some articles about vectorization is it possible to apply on my calcul ? If yes how to do it
df2 = df.copy()
df2['Count']= [np.count_nonzero((df.values[:,1] == row[2] )& ( df.values[:,2] < row[4]) ) for row in df.itertuples()]
print(df2)
Jérôme Richard's insights for a O(n log n) solution can be translated to pandas. The speed up depends on the number and size of the groups in the dataframe.
df2 = df.copy()
gr = df2.groupby('Nom player')
lookup = gr.score.apply(np.sort).to_dict()
df2['count'] = gr.apply(
lambda x: pd.Series(
np.searchsorted(lookup[x.name], x['target score']),
index=x.index)
).droplevel(0)
print(df2)
Output
index Nom player score target score count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
You can try:
df['Count'] = df.groupby("Nom player").apply(
lambda x: pd.Series((sum(x["score"] < s) for s in x["target score"]), index=x.index)
).droplevel(0)
print(df)
Prints:
index Nom player score target score Count
0 0 felix 3 10 5
1 1 felix 8 7 4
2 2 theo 4 5 1
3 3 patrick 12 6 0
4 4 sophie 7 6 1
5 5 sophie 3 6 1
6 6 felix 2 4 4
7 7 felix 2 2 0
8 8 felix 2 3 3
EDIT: Quick benchmark:
from timeit import timeit
def add_count1(df):
df["Count"] = (
df.groupby("Nom player")
.apply(
lambda x: pd.Series(
((x["score"] < s).sum() for s in x["target score"]), index=x.index
)
)
.droplevel(0)
)
def add_count2(df):
df["Count"] = [
np.count_nonzero((df.values[:, 1] == row[2]) & (df.values[:, 2] < row[4]))
for row in df.itertuples()
]
def add_count3(df):
gr = df.groupby('Nom player')
lookup = gr.score.apply(lambda x: np.sort(np.array(x))).to_dict()
df['count'] = gr.apply(
lambda x: pd.Series(
np.searchsorted(lookup[x.name], x['target score']),
index=x.index)
).droplevel(0)
df = pd.concat([df] * 1000).reset_index(drop=True) # DataFrame of len=9000
t1 = timeit("add_count1(x)", "x=df.copy()", number=1, globals=globals())
t2 = timeit("add_count2(x)", "x=df.copy()", number=1, globals=globals())
t3 = timeit("add_count3(x)", "x=df.copy()", number=1, globals=globals())
print(t1)
print(t2)
print(t3)
Prints on my computer:
0.7540620159707032
6.63946107000811
0.004106967011466622
So my answer should be faster than the original, but Michael Szczesny's answer is the fastest.
There are two main issues in the current code. CPython string objects are slow, especially string comparison. Moreover, the current algorithm has a quadratic complexity: it compares all rows matching with the current one, for each row. The later is the biggest issue for large dataframes.
Implementation
The first thing to do is to replace the string comparison with something faster. Strings objects can be converted to native string using np.array. Then, the unique strings can be extracted as well as their location using np.unique. This basically help us to replace the string matching problem with an integer matching problem. Comparing native integer is generally significantly faster mainly because the processor like that and Numpy can use efficient SIMD instructions so to compare integers. Here is how to convert the string column to label indices:
# 0.065 ms
labels, labelIds = np.unique(np.array(df.values[:,1], dtype='U'), return_inverse=True)
Now, we can group-by the score by label (player names) efficiently. The thing is Numpy does not provide any group-by function. While this is possible to do that efficiently using multiple np.argsort, a basic pure-Python dict-based approach turns out to be pretty fast in practice. Here is the code grouping scores by label and sorting the set of score associated to each label (useful for the next step):
# 0.014 ms
from collections import defaultdict
scoreByGroups = defaultdict(lambda: [])
labelIdsList = labelIds.tolist()
scoresList = df['score'].tolist()
targetScoresList = df['target score'].tolist()
for labelId, score in zip(labelIdsList, scoresList):
scoreByGroups[labelId].append(score)
for labelId, scoreGroup in scoreByGroups.items():
scoreByGroups[labelId] = np.sort(np.array(scoreGroup, np.int32))
scoreByGroups can now be used to efficiently find the number of scores smaller than a given one for a given label. One just need to read scoreByGroups[label] (constant time) and then do a binary search on the resulting array (O(log n)). Here is how:
# 0.014 ms
counts = [np.searchsorted(scoreByGroups[labelId], score)
for labelId, score in zip(labelIdsList, targetScoresList)]
# Copies are slow, but adding a new column seems even slower
# 0.212 ms
df2 = df.copy()
df2['Count'] = np.fromiter(counts, np.int32)
Results
The resulting code takes 0.305 ms on my machine on the example input while the initial code takes 1.35 ms. This means this implementation is about 4.5 times faster. 2/3 of the time is unfortunately spent in the creation of the new dataframe with the new column. Note that the provided code should be much faster than the initial code on large dataframe thanks to a O(n log n) complexity instead of a O(n²) one.
Faster implementation for large dataframes
On large dataframe, calling np.searchsorted for each item is expensive due to the overhead of Numpy. On solution to easily remove this overhead is to use Numba. The computation can be optimized using a list instead of a dictionary since the labels are integers in the range 0..len(labelIds). The computation can also partially done in parallel.
The string to int conversion can be made significantly faster using pd.factorize though this is still an expensive process.
Here is the complete Numba-based solution:
import numba as nb
#nb.njit('(int64[:], int64[:], int64[:])', parallel=True)
def compute_counts(labelIds, scores, targetScores):
groupSizes = np.bincount(labelIds)
groupOffset = np.zeros(groupSizes.size, dtype=np.int64)
scoreByGroups = [np.empty(e, dtype=np.int64) for e in groupSizes]
counts = np.empty(labelIds.size, dtype=np.int64)
for labelId, score in zip(labelIds, scores):
offset = groupOffset[labelId]
scoreByGroups[labelId][offset] = score
groupOffset[labelId] = offset + 1
for labelId in nb.prange(len(scoreByGroups)):
scoreByGroups[labelId].sort()
for i in nb.prange(labelIds.size):
counts[i] = np.searchsorted(scoreByGroups[labelIds[i]], targetScores[i])
return counts
df2 = df.copy() # Slow part
labelIds, labels = pd.factorize(df['Nom player']) # Slow part
counts = compute_counts( # Pretty fast part
labelIds.astype(np.int64),
df['score'].to_numpy().astype(np.int64),
df['target score'].to_numpy().astype(np.int64)
)
df2['Count'] = counts # Slow part
On my 6-core machine, this code is much faster on large dataframe. In fact, it is the fastest one of the proposed answers. It is only 2.5 faster than the one of #MichaelSzczesny on a random dataframe with 9000 rows. The string to int conversion takes 40-45% of the time and the creation of the new Pandas dataframe (with the additional column) takes 25% of the time. The time taken by the Numba function is actually small in the end. Most of the time is finally lost in overheads.
Note that using categorial data can be done once (pre-computation) and it can be useful to other computation so it may actually not be so expensive.

Compare two data frames for different values in a column

I have two dataframe, please tell me how I can compare them by operator name, if it matches, then add the values ​​of quantity and time to the first data frame.
In [2]: df1 In [3]: df2
Out[2]: Out[3]:
Name count time Name count time
0 Bob 123 4:12:10 0 Rick 9 0:13:00
1 Alice 99 1:01:12 1 Jone 7 0:24:21
2 Sergei 78 0:18:01 2 Bob 10 0:15:13
85 rows x 3 columns 105 rows x 3 columns
I want to get:
In [5]: df1
Out[5]:
Name count time
0 Bob 133 4:27:23
1 Alice 99 1:01:12
2 Sergei 78 0:18:01
85 rows x 3 columns
Use set_index and add them together. Finally, update back.
df1 = df1.set_index('Name')
df1.update(df1 + df2.set_index('Name'))
df1 = df1.reset_index()
Out[759]:
Name count time
0 Bob 133.0 04:27:23
1 Alice 99.0 01:01:12
2 Sergei 78.0 00:18:01
Note: I assume time columns in both df1 and df2 are already in correct date/time format. If they are in string format, you need to convert them before running above commands as follows:
df1.time = pd.to_timedelta(df1.time)
df2.time = pd.to_timedelta(df2.time)

Panda: multiindex vs groupby [duplicate]

So I learned that I can use DataFrame.groupby without having a MultiIndex to do subsampling/cross-sections.
On the other hand, when I have a MultiIndex on a DataFrame, I still need to use DataFrame.groupby to do sub-sampling/cross-sections.
So what is a MultiIndex good for apart from the quite helpful and pretty display of the hierarchies when printing?
Hierarchical indexing (also referred to as “multi-level” indexing) was introduced in the pandas 0.4 release.
This opens the door to some quite sophisticated data analysis and manipulation, especially for working with higher dimensional data. In essence, it enables you to effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure (DataFrame), for example.
Imagine constructing a dataframe using MultiIndex like this:-
import pandas as pd
import numpy as np
np.arrays = [['one','one','one','two','two','two'],[1,2,3,1,2,3]]
df = pd.DataFrame(np.random.randn(6,2),index=pd.MultiIndex.from_tuples(list(zip(*np.arrays))),columns=['A','B'])
df # This is the dataframe we have generated
A B
one 1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
two 1 -0.101713 -1.204458
2 0.958008 -0.455419
3 -0.191702 -0.915983
This df is simply a data structure of two dimensions
df.ndim
2
But we can imagine it, looking at the output, as a 3 dimensional data structure.
one with 1 with data -0.732470 -0.313871.
one with 2 with data -0.031109 -2.068794.
one with 3 with data 1.520652 0.471764.
A.k.a.: "effectively store and manipulate arbitrarily high dimension data in a 2-dimensional tabular structure"
This is not just a "pretty display". It has the benefit of easy retrieval of data since we now have a hierarchal index.
For example.
In [44]: df.ix["one"]
Out[44]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
will give us a new data frame only for the group of data belonging to "one".
And we can narrow down our data selection further by doing this:-
In [45]: df.ix["one"].ix[1]
Out[45]:
A -0.732470
B -0.313871
Name: 1
And of course, if we want a specific value, here's an example:-
In [46]: df.ix["one"].ix[1]["A"]
Out[46]: -0.73247029752040727
So if we have even more indexes (besides the 2 indexes shown in the example above), we can essentially drill down and select the data set we are really interested in without a need for groupby.
We can even grab a cross-section (either rows or columns) from our dataframe...
By rows:-
In [47]: df.xs('one')
Out[47]:
A B
1 -0.732470 -0.313871
2 -0.031109 -2.068794
3 1.520652 0.471764
By columns:-
In [48]: df.xs('B', axis=1)
Out[48]:
one 1 -0.313871
2 -2.068794
3 0.471764
two 1 -1.204458
2 -0.455419
3 -0.915983
Name: B
Great post by #Calvin Cheng, but thought I'd take a stab at this as well.
When to use a MultiIndex:
When a single column’s value isn’t enough to uniquely identify a row.
When data is logically hierarchical - meaning that it has multiple dimensions or “levels.”
Why (your core question) - at least these are the biggest benefits IMO:
Easy manipulation via stack() and unstack()
Easy math when there are multiple column levels
Syntactic sugar for slicing/filtering
Example:
Dollars Units
Date Store Category Subcategory UPC EAN
2018-07-10 Store 1 Alcohol Liqour 80480280024 154.77 7
Store 2 Alcohol Liqour 80480280024 82.08 4
Store 3 Alcohol Liqour 80480280024 259.38 9
Store 1 Alcohol Liquor 80432400630 477.68 14
674545000001 139.68 4
Store 2 Alcohol Liquor 80432400630 203.88 6
674545000001 377.13 13
Store 3 Alcohol Liquor 80432400630 239.19 7
674545000001 432.32 14
Store 1 Beer Ales 94922755711 65.17 7
702770082018 174.44 14
736920111112 50.70 5
Store 2 Beer Ales 94922755711 129.60 12
702770082018 107.40 10
736920111112 59.65 5
Store 3 Beer Ales 94922755711 154.00 14
702770082018 137.40 10
736920111112 107.88 12
Store 1 Beer Lagers 702770081011 156.24 12
Store 2 Beer Lagers 702770081011 137.06 11
Store 3 Beer Lagers 702770081011 119.52 8
1) If we want to easily compare sales across stores, we can use df.unstack('Store') to line everything up side-by-side:
Dollars Units
Store Store 1 Store 2 Store 3 Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 154.77 82.08 259.38 7 4 9
Liquor 80432400630 477.68 203.88 239.19 14 6 7
674545000001 139.68 377.13 432.32 4 13 14
Beer Ales 94922755711 65.17 129.60 154.00 7 12 14
702770082018 174.44 107.40 137.40 14 10 10
736920111112 50.70 59.65 107.88 5 5 12
Lagers 702770081011 156.24 137.06 119.52 12 11 8
2) We can also easily do math on multiple columns. For example, df['Dollars'] / df['Units'] will then divide each store's dollars by its units, for every store without multiple operations:
Store Store 1 Store 2 Store 3
Date Category Subcategory UPC EAN
2018-07-10 Alcohol Liqour 80480280024 22.11 20.52 28.82
Liquor 80432400630 34.12 33.98 34.17
674545000001 34.92 29.01 30.88
Beer Ales 94922755711 9.31 10.80 11.00
702770082018 12.46 10.74 13.74
736920111112 10.14 11.93 8.99
Lagers 702770081011 13.02 12.46 14.94
3) If we then want to filter to just specific rows, instead of using the
df[(df[col1] == val1) and (df[col2] == val2) and (df[col3] == val3)]
format, we can instead .xs or .query (yes these work for regular dfs, but it's not very useful). The syntax would instead be:
df.xs((val1, val2, val3), level=(col1, col2, col3))
More examples can be found in this tutorial notebook I put together.
The alternative to using a multiindex is to store your data using multiple columns of a dataframe. One would expect multiindex to provide a performance boost over naive column storage, but as of Pandas v 1.1.4, that appears not to be the case.
Timinigs
import numpy as np
import pandas as pd
np.random.seed(2020)
inv = pd.DataFrame({
'store_id': np.random.choice(10000, size=10**7),
'product_id': np.random.choice(1000, size=10**7),
'stock': np.random.choice(100, size=10**7),
})
# Create a DataFrame with a multiindex
inv_multi = inv.groupby(['store_id', 'product_id'])[['stock']].agg('sum')
print(inv_multi)
stock
store_id product_id
0 2 48
4 18
5 58
7 149
8 158
... ...
9999 992 132
995 121
996 105
998 99
999 16
[6321869 rows x 1 columns]
# Create a DataFrame without a multiindex
inv_cols = inv_multi.reset_index()
print(inv_cols)
store_id product_id stock
0 0 2 48
1 0 4 18
2 0 5 58
3 0 7 149
4 0 8 158
... ... ... ...
6321864 9999 992 132
6321865 9999 995 121
6321866 9999 996 105
6321867 9999 998 99
6321868 9999 999 16
[6321869 rows x 3 columns]
%%timeit
inv_multi.xs(key=100, level='store_id')
10 loops, best of 3: 20.2 ms per loop
%%timeit
inv_cols.loc[inv_cols.store_id == 100]
The slowest run took 8.79 times longer than the fastest. This could mean that an intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop
%%timeit
inv_multi.xs(key=100, level='product_id')
100 loops, best of 3: 9.08 ms per loop
%%timeit
inv_cols.loc[inv_cols.product_id == 100]
100 loops, best of 3: 12.2 ms per loop
%%timeit
inv_multi.xs(key=(100, 100), level=('store_id', 'product_id'))
10 loops, best of 3: 29.8 ms per loop
%%timeit
inv_cols.loc[(inv_cols.store_id == 100) & (inv_cols.product_id == 100)]
10 loops, best of 3: 28.8 ms per loop
Conclusion
The benefits from using a MultiIndex are about syntactic sugar, self-documenting data, and small conveniences from functions like unstack() as mentioned in #ZaxR's answer; Performance is not a benefit, which seems like a real missed opportunity.
Based on the comment on this
answer it seems the
experiment was flawed. Here is my attempt at a correct experiment.
Timings
import pandas as pd
import numpy as np
from timeit import timeit
random_data = np.random.randn(16, 4)
multiindex_lists = [["A", "B", "C", "D"], [1, 2, 3, 4]]
multiindex = pd.MultiIndex.from_product(multiindex_lists)
dfm = pd.DataFrame(random_data, multiindex)
df = dfm.reset_index()
print("dfm:\n", dfm, "\n")
print("df\n", df, "\n")
dfm_selection = dfm.loc[("B", 4), 3]
print("dfm_selection:", dfm_selection, type(dfm_selection))
df_selection = df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0]
print("df_selection: ", df_selection, type(df_selection), "\n")
print("dfm_selection timeit:",
timeit(lambda: dfm.loc[("B", 4), 3], number=int(1e6)))
print("df_selection timeit: ",
timeit(
lambda: df[(df["level_0"] == "B") & (df["level_1"] == 4)][3].iat[0],
number=int(1e6)))
dfm:
0 1 2 3
A 1 -1.055128 -0.845019 -2.853027 0.521738
2 0.397804 0.385045 -0.121294 -0.696215
3 -0.551836 -0.666953 -0.956578 1.929732
4 -0.154780 1.778150 0.183104 -0.013989
B 1 -0.315476 0.564419 0.492496 -1.052432
2 -0.695300 0.085265 0.701724 -0.974168
3 -0.879915 -0.206499 1.597701 1.294885
4 0.653261 0.279641 -0.800613 1.050241
C 1 1.004199 -1.377520 -0.672913 1.491793
2 -0.453452 0.367264 -0.002362 0.411193
3 2.271958 0.240864 -0.923934 -0.572957
4 0.737893 -0.523488 0.485497 -2.371977
D 1 1.133661 -0.584973 -0.713320 -0.656315
2 -1.173231 -0.490667 0.634677 1.711015
3 -0.050371 -0.175644 0.124797 0.703672
4 1.349595 0.122202 -1.498178 0.013391
df
level_0 level_1 0 1 2 3
0 A 1 -1.055128 -0.845019 -2.853027 0.521738
1 A 2 0.397804 0.385045 -0.121294 -0.696215
2 A 3 -0.551836 -0.666953 -0.956578 1.929732
3 A 4 -0.154780 1.778150 0.183104 -0.013989
4 B 1 -0.315476 0.564419 0.492496 -1.052432
5 B 2 -0.695300 0.085265 0.701724 -0.974168
6 B 3 -0.879915 -0.206499 1.597701 1.294885
7 B 4 0.653261 0.279641 -0.800613 1.050241
8 C 1 1.004199 -1.377520 -0.672913 1.491793
9 C 2 -0.453452 0.367264 -0.002362 0.411193
10 C 3 2.271958 0.240864 -0.923934 -0.572957
11 C 4 0.737893 -0.523488 0.485497 -2.371977
12 D 1 1.133661 -0.584973 -0.713320 -0.656315
13 D 2 -1.173231 -0.490667 0.634677 1.711015
14 D 3 -0.050371 -0.175644 0.124797 0.703672
15 D 4 1.349595 0.122202 -1.498178 0.013391
dfm_selection: 1.0502406808918188 <class 'numpy.float64'>
df_selection: 1.0502406808918188 <class 'numpy.float64'>
dfm_selection timeit: 63.92458086000079
df_selection timeit: 450.4555013199997
Conclusion
MultiIndex single-value retrieval is over 7 times faster than conventional
dataframe single-value retrieval.
The syntax for MultiIndex retrieval is much cleaner.

How to multiply iteratively down a column?

I am having a tough time with this one - not sure why...maybe it's the late hour.
I have a dataframe in pandas as follows:
1 10
2 11
3 20
4 5
5 10
I would like to calculate for each row the multiplicand for each row above it. For example, at row 3, I would like to calculate 10*11*20, or 2,200.
How do I do this?
Use cumprod.
Example:
df = pd.DataFrame({'A': [10, 11, 20, 5, 10]}, index=range(1, 6))
df['cprod'] = df['A'].cumprod()
Note, since your example is just a single column, a cumulative product can be done succinctly with a Series:
import pandas as pd
s = pd.Series([10, 11, 20, 5, 10])
s
# Output
0 10
1 11
2 20
3 5
4 10
dtype: int64
s.cumprod()
# Output
0 10
1 110
2 2200
3 11000
4 110000
dtype: int64
Kudos to #bananafish for locating the inherent cumprod method.

How to unite several results of a dataframe columns describe() into one dataframe?

I am applying describe() to several columns of my dataframe, for example:
raw_data.groupby("user_id").size().describe()
raw_data.groupby("business_id").size().describe()
And several more, because I want to find out how many data points are there per user on average/median/etc..
My question is, each of those calls returns something that seems to be an unstructured output. Is there an easy way to combine them all to a single new dataframe which columns will be: [count,mean,std,min,25%,50%,75%,max] and the index will be the various columns described?
Thanks!
I might simply build a new DataFrame manually. If you have
>>> raw_data
user_id business_id data
0 10 1 5
1 20 10 6
2 20 100 7
3 30 100 8
Then the results of groupby(smth).size().describe() are just another Series:
>>> raw_data.groupby("user_id").size().describe()
count 3.000000
mean 1.333333
std 0.577350
min 1.000000
25% 1.000000
50% 1.000000
75% 1.500000
max 2.000000
dtype: float64
>>> type(_)
<class 'pandas.core.series.Series'>
and so:
>>> descrs = ((col, raw_data.groupby(col).size().describe()) for col in raw_data)
>>> pd.DataFrame.from_items(descrs).T
count mean std min 25% 50% 75% max
user_id 3 1.333333 0.57735 1 1 1 1.5 2
business_id 3 1.333333 0.57735 1 1 1 1.5 2
data 4 1.000000 0.00000 1 1 1 1.0 1
Instead of from_items I could have passed a dictionary, e.g.
pd.DataFrame({col: raw_data.groupby(col).size().describe() for col in raw_data}).T, but this way the column order is preserved without having to think about it.
If you don't want all the columns, instead of for col in raw_data, you could define columns_to_describe = ["user_id", "business_id"] etc and use for col in columns_to_describe, or use for col in raw_data if col.endswith("_id"), or whatever you like.