Conditional frequency of elements within lists in pandas data frame - pandas

I have a data frame in pandas like this:
STATUS FEATURES
A [x,y,z]
A [t, y]
B [x,p,t]
B [x,p]
I want to count the frequency of the elements in the lists of features conditional on the status.
The desired output would be:
STATUS FEATURES FREQUENCY
A x 1
A y 2
A z 1
A t 1
B x 2
B t 1
B p 2

Let us do explode , the groupby size
s=df.explode(['FEATURES']).groupby(['STATUS','FEATURES']).size().reset_index()

Use DataFrame.explode and SeriesGroupBy.value_counts:
new_df = (df.explode('FEATURES')
.groupby('STATUS')['FEATURES']
.value_counts()
.reset_index(name='FRECUENCY'))
print(new_df)
Output
STATUS FEATURES FRECUENCY
0 A y 2
1 A t 1
2 A x 1
3 A z 1
4 B p 2
5 B x 2
6 B t 1

Related

Multimatch join in pandas

I am looking for joining two data frame on one column and if there is a multi match then append the results to another column.
NB. using a different example as yours is not reproducible.
You can convert to str.lower, then explode and map the values to groupby.agg again as string:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = (df1['name']
.str.upper().str.split(',')
.explode()
.map(mapper)
.groupby(level=0).agg(','.join)
)
Or, with a list comprehension:
mapper = df2.set_index('name')['ID'].astype(str)
df1['ID'] = [','.join([mapper[x] for x in s.split(',') if x in mapper])
for s in df1['name']]
output:
name ID
0 A 1
1 b 2
2 A,B 1,2
3 C,a 3,1
4 D 4
Used input:
# df1
name
0 A
1 b
2 A,B
3 C,a
4 D
# df2
name ID
0 A 1
1 B 2
2 C 3
3 D 4

df.apply(myfunc, axis=1) results in error but df.groupby(df.index).apply(myfunc) does not

I have a dataframe that looks like this:
a b c
0 x x x
1 y y y
2 z z z
I would like to apply a function to each row of dataframe. That function then creates a new dataframe with multiple rows from each input row and returns it. Here is my_func:
def my_func(df):
dup_num = int(df.c - df.a)
if isinstance(df, pd.Series):
df_expanded = pd.concat([pd.DataFrame(df).transpose()]*dup_num,
ignore_index=True)
else:
df_expanded = pd.concat([pd.DataFrame(df)]*dup_num,
ignore_index=True)
return df_expanded
The final dataframe will look like something like this:
a b c
0 x x x
1 x x x
2 y y y
3 y y y
4 y y y
5 z z z
6 z z z
So I did:
df_expanded = df.apply(my_func, axis=1)
I inserted breakpoints inside the function and for each row, the created dataframe from my_func is correct. However, at the end, when the last row returns, I get an error stating that:
ValueError: cannot copy sequence with size XX to array axis with dimension YY
As if apply is trying to return a Series not a group of dataFrames that the function created.
So instead of df.apply I did:
df_expanded = df.groupby(df.index).apply(my_func)
Which just creates groups of single rows and applies the same function. This on the other hand works.
Why?
Perhaps we can take advantage of how pd.Series.explode and pd.Series.apply(pd.Series) work to simplify this process.
Given:
a b c
0 1 1 4
1 2 2 4
2 3 3 4
Doing:
new_df = (df.apply(lambda x: [x.tolist()]*(x.c-x.a), axis=1)
.explode(ignore_index=True)
.apply(pd.Series))
new_df.columns = df.columns
print(new_df)
Output:
a b c
0 1 1 4
1 1 1 4
2 1 1 4
3 2 2 4
4 2 2 4
5 3 3 4

Pandas - assign column values to new columns names

I have this dataframe:
player_id scout_occ round scout
812842 2 1 X
812842 4 1 Y
812842 1 1 Z
812842 1 2 X
812842 2 2 Y
812842 2 2 Z
And I need to transpose 'scout' values to columns, as well as using number of occurrences as value or these new columns, ending up with:
player_id round X Y Z
812842 1 2 4 1
812842 2 1 2 2
How do I achieve this?
Use pivot_table. For example:
df = df.pivot_table(values='scout_occ',index=['player_id','round'],columns='scout')
Then if you don't want to use column name(scout):
df.columns.name = None
Also, if you want to use player_id and round as a column not as an index:
df.reset_index()

How to *multiply* (for lack of a better term) two dataframes [duplicate]

The contents of this post were originally meant to be a part of
Pandas Merging 101,
but due to the nature and size of the content required to fully do
justice to this topic, it has been moved to its own QnA.
Given two simple DataFrames;
left = pd.DataFrame({'col1' : ['A', 'B', 'C'], 'col2' : [1, 2, 3]})
right = pd.DataFrame({'col1' : ['X', 'Y', 'Z'], 'col2' : [20, 30, 50]})
left
col1 col2
0 A 1
1 B 2
2 C 3
right
col1 col2
0 X 20
1 Y 30
2 Z 50
The cross product of these frames can be computed, and will look something like:
A 1 X 20
A 1 Y 30
A 1 Z 50
B 2 X 20
B 2 Y 30
B 2 Z 50
C 3 X 20
C 3 Y 30
C 3 Z 50
What is the most performant method of computing this result?
Let's start by establishing a benchmark. The easiest method for solving this is using a temporary "key" column:
pandas <= 1.1.X
def cartesian_product_basic(left, right):
return (
left.assign(key=1).merge(right.assign(key=1), on='key').drop('key', 1))
cartesian_product_basic(left, right)
pandas >= 1.2
left.merge(right, how="cross") # implements the technique above
col1_x col2_x col1_y col2_y
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
How this works is that both DataFrames are assigned a temporary "key" column with the same value (say, 1). merge then performs a many-to-many JOIN on "key".
While the many-to-many JOIN trick works for reasonably sized DataFrames, you will see relatively lower performance on larger data.
A faster implementation will require NumPy. Here are some famous NumPy implementations of 1D cartesian product. We can build on some of these performant solutions to get our desired output. My favourite, however, is #senderle's first implementation.
def cartesian_product(*arrays):
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] + [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la)
Generalizing: CROSS JOIN on Unique or Non-Unique Indexed DataFrames
Disclaimer
These solutions are optimised for DataFrames with non-mixed scalar dtypes. If dealing with mixed dtypes, use at your
own risk!
This trick will work on any kind of DataFrame. We compute the cartesian product of the DataFrames' numeric indices using the aforementioned cartesian_product, use this to reindex the DataFrames, and
def cartesian_product_generalized(left, right):
la, lb = len(left), len(right)
idx = cartesian_product(np.ogrid[:la], np.ogrid[:lb])
return pd.DataFrame(
np.column_stack([left.values[idx[:,0]], right.values[idx[:,1]]]))
cartesian_product_generalized(left, right)
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left, right))
True
And, along similar lines,
left2 = left.copy()
left2.index = ['s1', 's2', 's1']
right2 = right.copy()
right2.index = ['x', 'y', 'y']
left2
col1 col2
s1 A 1
s2 B 2
s1 C 3
right2
col1 col2
x X 20
y Y 30
y Z 50
np.array_equal(cartesian_product_generalized(left, right),
cartesian_product_basic(left2, right2))
True
This solution can generalise to multiple DataFrames. For example,
def cartesian_product_multi(*dfs):
idx = cartesian_product(*[np.ogrid[:len(df)] for df in dfs])
return pd.DataFrame(
np.column_stack([df.values[idx[:,i]] for i,df in enumerate(dfs)]))
cartesian_product_multi(*[left, right, left]).head()
0 1 2 3 4 5
0 A 1 X 20 A 1
1 A 1 X 20 B 2
2 A 1 X 20 C 3
3 A 1 X 20 D 4
4 A 1 Y 30 A 1
Further Simplification
A simpler solution not involving #senderle's cartesian_product is possible when dealing with just two DataFrames. Using np.broadcast_arrays, we can achieve almost the same level of performance.
def cartesian_product_simplified(left, right):
la, lb = len(left), len(right)
ia2, ib2 = np.broadcast_arrays(*np.ogrid[:la,:lb])
return pd.DataFrame(
np.column_stack([left.values[ia2.ravel()], right.values[ib2.ravel()]]))
np.array_equal(cartesian_product_simplified(left, right),
cartesian_product_basic(left2, right2))
True
Performance Comparison
Benchmarking these solutions on some contrived DataFrames with unique indices, we have
Do note that timings may vary based on your setup, data, and choice of cartesian_product helper function as applicable.
Performance Benchmarking Code
This is the timing script. All functions called here are defined above.
from timeit import timeit
import pandas as pd
import matplotlib.pyplot as plt
res = pd.DataFrame(
index=['cartesian_product_basic', 'cartesian_product_generalized',
'cartesian_product_multi', 'cartesian_product_simplified'],
columns=[1, 10, 50, 100, 200, 300, 400, 500, 600, 800, 1000, 2000],
dtype=float
)
for f in res.index:
for c in res.columns:
# print(f,c)
left2 = pd.concat([left] * c, ignore_index=True)
right2 = pd.concat([right] * c, ignore_index=True)
stmt = '{}(left2, right2)'.format(f)
setp = 'from __main__ import left2, right2, {}'.format(f)
res.at[f, c] = timeit(stmt, setp, number=5)
ax = res.div(res.min()).T.plot(loglog=True)
ax.set_xlabel("N");
ax.set_ylabel("time (relative)");
plt.show()
Continue Reading
Jump to other topics in Pandas Merging 101 to continue learning:
Merging basics - basic types of joins
Index-based joins
Generalizing to multiple DataFrames
Cross join *
* you are here
After pandas 1.2.0 merge now have option cross
left.merge(right, how='cross')
Using itertools product and recreate the value in dataframe
import itertools
l=list(itertools.product(left.values.tolist(),right.values.tolist()))
pd.DataFrame(list(map(lambda x : sum(x,[]),l)))
0 1 2 3
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
Here's an approach with triple concat
m = pd.concat([pd.concat([left]*len(right)).sort_index().reset_index(drop=True),
pd.concat([right]*len(left)).reset_index(drop=True) ], 1)
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
One option is with expand_grid from pyjanitor:
# pip install pyjanitor
import pandas as pd
import janitor as jn
others = {'left':left, 'right':right}
jn.expand_grid(others = others)
left right
col1 col2 col1 col2
0 A 1 X 20
1 A 1 Y 30
2 A 1 Z 50
3 B 2 X 20
4 B 2 Y 30
5 B 2 Z 50
6 C 3 X 20
7 C 3 Y 30
8 C 3 Z 50
I think the simplest way would be to add a dummy column to each data frame, do an inner merge on it and then drop that dummy column from the resulting cartesian dataframe:
left['dummy'] = 'a'
right['dummy'] = 'a'
cartesian = left.merge(right, how='inner', on='dummy')
del cartesian['dummy']

Pandas groupby sort each group values and order dataframe groups based on max of each group

I have a dataset containing 3 columns, I’m trying to group them and print each group in sorted fashion (based on highest value in each group). The records in each group also have to be in sorted fashion.
Dataset looks like below.
key1,key2,val
b,y,21
c,y,25
c,z,10
b,x,20
b,z,5
c,x,17
a,x,15
a,y,18
a,z,100
df=pd.read_csv('/tmp/hello.csv')
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max', 'val'], ascending=False).drop('max', axis=1)
I'm applying transform as it works per group basis and then sorting the values.
Above code results in my desired dataframe:
a,z,100
a,y,18
a,x,15
c,y,25
c,x,17
c,z,10
b,y,21
b,x,20
b,z,5
But, the same code fails for below dataset.
key1,key2,val
b,y,10
c,y,10
c,z,10
b,x,2
b,z,2
c,x,2
a,x,2
a,y,2
a,z,2
Below is the desired output
key1,key2,val
c,y,10
c,z,10
c,x,2
b,y,10
b,x,2
b,z,2
a,x,2
a,y,2
a,z,2
Please help me in properly grouping and sorting the dataframe for my scenario.
Add column key1 to sort_values because in second DataFrame are multiple maximum values 10 per groups, so sorting cannot distingush groups:
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
8 a z 100
7 a y 18
6 a x 15
1 c y 25
5 c x 17
2 c z 10
0 b y 21
3 b x 20
4 b z 5
df['max'] = df.groupby(['key1'])['val'].transform('max')
dff=df.sort_values(['max','key1', 'val'], ascending=False).drop('max', axis=1)
print (dff)
key1 key2 val
1 c y 10
2 c z 10
5 c x 2
0 b y 10
3 b x 2
4 b z 2
6 a x 2
7 a y 2
8 a z 2