Pandas finding average in a comma separated column - pandas

I want to take average based on one column which is comma separated and take mean on other column.
My file looks like this:
ColumnA ColumnB
A, B, C 2.9
A, C 9.087
D 6.78
B, D, C 5.49
My output should look like this:
A 7.4435
B 5.645
C 5.83
D 6.135
My code is this:
df = pd.DataFrame(data.ColumnA.str.split(',', expand=True).stack(), columns= ['ColumnA'])
df = df.reset_index(drop = True)
df_avg = pd.DataFrame(df.groupby(by = ['ColumnA'])['ColumnB'].mean())
df_avg = df_avg.reset_index()
It has to be around the same lines but can't figure it out.

In your solution is created index by column ColumnB for avoid lost column values after stack and Series.reset_index, last is added as_index=False for column after aggregation:
df = (df.set_index('ColumnB')['ColumnA']
.str.split(',', expand=True)
.stack()
.reset_index(name='ColumnA')
.groupby('ColumnA', as_index=False)['ColumnB']
.mean())
print (df)
ColumnA ColumnB
0 A 5.993500
1 B 4.195000
2 C 5.825667
3 D 6.135000
Or alternative solution with DataFrame.explode:
df = (df.assign(ColumnA = df['ColumnA'].str.split(','))
.explode('ColumnA')
.groupby('ColumnA', as_index=False)['ColumnB']
.mean())
print (df)
ColumnA ColumnB
0 A 5.993500
1 B 4.195000
2 C 5.825667
3 D 6.135000

Related

Drop duplicates and combine IDs where those duplication exist [duplicate]

I tried to use groupby to group rows with multiple values.
col val
A Cat
A Tiger
B Ball
B Bat
import pandas as pd
df = pd.read_csv("Inputfile.txt", sep='\t')
group = df.groupby(['col'])['val'].sum()
I got
A CatTiger
B BallBat
I want to introduce a delimiter, so that my output looks like
A Cat-Tiger
B Ball-Bat
I tried,
group = df.groupby(['col'])['val'].sum().apply(lambda x: '-'.join(x))
this yielded,
A C-a-t-T-i-g-e-r
B B-a-l-l-B-a-t
What is the issue here ?
Thanks,
AP
Alternatively you can do it this way:
In [48]: df.groupby('col')['val'].agg('-'.join)
Out[48]:
col
A Cat-Tiger
B Ball-Bat
Name: val, dtype: object
UPDATE: answering question from the comment:
In [2]: df
Out[2]:
col val
0 A Cat
1 A Tiger
2 A Panda
3 B Ball
4 B Bat
5 B Mouse
6 B Egg
In [3]: df.groupby('col')['val'].agg('-'.join)
Out[3]:
col
A Cat-Tiger-Panda
B Ball-Bat-Mouse-Egg
Name: val, dtype: object
Last for convert index or MultiIndex to columns:
df1 = df.groupby('col')['val'].agg('-'.join).reset_index(name='new')
just try
group = df.groupby(['col'])['val'].apply(lambda x: '-'.join(x))

Is there a way to use .loc on column names instead of the values inside the columns?

I am wondering if there is a way to use .loc to check to sort df with certain column names == something else on another df. I know you can usually use it to check if the value is == to something, but what about the actual column name itself?
ex.
df1 = [ 0, 1, 2, 3]
df2 .columns = [2,4,6]
Is there a way to only display df2 values where the column name is == df1 without hardcoding it and saying df2.loc[:, ==2]?
IIUC, you can use df2.columns.intersection to get columns only present in df1:
>>> df1
A B D F
0 0.431332 0.663717 0.922112 0.562524
1 0.467159 0.549023 0.139306 0.168273
>>> df2
A B C D E F
0 0.451493 0.916861 0.257252 0.600656 0.354882 0.109236
1 0.676851 0.585368 0.467432 0.594848 0.962177 0.714365
>>> df2[df2.columns.intersection(df1.columns)]
A B D F
0 0.451493 0.916861 0.600656 0.109236
1 0.676851 0.585368 0.594848 0.714365
One solution:
df3 = df2[[c for c in df2.columns if c in df1]]

pandas return auxilliary column from groupby and max

I have a pandas DataFrame with 3 columns, A, B, and V.
I want a DataFrame with A as the index and one column, which contains the B for the maximum V
I can easily create a df with A and the maximum V using groupby, and then perform some machinations to extract the corresponding B, but that seems like the wrong idea.
I've been playing with combinations of groupby and agg with no joy.
Sample Data:
A,B,V
MHQ,Q,0.5192
MMO,Q,0.4461
MTR,Q,0.5385
MVM,Q,0.351
NCR,Q,0.0704
MHQ,E,0.5435
MMO,E,0.4533
MTR,E,-0.6716
MVM,E,0.3684
NCR,E,-0.0278
MHQ,U,0.2712
MMO,U,0.1923
MTR,U,0.3833
MVM,U,0.1355
NCR,U,0.1058
A = [1,1,1,2,2,2,3,3,3,4,4,4]
B = [1,2,3,4,5,6,7,8,9,10,11,12]
V = [21,22,23,24,25,26,27,28,29,30,31,32]
df = pd.DataFrame({'A': A, 'B': B, 'V': V})
res = df.groupby('A').apply(
lambda x: x[x['V']==x['V'].max()]).set_index('A')['B'].to_frame()
res
B
A
1 3
2 6
3 9
4 12

How I can select a column where in another column I need a specific things

I have a pyspark data frame. How I can select a column where in another column I need a specific things. suppose I have n columns. for 2 columns I have
A. B.
a b
a c
d f
I want all column B. where column A is a. so
A. B.
a b
a c
It's just a simple filter:
df2 = df.filter("A = 'a'")
which comes in many flavours, such as
df2 = df.filter(df.A == 'a')
df2 = df.filter(df['A'] == 'a')
or
import pyspark.sql.functions as F
df2 = df.filter(F.col('A') == 'a')

Group a subset of values into a list of single row per key, but add None if true on a condition

Suppose I have the following data that I want to conduct groupby on:
Key Prod Val
A a 1
A b 0
B a 1
B b 1
B d 1
C a 0
C b 0
I want to group the table so I have a single row per each key, A, B and C, and a list containing the prod values corresponding to the key. But the element should only be in the list of there's an indicator of 1 for the corresponding val. If it's completely 0 for the entire subset of a key, than the key should just get a none value. Here's the result I'm looking for using the same e.g. above:
Key List
A [a]
B [a, b, d]
C None
What's the most efficient way to perform this in pandas?
Let's try:
df.query('Val == 1').groupby('Key')['Prod'].agg(lambda x: list(x)).reindex(df.Key.unique())
Output:
Key
A [a]
B [a, b, d]
C NaN
dtype: object
I think just making a new dataframe would be easiest:
df2 = pd.DataFrame(columns = ['list'], index = set(df1.Key))
for i, row in df2.iterrows():
df2.loc[i, 'list'] = []
for i, row in df1.iterrows():
key = df1.loc[i, 'key']
if df1.loc[i, 'val'] == 1:
df2.loc[key, 'list'].append(df1.loc[i, 'prod'])