How to group stats by words in pandas dataframe - pandas

I want to do aggregations on a panda dataframe by word.
Basically there are 3 columns with the click/impression count with the corresponding phrase. I would like to split the phrase into tokens and then sum up their clicks to tokens to decide which token is relatively good/bad.
Expected input: Panda dataframe as below
click_count impression_count text
1 10 100 pizza
2 20 200 pizza italian
3 1 1 italian cheese
Expected output:
click_count impression_count token
1 30 300 pizza // 30 = 20 + 10, 300 = 200+100
2 21 201 italian // 21 = 20 + 1
3 1 1 cheese // cheese only appeared once in italian cheese

tokens = df.text.str.split(expand=True)
token_cols = ['token_{}'.format(i) for i in range(tokens.shape[1])]
tokens.columns = token_cols
df1 = pd.concat([df.drop('text', axis=1), tokens], axis=1)
df1
df2 = pd.lreshape(df1, {'tokens': token_cols})
df2
df2.groupby('tokens').sum()

This creates a new DataFrame like piRSquared's but tokens are stacked and merged with the original:
(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True)
.to_frame('token').merge(df, left_index=True, right_index=True)
.groupby('token')['click_count', 'impression_count'].sum())
Out:
click_count impression_count
token
cheese 1 1
italian 21 201
pizza 30 300
If you break this down, it merges this:
df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True).to_frame('token')
Out:
token
1 pizza
2 pizza
2 italian
3 italian
3 cheese
with the original DataFrame on their indices. The resulting df is:
(df['text'].str.split(expand=True).stack().reset_index(level=1, drop=True)
.to_frame('token').merge(df, left_index=True, right_index=True))
Out:
token click_count impression_count text
1 pizza 10 100 pizza
2 pizza 20 200 pizza italian
2 italian 20 200 pizza italian
3 italian 1 1 italian cheese
3 cheese 1 1 italian cheese
The rest is grouping by the token column.

You could do
In [3091]: s = df.text.str.split(expand=True).stack().reset_index(drop=True, level=-1)
In [3092]: df.loc[s.index].assign(token=s).groupby('token',sort=False,as_index=False).sum()
Out[3092]:
token click_count impression_count
0 pizza 30 300
1 italian 21 201
2 cheese 1 1
Details
In [3093]: df
Out[3093]:
click_count impression_count text
1 10 100 pizza
2 20 200 pizza italian
3 1 1 italian cheese
In [3094]: s
Out[3094]:
1 pizza
2 pizza
2 italian
3 italian
3 cheese
dtype: object

Related

Python pandas dataframe, how to get the set number

Here is eaxmple:
df=pd.DataFrame([('apple'),('apple'),('apple'),('orange'),('orange')],columns=['A'])
df
Out[5]:
A
0 apple
1 apple
2 apple
3 orange
4 orange
I want to assign a number next to it, example, apple is the first set of list ['apple','orange'], B column is 1, then 2 for orange:
A B
0 apple 1
1 apple 1
2 apple 1
3 orange 2
4 orange 2
Bellow wouldn't work.
df['B']=df['A'].tolist().index(df['A']) +1
You can use the pd.factorize function. This function is used to convert arrays into categorical ones.
pd.Series.factorize is also available as a method of pd.Series objects:
codes, _ = df["A"].factorize()
df["B"] = codes + 1
print(df)
A B
0 apple 1
1 apple 1
2 apple 1
3 orange 2
4 orange 2
use groupby ngroup + 1 with sort=False to ensure groups are enumerated in the order they appear in the DataFrame:
df['B'] = df.groupby('A', sort=False).ngroup() + 1
df:
A B
0 apple 1
1 apple 1
2 apple 1
3 orange 2
4 orange 2

Get value from another df based on condition

I have 2 df
df1:
ID X Y Cond
Johnson 2 3 fine
Sand NAN NAN sick
Cooper 1 2 fine
Nelson 1 2 fine
Peterson 4 5 fine
and df2 :
id2 X Y
Magic 2 3
Sand 2 3
Cooper 1 2
Dean 1 2
I want to update x value in df1, if Cond ="sick" and df["id"] = df["id2]
to get the new df1 :
ID X Y Cond
Johnson 2 3 fine
Sand 2 3 sick
Cooper 1 2 fine
Nelson 1 2 fine
Peterson 4 5 fine
I tried :
df1["x"] = np.where((df["cond"]=="sick")& (df1["id"]==df2["id2"]),df2["x"],"")
But its not working. I get this ValueError :
ValueError: Can only compare identically-labeled Series objects
Thank you
First convert both ID columns to index values for possible match selected rows by DataFrame.loc:
df11 = df1.set_index('ID')
df22 = df2.set_index('id2')
df11.loc[df11["Cond"]=="sick", ['X','Y']] = df22[['X','Y']]
df = df11.reset_index()
print (df)
ID X Y Cond
0 Johnson 2 3 fine
1 Sand 2 3 sick
2 Cooper 1 2 fine
3 Nelson 1 2 fine
4 Peterson 4 5 fine
You can use the where() method of pandas dataframes instead of the wherefunction from numpy. The code looks like this :
df1.loc[:,["X", "Y"]] = df1.loc[:,["X", "Y"]].where(df1["Cond"]!="sick",df2.loc[:,["X", "Y"]])

how to apply one hot encoding or get dummies on 2 columns together in pandas?

I have below dataframe which contain sample values like:-
df = pd.DataFrame([["London", "Cambridge", 20], ["Cambridge", "London", 10], ["Liverpool", "London", 30]], columns= ["city_1", "city_2", "id"])
city_1 city_2 id
London Cambridge 20
Cambridge London 10
Liverpool London 30
I need the output dataframe as below which is built while joining 2 city columns together and applying one hot encoding after that:
id London Cambridge Liverpool
20 1 1 0
10 1 1 0
30 1 0 1
Currently, I am using the below code which works one time on a column, please could you advise if there is any pythonic way to get the above output
output_df = pd.get_dummies(df, columns=['city_1', 'city_2'])
which results in
id city_1_Cambridge city_1_London and so on columns
You can add parameters prefix_sep and prefix to get_dummies and then use max if want only 1 or 0 values (dummies or indicator columns) or sum if need count 1 values :
output_df = (pd.get_dummies(df, columns=['city_1', 'city_2'], prefix_sep='', prefix='')
.max(axis=1, level=0))
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1
Or if want processing all columns without id convert not processing column(s) to index first by DataFrame.set_index, then use get_dummies with max and last add DataFrame.reset_index:
output_df = (pd.get_dummies(df.set_index('id'), prefix_sep='', prefix='')
.max(axis=1, level=0)
.reset_index())
print (output_df)
id Cambridge Liverpool London
0 20 1 0 1
1 10 1 0 1
2 30 0 1 1

python pandas - set column value of column based on index and or ID of concatenated dataframes

I have a concatenated dataframe of at least two concatenated dataframes:
i.e.
df1
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
df2
Name | Type | ID
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
ConcatDf:
Name | Type | ID
0 Joe A 1
1 Fred B 2
2 Mike Both 3
3 Frank Both 4
0 Bill Both 1
1 Jill Both 2
2 Mill B 3
3 Hill A 4
Suppose after they are concatenated, I'd like to set Type for all records from df1 to C and all records from df2 to B. Is this possible?
The indices of the dataframes can be vastly different sizes.
Thanks in advance.
df3 = pd.concat([df1,df2], keys = (1,2))
df3.loc[(1), 'Type'] == 'C'
When you concat you can assign the df's keys. This will create a multi-index with the keys separating the concatonated df's. Then when you use .loc with keys you can use( around the key to call the group. In the code above we would change all the Types of df1 (which has a key of 1) to C.
Use merge with indicator=True to find rows belong to df1 or df2. Next, use np.where to assign A or B.
t = concatdf.merge(df1, how='left', on=concatdf.columns.tolist(), indicator=True)
concatdf['Type'] = np.where(t._merge.eq('left_only'), 'B', 'C')
Out[2185]:
Name Type ID
0 Joe C 1
1 Fred C 2
2 Mike C 3
3 Frank C 4
0 Bill B 1
1 Jill B 2
2 Mill B 3
3 Hill B 4

How to manipulate specific condition in some colums by pandas

I've question about how to manipulate specific condtion in some colums.
For example,
from pandas import DataFrame
import pandas as pd
df = DataFrame({'name' : ['apple','pineapple','melon','orange','mango','durian'],
'amt' : [200,300,100,1,3,120]},
index = ['1','2','3','4','5','6'])
print(df)
I can see,
amt name
1 200 apple
2 300 pineapple
3 100 melon
4 1 orange
5 3 mango
6 120 durian
From above result I want to manipulate amt of apple with other items hold.
I just only know...
df.loc[df.name.str.contains('apple'), 'amt'] = df['amt']/100
This syntax manipulates not only 'apple' but 'pineapple'.
I'd like to get only result revising apple's amt like...
amt name
1 2 apple
2 300 pineapple
3 100 melon
4 1 orange
5 3 mango
6 120 durian
Is there anyone help me?
Thanks for reading.