How to replace values of a column based on another data frame? - pandas

I have a column containing symbols of chemical elements and other substances. Something like this:
Commoditie
sn
sulfuric acid
cu
sodium chloride
au
df1 = pd.DataFrame(['sn', 'sulfuric acid', 'cu', 'sodium chloride', 'au'], columns=['Commodities'])
And I have another data frame containing the symbols of the chemical elements and their respective names. Like this:
Name
Symbol
sn
tin
cu
copper
au
gold
df2 = pd.DataFrame({'Name': ['tin', 'copper', 'gold'], 'Symbol': ['sn', 'cu', 'au']})
I need to replace the symbols (in the first dataframe)(df1['Commoditie']) with the names (in the second one) (df2['Names']), so that it outputs like the following:
I need the
Output:
Commoditie
tin
sulfuric acid
copper
sodium chloride
gold
I tried using for loops and lambda but got different results than expected. I have tried many things and googled, I think it's something basic, but I just can't find an answer.
Thank you in advance!

first, convert df2 to a dictionary:
replace_dict=dict(df2[['Symbol','Name']].to_dict('split')['data'])
#{'sn': 'tin', 'cu': 'copper', 'au': 'gold'}
then use replace function:
df1['Commodities']=df1['Commodities'].replace(replace_dict)
print(df1)
'''
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
'''

Try:
for i, row in df2.iterrows():
df1.Commodities = df1.Commodities.str.replace(row.Symbol, row.Name)
which gives df1 as:
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
EDIT: Note that it's very likely to be far more efficient to skip defining df2 at all and just zip your lists of names and symbols together and iterate over that.

Related

Python - compare multiple columns, search list of keywords in column and compare with another, in two dataframes to generate a third resultset

I have two very different dataframes.
df1 looks like this:
Region
Entity
Desk
Function
Key
US
AAA
Top class, Desk1, Mike's team
Writing, advising
Unique_1
US
AAA
team beta, Blue rats, Tom
task_a, task_d
Unique_2
EMEA
ZZZ
Delta one
Forecasts, month-end, Sales
Unique_3
JPN
XYZ
Admin
task1, task_b, task_g
Unique_4
df2 looks like this:
Region
Entity
Desk
Function
ID
EMEA
ZZZ
Equity, delta one
Sales, sweet talking, schmoozing
A_01
US
AAA
Desk 1, A team, Top class
Writing,calling,listening, advising
A_02
US
AAA
Desk 2, Ninjas, 2nd team, Tom's team
Secret, private
A_03
EMEA
DDD
Equity, Private Equity
task1, task2, task3, task4
A_04
JPN
XXX
Admin, Secretaries
task_a, task_b
A_05
df2 is a much larger recordset than df1.
Both Desk and Function in each of the dataframes were free-text fields and allowed the input of rubbish data. I am trying to build a new recordset from these dataframes using the following criteria:
where -
df1['Region'] == df2['Region']
AND
df1['Entity'] == df2['Entity']
AND
any of the phrases within df1['Desk'] can be matched to any of the phrases within df2['Desk']
AND
any of the phrases within df1['Function'] can be matched to any of the phrases within df2['Function'].
I need the ultimate output to look something like this:
df2.Id
df1.Key
MATCH
A_02
Unique_1
Exact
Unique_2
No match
A_01
Unique_3
Exact
Unique_4
No match
I am really struggling with this. I have both dataframes but I cannot loop through df1 to match the columns as specified above in df2. I've tried merging the dataframes, using np.where and brute force looping but nothing is working. The tricky bit is matching the Desk and Function columns.
Any ideas?
IIUC, one option is to use a cross merge :
def cross_match(df1, df2, col):
df = df1.merge(df2, how="cross")
colx, coly = [f"{col}_x", f"{col}_y"]
df[[colx, coly]] = df[[colx, coly]].apply(lambda x: x.str.lower()
.str.split("\s*,\s*"))
df["MATCH"] = (pd.Series([any(w in sent for w in lst)
for lst, sent in zip(df[f"{col}_x"], df[f"{col}_y"])])
.map({True:"Exact"}))
return df.query("MATCH == 'Exact'")
desk, func = cross_match(df1, df2, "Desk"), cross_match(df1, df2, "Function")
out = (
pd.merge(desk, func,
left_on=["Region_x", "Entity_x", "ID"],
right_on=["Region_y", "Entity_y", "ID"],
suffixes=("", "_")).set_index("Key")
.reindex(df1["Key"].unique())
.fillna({"MATCH": "No match"})
.reset_index()[["ID", "Key", "MATCH"]]
)
Disclaimer : This approach may get incredibly slow when huge datasets (df1, df2).
Output :
print(out)
ID Key MATCH
0 A_02 Unique_1 Exact
1 NaN Unique_2 No match
2 A_01 Unique_3 Exact
3 NaN Unique_4 No match

Averaging dataframes with many string columns and display back all the columns

I have struggled with this even after looking at the various past answers to no avail.
My data consists of columns numeric and non numeric. I'd like to average the numeric columns and display my data on the GUI together with the information on the non-numeric columns.The non numeric columns have info such as names,rollno,stream while the numeric columns contain students marks for various subjects. It works well when dealing with one dataframe but fails when I combine two or more dataframes in which it returms only the average of the numeric columns and displays it leaving the non numeric columns undisplayed. Below is one of the codes I've tried so far.
df=pd.concat((df3,df5))
dfs =df.groupby(df.index,level=0).mean()
headers = list(dfs)
self.marks_table.setRowCount(dfs.shape[0])
self.marks_table.setColumnCount(dfs.shape[1])
self.marks_table.setHorizontalHeaderLabels(headers)
df_array = dfs.values
for row in range(dfs.shape[0]):
for col in range(dfs.shape[1]):
self.marks_table.setItem(row, col,QTableWidgetItem(str(df_array[row,col])))
A working code should return averages in something like this
STREAM ADM NAME KCPE ENG KIS
0 EAGLE 663 FLOYCE ATI 250 43 5
1 EAGLE 664 VERONICA 252 32 33
2 EAGLE 665 MACREEN A 341 23 23
3 EAGLE 666 BRIDGIT 286 23 2
Rather than
ADM KCPE ENG KIS
0 663.0 250.0 27.5 18.5
1 664.0 252.0 26.5 33.0
2 665.0 341.0 17.5 22.5
3 666.0 286.0 38.5 23.5
Sample data
Df1 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[70,28,79],
'KIS':[37,82,79],
'MAT':[67,38,29]})
Df2 = pd.DataFrame({
'STREAM':[NORTH,SOUTH],
'ADM':[437,238,439],
'NAME':[JAMES,MARK,PETER],
'KCPE':[233,168,349],
'ENG':[40,12,56],
'KIS':[33,43,43],
'MAT':[22,58,23]})
Your question not clear. However guessing the origin of question based on content. I have modified your datframes which were not well done by adding a stream called 'CENTRAL', see
Df1 = pd.DataFrame({'STREAM':['NORTH','SOUTH', 'CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[70,28,79],'KIS':[37,82,79],'MAT':[67,38,29]})
Df2 = pd.DataFrame({ 'STREAM':['NORTH','SOUTH','CENTRAL'],'ADM':[437,238,439], 'NAME':['JAMES','MARK','PETER'],'KCPE':[233,168,349],'ENG':[40,12,56],'KIS':[33,43,43],'MAT':[22,58,23]})
I have assumed you want to merge the two dataframes and find avarage
df3=Df2.append(Df1)
df3.groupby(['STREAM','ADM','NAME'],as_index=False).sum()
Outcome

Pandas Dataframe: Divide Column entries by number of occurence

my Problem:
I have this DF:
df_problem = pd.DataFrame({"Share":['5%','6%','9%','9%', '9%'],"level_1":[0,0,1,2,3], 'BO':['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
The Problem is, that the 9% are actually divided by the three shareholders. So I want to giv each of them their share of 3% and put it to their names. It then should look like this:
df_solution = pd.DataFrame({"Share":['5%','6%','3%','3%', '3%'],"level_1":[0,0,0,1,2], 'BO': ['Nestle', 'Procter', 'Nestle', 'Tesla', 'Jeff']})
How do I do this in a simple way?
You could try something like this:
f_problem['Share'] = (f_problem['Share'].str.replace('%', '').astype(float) /
f_problem.groupby('Share')['BO'].
transform('count')).astype(str) + '%'
>>> f_problem
Share level_1 BO
0 5.0% 0 Nestle
1 6.0% 0 Procter
2 3.0% 1 Nestle
3 3.0% 2 Tesla
4 3.0% 3 Jeff
Please note that I have assumed that the value of the column 'Share' to be float as you could see above.

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation

Filtering DataFrame by list of substrings

Building off this answer, is there a way to filter a Pandas dataframe by a list of substrings?
Say I want to find all rows where df['menu_item'] contains fresh or spaghetti
Without something like this:
df[df['menu_item'].str.contains('fresh') | (df['menu_item'].str.contains('spaghetti')]
The str.contains method you're using accepts regex, so use the regex | as or:
df[df['menu_item'].str.contains('fresh|spaghetti')]
Example Input:
menu_item
0 fresh fish
1 fresher fish
2 lasagna
3 spaghetti o's
4 something edible
Example Output:
menu_item
0 fresh fish
1 fresher fish
3 spaghetti o's