Python - compare multiple columns, search list of keywords in column and compare with another, in two dataframes to generate a third resultset - pandas

I have two very different dataframes.
df1 looks like this:
Region
Entity
Desk
Function
Key
US
AAA
Top class, Desk1, Mike's team
Writing, advising
Unique_1
US
AAA
team beta, Blue rats, Tom
task_a, task_d
Unique_2
EMEA
ZZZ
Delta one
Forecasts, month-end, Sales
Unique_3
JPN
XYZ
Admin
task1, task_b, task_g
Unique_4
df2 looks like this:
Region
Entity
Desk
Function
ID
EMEA
ZZZ
Equity, delta one
Sales, sweet talking, schmoozing
A_01
US
AAA
Desk 1, A team, Top class
Writing,calling,listening, advising
A_02
US
AAA
Desk 2, Ninjas, 2nd team, Tom's team
Secret, private
A_03
EMEA
DDD
Equity, Private Equity
task1, task2, task3, task4
A_04
JPN
XXX
Admin, Secretaries
task_a, task_b
A_05
df2 is a much larger recordset than df1.
Both Desk and Function in each of the dataframes were free-text fields and allowed the input of rubbish data. I am trying to build a new recordset from these dataframes using the following criteria:
where -
df1['Region'] == df2['Region']
AND
df1['Entity'] == df2['Entity']
AND
any of the phrases within df1['Desk'] can be matched to any of the phrases within df2['Desk']
AND
any of the phrases within df1['Function'] can be matched to any of the phrases within df2['Function'].
I need the ultimate output to look something like this:
df2.Id
df1.Key
MATCH
A_02
Unique_1
Exact
Unique_2
No match
A_01
Unique_3
Exact
Unique_4
No match
I am really struggling with this. I have both dataframes but I cannot loop through df1 to match the columns as specified above in df2. I've tried merging the dataframes, using np.where and brute force looping but nothing is working. The tricky bit is matching the Desk and Function columns.
Any ideas?

IIUC, one option is to use a cross merge :
def cross_match(df1, df2, col):
df = df1.merge(df2, how="cross")
colx, coly = [f"{col}_x", f"{col}_y"]
df[[colx, coly]] = df[[colx, coly]].apply(lambda x: x.str.lower()
.str.split("\s*,\s*"))
df["MATCH"] = (pd.Series([any(w in sent for w in lst)
for lst, sent in zip(df[f"{col}_x"], df[f"{col}_y"])])
.map({True:"Exact"}))
return df.query("MATCH == 'Exact'")
desk, func = cross_match(df1, df2, "Desk"), cross_match(df1, df2, "Function")
out = (
pd.merge(desk, func,
left_on=["Region_x", "Entity_x", "ID"],
right_on=["Region_y", "Entity_y", "ID"],
suffixes=("", "_")).set_index("Key")
.reindex(df1["Key"].unique())
.fillna({"MATCH": "No match"})
.reset_index()[["ID", "Key", "MATCH"]]
)
Disclaimer : This approach may get incredibly slow when huge datasets (df1, df2).
Output :
print(out)
ID Key MATCH
0 A_02 Unique_1 Exact
1 NaN Unique_2 No match
2 A_01 Unique_3 Exact
3 NaN Unique_4 No match

Related

How to replace values of a column based on another data frame?

I have a column containing symbols of chemical elements and other substances. Something like this:
Commoditie
sn
sulfuric acid
cu
sodium chloride
au
df1 = pd.DataFrame(['sn', 'sulfuric acid', 'cu', 'sodium chloride', 'au'], columns=['Commodities'])
And I have another data frame containing the symbols of the chemical elements and their respective names. Like this:
Name
Symbol
sn
tin
cu
copper
au
gold
df2 = pd.DataFrame({'Name': ['tin', 'copper', 'gold'], 'Symbol': ['sn', 'cu', 'au']})
I need to replace the symbols (in the first dataframe)(df1['Commoditie']) with the names (in the second one) (df2['Names']), so that it outputs like the following:
I need the
Output:
Commoditie
tin
sulfuric acid
copper
sodium chloride
gold
I tried using for loops and lambda but got different results than expected. I have tried many things and googled, I think it's something basic, but I just can't find an answer.
Thank you in advance!
first, convert df2 to a dictionary:
replace_dict=dict(df2[['Symbol','Name']].to_dict('split')['data'])
#{'sn': 'tin', 'cu': 'copper', 'au': 'gold'}
then use replace function:
df1['Commodities']=df1['Commodities'].replace(replace_dict)
print(df1)
'''
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
'''
Try:
for i, row in df2.iterrows():
df1.Commodities = df1.Commodities.str.replace(row.Symbol, row.Name)
which gives df1 as:
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
EDIT: Note that it's very likely to be far more efficient to skip defining df2 at all and just zip your lists of names and symbols together and iterate over that.

Finding the mean of a column; but excluding a singular value

Imagine I have a dataset that is like so:
ID birthyear weight
0 619040 1962 0.1231231
1 600161 1963 0.981742
2 25602033 1963 1.3123124
3 624870 1987 10,000
and I want to get the mean of the column weight, but the obvious 10,000 is hindering the actual mean. In this situation I cannot change the value but must work around it, this is what I've got so far, but obviously it's including that last value.
avg_num_items = df_cleaned['trans_quantity'].mean()
translist = df_cleaned['trans_quantity'].tolist()
my dataframe is df_cleaned and the column I'm actually working with is 'trans_quantity' so how do I go about the mean while working around that value?
Since you added SQL in your tags, In SQL you'd want to exclude it in the WHERE clause:
SELECT AVG(trans_quantity)
FROM your_data_base
WHERE trans_quantity <> 10,000
In Pandas:
avg_num_items = df_cleaned[df_cleaned["trans_quantity"] != 10000]["trans_quantity"].mean()
You can also replace your value with a NAN and skip it in the mean:
avg_num_items = df_cleaned["trans_quantity"].replace(10000, np.nan).mean(skipna=True)
With pandas, ensure you have numeric data (10,000 is a string), filter the values above threshold and use the mean:
(pd.to_numeric(df['weight'], errors='coerce')
.loc[lambda x: x<10000]
.mean()
)
output: 0.8057258333333334

using list as an argument in groupby() in pandas and none of the key elements match column or index names

So I have a random values of dataframe as below and a book I am studying uses a list was groupby key (key_list). How is the dataframe grouped in this case since none of list values match column or index names? So, the last two lines are confusing to me.
people = pd.DataFrame(np.random.randn(5,5), columns = ['a','b','c','d','e'], index=['Joe','Steve','Wes','Jim','Travis'])
key_list = ['one','one','one','two','two']
people.groupby(key_list).min()
people.groupby([len, key_list]).min()
Thank you in advance!
The user guide on groupby explains a lot and I suggest you have a look at it. I'll explain as much as I understand for your use case.
You can verify the groups created using the group method:
people.groupby(key_list).groups
{'one': Index(['Joe', 'Steve', 'Wes'], dtype='object'),
'two': Index(['Jim', 'Travis'], dtype='object')}
You have your dictionary with the keys 'one' and two' being the groups from the key_list list. As such when you ask for the 'min', it looks at each group and picks out the minimum, indexed from the first column. Let's inspect group 'one' using the get_group method:
people.groupby(key_list).get_group('one')
a b c d e
Joe -0.702122 0.277164 1.017261 -1.664974 -1.852730
Steve -0.866450 -0.373737 1.964857 -1.123291 1.251595
Wes -0.043835 -0.011108 0.214802 0.065022 -1.335713
You can see that Steve has the lowest value from column 'a'. when you run the next line it should give you that:
people.groupby(key_list).get_group('one').min()
a -0.866450
b -0.373737
c 0.214802
d -1.664974
e -1.852730
dtype: float64
The same concept applies when you run it on the second group 'two'. As such, when you run the first part of your groupby code:
people.groupby(key_list).min()
You get the minimum row indexed at 'a' for each group:
a b c d e
one -0.866450 -0.373737 0.214802 -1.664974 -1.852730
two -1.074355 -0.098190 -0.595726 -2.194481 0.232505
The second part of your code, which involves the len applies the same grouping concept. In this case, it groups the dataframe according to the length of the strings in its index: (Jim, Joe, Wes) - 3 letters, (Steve) - 5 letters, (Travis) - 6 letters, and then groups with the key_list to give the final output:
a b c d e
3 one -0.702122 -0.011108 0.214802 -1.664974 -1.852730
two -0.928987 -0.098190 3.025985 0.702471 0.232505
5 one -0.866450 -0.373737 1.964857 -1.123291 1.251595
6 two -1.074355 1.110879 -0.595726 -2.194481 0.394216
Note that for 3 it spills out 'one' and 'two' because 'Joe' and 'Wes' are in group 'one' but the lowest is 'Joe', while 'Jim' is the only three letter word in group 'two'. The same concept goes for 5 letter and 6 letter words.

Create new column on pandas DataFrame in which the entries are randomly selected entries from another column

I have a DataFrame with the following structure.
df = pd.DataFrame({'tenant_id': [1,1,1,2,2,2,3,3,7,7], 'user_id': ['ab1', 'avc1', 'bc2', 'iuyt', 'fvg', 'fbh', 'bcv', 'bcb', 'yth', 'ytn'],
'text':['apple', 'ball', 'card', 'toy', 'sleep', 'happy', 'sad', 'be', 'u', 'pop']})
This gives the following output:
df = df[['tenant_id', 'user_id', 'text']]
tenant_id user_id text
1 ab1 apple
1 avc1 ball
1 bc2 card
2 iuyt toy
2 fvg sleep
2 fbh happy
3 bcv sad
3 bcb be
7 yth u
7 ytn pop
I would like to groupby on tenant_id and create a new column which is a random selection of strings from the user_id column.
Thus, I would like my output to look like the following:
tenant_id user_id text new_column
1 ab1 apple [ab1, bc2]
1 avc1 ball [ab1]
1 bc2 card [avc1]
2 iuyt toy [fvg, fbh]
2 fvg sleep [fbh]
2 fbh happy [fvg]
3 bcv sad [bcb]
3 bcb be [bcv]
7 yth u [pop]
7 ytn pop [u]
Here, random id's from the user_id column have been selected, these id's can be repeated as "fvg" is repeated for tenant_id=2. I would like to have a threshold of not more than ten id's. This data is just a sample and has only 10 id's to start with, so generally any number much less than the total number of user_id's. This case say 1 less than total user_id's that belong to a tenant.
i tried first figuring out how to select random subset of varying length with
df.sample
new_column = df.user_id.sample(n=np.random.randint(1, 10)))
I am kinda lost after this, assigning it to my df results in Nan's, probably because they are of variable lengths. Please help.
Thanks.
per my comment:
Your 'new column' is not a new column, it's a new cell for a single row.
If you want to assign the result to a new column, you need to create a new column, and apply the cell computation to it.
df['new column'] = df['user_id'].apply(lambda x: df.user_id.sample(n=np.random.randint(1, 10))))
it doesn't really matter what column you use for the apply since the variable is not used in the computation

Mapping column values to a combination of another csv file's information

I have a dataset that indicates date & time in 5-digit format: ddd + hm
ddd part starts from 2009 Jan 1. Since the data was collected only from then to 2-years period, its [min, max] would be [1, 365 x 2 = 730].
Data is observed in 30-min interval, making 24 hrs per day period to lengthen to 48 at max. So [min, max] for hm at [1, 48].
Following is the excerpt of daycode.csv file that contains ddd part of the daycode, matching date & hm part of the daycode, matching time.
And I think I agreed to not showing the dataset which is from ISSDA. So..I will just describe that the daycode in the File1.txt file reads like '63317'.
This link gave me a glimpse of how to approach this problem, and I was in the middle of putting up this code together..which of course won't work at this point.
consume = pd.read_csv("data/File1.txt", sep= ' ', encoding = "utf-8", names =['meter', 'daycode', 'val'])
df1= pd.read_csv("data/daycode.csv", encoding = "cp1252", names =['code', 'print'])
test = consume[consume['meter']==1048]
test['daycode'] = test['daycode'].map(df1.set_index('code')['print'])
plt.plot(test['daycode'], test['val'], '.')
plt.title('test of meter 1048')
plt.xlabel('daycode')
plt.ylabel('energy consumption [kWh]')
plt.show()
Not all units(thousands) have been observed at full length but 730 x 48 is a large combination to lay out on excel by hand. Tbh, not an elegant solution but I tried by dragging - it doesn't quite get it.
If I could read the first 3 digits of the column values and match with another file's column, 2 last digits with another column, then combine.. is there a way?
For the last 2 lines you can just do something like this
df['first_3_digits'] = df['col1'].map(lambda x: str(x)[:3])
df['last_2_digits'] = df['col1'].map(lambda x: str(x)[-2:])
for joining 2 dataframes
df3 = df.merge(df2,left_on=['first_3_digits','last_2_digits'],right_on=['col1_df2','col2_df2'],how='left')