Looking to perform a regex function to match a column of a dataframe with the first word of another. The dataframes were collected from different sources so the names of the drug are similar but do not match completely. They do match up if you ignore case and match for the first word.
I have two dataframes: one with drug names and another with a list of drug names with their respective prices. Fruits were added to the drug names for example purposes.
Dataframe A
drug
0 drug1 apple
1 drug2 orange
2 drug3 lemon
3 drug4 peach
Dataframe B
drugB price Regex
0 DRUG2 2 ^([\w\-]+)
1 DRUG4 4 ^([\w\-]+)
2 DRUG3 3 ^([\w\-]+)
3 DRUG1 1 ^([\w\-]+)
I am looking to use the Regex column to append dataframe A to B like so. Hopefully using the first name of drug column and match it to the respective column.
drug drugB price Regex
0 drug1 apple DRUG1 1 ^([\w\-]+)
1 drug2 orange DRUG2 2 ^([\w\-]+)
2 drug3 lemon DRUG3 3 ^([\w\-]+)
3 drug4 peach DRUG4 4 ^([\w\-]+)
I was inspired to try it this way based on the following stackoverflow question: How to merge pandas table by regex.
Thank you in advance! I hit a dead end with this problem and couldn't figure a way to get it to work.
You don't really need to define the regexes in the second dataframe. ALollz is right btw. you could easily split the string, but I guess the purpose you need this for is more complex and probably you have drug names which include spaces.
Simple version with a common regex
If you can manage to define one common regex that matches all drug names, you can use the following code:
df_A['drugA']= df_A['drug'].str.extract('^\s*(?P<drugA>[\w\-]*)')['drugA'].str.upper()
df_A.merge(df_B[['drugB', 'price']], left_on='drugA', right_on='drugB', how='left')
Just replace the expression behind with the regex you need. The output would be:
drug drugA drugB price
0 drug1 apple DRUG1 DRUG1 1
1 drug2 orange DRUG2 DRUG2 2
2 drug3 lemon DRUG3 DRUG3 3
3 drug4 peach DRUG4 DRUG4 4
Version with a generated regex
drug_list= df_B['drugB'].to_list()
# sort the drug names by length descending
# to make sure we get the longest match
# --> relevant only if a drug name is included
# fully in another name
# Like "Aspirin" & "Aspirin plus C"
drug_list.sort(key=lambda drug: len(drug), reverse=True)
drug_pattern= '^\s*(?P<drugA>{drug_list})'.format(drug_list='|'.join(drug_list))
df_A['drugA']= df_A['drug'].str.extract(drug_pattern, re.I)['drugA'].str.upper()
df_A.merge(df_B[['drugB', 'price']], left_on='drugA', right_on='drugB', how='left')
This outputs the same as above. Please note, that this version might be limited regarding the number of drugs you can use. If you have hundrets of drugs, it might run into problems, because the regular expression string gets long in that case. But this version is sharper and also supports space in the drug names.
In case you can work out one pattern, that is able to cut out all drug names correctly, I definatley would recommend to use the first method. E.g. if you can spot a pattern, that comes after the drug name, you can use it to cut out the drug names much easier.
Related
I have a column containing symbols of chemical elements and other substances. Something like this:
Commoditie
sn
sulfuric acid
cu
sodium chloride
au
df1 = pd.DataFrame(['sn', 'sulfuric acid', 'cu', 'sodium chloride', 'au'], columns=['Commodities'])
And I have another data frame containing the symbols of the chemical elements and their respective names. Like this:
Name
Symbol
sn
tin
cu
copper
au
gold
df2 = pd.DataFrame({'Name': ['tin', 'copper', 'gold'], 'Symbol': ['sn', 'cu', 'au']})
I need to replace the symbols (in the first dataframe)(df1['Commoditie']) with the names (in the second one) (df2['Names']), so that it outputs like the following:
I need the
Output:
Commoditie
tin
sulfuric acid
copper
sodium chloride
gold
I tried using for loops and lambda but got different results than expected. I have tried many things and googled, I think it's something basic, but I just can't find an answer.
Thank you in advance!
first, convert df2 to a dictionary:
replace_dict=dict(df2[['Symbol','Name']].to_dict('split')['data'])
#{'sn': 'tin', 'cu': 'copper', 'au': 'gold'}
then use replace function:
df1['Commodities']=df1['Commodities'].replace(replace_dict)
print(df1)
'''
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
'''
Try:
for i, row in df2.iterrows():
df1.Commodities = df1.Commodities.str.replace(row.Symbol, row.Name)
which gives df1 as:
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
EDIT: Note that it's very likely to be far more efficient to skip defining df2 at all and just zip your lists of names and symbols together and iterate over that.
I have a verbal algorithm question, thus I have no code yet. The question is this: How can I possibly create an algorithm such that I have 2 dynamic stacks, both can or can not have duplicate items of strings, for example I have 3 breads, 4 lemons and 2 pens in the first stack, say s1, and I have 5 breads, 3 lemons and 5 pens in the second stack, say s2. I want to find the number of duplicates in each stack, and print out the minimum number of duplicates in both lists, for example:
bread --> 3
lemon --> 3
pen --> 2
How can I traverse 2 stacks and print the number of duplicated occurrences until the end of stacks? If you are confused about anything, I can edit my question depending on your confusion. Thanks.
I have a list and a data frame. I want to find the number of each word in the list (some words in the list are pair) for each "emotions" in the data frame.
Here is my list:
[(frozenset({'know'}), 16528),
(frozenset({'im'}), 39047),
(frozenset({'feeling'}), 99455),
(frozenset({'like'}), 49332),
(frozenset({'feel', 'im'}), 16602),
(frozenset({'feeling', 'im'}), 23488),
(frozenset({'feel'}), 202985),
(frozenset({'feel', 'like'}), 42162),
(frozenset({'time'}), 17203),
(frozenset({'really'}), 17247)]
and this is my data frame:
Unnamed: 0 id text emotions
0 0 27383 [feel, awful, job, get, position, succeed, hap... sadness
1 1 110083 [im, alone, feel, awful] sadness
2 2 140764 [ive, probably, mentioned, really, feel, proud... joy
3 3 100071 [feeling, little, low, day, back] sadness
4 4 2837 [beleive, much, sensitive, people, feeling, te... love
Here is the expected output:
6 columns for six existed emotions and the last column is for totall count.
excuse the basic nature of this question but I have searched for hours for an answer and they all seem to over complicate what I need.
I have a dataframe like the following: -
id food_item_1 food_item_2 food_item_3
1 nuts bread coffee
2 potatoes coffee cake
3 fish beer coffee
4 bread coffee coffee
What I want to do is search all the 'food_item_*' columns (so in this case there are 3) and have returned back to me the single most common value such as e.g. 'coffee' across all 3 columns.
Could someone please recommend the best way to do this?
Many thanks
md
Use DataFrame.filter, reshape by DataFrame.stack and then use Series.mode, last select first value by position with Series.iat:
a = df.filter(like='food_item_').stack().mode().iat[0]
print (a)
coffee
Another idea is with Series.value_counts and selecting first value of index:
a = df.filter(like='food_item_').stack().value_counts().index[0]
You can also melt your columns and value_counts:
print (df.melt(id_vars="id", value_vars=df.columns[1:])["value"].value_counts())
coffee 5
bread 2
nuts 1
potatoes 1
cake 1
beer 1
fish 1
Building off this answer, is there a way to filter a Pandas dataframe by a list of substrings?
Say I want to find all rows where df['menu_item'] contains fresh or spaghetti
Without something like this:
df[df['menu_item'].str.contains('fresh') | (df['menu_item'].str.contains('spaghetti')]
The str.contains method you're using accepts regex, so use the regex | as or:
df[df['menu_item'].str.contains('fresh|spaghetti')]
Example Input:
menu_item
0 fresh fish
1 fresher fish
2 lasagna
3 spaghetti o's
4 something edible
Example Output:
menu_item
0 fresh fish
1 fresher fish
3 spaghetti o's