Search muliple columns for the most common value - pandas

excuse the basic nature of this question but I have searched for hours for an answer and they all seem to over complicate what I need.
I have a dataframe like the following: -
id food_item_1 food_item_2 food_item_3
1 nuts bread coffee
2 potatoes coffee cake
3 fish beer coffee
4 bread coffee coffee
What I want to do is search all the 'food_item_*' columns (so in this case there are 3) and have returned back to me the single most common value such as e.g. 'coffee' across all 3 columns.
Could someone please recommend the best way to do this?
Many thanks
md

Use DataFrame.filter, reshape by DataFrame.stack and then use Series.mode, last select first value by position with Series.iat:
a = df.filter(like='food_item_').stack().mode().iat[0]
print (a)
coffee
Another idea is with Series.value_counts and selecting first value of index:
a = df.filter(like='food_item_').stack().value_counts().index[0]

You can also melt your columns and value_counts:
print (df.melt(id_vars="id", value_vars=df.columns[1:])["value"].value_counts())
coffee 5
bread 2
nuts 1
potatoes 1
cake 1
beer 1
fish 1

Related

Minimum number if Common Items in 2 Dynamic Stacks

I have a verbal algorithm question, thus I have no code yet. The question is this: How can I possibly create an algorithm such that I have 2 dynamic stacks, both can or can not have duplicate items of strings, for example I have 3 breads, 4 lemons and 2 pens in the first stack, say s1, and I have 5 breads, 3 lemons and 5 pens in the second stack, say s2. I want to find the number of duplicates in each stack, and print out the minimum number of duplicates in both lists, for example:
bread --> 3
lemon --> 3
pen --> 2
How can I traverse 2 stacks and print the number of duplicated occurrences until the end of stacks? If you are confused about anything, I can edit my question depending on your confusion. Thanks.

Matching columns in dataframe using regex

Looking to perform a regex function to match a column of a dataframe with the first word of another. The dataframes were collected from different sources so the names of the drug are similar but do not match completely. They do match up if you ignore case and match for the first word.
I have two dataframes: one with drug names and another with a list of drug names with their respective prices. Fruits were added to the drug names for example purposes.
Dataframe A
drug
0 drug1 apple
1 drug2 orange
2 drug3 lemon
3 drug4 peach
Dataframe B
drugB price Regex
0 DRUG2 2 ^([\w\-]+)
1 DRUG4 4 ^([\w\-]+)
2 DRUG3 3 ^([\w\-]+)
3 DRUG1 1 ^([\w\-]+)
I am looking to use the Regex column to append dataframe A to B like so. Hopefully using the first name of drug column and match it to the respective column.
drug drugB price Regex
0 drug1 apple DRUG1 1 ^([\w\-]+)
1 drug2 orange DRUG2 2 ^([\w\-]+)
2 drug3 lemon DRUG3 3 ^([\w\-]+)
3 drug4 peach DRUG4 4 ^([\w\-]+)
I was inspired to try it this way based on the following stackoverflow question: How to merge pandas table by regex.
Thank you in advance! I hit a dead end with this problem and couldn't figure a way to get it to work.
You don't really need to define the regexes in the second dataframe. ALollz is right btw. you could easily split the string, but I guess the purpose you need this for is more complex and probably you have drug names which include spaces.
Simple version with a common regex
If you can manage to define one common regex that matches all drug names, you can use the following code:
df_A['drugA']= df_A['drug'].str.extract('^\s*(?P<drugA>[\w\-]*)')['drugA'].str.upper()
df_A.merge(df_B[['drugB', 'price']], left_on='drugA', right_on='drugB', how='left')
Just replace the expression behind with the regex you need. The output would be:
drug drugA drugB price
0 drug1 apple DRUG1 DRUG1 1
1 drug2 orange DRUG2 DRUG2 2
2 drug3 lemon DRUG3 DRUG3 3
3 drug4 peach DRUG4 DRUG4 4
Version with a generated regex
drug_list= df_B['drugB'].to_list()
# sort the drug names by length descending
# to make sure we get the longest match
# --> relevant only if a drug name is included
# fully in another name
# Like "Aspirin" & "Aspirin plus C"
drug_list.sort(key=lambda drug: len(drug), reverse=True)
drug_pattern= '^\s*(?P<drugA>{drug_list})'.format(drug_list='|'.join(drug_list))
df_A['drugA']= df_A['drug'].str.extract(drug_pattern, re.I)['drugA'].str.upper()
df_A.merge(df_B[['drugB', 'price']], left_on='drugA', right_on='drugB', how='left')
This outputs the same as above. Please note, that this version might be limited regarding the number of drugs you can use. If you have hundrets of drugs, it might run into problems, because the regular expression string gets long in that case. But this version is sharper and also supports space in the drug names.
In case you can work out one pattern, that is able to cut out all drug names correctly, I definatley would recommend to use the first method. E.g. if you can spot a pattern, that comes after the drug name, you can use it to cut out the drug names much easier.

how can i find correlation between very few items in dataframe pandas

Hi i am new to dataframe, please help me resolve this.
My dataframe1 looks like this (It has itemID and Item name), i only have 7 items
itemID ItemName
1 abc
2 fds
3 btbtr
4 gerhet
5 dfhkwjfn
6 adaf
7 jdkj
My Dataframe2 looks like this:
which has userID, and itemID, here i have 20k users and each user has a itemid in front of it(can be multiple)
userId itemID
23213 2
31267 3
52144 1
52144 2
87467 6
how can i find item- item correlation between the items?
I want that item1 is highly correlated with item3 and item6
i tried corrwith() but all i get is NaN.
please help me find this, Thanks in advance
Here is the approach I can think of. Might be crude, but here we go.
Remove all users which have only 1 item in front of them
Now you only have users with multiple items.
Make a note of the count of co-occurrence of items. i.e. make a data frame of sort
item-item : count
1-2 : 50
3-5 : 35
and so on. Now after getting all one on one correlations normalize the count values between 0-1 and you have your correlation between all items.
Hope it helps!

Filtering DataFrame by list of substrings

Building off this answer, is there a way to filter a Pandas dataframe by a list of substrings?
Say I want to find all rows where df['menu_item'] contains fresh or spaghetti
Without something like this:
df[df['menu_item'].str.contains('fresh') | (df['menu_item'].str.contains('spaghetti')]
The str.contains method you're using accepts regex, so use the regex | as or:
df[df['menu_item'].str.contains('fresh|spaghetti')]
Example Input:
menu_item
0 fresh fish
1 fresher fish
2 lasagna
3 spaghetti o's
4 something edible
Example Output:
menu_item
0 fresh fish
1 fresher fish
3 spaghetti o's

Nearest Neighbor Search on large database table - SQL and/or ArcGis

Sorry for posting something that's probably obvious, but I don't have much database experience. Any help would be greatly appreciated - but remember, I'm a beginner :-)
I have a table like this:
Table.fruit
ID type Xcoordinate Ycoordinate Taste Fruitiness
1 Apple 3 3 Good 1,5
2 Orange 5 4 Bad 2,9
3 Apple 7 77 Medium 1,4
4 Banana 4 69 Bad 9,5
5 Pear 9 15 Medium 0,1
6 Apple 3 38 Good -5,8
7 Apple 1 4 Good 3
8 Banana 15 99 Bad 6,8
9 Pear 298 18789 Medium 10,01
… … … … … …
1000 Apple 1344 1388 Bad 5
… … … … … …
1958 Banana 759 1239 Good 1
1959 Banana 3 4 Medium 5,2
I need:
A table that gives me
The n (eg.: n=5) closest points to EACH point in the original table, including distance
Table.5nearest (please note that the distances are fake). So the resulting table has ID1, ID2 and distance between ID1 and ID2 (can't post images yet, unfortunately).
ID.Fruit1 ID.Fruit2 Distance
1 1959 1
1 7 2
1 2 2
1 5 30
1 14 50
2 1959 1
2 1 2
… … …
1000 1958 400
1000 Xxx Xxx
… … …
How can I do this (ideally with SQL/database management) or in ArcGis or similar? Any ideas?
Unfortunately, my table contains 15000 datasets, so the resulting table will have 75000 datasets if I choose n=5.
Any suggestions GREATLY appreciated.
EDIT:
Thank you very much for your comments and suggestions so far. Let me expand on it a little:
The first proposed method is sort of a brute-force scan of the whole table rendering huge filesizes or, likely, crashes, correct?
Now, the fruit is just a dummy, the real table contains a fix ID, nominal attributes ("fruit types" etc), X and Y spatial columns (in Gauss-Krueger) and some numeric attributes.
Now, I guess there is a way to code a "bounding box" into this, so the distances calculation is done for my point in question (let's say 1) and every other point within a square with a certain edge length. I can imagine (remotely) coding or querying for that, but how do I get the script to do that for EVERY point in my ID column. The way I understand it, this should either create a "subtable" for each record/point in my "Table.Fruit" containing all points within the square around the record/point with a distance field added - or, one big new table ("Table.5nearest"). I hope this makes some kind of sense. Any ideas? THanks again
To get all the distances between all fruit is fairly straightforward. In Access SQL (although you may need to add parentheses everywhere to get it to work :P):
select fruit1.id,
fruit2.id,
sqr(((fruit2.xcoordinate - fruit1.xcoordinate)^2) + ((fruit2.ycoordinate - fruit1.ycoordinate)^2)) as distance
from fruit as fruit1
join fruit as fruit2
on fruit2.id <> fruit1.id
order by distance;
I don't know if Access has the necessary sophistication to limit this to the "top n" records for each fruit; so this query, on your recordset, will return 225 million records (or, more likely, crash while trying)!
Thank you for your comments so far; in the meantime, I have gone for a pre-fabricated solution, an add-in for ArcGis called Hawth's Tools. This really works like a breeze to find the n closest neighbors to any point feature with an x and y value. So I hope it can help someone with similar problems and questions.
However, it leaves me with a more database-related issue now. Do you have an idea how I can get any DBMS (preferably Access), to give me a list of all my combinations? That is, if I have a point feature with 15000 fruits arranged in space, how do I get all "pure banana neighborhoods" (apple, lemon, etc.) and all other combinations?
Cheers and best wishes.