Filtering DataFrame by list of substrings - pandas

Building off this answer, is there a way to filter a Pandas dataframe by a list of substrings?
Say I want to find all rows where df['menu_item'] contains fresh or spaghetti
Without something like this:
df[df['menu_item'].str.contains('fresh') | (df['menu_item'].str.contains('spaghetti')]

The str.contains method you're using accepts regex, so use the regex | as or:
df[df['menu_item'].str.contains('fresh|spaghetti')]
Example Input:
menu_item
0 fresh fish
1 fresher fish
2 lasagna
3 spaghetti o's
4 something edible
Example Output:
menu_item
0 fresh fish
1 fresher fish
3 spaghetti o's

Related

How to replace values of a column based on another data frame?

I have a column containing symbols of chemical elements and other substances. Something like this:
Commoditie
sn
sulfuric acid
cu
sodium chloride
au
df1 = pd.DataFrame(['sn', 'sulfuric acid', 'cu', 'sodium chloride', 'au'], columns=['Commodities'])
And I have another data frame containing the symbols of the chemical elements and their respective names. Like this:
Name
Symbol
sn
tin
cu
copper
au
gold
df2 = pd.DataFrame({'Name': ['tin', 'copper', 'gold'], 'Symbol': ['sn', 'cu', 'au']})
I need to replace the symbols (in the first dataframe)(df1['Commoditie']) with the names (in the second one) (df2['Names']), so that it outputs like the following:
I need the
Output:
Commoditie
tin
sulfuric acid
copper
sodium chloride
gold
I tried using for loops and lambda but got different results than expected. I have tried many things and googled, I think it's something basic, but I just can't find an answer.
Thank you in advance!
first, convert df2 to a dictionary:
replace_dict=dict(df2[['Symbol','Name']].to_dict('split')['data'])
#{'sn': 'tin', 'cu': 'copper', 'au': 'gold'}
then use replace function:
df1['Commodities']=df1['Commodities'].replace(replace_dict)
print(df1)
'''
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
'''
Try:
for i, row in df2.iterrows():
df1.Commodities = df1.Commodities.str.replace(row.Symbol, row.Name)
which gives df1 as:
Commodities
0 tin
1 sulfuric acid
2 copper
3 sodium chloride
4 gold
EDIT: Note that it's very likely to be far more efficient to skip defining df2 at all and just zip your lists of names and symbols together and iterate over that.

Cypher: How to create a recursive cost query with alternates?

I have the following structure:
(:pattern)-[:contains]->(:pattern)
...basically a hierarchy of patterns that use other patterns as content. These constitute trees.
Certain patterns are generated by certain generators:
(:generator)-[:canProduce]->(:pattern)
The canProduce relationship has a cost value associated with it as a property. Multiple generators can create the same pattern.
I would like to figure out, with a query, what patterns I need to generate to produce a particular output - and which generators to choose to have the lowest cost. I started like this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern) RETURN ps
so far so good. The results don't contain the starting pattern, so I made this:
MATCH (p:pattern {name: 'preciousPattern'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as patterns
RETURN patterns
That does not feel elegant, but it also does not provide the hierarchy
I can of course do a path query (MATCH path = MATCH...) but the results don't seem very useful.
Also, now I need to connect the cost from the generator relationship.
I tried this:
MATCH (p:pattern {name: 'awesome'})-[:contains *]->(ps:pattern)
WITH p+collect(ps) as list
UNWIND list as rec
CALL {
WITH rec
MATCH (rec)-[r:canGenerate]-(g:generator)
return r.GenCost as GenCost, g.name AS GenName
}
return rec.name, GenCost , GenName
The problem I have now is that if any of the patterns that are part of another pattern can be generated by multiple generators, I just get double entries in the list, but what I want is separate lists for each alternative possibility, so that I can generate the cost.
This is my pattern tree:
Awesome
input1
input2
input 3
Input 3 can be generated by 2 different generators. I now get:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
input4 | 1.4 | TestGen4
What I want is this: Two lists (or n, in the general case, where I might have n possible paths), one
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input3 | 1.25 | TestGen3
and one:
Awesome | 2 | MainGen
input1 | 3 | TestGen1
input2 | 2.5 | TestGen2
input4 | 1.4 | TestGen4
each set representing one alternative set, so that I can calculate the costs and compare.
I have no idea how to do something like that. Any suggestions?

Search muliple columns for the most common value

excuse the basic nature of this question but I have searched for hours for an answer and they all seem to over complicate what I need.
I have a dataframe like the following: -
id food_item_1 food_item_2 food_item_3
1 nuts bread coffee
2 potatoes coffee cake
3 fish beer coffee
4 bread coffee coffee
What I want to do is search all the 'food_item_*' columns (so in this case there are 3) and have returned back to me the single most common value such as e.g. 'coffee' across all 3 columns.
Could someone please recommend the best way to do this?
Many thanks
md
Use DataFrame.filter, reshape by DataFrame.stack and then use Series.mode, last select first value by position with Series.iat:
a = df.filter(like='food_item_').stack().mode().iat[0]
print (a)
coffee
Another idea is with Series.value_counts and selecting first value of index:
a = df.filter(like='food_item_').stack().value_counts().index[0]
You can also melt your columns and value_counts:
print (df.melt(id_vars="id", value_vars=df.columns[1:])["value"].value_counts())
coffee 5
bread 2
nuts 1
potatoes 1
cake 1
beer 1
fish 1

Matching columns in dataframe using regex

Looking to perform a regex function to match a column of a dataframe with the first word of another. The dataframes were collected from different sources so the names of the drug are similar but do not match completely. They do match up if you ignore case and match for the first word.
I have two dataframes: one with drug names and another with a list of drug names with their respective prices. Fruits were added to the drug names for example purposes.
Dataframe A
drug
0 drug1 apple
1 drug2 orange
2 drug3 lemon
3 drug4 peach
Dataframe B
drugB price Regex
0 DRUG2 2 ^([\w\-]+)
1 DRUG4 4 ^([\w\-]+)
2 DRUG3 3 ^([\w\-]+)
3 DRUG1 1 ^([\w\-]+)
I am looking to use the Regex column to append dataframe A to B like so. Hopefully using the first name of drug column and match it to the respective column.
drug drugB price Regex
0 drug1 apple DRUG1 1 ^([\w\-]+)
1 drug2 orange DRUG2 2 ^([\w\-]+)
2 drug3 lemon DRUG3 3 ^([\w\-]+)
3 drug4 peach DRUG4 4 ^([\w\-]+)
I was inspired to try it this way based on the following stackoverflow question: How to merge pandas table by regex.
Thank you in advance! I hit a dead end with this problem and couldn't figure a way to get it to work.
You don't really need to define the regexes in the second dataframe. ALollz is right btw. you could easily split the string, but I guess the purpose you need this for is more complex and probably you have drug names which include spaces.
Simple version with a common regex
If you can manage to define one common regex that matches all drug names, you can use the following code:
df_A['drugA']= df_A['drug'].str.extract('^\s*(?P<drugA>[\w\-]*)')['drugA'].str.upper()
df_A.merge(df_B[['drugB', 'price']], left_on='drugA', right_on='drugB', how='left')
Just replace the expression behind with the regex you need. The output would be:
drug drugA drugB price
0 drug1 apple DRUG1 DRUG1 1
1 drug2 orange DRUG2 DRUG2 2
2 drug3 lemon DRUG3 DRUG3 3
3 drug4 peach DRUG4 DRUG4 4
Version with a generated regex
drug_list= df_B['drugB'].to_list()
# sort the drug names by length descending
# to make sure we get the longest match
# --> relevant only if a drug name is included
# fully in another name
# Like "Aspirin" & "Aspirin plus C"
drug_list.sort(key=lambda drug: len(drug), reverse=True)
drug_pattern= '^\s*(?P<drugA>{drug_list})'.format(drug_list='|'.join(drug_list))
df_A['drugA']= df_A['drug'].str.extract(drug_pattern, re.I)['drugA'].str.upper()
df_A.merge(df_B[['drugB', 'price']], left_on='drugA', right_on='drugB', how='left')
This outputs the same as above. Please note, that this version might be limited regarding the number of drugs you can use. If you have hundrets of drugs, it might run into problems, because the regular expression string gets long in that case. But this version is sharper and also supports space in the drug names.
In case you can work out one pattern, that is able to cut out all drug names correctly, I definatley would recommend to use the first method. E.g. if you can spot a pattern, that comes after the drug name, you can use it to cut out the drug names much easier.

Value to table header in Pentaho

Hi I'm quite new in Pentaho Spoon and I have a problem:
I have a table like this:
model | type | color| q
--1---| --1-- | blue | 1
--1---| --2-- | blue | 2
--1---| --1-- | red | 1
--1---| --2-- | red | 3
--2---| --1-- | blue | 4
--2---| --2-- | blue | 5
And I would like to create a single table (to export in csv or excel) for each model grouped by type with the value of the group as header and as value the q value:
table-1.csv
type | blue | red
--1--| -1-- | -1-
--2--| -2-- | -3-
table-2.csv
type | blue
--1--| -4-
--2--| -5-
I tried with row denormalizer but nothing.
Any suggestion?
Typically it's helpful to see what you have done in order to offer help, but I know how counterintuitive the "help" on this step is.
Make sure you sort the rows on Model and Type before sending them to the denormalizer step. Then give this a try:
As for splitting the output into files, there are a few ways to handle that. Take a look at the Switch/Case step using the Model field.
Also, if you haven't found them already, take a look at the sample files that come with the PDI download. They should be in ...pdi-ce-6.1.0.1-196\data-integration\samples. They can be more helpful than the online documentation sometimes.
Row denormalizer can't be used here if number of colors is unknown, also, you can't define text output fields dynamically.
There are few ways that I can see without using java and js steps. One of them is based on the following idea: we can prepare rows with two columns:
Row Model
type|blue|red 1
1|1|1 1
2|2|3 1
type|blue 2
1|4 2
2|5 2
Then we can prepare filename for each row using Model field and then easily output all rows using text output where file name is taken from filename field. In this case all records will be exported into two files without additional efforts.
Here you can find sample transformation: copy-paste me into new transformation
Please note that it's a sample solution that works only with csv. Also it works only if you have the same number of colors for each type inside model. It's just a hint how to use spoon, it's not a complete solution.