How to match multiple words from list with pandas data frame column - pandas

I have a list like :
keyword_list = ['motorcycle love hobby ', 'bike love me', 'cycle', 'dirtbike cycle motorbike ']
I want to find these words in the panda's data frame column and if 3 words match then it should create a new column with these words.
I need something like this :
enter image description here

You can probably use set operations:
kw = {s: set(s.split()) for s in keyword_list}
def subset(s):
S1 = set(s.split())
for k, S2 in kw.items():
if S2.issubset(S1):
return k
df['trigram'] = [subset(s) for s in df['description'].str.lower()]
print(df)
Output:
description trigram
0 I love motorcycle though I have other hobby motorcycle love hobby
1 I have bike None

Related

mecab python extract company name

I'm trying to run the data in a column and extract only the company name using MeCab library and list them in a new column.
The target column is a comment column which includes employee names, company names, invoice number etc all together or by itself depending on the transaction. Listed below is my code trying to extract only the company name. Please note the below code is still in production, but just wanted to post something to start with.
Sorry in advance for my messy coding...
Thank you,
import mecab-python3
import ipadic
df = pd.read_csv("")
m = MeCab.Tagger(ipadic.MECAB_ARGS)
def kaiseki(column):
list= df[column].values.tolist()
new_list = []
new_list2 = []
for li in list:
li = m.parse(li)
new_list.append(li)
li2 = li.split('\n')
new_list2.append(li2)
for li1 in li2:
li2 = li1.split('\t')
for li2_1 in li2:
li2_1_1 = li2_1.split(',')[0]
#組織名 means company name in Japanese
if li2_1_1 == '組織名':
print(li1.split()[0])
else:
continue
df[column] = new_list
df["column2"] = new_list2
return df["columns2"]
columns = ['column']
for column in columns:
kaiseki(column)

Compare two Dataframes based on column value(String, Substring) and update another column value

Dataframes df1,df2, where df1 Name column has a partial matching string on df2 Name column value. On the partial match of name column values, then compare the price column value of both data frames and if it is the same price then update column(Flag) in df1 as 'Delete'
df1
Name
Price
Flag
VENTILLA HOME FARR
662324.21
Delete
VENTILLA HOME FARR
-277961.62
VENTILLA HOME FARR
776011.5
VARAMANT METRO PLANET
662324.21
VARAMANT METRO PLANET
55555.21
Delete
VARAMANT METRO PLANET
267117.5499
FANTHOM STREET LLB
83265.2799
FANTHOM STREET LLB
-444452.96
Delete
FANTHOM STREET LLB
267117.5499
df2
my_dict = {'VT METRO PLANET ': 267117.5499, 'VENTILLA HOME FA ': -277961.62, 'FANTHOM STREET ': 83265.2799}
df2 = pd.DataFrame(list(my_dict.items()),columns = ['Name','Price'])
Expected Output
Any help would be appreciated
the solution I share here for this problem is based on the set, so if the Name of dataframe 1 is at least sharing one word with the Name of dataframe 2, and also their Price is equal then we edit the Flag column in the dataframe 1 by "Delete" otherwise we made it as "None"
This The Code Source :
def check(row):
df1_Name = set(map(lambda word: word.lower(),row.Name.split(' ')))
df1_price = row.Price
df1_flag = row.Flag
for df2_Name, df2_Price in df2[['Name', 'Price']].values:
df2_Name = set(map(lambda word: word.lower(),df2_Name.split(' ')))
if len(df1_Name.intersection(df2_Name)) > 1 and df1_price == df2_Price:
return 'Delete'
return ''
df1["Flag"]= df1.apply(checkMatch,axis=1)

Pandas manipulation: matching data from other columns to one column, applied uniquely to all rows

I have a model that predicts 10 words for a particular course in order of likelihood, and I'd like the first 5 words of those words that appear in the course's description.
This is the format of the data:
course_name course_title course_description predicted_word_10 predicted_word_9 predicted_word_8 predicted_word_7 predicted_word_6 predicted_word_5 predicted_word_4 predicted_word_3 predicted_word_2 predicted_word_1
Xmath 32 Precalculus Polynomial and rational functions, exponential... directed scholars approach build african different visual cultures placed global
Xphilos 2 Morality Introduction to ethical and political philosop... make presentation weekly european ways general range questions liberal speakers
My idea is for each row to start iterating from predicted_word_1 until I get the first 5 that are in the description. I'd like to save those words in the order they appear into additional columns description_word_1 ... description_word_5. (If there are <5 predicted words in the description I plan to return NAN in the corresponding columns).
To clarify with an example: if the course_description of a course is 'Polynomial and rational functions, exponential and logarithmic functions, trigonometry and trigonometric functions. Complex numbers, fundamental theorem of algebra, mathematical induction, binomial theorem, series, and sequences. ' and its first few predicted words are irrelevantword1, induction, exponential, logarithmic, irrelevantword2, polynomial, algebra...
I would want to return induction, exponential, logarithmic, polynomial, algebra for that in that order and do the same for the rest of the courses.
My attempt was to define an apply function that will take in a row and iterate from the first predicted word until it finds the first 5 that are in the description, but the part I am unable to figure out is how to create these additional columns that have the correct words for each course. This code will currently only keep the words for one course for all the rows.
def find_top_description_words(row):
print(row['course_title'])
description_words_index=1
for i in range(num_words_per_course):
description = row.loc['course_description']
word_i = row.loc['predicted_word_' + str(i+1)]
if (word_i in description) & (description_words_index <=5) :
print(description_words_index)
row['description_word_' + str(description_words_index)] = word_i
description_words_index += 1
df.apply(find_top_description_words,axis=1)
The end goal of this data manipulation is to keep the top 10 predicted words from the model and the top 5 predicted words in the description so the dataframe would look like:
course_name course_title course_description top_description_word_1 ... top_description_word_5 predicted_word_1 ... predicted_word_10
Any pointers would be appreciated. Thank you!
If I understand correctly:
Create new DataFrame with just 100 predicted words:
pred_words_lists = df.apply(lambda x: list(x[3:].dropna())[::-1], axis = 1)
Please note that, there are lists in each row with predicted words. The order is nice, I mean the first, not empty, predicted word is on the first place, the second on the second place and so on.
Now let's create a new DataFrame:
pred_words_df = pd.DataFrame(pred_words_lists.tolist())
pred_words_df.columns = df.columns[:2:-1]
And The final DataFrame:
final_df = df[['course_name', 'course_title', 'course_description']].join(pred_words_df.iloc[:,0:11])
Hope this works.
EDIT
def common_elements(xx, yy):
temp = pd.Series(range(0, len(xx)), index= xx)
return list(df.reindex(yy).sort_values()[0:10].dropna().index)
pred_words_lists = df.apply(lambda x: common_elements(x[2].replace(',','').split(), list(x[3:].dropna())), axis = 1)
Does it satisfy your requirements?
Adapted solution (OP):
def get_sorted_descriptions_words(course_description, predicted_words, k):
description_words = course_description.replace(',','').split()
predicted_words_list = list(predicted_words)
predicted_words = pd.Series(range(0, len(predicted_words_list)), index=predicted_words_list)
predicted_words = predicted_words[~predicted_words.index.duplicated()]
ordered_description = predicted_words.reindex(description_words).dropna().sort_values()
ordered_description_list = pd.Series(ordered_description.index).unique()[:k]
return ordered_description_list
df.apply(lambda x: get_sorted_descriptions_words(x['course_description'], x.filter(regex=r'predicted_word_.*'), k), axis=1)

Creating a function to count the number of pos in a pandas instance

I've used NLTK to pos_tag sentences in a pandas dataframe from an old Yelp competition. This returns a list of tuples (word, POS). I'd like to count the number of parts of speech for each instance. How would I, say, create a function to count the number of being verbs in each review? I know how to apply functions to features - no problem there. I just can't wrap my head around how to count things inside tuples inside lists inside a pd feature.
The head is here, as a tsv: https://pastebin.com/FnnBq9rf
Thank you #zhangyulin for your help. After two days, I learned some incredibly important things (as a novice programmer!). Here's the solution!
def NounCounter(x):
nouns = []
for (word, pos) in x:
if pos.startswith("NN"):
nouns.append(word)
return nouns
df["nouns"] = df["pos_tag"].apply(NounCounter)
df["noun_count"] = df["nouns"].str.len()
As an example, for dataframe df, noun count of the column "reviews" can be saved to a new column "noun_count" using this code.
def NounCount(x):
nounCount = sum(1 for word, pos in pos_tag(word_tokenize(x)) if pos.startswith('NN'))
return nounCount
df["noun_count"] = df["reviews"].apply(NounCount)
df.to_csv('./dataset.csv')
There are a number of ways you can do that and one very straight forward way is to map the list (or pandas series) of tuples to indicator of whether the word is a verb, and count the number of 1's you have.
Assume you have something like this (please correct me if it's not, as you didn't provide an example):
a = pd.Series([("run", "verb"), ("apple", "noun"), ("play", "verb")])
You can do something like this to map the Series and sum the count:
a.map(lambda x: 1 if x[1]== "verb" else 0).sum()
This will return you 2.
I grabbed a sentence from the link you shared:
text = nltk.word_tokenize("My wife took me here on my birthday for breakfast and it was excellent.")
tag = nltk.pos_tag(text)
a = pd.Series(tag)
a.map(lambda x: 1 if x[1]== "VBD" else 0).sum()
# this returns 2

How to query a dataframe using a column of other dataframe in R

I have 2 dataframes in R and I want to do a query using the dataframe "y" like parameter to dataframe "x".
I have this code:
x <- c('The book is on the table','I hear birds outside','The electricity
came back')
x <- data.frame(x)
colnames(x) <- c('text')
x
y <- c('book','birds','electricity')
y <- data.frame(y)
colnames(y) <- c('search')
y
r <- sqldf("select * from x where text IN (select search from y)")
r
I think to use "like" here, but I don´t know.
Can you helpme ?
If you want a sqldf solution, I think that this would work:
sqldf("select x.text, y.search FROM x JOIN y on x.text LIKE '%' || y.search || '%'")
## text search
## 1 The book is on the table book
## 2 I hear birds outside birds
## 3 The electricity \ncame back electricity
You could use the fuzzyjoin package:
library(dplyr)
library(fuzzyjoin)
regex_join(
mutate_if(x, is.factor, as.character),
mutate_if(y, is.factor, as.character),
by = c("text" = "search")
)
# text search
# 1 The book is on the table book
# 2 I hear birds outside birds
# 3 The electricity \ncame back electricity
It's hard to know if this is what you want without a more varied fixture. To add a little bit of variation, I added an extra word to y$search - y = c('book','birds','electricity', 'cat'). More variation would further clarify
Just know which words are in which statements? sapply and grepl
> m = sapply(y$search, grepl, x$text)
> rownames(m) = x$text
> colnames(m) = y$search
> m
book birds electricity cat
The book is on the table TRUE FALSE FALSE FALSE
I hear birds outside FALSE TRUE FALSE FALSE
The electricity \ncame back FALSE FALSE TRUE FALSE
Pulling out just the matching rows?
> library(magrittr) # To use the pipe, "%>%"
> x %>% data.table::setDT() # To return the result as a table easily
>
> x[(sapply(y$search, grepl, x$text) %>% rowSums() %>% as.logical()) * (1:nrow(x)), ]
text
1: The book is on the table
2: I hear birds outside
3: The electricity \ncame back
#Aurèle's solution will give the best result for matching text and the text it match to. Note that if back was also in y$search, the text The electricity \ncame back would get reported twice in the result for the different search terms matched, so this is better in the case that uniqueness is not important.
So it largely depends on your desired output.