spaCy nlp - positions of entities in string, extracting nearby words - spacy

Lets say I have a string and want to mark some entities such as Organizations.
string = I was working as a marketing executive for Bank of India, a 4 months..
string_tagged = I was working as a marketing executive for [Bank of India], a 4 months..
I want to identify the words beside the entity tagged.
How can I locate the positions of the entity tagged and extract the words beside the entity?
My code:
import spacy
nlp = spacy.load('en')
doc = nlp(string)
company = doc.text
for ent in doc.ents:
if ent.label_ == 'ORG':
company = company[:ent.start_char] + company[:ent.start_char -1] +company[:ent.end_char +1]
print company

As I understood from your question you want words beside the ORG tagged token:
import spacy
nlp = spacy.load('en')
#string = "blah blah"
doc = nlp(string)
company = ""
for i in range (1, len(doc)-1)):
if doc[i].ent.label_ == 'ORG':
company = doc[i-1] + doc[i] + doc[i+1] # previous word, tagged word and next one
print company
be aware of the first and last token checking.

Following code works for me:
doc = nlp(str_to_be_tokenized)
company = []
for ent in doc.ents:
if ent.label_ == 'ORG' and ent.text not in company:
company.append(ent.text)
print(company)
The 2nd condition in if is to extract only unique company names in my block of text. If you remove that you'll get all instances of 'ORG' added to your company list. Hope this'll work for you as well

Related

Replacing part of string in a column with exceptions

I am doing a data cleaning in Jupyter Notebook with Pandas and I am trying to get just first part of the string. But there is a catch. I can easily delete rest of the expression but some fields are actually valid. So for example in the column:
SHIPMENT_PROVIDER
Usps
Usps International
Uspsxy3
Usps7kju
Usps0by
So I want to keep Usps and Usps international. So far I used following code for simpler challenges:
orders.loc[:, 'SHIPMENT_PROVIDER'] = orders.loc[:, 'SHIPMENT_PROVIDER'].replace(to_replace='(?:Usps)([a-zA-Z0-9]+)$', value = 'Usps', regex = True)
But this won't work with two alternative expressions. So the idea that Usps with some random characters e.g.(Uspsxyz) should be replaced by Usps, but Usps International with some random characters e.g. (Usps Internationalxyz) should be replaced by Usps International.
Others have posted regex solutions. How about a non-regex solution:
s = orders["SHIPMENT_PROVIDER"]
orders["SHIPMENT_PROVIDER"] = np.select(
[s.str.startswith("Usps International"), s.str.startswith("Usps")],
["Usps International", "Usps"],
s,
)
As a pattern, you could use a capture group for the first part instead with an optional part for International
^(Usps)(?: International)?[a-zA-Z0-9]+$
In the replacement use group 1.
Regex demo
import pandas as pd
pattern = r"^(Usps)(?: International)?[a-zA-Z0-9]+$"
items = [
"Usps",
"Usps International",
"Uspsxy3",
"Usps7kju",
"Usps0by",
"Usps Internationalxyz"
]
orders = pd.DataFrame(items, columns=["SHIPMENT_PROVIDER"])
orders.loc[:, 'SHIPMENT_PROVIDER'] = orders.loc[:, 'SHIPMENT_PROVIDER'].replace(r"^(Usps)(?: International)?[a-zA-Z0-9]+$", r"\1", regex=True)
print(orders)
Output
SHIPMENT_PROVIDER
0 Usps
1 Usps International
2 Usps
3 Usps
4 Usps
5 Usps

How to match multiple words from list with pandas data frame column

I have a list like :
keyword_list = ['motorcycle love hobby ', 'bike love me', 'cycle', 'dirtbike cycle motorbike ']
I want to find these words in the panda's data frame column and if 3 words match then it should create a new column with these words.
I need something like this :
enter image description here
You can probably use set operations:
kw = {s: set(s.split()) for s in keyword_list}
def subset(s):
S1 = set(s.split())
for k, S2 in kw.items():
if S2.issubset(S1):
return k
df['trigram'] = [subset(s) for s in df['description'].str.lower()]
print(df)
Output:
description trigram
0 I love motorcycle though I have other hobby motorcycle love hobby
1 I have bike None

How to avoid double-extraction of patterns in SpaCy?

I'm using an incident database to identify the causes of accidents. I have defined a pattern and a function to extract the matching patterns. However, sometimes this function creates overlapping results. I saw in a previous post that we can use for span in spacy.util.filter_spans(spans):
to avoid repetition of answers. But I don't know how to rewrite the function with this. I will be grateful for any help you can provide.
pattern111 = [{'DEP':'compound','OP':'?'},{'DEP':'nsubj'}]
def get_relation111(x):
doc = nlp(x)
matcher = Matcher(nlp.vocab)
relation= []
matcher.add("matching_111", [pattern111], on_match=None)
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start: end]
relation.append(matched_span.text)
return relation
filter_spans can be used on any list of spans. This is a little weird because you want a list of strings, but you can work around it by saving a list of spans first and only converting to strings after you've filtered.
def get_relation111(x):
doc = nlp(x)
matcher = Matcher(nlp.vocab)
relation= []
matcher.add("matching_111", [pattern111], on_match=None)
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start: end]
relation.append(matched_span)
# XXX Just add this line
relation = [ss.text for ss in filter_spans(relation)]
return relation

mecab python extract company name

I'm trying to run the data in a column and extract only the company name using MeCab library and list them in a new column.
The target column is a comment column which includes employee names, company names, invoice number etc all together or by itself depending on the transaction. Listed below is my code trying to extract only the company name. Please note the below code is still in production, but just wanted to post something to start with.
Sorry in advance for my messy coding...
Thank you,
import mecab-python3
import ipadic
df = pd.read_csv("")
m = MeCab.Tagger(ipadic.MECAB_ARGS)
def kaiseki(column):
list= df[column].values.tolist()
new_list = []
new_list2 = []
for li in list:
li = m.parse(li)
new_list.append(li)
li2 = li.split('\n')
new_list2.append(li2)
for li1 in li2:
li2 = li1.split('\t')
for li2_1 in li2:
li2_1_1 = li2_1.split(',')[0]
#組織名 means company name in Japanese
if li2_1_1 == '組織名':
print(li1.split()[0])
else:
continue
df[column] = new_list
df["column2"] = new_list2
return df["columns2"]
columns = ['column']
for column in columns:
kaiseki(column)

I would like to create an auto-schedule but I have some problems

Creating a schedule where I introduce a word and the program returns the info I need is what I want. I mean, if I write the word "monday" I would like a response with the subjects I have that day. I did this (very brief example, I have more subjects):
x = int(input("Day of the week: "))
if x == 2:
x = 0
print('9:00-11:00 Biology - Classroom C4B \n11:00-13:00 Maths- Classroom C5')
elif x == 3:
print('11:00-13:00 Physics - Classroom C4B')
This works, but the problem is that I do not want to enter numbers but words. I also tried with eval and works too. However, in that case, I must enter a word between '' because eval reads strings and that is not what I want. How can I improve my program?
Thanks in advance (Python 3)
And because you do not use raw_input( )?
input() actually evaluates the input as Python code.
And in your code x = int(input("Day of the week: ")) only accepts numerics inputs. raw_input() returns the verbatim string entered by the user.
day = raw_input("Day of the week: ")
if day == "monday":
print('9:00-11:00 Biology - Classroom C4B \n11:00-13:00 Maths- Classroom C5')
elif day == "tuesday":
print('11:00-13:00 Physics - Classroom C4B')