Printing the remainder of a sentence after using spacy matcher to find the start of a target sentence - spacy

I am looking to take the aim out of a scientific journal abstract and am using spacy. I have a screenshot of the abstract and have run pytesseract on the image. I have tokenized the text into sentences with:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
#Tokenize into sentences
sents = nlp.create_pipe("sentencizer")
nlp.add_pipe(sents)
[sent.text for sent in doc.sents]
Which seems to work quite well and gives me a list of sentences. I then made a rule based matcher that I believe matches the part of a sentence preceding the aim of the study:
#Rule based matching for AIM
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PART"}, {"POS": "VERB"}, {"POS": "DET"}, {"POS": "NOUN"}]
matcher.add('Aim', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(span.text)
The matcher prints the target part of the sentence, so I know the matcher works (at least well enough for now and I can improve later). What I want to do now is run the matcher on each sentence and if it matches, print the sentence. I tried:
matches = matcher(doc.sents)
if matches:
print(sent.text)
But it returns: TypeError: Argument 'doc' has incorrect type (expected spacy.tokens.doc.Doc, got generator)

For anyone interested I solved this by changing:
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(span.text)
To:
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
sents = span.sent #ADDED THIS LINE
print(sents)

Related

How to avoid double-extraction of patterns in SpaCy?

I'm using an incident database to identify the causes of accidents. I have defined a pattern and a function to extract the matching patterns. However, sometimes this function creates overlapping results. I saw in a previous post that we can use for span in spacy.util.filter_spans(spans):
to avoid repetition of answers. But I don't know how to rewrite the function with this. I will be grateful for any help you can provide.
pattern111 = [{'DEP':'compound','OP':'?'},{'DEP':'nsubj'}]
def get_relation111(x):
doc = nlp(x)
matcher = Matcher(nlp.vocab)
relation= []
matcher.add("matching_111", [pattern111], on_match=None)
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start: end]
relation.append(matched_span.text)
return relation
filter_spans can be used on any list of spans. This is a little weird because you want a list of strings, but you can work around it by saving a list of spans first and only converting to strings after you've filtered.
def get_relation111(x):
doc = nlp(x)
matcher = Matcher(nlp.vocab)
relation= []
matcher.add("matching_111", [pattern111], on_match=None)
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start: end]
relation.append(matched_span)
# XXX Just add this line
relation = [ss.text for ss in filter_spans(relation)]
return relation

spaCy nlp - positions of entities in string, extracting nearby words

Lets say I have a string and want to mark some entities such as Organizations.
string = I was working as a marketing executive for Bank of India, a 4 months..
string_tagged = I was working as a marketing executive for [Bank of India], a 4 months..
I want to identify the words beside the entity tagged.
How can I locate the positions of the entity tagged and extract the words beside the entity?
My code:
import spacy
nlp = spacy.load('en')
doc = nlp(string)
company = doc.text
for ent in doc.ents:
if ent.label_ == 'ORG':
company = company[:ent.start_char] + company[:ent.start_char -1] +company[:ent.end_char +1]
print company
As I understood from your question you want words beside the ORG tagged token:
import spacy
nlp = spacy.load('en')
#string = "blah blah"
doc = nlp(string)
company = ""
for i in range (1, len(doc)-1)):
if doc[i].ent.label_ == 'ORG':
company = doc[i-1] + doc[i] + doc[i+1] # previous word, tagged word and next one
print company
be aware of the first and last token checking.
Following code works for me:
doc = nlp(str_to_be_tokenized)
company = []
for ent in doc.ents:
if ent.label_ == 'ORG' and ent.text not in company:
company.append(ent.text)
print(company)
The 2nd condition in if is to extract only unique company names in my block of text. If you remove that you'll get all instances of 'ORG' added to your company list. Hope this'll work for you as well

Creating a function to count the number of pos in a pandas instance

I've used NLTK to pos_tag sentences in a pandas dataframe from an old Yelp competition. This returns a list of tuples (word, POS). I'd like to count the number of parts of speech for each instance. How would I, say, create a function to count the number of being verbs in each review? I know how to apply functions to features - no problem there. I just can't wrap my head around how to count things inside tuples inside lists inside a pd feature.
The head is here, as a tsv: https://pastebin.com/FnnBq9rf
Thank you #zhangyulin for your help. After two days, I learned some incredibly important things (as a novice programmer!). Here's the solution!
def NounCounter(x):
nouns = []
for (word, pos) in x:
if pos.startswith("NN"):
nouns.append(word)
return nouns
df["nouns"] = df["pos_tag"].apply(NounCounter)
df["noun_count"] = df["nouns"].str.len()
As an example, for dataframe df, noun count of the column "reviews" can be saved to a new column "noun_count" using this code.
def NounCount(x):
nounCount = sum(1 for word, pos in pos_tag(word_tokenize(x)) if pos.startswith('NN'))
return nounCount
df["noun_count"] = df["reviews"].apply(NounCount)
df.to_csv('./dataset.csv')
There are a number of ways you can do that and one very straight forward way is to map the list (or pandas series) of tuples to indicator of whether the word is a verb, and count the number of 1's you have.
Assume you have something like this (please correct me if it's not, as you didn't provide an example):
a = pd.Series([("run", "verb"), ("apple", "noun"), ("play", "verb")])
You can do something like this to map the Series and sum the count:
a.map(lambda x: 1 if x[1]== "verb" else 0).sum()
This will return you 2.
I grabbed a sentence from the link you shared:
text = nltk.word_tokenize("My wife took me here on my birthday for breakfast and it was excellent.")
tag = nltk.pos_tag(text)
a = pd.Series(tag)
a.map(lambda x: 1 if x[1]== "VBD" else 0).sum()
# this returns 2

How to find all offset postions of term using Apache Lucene

I am trying to find all offset postions of given term. For instance I have input: "dog cat orange dog green dog" And i would like to find offsets for term: "dog". The result would be: 0,15,25.
Terms terms = indexReader.getTermVector(0,"text");
TermsEnum iterator = terms.iterator();
BytesRef byteRef = null;
while((byteRef = iterator.next()) != null)
String term = byteRef.utf8ToString(); //here I find term name
/* Here I only know about term frequency and first offset(0) for given term not all of them */
Lets say I have a term, that occured 3 times while indexing like above. I would like to get array containing all offsets for term occurences.
Now i am getting only one offset for each term. How i can gather more information. I would be grateful for any help.
EDIT:
FieldType fieldType = new FieldType();
fieldType.setTokenized(true);
fieldType.setStoreTermVectors(true);
fieldType.setStoreTermVectorPositions(true);
fieldType.setStoreTermVectorOffsets(true);
fieldType.setIndexOptions(IndexOptions.DOCS_AND_FREQS_AND_POSITIONS_AND_OFFSETS);

Checking errors in my program

I'm trying to make some changes to my dictionary counter in python. I want make some changes to my current counter, but not making any progress so far. I want my code to show the number of different words.
This is what I have so far:
# import sys module in order to access command line arguments later
import sys
# create an empty dictionary
dicWordCount = {}
# read all words from the file and put them into
#'dicWordCount' one by one,
# then count the occurance of each word
you can use the Count function from collections lib:
from collections import Counter
q = Counter(fileSource.read().split())
total = sum(q.values())
First, your first problem, add a variable for the word count and one for the different words. So wordCount = 0 and differentWords = 0. In the loop for your file reading put wordCount += 1 at the top, and in your first if statement put differentWords += 1. You can print these variables out at the end of the program as well.
The second problem, in your printing, add the if statement, if len(strKey)>4:.
If you want a full example code here it is.
import sys
fileSource = open(sys.argv[1], "rt")
dicWordCount = {}
wordCount = 0
differentWords = 0
for strWord in fileSource.read().split():
wordCount += 1
if strWord not in dicWordCount:
dicWordCount[strWord] = 1
differentWords += 1
else:
dicWordCount[strWord] += 1
for strKey in sorted(dicWordCount, key=dicWordCount.get, reverse=True):
if len(strKey) > 4: # if the words length is greater than four.
print(strKey, dicWordCount[strKey])
print("Total words: %s\nDifferent Words: %s" % (wordCount, differentWords))
For your first qs, you can use set to help you count the number of different words. (Assume there is a space between every two words)
str = 'apple boy cat dog elephant fox'
different_word_count = len(set(str.split(' ')))
For your second qs, using a dictionary to help you record the word_count is ok.
How about this?
#gives unique words count
unique_words = len(dicWordCount)
total_words = 0
for k, v in dicWordCount.items():
total_words += v
#gives total word count
print(total_words)
You don't need a separate variable for counting word counts since you're using dictionary, and to count the total words, you just need to add the values of the keys(which are just counts)