How to avoid double-extraction of patterns in SpaCy? - dataframe

I'm using an incident database to identify the causes of accidents. I have defined a pattern and a function to extract the matching patterns. However, sometimes this function creates overlapping results. I saw in a previous post that we can use for span in spacy.util.filter_spans(spans):
to avoid repetition of answers. But I don't know how to rewrite the function with this. I will be grateful for any help you can provide.
pattern111 = [{'DEP':'compound','OP':'?'},{'DEP':'nsubj'}]
def get_relation111(x):
doc = nlp(x)
matcher = Matcher(nlp.vocab)
relation= []
matcher.add("matching_111", [pattern111], on_match=None)
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start: end]
relation.append(matched_span.text)
return relation

filter_spans can be used on any list of spans. This is a little weird because you want a list of strings, but you can work around it by saving a list of spans first and only converting to strings after you've filtered.
def get_relation111(x):
doc = nlp(x)
matcher = Matcher(nlp.vocab)
relation= []
matcher.add("matching_111", [pattern111], on_match=None)
matches = matcher(doc)
for match_id, start, end in matches:
matched_span = doc[start: end]
relation.append(matched_span)
# XXX Just add this line
relation = [ss.text for ss in filter_spans(relation)]
return relation

Related

Workaround Google Sheets API does not accept range request without specifying desired final line

My spreadsheet has values in this model:
And I need to create a list to use in Python, including the empty fields that exist between values:
CLIENT_SECRET_FILE = 'client_secrets.json'
API_NAME = 'sheets'
API_VERSION = 'v4'
SCOPES = ['https://www.googleapis.com/auth/spreadsheets']
service = Create_Service(CLIENT_SECRET_FILE, API_NAME, API_VERSION, SCOPES)
spreadsheet_id = sheet_id
get_page_id = 'Winning_Margin'
range_score = 'O1:O10000'
spreadsheets_match_score = []
range_names2 = get_page_id + '!' + range_score
result2 = service.spreadsheets().values().get(
spreadsheetId=spreadsheet_id, range=range_names2, valueRenderOption='UNFORMATTED_VALUE').execute()
sheet_output_data2 = result2["values"]
for i, eventao2 in enumerate(sheet_output_data2):
try:
spreadsheets_match_score.append(sheet_output_data2[i][0])
except:
spreadsheets_match_score.append('')
In this case, this list (spreadsheets_match_score = []) would result in:
["0-0","0-0","4-0","0-1","6-0","","","","0-3","2-2","","","","","0-1","","","3-0","1-1","3-1","","","",""]
My spreadsheet currently has 24 rows, but it will grow without a fixed ending value.
So, I tried to use the range without putting the value of the last line (range_score = 'O1:O'), but it doesn't accept, the range needs to specify the final line (range_score = 'O1:O10000').
I put 10000 exactly so I don't have to change, but this is very wrong to do, because it does a search for a non-existent range, I'm very afraid that in the future it will generate an error.
Is there any method so that I can not need to specify the last row of the worksheet?
To be something like:
range_score = 'O1:O'
The problem is not in the range specification method for data collection, can use either range_score = 'O1:O' or range_score = 'O1:O100000000000' if looking for all the column rows.
In the case of the question, the problem was when line 1 of the desired column has no values, being null, the request failed but because of the empty ["values"] return.
In short, I was looking for the error in the wrong place.

Printing the remainder of a sentence after using spacy matcher to find the start of a target sentence

I am looking to take the aim out of a scientific journal abstract and am using spacy. I have a screenshot of the abstract and have run pytesseract on the image. I have tokenized the text into sentences with:
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
#Tokenize into sentences
sents = nlp.create_pipe("sentencizer")
nlp.add_pipe(sents)
[sent.text for sent in doc.sents]
Which seems to work quite well and gives me a list of sentences. I then made a rule based matcher that I believe matches the part of a sentence preceding the aim of the study:
#Rule based matching for AIM
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "PART"}, {"POS": "VERB"}, {"POS": "DET"}, {"POS": "NOUN"}]
matcher.add('Aim', None, pattern)
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(span.text)
The matcher prints the target part of the sentence, so I know the matcher works (at least well enough for now and I can improve later). What I want to do now is run the matcher on each sentence and if it matches, print the sentence. I tried:
matches = matcher(doc.sents)
if matches:
print(sent.text)
But it returns: TypeError: Argument 'doc' has incorrect type (expected spacy.tokens.doc.Doc, got generator)
For anyone interested I solved this by changing:
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(span.text)
To:
matches = matcher(doc)
for match_id, start, end in matches:
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
sents = span.sent #ADDED THIS LINE
print(sents)

How do I reverse each value in a column bit wise for a hex number?

I have a dataframe which has a column called hexa which has hex values like this. They are of dtype object.
hexa
0 00802259AA8D6204
1 00802259AA7F4504
2 00802259AA8D5A04
I would like to remove the first and last bits and reverse the values bitwise as follows:
hexa-rev
0 628DAA592280
1 457FAA592280
2 5A8DAA592280
Please help
I'll show you the complete solution up here and then explain its parts below:
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
reversed_bits = [list_of_bits[-i] for i in range(1,len(list_of_bits)+1)]
return ''.join(reversed_bits)
df['hexa-rev'] = df['hexa'].apply(lambda x: reverse_bits(x))
There are possibly a couple ways of doing it, but this way should solve your problem. The general strategy will be defining a function and then using the apply() method to apply it to all values in the column. It should look something like this:
df['hexa-rev'] = df['hexa'].apply(lambda x: reverse_bits(x))
Now we need to define the function we're going to apply to it. Breaking it down into its parts, we strip the first and last bit by indexing. Because of how negative indexes work, this will eliminate the first and last bit, regardless of the size. Your result is a list of characters that we will join together after processing.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
The second line iterates through the list of characters, matches the first and second character of each bit together, and then concatenates them into a single string representing the bit.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
The second to last line returns the list you just made in reverse order. Lastly, the function returns a single string of bits.
def reverse_bits(bits):
trimmed_bits = bits[2:-2]
list_of_bits = [i+j for i, j in zip(trimmed_bits[::2], trimmed_bits[1::2])]
reversed_bits = [list_of_bits[-i] for i in range(1,len(list_of_bits)+1)]
return ''.join(reversed_bits)
I explained it in reverse order, but you want to define this function that you want applied to your column, and then use the apply() function to make it happen.

Creating a function to count the number of pos in a pandas instance

I've used NLTK to pos_tag sentences in a pandas dataframe from an old Yelp competition. This returns a list of tuples (word, POS). I'd like to count the number of parts of speech for each instance. How would I, say, create a function to count the number of being verbs in each review? I know how to apply functions to features - no problem there. I just can't wrap my head around how to count things inside tuples inside lists inside a pd feature.
The head is here, as a tsv: https://pastebin.com/FnnBq9rf
Thank you #zhangyulin for your help. After two days, I learned some incredibly important things (as a novice programmer!). Here's the solution!
def NounCounter(x):
nouns = []
for (word, pos) in x:
if pos.startswith("NN"):
nouns.append(word)
return nouns
df["nouns"] = df["pos_tag"].apply(NounCounter)
df["noun_count"] = df["nouns"].str.len()
As an example, for dataframe df, noun count of the column "reviews" can be saved to a new column "noun_count" using this code.
def NounCount(x):
nounCount = sum(1 for word, pos in pos_tag(word_tokenize(x)) if pos.startswith('NN'))
return nounCount
df["noun_count"] = df["reviews"].apply(NounCount)
df.to_csv('./dataset.csv')
There are a number of ways you can do that and one very straight forward way is to map the list (or pandas series) of tuples to indicator of whether the word is a verb, and count the number of 1's you have.
Assume you have something like this (please correct me if it's not, as you didn't provide an example):
a = pd.Series([("run", "verb"), ("apple", "noun"), ("play", "verb")])
You can do something like this to map the Series and sum the count:
a.map(lambda x: 1 if x[1]== "verb" else 0).sum()
This will return you 2.
I grabbed a sentence from the link you shared:
text = nltk.word_tokenize("My wife took me here on my birthday for breakfast and it was excellent.")
tag = nltk.pos_tag(text)
a = pd.Series(tag)
a.map(lambda x: 1 if x[1]== "VBD" else 0).sum()
# this returns 2

Apply function with pandas dataframe - POS tagger computation time

I'm very confused on the apply function for pandas. I have a big dataframe where one column is a column of strings. I'm then using a function to count part-of-speech occurrences. I'm just not sure the way of setting up my apply statement or my function.
def noun_count(row):
x = tagger(df['string'][row].split())
# array flattening and filtering out all but nouns, then summing them
return num
So basically I have a function similar to the above where I use a POS tagger on a column that outputs a single number (number of nouns). I may possibly rewrite it to output multiple numbers for different parts of speech, but I can't wrap my head around apply.
I'm pretty sure I don't really have either part arranged correctly. For instance, I can run noun_count[row] and get the correct value for any index but I can't figure out how to make it work with apply how I have it set up. Basically I don't know how to pass the row value to the function within the apply statement.
df['num_nouns'] = df.apply(noun_count(??),1)
Sorry this question is all over the place. So what can I do to get a simple result like
string num_nouns
0 'cat' 1
1 'two cats' 1
EDIT:
So I've managed to get something working by using list comprehension (someone posted an answer, but they've deleted it).
df['string'].apply(lambda row: noun_count(row),1)
which required an adjustment to my function:
def tagger_nouns(x):
list_of_lists = st.tag(x.split())
flat = [y for z in list_of_lists for y in z]
Parts_of_speech = [row[1] for row in flattened]
c = Counter(Parts_of_speech)
nouns = c['NN']+c['NNS']+c['NNP']+c['NNPS']
return nouns
I'm using the Stanford tagger, but I have a big problem with computation time, and I'm using the left 3 words model. I'm noticing that it's calling the .jar file again and again (java keeps opening and closing in the task manager) and maybe that's unavoidable, but it's really taking far too long to run. Any way I can speed it up?
I don't know what 'tagger' is but here's a simple example with a word count that ought to work more or less the same way:
f = lambda x: len(x.split())
df['num_words'] = df['string'].apply(f)
string num_words
0 'cat' 1
1 'two cats' 2