I would like to build a keyword list from tokens with a lookup back to the sentence where they came from, thanks
You can get the sentences from token.doc.sents, and then find the first one that starts on or after your token. You can make this more convenient by adding an extension attribute to token like this:
import spacy
from spacy.tokens import Token
def get_sentence(token):
for sent in token.doc.sents:
if sent.start <= token.i:
return sent
# Add a computed property, which will be accessible as token._.sent
Token.set_extension('sent', getter=get_sentence)
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Sentence one. Sentence two.')
print(list(doc.sents))
print(doc[0]._.sent)
print(doc[-1]._.sent)
Related
I am looking to reduce a word to its base form without using contextual information. I tried out spacy and that requires running out nlp to get the base form of a single word but that comes with an increase in execution time.
I have gone through this post where disabling parser and NER pipeline components speed up the execution time to some extent but I just want a process to directly lookup into the database of word and its lemma form ( basically the base form of a word without considering contextual information
my_list = ["doing", "done", "did", "do"]
for my_word in my_list:
doc = nlp(my_word, disable=['parser', 'ner'])
for w in doc:
print("my_word {}, base_form {}".format(w, w.lemma_))
desired output
my_word doing, base_form do
my_word done, base_form do
my_word did, base_form do
my_word do, base_form do
Note: I also tried out spacy.lemmatizer but that is not giving the expected results and required pos as an additional arugments.
If you just want lemmas from a lookup table, you can install the lookup tables and initialize a very basic pipeline that only includes a tokenizer. If the lookup tables are installed, token.lemma_ will look up the form in the table.
Install the lookup tables (which are otherwise only saved in the provided models and not included in the main spacy package to save space):
pip install spacy[lookups]
Tokenize and lemmatize:
import spacy
nlp = spacy.blank("en")
assert nlp("doing")[0].lemma_ == "do"
assert nlp("done")[0].lemma_ == "do"
Spacy's lookup tables are available in this repository:
https://github.com/explosion/spacy-lookups-data
There you can read the documentation and check the examples that might help you.
I'm using spaCy v2 with the French module fr_core_news_sm
Unfortunately this model produces many parsing errors so I would like to preprocess the text in order to optimize the output.
Here is an example: the interjection/adverb carrément is analyzed as the plural 3rd person of the (imaginary) verb carrémer. I don't mind for the wrong POS tag analysis, but it does spoil the dependency parse. Therefore I would like to replace carrément by some other adverb (like souvent) or interjection for which I know that spaCy will parse correctly.
For that I need to able to add a comment saying that a replacement has taken place, something like souvent /*orig=carrément*/ so that souvent will be parsed by spaCy but NOT /*orig=carrément*/ which will have no incidence on the dependency parsing.
Is this possible?
IS there some other way to tell spaCy “carrément is NOT a verb but an interjection please treat it as such”, without recompiling the model?
(I know this is possible in TreeTagger, where you can add a configuration file with POS tags for any word you want… but of course TreeTagger is not a dependency parser.)
You can use custom extensions to save this kind of information on each token:
https://spacy.io/usage/processing-pipelines#custom-components-attributes
from spacy.tokens import Token
Token.set_extension("orig", default="")
doc[1]._.orig = "carrément"
Here's an example from the Token docs:
from spacy.tokens import Token
fruit_getter = lambda token: token.text in ("apple", "pear", "banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = nlp("I have an apple")
assert doc[3]._.is_fruit
(The tagger and the dependency parser are completely independent, so changing the POS of a word won't affect how it gets parsed.)
I want to create spaCy doc given I have raw text and words but missing whitespaces data.
from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=words, spaces=spaces)
How to do it correctly so information about whitespaces was not lost ?
Example of data I have :
data= {'text': 'This is just a test sample.', 'words': ['This', 'is', 'just', 'a', 'test', 'sample', '.']}
Based on our discussion in the comments, I would suggest doing either of the following:
Preferred route:
Substitute in the Spacy pipeline those elements you want to improve. If you don't trust the POS tagger for a reason, substitute in a custom parser more fit-for-purpose. OPtionally, you can train the existing POS tagger model with your own annotated data using a tool like Prodigy.
Quick and dirty route:
Load the document as plain text in a Spacy doc
Loop over the tokens as Spacy parsed them and match to your own list of tokens by checking of all the characters match.
If you don't get matches, handle exceptions as an input for a better tokenizer / check why your tokenizer is doing things differently
if you do get a match, load your additional information as Extension Attributes (https://spacy.io/usage/processing-pipelines#custom-components-attributes)
Use these extra attributes in further loops to check if these extra attributes match the Spacy Parser, and output the eventual training dataset.
Would the correct method with a Scrapy Spider for entering a zip code value "27517" automatically within the entry box on this website: Locations of Junkyards be to use a Form Request?
Here is what I have right now:
import scrapy
from scrapy.http import FormRequest
from scrapy.item import Item, Field
from scrapy.http import FormRequest
from scrapy.spider import BaseSpider
class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["http://www.lkqcorp.com/en-us/locationResults/"]
start_urls = ['http://www.lkqcorp.com/en-us/locationResults/']
def start_requests(self):
return [ FormRequest("http://www.lkqcorp.com/en-us/locationResults/",
formdata={'dnnVariable':'27517'},
callback=self.parse) ]
def parsel(self):
print self.status
It doesn't do anything when run though, is Form Request mainly for completing login fields? What would be the best way to get to THIS page? (which comes up after the search for the zip 27517 and is where I would start scraping my desired information with a scrapy spider)
this isn't really a FormRequest as FormRequests is only a name for a POST request in scrapy, and of course it helps you fill a form, but a form is also normally a POST request.
You need some debugging console (I prefer Firebug for Firefox) to check which requests are being done, and it looks like it is a GET request and quite simple to replicate, the url would be something like this where you'll have to change the number after /fullcrit/ to the desired zip code, but you also need the lat and lng arguments, for that you could use the Google Maps API, check this answer for an example on how to get it, but to summarise just do this Request and get the location argument.
I have decided to save templates of all system emails in the DB.
The body of these emails are normal django templates (with tags).
This means I need the template engine to load the template from a string and not from a file. Is there a way to accomplish this?
Instantiate a django.template.Template(), passing the string to use as a template.
To complement the answer from Ignacio Vazquez-Abrams, here is the code snippet that I use to get a template object from a string:
from django.template import engines, TemplateSyntaxError
def template_from_string(template_string, using=None):
"""
Convert a string into a template object,
using a given template engine or using the default backends
from settings.TEMPLATES if no engine was specified.
"""
# This function is based on django.template.loader.get_template,
# but uses Engine.from_string instead of Engine.get_template.
chain = []
engine_list = engines.all() if using is None else [engines[using]]
for engine in engine_list:
try:
return engine.from_string(template_string)
except TemplateSyntaxError as e:
chain.append(e)
raise TemplateSyntaxError(template_string, chain=chain)
The engine.from_string method will instantiate a django.template.Template object with template_string as its first argument, using the first backend from settings.TEMPLATES that does not result in an error.
Using django Template together with Context worked for me on >= Django 3.
from django.template import Template, Context
template = Template('Hello {{name}}.')
context = Context(dict(name='World'))
rendered: str = template.render(context)