I am looking to reduce a word to its base form without using contextual information. I tried out spacy and that requires running out nlp to get the base form of a single word but that comes with an increase in execution time.
I have gone through this post where disabling parser and NER pipeline components speed up the execution time to some extent but I just want a process to directly lookup into the database of word and its lemma form ( basically the base form of a word without considering contextual information
my_list = ["doing", "done", "did", "do"]
for my_word in my_list:
doc = nlp(my_word, disable=['parser', 'ner'])
for w in doc:
print("my_word {}, base_form {}".format(w, w.lemma_))
desired output
my_word doing, base_form do
my_word done, base_form do
my_word did, base_form do
my_word do, base_form do
Note: I also tried out spacy.lemmatizer but that is not giving the expected results and required pos as an additional arugments.
If you just want lemmas from a lookup table, you can install the lookup tables and initialize a very basic pipeline that only includes a tokenizer. If the lookup tables are installed, token.lemma_ will look up the form in the table.
Install the lookup tables (which are otherwise only saved in the provided models and not included in the main spacy package to save space):
pip install spacy[lookups]
Tokenize and lemmatize:
import spacy
nlp = spacy.blank("en")
assert nlp("doing")[0].lemma_ == "do"
assert nlp("done")[0].lemma_ == "do"
Spacy's lookup tables are available in this repository:
https://github.com/explosion/spacy-lookups-data
There you can read the documentation and check the examples that might help you.
Related
I have been searching for this online for a while but no proper answer solves my confusion.
There is a stair generator which contains several parameters in the modifier
Example snapshot
I want to change certain parameters such as
Dict["stair Count"] = 20
and somehow regenerate the model and output a ply/glb file.
In order to do so I want to get the list that modifier is showing, but I don't know how to do so in python.
import bpy
ctx = bpy.context
obj = ctx.object
nodes = obj.modifiers["STAIR GENERATOR"]
list(nodes.keys())
list(nodes.values())
I can see a list of the parameter values, but the keys seem quite wrong comparing to the GUI. So I have two questions.
How do I get a full list of the key label from python just like the GUI? (e.g. “Switch”, “stair count”, “stair height”, …)
How do I apply the change on the script and update the stairs?
Thank you in advance
I'm using spaCy v2 with the French module fr_core_news_sm
Unfortunately this model produces many parsing errors so I would like to preprocess the text in order to optimize the output.
Here is an example: the interjection/adverb carrément is analyzed as the plural 3rd person of the (imaginary) verb carrémer. I don't mind for the wrong POS tag analysis, but it does spoil the dependency parse. Therefore I would like to replace carrément by some other adverb (like souvent) or interjection for which I know that spaCy will parse correctly.
For that I need to able to add a comment saying that a replacement has taken place, something like souvent /*orig=carrément*/ so that souvent will be parsed by spaCy but NOT /*orig=carrément*/ which will have no incidence on the dependency parsing.
Is this possible?
IS there some other way to tell spaCy “carrément is NOT a verb but an interjection please treat it as such”, without recompiling the model?
(I know this is possible in TreeTagger, where you can add a configuration file with POS tags for any word you want… but of course TreeTagger is not a dependency parser.)
You can use custom extensions to save this kind of information on each token:
https://spacy.io/usage/processing-pipelines#custom-components-attributes
from spacy.tokens import Token
Token.set_extension("orig", default="")
doc[1]._.orig = "carrément"
Here's an example from the Token docs:
from spacy.tokens import Token
fruit_getter = lambda token: token.text in ("apple", "pear", "banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = nlp("I have an apple")
assert doc[3]._.is_fruit
(The tagger and the dependency parser are completely independent, so changing the POS of a word won't affect how it gets parsed.)
I'm developing a snakemake pipeline to QC a large set of data. I'd like to generate a set of plots for each dataset and then generate a html report that combines the plots with some text. Looking at https://snakemake.readthedocs.io/en/stable/snakefiles/reporting.html, this will only generate one report for the whole pipeline. Is there a way to do this per dataset?
The functionality to do this is still available via the old way of generating reports with report(), but this is now deprecated. Wondering if there is a recommended way to achieve something like the below with the newer report functionality.
from snakemake.utils import report
rule createreport:
input:
plot1 = "results/{dataset}/plot1.pdf",
output:
report = "reports/{dataset}.html"
run:
report("""
This plot shows etc...
plot1_
""", output.report, **input)
I want to create spaCy doc given I have raw text and words but missing whitespaces data.
from spacy.tokens import Doc
doc = Doc(nlp.vocab, words=words, spaces=spaces)
How to do it correctly so information about whitespaces was not lost ?
Example of data I have :
data= {'text': 'This is just a test sample.', 'words': ['This', 'is', 'just', 'a', 'test', 'sample', '.']}
Based on our discussion in the comments, I would suggest doing either of the following:
Preferred route:
Substitute in the Spacy pipeline those elements you want to improve. If you don't trust the POS tagger for a reason, substitute in a custom parser more fit-for-purpose. OPtionally, you can train the existing POS tagger model with your own annotated data using a tool like Prodigy.
Quick and dirty route:
Load the document as plain text in a Spacy doc
Loop over the tokens as Spacy parsed them and match to your own list of tokens by checking of all the characters match.
If you don't get matches, handle exceptions as an input for a better tokenizer / check why your tokenizer is doing things differently
if you do get a match, load your additional information as Extension Attributes (https://spacy.io/usage/processing-pipelines#custom-components-attributes)
Use these extra attributes in further loops to check if these extra attributes match the Spacy Parser, and output the eventual training dataset.
I need to store the data presented in the graphs on the Google Ngram website. For example, I want to store the occurences of "it's" as a percentage from 1800-2008, as presented in the following link: https://books.google.com/ngrams/graph?content=it%27s&year_start=1800&year_end=2008&corpus=0&smoothing=3&share=&direct_url=t1%3B%2Cit%27s%3B%2Cc0.
The data I want is the data you're able to scroll over on the graph. How can I extract this for about 140 different terms (e.g. "it's", "they're", "she's", etc.)?
econpy wrote a nice little module in Python that you can use through a command-line interface.
For your "it's" example, you would need to type this command in a terminal / windows console:
python getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
This will automatically save the query result in a CSV file named after your query parameters.
econpy's package, in #HugoMailhot's answer, no longer works (2021) and seems not maintained.
Here's a updated version, with some improvements for easier integration into Python code:
https://gitlab.com/cpbl/google-ngrams
You can call this from the command line (as in econpy's) to create a CSV file, e.g.
getngrams.py it's -startYear=1800 -endYear=2008 -corpus=eng_2009 -smoothing=3
or call it from python to get (and plot) data directly in python, e.g.:
from getngrams import ngrams
df = ngrams('bells and whistles -startYear=1900 -endYear=2018 -smoothing=2')
df.plot()
The xkcd functionality is still there too.
(Issues / bug fix pull requests /etc welcome there)