list available spaCy language objects - spacy

I need to list the currently available spaCy language objects in spacy.lang1. I have tried dir(spacy.lang) and searching the various options with no luck. How can I list the available language objects in the module?

Here is the solution that I found if it helps anyone else. Use spacy.__file__ to get the path for spaCy and then create a list of the directories in spacy/lang.
import spacy
from pathlib import Path
spacy_path = Path(spacy.__file__.replace('__init__.py',''))
spacy_langs = spacy_path / 'lang'
SPACY_LANGS = [str(x).split('/')[-1] for x in spacy_langs.iterdir() if x.is_dir() and str(x).split('/')[-1] != '__pycache__']

Related

Optimize single word base form extraction (lemmatization) in spacy

I am looking to reduce a word to its base form without using contextual information. I tried out spacy and that requires running out nlp to get the base form of a single word but that comes with an increase in execution time.
I have gone through this post where disabling parser and NER pipeline components speed up the execution time to some extent but I just want a process to directly lookup into the database of word and its lemma form ( basically the base form of a word without considering contextual information
my_list = ["doing", "done", "did", "do"]
for my_word in my_list:
doc = nlp(my_word, disable=['parser', 'ner'])
for w in doc:
print("my_word {}, base_form {}".format(w, w.lemma_))
desired output
my_word doing, base_form do
my_word done, base_form do
my_word did, base_form do
my_word do, base_form do
Note: I also tried out spacy.lemmatizer but that is not giving the expected results and required pos as an additional arugments.
If you just want lemmas from a lookup table, you can install the lookup tables and initialize a very basic pipeline that only includes a tokenizer. If the lookup tables are installed, token.lemma_ will look up the form in the table.
Install the lookup tables (which are otherwise only saved in the provided models and not included in the main spacy package to save space):
pip install spacy[lookups]
Tokenize and lemmatize:
import spacy
nlp = spacy.blank("en")
assert nlp("doing")[0].lemma_ == "do"
assert nlp("done")[0].lemma_ == "do"
Spacy's lookup tables are available in this repository:
https://github.com/explosion/spacy-lookups-data
There you can read the documentation and check the examples that might help you.

How to include comments inside text to be processed by spaCy

I'm using spaCy v2 with the French module fr_core_news_sm
Unfortunately this model produces many parsing errors so I would like to preprocess the text in order to optimize the output.
Here is an example: the interjection/adverb carrément is analyzed as the plural 3rd person of the (imaginary) verb carrémer. I don't mind for the wrong POS tag analysis, but it does spoil the dependency parse. Therefore I would like to replace carrément by some other adverb (like souvent) or interjection for which I know that spaCy will parse correctly.
For that I need to able to add a comment saying that a replacement has taken place, something like souvent /*orig=carrément*/ so that souvent will be parsed by spaCy but NOT /*orig=carrément*/ which will have no incidence on the dependency parsing.
Is this possible?
IS there some other way to tell spaCy “carrément is NOT a verb but an interjection please treat it as such”, without recompiling the model?
(I know this is possible in TreeTagger, where you can add a configuration file with POS tags for any word you want… but of course TreeTagger is not a dependency parser.)
You can use custom extensions to save this kind of information on each token:
https://spacy.io/usage/processing-pipelines#custom-components-attributes
from spacy.tokens import Token
Token.set_extension("orig", default="")
doc[1]._.orig = "carrément"
Here's an example from the Token docs:
from spacy.tokens import Token
fruit_getter = lambda token: token.text in ("apple", "pear", "banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = nlp("I have an apple")
assert doc[3]._.is_fruit
(The tagger and the dependency parser are completely independent, so changing the POS of a word won't affect how it gets parsed.)

How do I link a token with a sentence in Spacy

I would like to build a keyword list from tokens with a lookup back to the sentence where they came from, thanks
You can get the sentences from token.doc.sents, and then find the first one that starts on or after your token. You can make this more convenient by adding an extension attribute to token like this:
import spacy
from spacy.tokens import Token
def get_sentence(token):
for sent in token.doc.sents:
if sent.start <= token.i:
return sent
# Add a computed property, which will be accessible as token._.sent
Token.set_extension('sent', getter=get_sentence)
nlp = spacy.load('en_core_web_sm')
doc = nlp(u'Sentence one. Sentence two.')
print(list(doc.sents))
print(doc[0]._.sent)
print(doc[-1]._.sent)

importing a module from input

Forgive a noob. This may be beyond me.
I currently import variables from a module via
from a import *
What I aim to do is import the file as per the input string.
mod=str(input("Select a module: "))
from str(mod) import *
This is what I tried. Clearly wrong. I would like the code to ask for an input, which would be the name of a specific module, then import what the user inputs.
Sorry I can provide any more code, the nature of the question prevents me from being capable of showing what I need
You can simply use __import__():
>>> d = __import__("datetime")
>>> d
<module 'datetime' from 'C:\\Python33\\lib\\datetime.py'>
For a more sophisticated importing I suggest using importlib.
EDIT1 to make it more clear:
>>> mymodule = __import__(input("Which module you want?" ))
>>> mymodule.var1
If you want var1 instead of mymodule.var1, I would make aliases to the global namespace. However, I wouldn't do that since I do not see any sense in that.

TYPO3 Version 6.0 import from CSV or DB query

I have a CSV file which contains these columns - Timestamp, Author, Title and Content.
Now I would like to import this CSV into TYPO3, so that I can display a list of posts containing these attributes.
If the above is not possible, is there a way to write in manual SQL queries, so that I can manually insert content into TYPO3 ?
I have tried many extensions for importing CSV- wil_import, rs_impory, external import .. but none of them work !!
In the following image, I have installed wil_import, but It does not show up anything.
Do I need to make any changes anywhere else, like configuration or something?
You could use phpmyadmin's CSV import functionality. It works reliably.
I've had same problem once and my day was saved thanks to Francois Suter's (Core team member) extensions: svconnector and svconnector_csv. So, I can really recommend them