How can we make some generic inference of taxonomic relation between entities from text? Looking for words near 'type of' in the word2vec of en_core_web_lg model, they all seem unrelated. The words near 'type' however are more similar to it. But how can I use common phrases in my text and apply some generic similarity for inferring taxonomy from SVO triples etc.? Can do a Sense2Vec type approach, but wondering if something existing can be used without new training.
Output of code below:
['eradicate', 'wade', 'equator', 'educated', 'lcd', 'byproducts', 'two', 'propensity', 'rhinos', 'procrastinate']
def get_related(word):
filtered_words = [w for w in word.vocab if w.is_lower == word.is_lower and w.prob >= -15]
similarity = sorted(filtered_words, key=lambda w: word.similarity(w), reverse=True)
return similarity[:10]
print ([w.lower_ for w in get_related(nlp.vocab[u'type_of'])])
All the similarities your code retrieves are 0.0, so sorting the list has no effect.
You are treating "type_of" as a word (more accurately, a lexeme), and assuming spaCy will understand it as the phrase "type of". Note that the first has an underscore, while the second one does not; however even without the underscore, it is not a lexeme in the model's vocabulary. Since the model does not have sufficient data on "type_of" for a similarity score, the score is 0.0 for every word you compare to it.
Instead, you can create a Span of the words "type of" and call similarity() on that. This requires only a small change to your code:
import spacy
def get_related(span): # this now expects a Span instead of a Lexeme
filtered_words = [w for w in span.vocab if
w.is_lower == span.text.islower()
and w.prob >= -15] # filter by probability and case
# (use the lowercase words if and only if the whole Span is in lowercase)
similarity = sorted(filtered_words,
key=lambda w: span.similarity(w),
reverse=True) # sort by the similarity of each word to the whole Span
return similarity[:10] # return the 10 most similar words
nlp = spacy.load('en_core_web_lg') # load the model
print([w.lower_ for w in get_related(nlp(u'type')[:])]) # print related words for "type"
print([w.lower_ for w in get_related(nlp(u'type of')[:])]) # print related words for "type of"
Output:
['type', 'types', 'kind', 'sort', 'specific', 'example', 'particular', 'similar', 'different', 'style']
['type', 'of', 'types', 'kind', 'particular', 'sort', 'different', 'such', 'same', 'associated']
As you can see, all the words are related to the input to some degree, and the output is similar but not identical for "type" and "type of".
Related
TLDR: How to iterate through predefined feature subsets as part of a scikit-learn gridsearchcv pipeline?
For a regression task, I have set up a (nested) CV to chose and evaluate models for a given pandas dataframe model_X (all numerical columns, no missing data) and a target pandas series model_y.
My goal is to combine feature selection and hyperparameter tuning. However, for my purpose I do not want to use any of sklearn's feature selection algorithms, instead I simply want to try different predefined subsets of the available columns and get them tested against each other (and of course in all combinations with the other hyperparameters) in the CV.
For this purpose I have a list of tuples feature_candidates_list, where each tuple contains certain column names from model_X to be used together as features.
To achieve this I am using Functiontransformer like so:
def SelectFeatures(model_X, feature_set, feature_sets=feature_candidates_list):
return model_X.loc[:, feature_sets[feature_set]]
CustomFeatureSelector = FunctionTransformer(SelectFeatures, feature_names_out='one-to-one')
And here is how I put all together in a pipeline and param grid (this is a reduced example for only the relevant steps):
PreProcessor = ColumnTransformer([
('selector', CustomFeatureSelector, model_X.columns),
('scaler', StandardScaler(), make_column_selector(dtype_include=np.number)),
])
pipe = Pipeline(steps=[
('preprocessor', PreProcessor),
('regressor', DummyRegressor()) # just a dummy here, as it can't be empty (actual regressor see regressor_params)
])
preprocessor_params = [
{
'preprocessor__selector__kw_args': [{'feature_set':i} for i in range(len(feature_candidates_list))],
'preprocessor__scaler__with_mean': [True, False],
'preprocessor__scaler__with_std': [True, False],
},
]
regressor_params = [
{
'regressor': [TweedieRegressor(max_iter=1000)],
'regressor__power': [0, 1],
'regressor__alpha': [0, 1],
'regressor__link': ['log'],
'regressor__fit_intercept': [True, False],
},
]
params = [{**dict_pre, **dict_reg} for dict_reg in regressor_params for dict_pre in preprocessor_params]
Finally, to run the model selection and evaluation I use:
scoring = {
'R2': 'r2',
'MAPE': 'neg_mean_absolute_percentage_error',
'MedAE': 'neg_median_absolute_error',
'MSLE': 'neg_mean_squared_log_error',
}
refit_scorer = 'R2'
with parallel_backend('loky', n_jobs=-1):
innerCV = GridSearchCV(
pipe,
params,
scoring= scoring,
refit= refit_scorer,
cv=10,
verbose=1,
)
outerCV = cross_validate(
innerCV,
model_X,
model_y,
scoring=scoring,
cv=10,
return_estimator=True,
verbose=1,
)
I am not sure if this pipeline actually selects the features as intended.
model_X has m columns
every tuple in feature_candidates_list contains n column names (n < m, of course).
What I did to check on a single outer fold's best estimator is:
outerCV['estimator'][0].best_estimator_.named_steps['regressor'].n_features_in_
which gives me m + n but I expected n (also tested it for the other folds).
I think there must be something wrong in how I put together my preprocessor. It seems like it is taking all original columns of model_X and concatenates them with the chosen set of columns instead of replacing. When I switch off the scaler the output of the above is in deed equal to n, however, I still cannot see which features were chosen for a respective estimator because calling .feature_names_in_ on them raises:
AttributeError: 'TweedieRegressor' object has no attribute 'feature_names_in_'
Maybe the whole way I approach this selection of features in gridsearchcv is not smart and I should go a different route? Any hints welcome!
Update:
I switched to sklearn nightly (v1.2.dev0) where I can use set_config(transform_output='pandas') to avoid getting my dataframe converted to a numpy array by transformers. This helps to get feature names when calling the .feature_names_in_ on one of the estimators but it only works when I have just the scaler activated.
When I also activate my custom selector the fitting fails for all folds. But when I turn off the set_config(...) again, it works just like in the stable versions v1.1.2 and v1.1.3 without ability to get feature names.
I am checking on which words the SpaCy Spanish lemmatizer works on using the .has_vector method. In the two columns of the datafame I have the output of the function that indicates which words can be lemmatized and in the other one the corresponding phrase.
I would like to know how I can extract all the words that have False output to correct them so that I can lemmatize.
So I created the function:
def lemmatizer(text):
doc = nlp(text)
return ' '.join([str(word.has_vector) for word in doc])
And applied it to the column sentences in the DataFrame
df["Vectors"] = df.reviews.apply(lemmatizer)
And put in another data frame as:
df2= pd.DataFrame(df[['Vectors', 'reviews']])
The output is
index Vectors reviews
1 True True True False 'La pelicula es aburridora'
Two ways to do this:
import pandas
import spacy
nlp = spacy.load('en_core_web_lg')
df = pandas.DataFrame({'reviews': ["aaabbbcccc some example words xxxxyyyz"]})
If you want to use has_vector:
def get_oov1(text):
return [word.text for word in nlp(text) if not word.has_vector]
Alternatively you can use the is_oov attribute:
def get_oov2(text):
return [word.text for word in nlp(text) if word.is_oov]
Then as you already did:
df["oov_words1"] = df.reviews.apply(get_oov1)
df["oov_words2"] = df.reviews.apply(get_oov2)
Which will return:
> reviews oov_words1 oov_words2
0 aaabbbcccc some example words xxxxyyyz [aaabbbcccc, xxxxyyyz] [aaabbbcccc, xxxxyyyz]
Note:
When working with both of these ways it is important to know that this is model dependent, and usually has no backbone in smaller models and will always return a default value!
That means when you run the exact same code but e.g. with en_core_web_sm you get this:
> reviews oov_words1 oov_words2
0 aaabbbcccc some example words xxxxyyyz [] [aaabbbcccc, some, example, words, xxxxyyyz]
Which is because has_vector has a default value of False and is then not set by the model. is_oov has a default value of True and then is not by the model either. So with the has_vector model it wrongly shows all words as unknown and with is_oov it wrongly shows all as known.
The code below as an example for analyzing massive corpus. I want to restrict the term-document matrix to 1000 most frequent unigrams, but changing the max-features parameter to n only return the first n unigrams. Any suggestion?
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
corpus = ['Hi my name is Joe.', 'Hi my name is Donald.']
vectorizer = TfidfVectorizer(max_features=3)
X = vectorizer.fit_transform(corpus).todense()
df = pd.DataFrame(X, columns=vectorizer.get_feature_names())
df.to_csv('test.csv')
I am assuming that this is a problem in your example, but according to the sklearn documentation for TfidfVectorizer ssays the following for max_features:
If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.
It might be that the first n terms are considered when there are words of equal frequency, but otherwise it should return the correct result. If it still does not work, I strongly suggest to open a bug report in the sklearn repository. However, you can also manually construct a vocabulary yourself (with your own interpretation of "frequency", by setting the vocabulary option:
vocabulary: Mapping or iterable, default=None
Either a Mapping (e.g., a dict) where keys are terms and values are indices in the feature matrix, or an iterable over terms. If not given, a vocabulary is determined from the input documents.
I have a csv with both categorical and float dtypes. I want to do the following:
For each categorical column i will use pandas to compute the unique values (pd.unique()) that are present in the column. say u_l for a column
I will use the len(u_l) to decide upon the dimension of embeddings that i use for a particular categorical column that i want i embed (this step is the reason i cannot use tensorflow_transform)
I want to create some stateful node that can map category (token) value to embeddings index thus subsequently i can lookup the embedding from embeddings matrix that i created in step 2
I dont know how to go about doing it currently. A very inelegant solution i can see is using tensorflow_datasets:
encoder = tfds.features.text.TokenTextEncoder(u_l,decode_token_separator=' ')
concatenate the entire column using space delimiter (c_l) (c_l is one string now) and then using encoder.encode(c_l)
This is a very basic thing that i think tensorflow would be able to do relatively easily. Please guide me to the right solution
If you want to use your word corpus as embedding like if you have corpus as this :
corpus :
"This pasta is good"
"This pasta is very good"
and you want to use embedding you can use Tokenizer of TF see this. It will create a dict containing words as keys and index as value like in above corpus dict looks like :
word_index = {"this" : 1, "pasta" : 2, "good" : 3, "very" : 4}
you can avoid stopwords.
Now you can make word embedding vector using these word_index dict so that it looks like
For corpus 1 : [1, 2, 3]
For corpus 2 : [1, 2, 4, 3]
Enough talk let see some code : Also define oov_token for out of vocabulary words.
You can do like this :
vocab_size = 10000
embedding_dim = 16
max_length = 120
trunc_type='post'
oov_tok = "<OOV>"
from tensorflow.keras.preprocessing.text import Tokenizer
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(training_sentences)
word_index = tokenizer.word_index
sequences = tokenizer.texts_to_sequences(training_sentences) # This will create word embedding vector
padded = pad_sequences(sequences,maxlen=max_length, truncating=trunc_type) # This will padd zeros according to `trunc_type`, here add zeros in last
testing_sequences = tokenizer.texts_to_sequences(testing_sentences)
testing_padded = pad_sequences(testing_sequences,maxlen=max_length)
Also see this GitHub code of me hope it will help
I am following the wildml blog on text classification using tensorflow. I am not able to understand the purpose of max_document_length in the code statement :
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
Also how can i extract vocabulary from the vocab_processor
I have figured out how to extract vocabulary from vocabularyprocessor object. This worked perfectly for me.
import numpy as np
from tensorflow.contrib import learn
x_text = ['This is a cat','This must be boy', 'This is a a dog']
max_document_length = max([len(x.split(" ")) for x in x_text])
## Create the vocabularyprocessor object, setting the max lengh of the documents.
vocab_processor = learn.preprocessing.VocabularyProcessor(max_document_length)
## Transform the documents using the vocabulary.
x = np.array(list(vocab_processor.fit_transform(x_text)))
## Extract word:id mapping from the object.
vocab_dict = vocab_processor.vocabulary_._mapping
## Sort the vocabulary dictionary on the basis of values(id).
## Both statements perform same task.
#sorted_vocab = sorted(vocab_dict.items(), key=operator.itemgetter(1))
sorted_vocab = sorted(vocab_dict.items(), key = lambda x : x[1])
## Treat the id's as index into list and create a list of words in the ascending order of id's
## word with id i goes at index i of the list.
vocabulary = list(list(zip(*sorted_vocab))[0])
print(vocabulary)
print(x)
not able to understand the purpose of max_document_length
The VocabularyProcessor maps your text documents into vectors, and you need these vectors to be of a consistent length.
Your input data records may not (or probably won't) be all the same length. For example if you're working with sentences for sentiment analysis they'll be of various lengths.
You provide this parameter to the VocabularyProcessor so that it can adjust the length of output vectors. According to the documentation,
max_document_length: Maximum length of documents. if documents are
longer, they will be trimmed, if shorter - padded.
Check out the source code.
def transform(self, raw_documents):
"""Transform documents to word-id matrix.
Convert words to ids with vocabulary fitted with fit or the one
provided in the constructor.
Args:
raw_documents: An iterable which yield either str or unicode.
Yields:
x: iterable, [n_samples, max_document_length]. Word-id matrix.
"""
for tokens in self._tokenizer(raw_documents):
word_ids = np.zeros(self.max_document_length, np.int64)
for idx, token in enumerate(tokens):
if idx >= self.max_document_length:
break
word_ids[idx] = self.vocabulary_.get(token)
yield word_ids
Note the line word_ids = np.zeros(self.max_document_length).
Each row in raw_documents variable will be mapped to a vector of length max_document_length.