How do I evaluate a custom pipeline component with a custom attribute? - spacy

Questions:
How can I give GoldParse the gold data for a custom attribute?
How can I extend the properties of Scorer by custom scores which are based on a custom attribute?
Explanation
I have implemented a custom pipeline component setting a custom attribute which was set with Doc.set_extension('results', default=[]). I want to evaluate my pipeline with labelled data (something like {text: "This is some text", results: ["banana", "picture"]}). It seems to me that GoldParse and Scorer are doing what I need with default attributes, but I can't find information on how to use them with a custom attribute.
I have seen and understood examples like this, but they only ever deal with default attributes.
What I've tried
I have tried figuring out whether I can somehow configure the two classes for custom attributes/scores, but haven't found a way. The parameters of the init method of GoldParse and the Scorer properties seem to be fixed.
I have thought about extending the two classes with subclasses, but they don't seem easily extendable to me.
What I would like to avoid
Of course I can copy the code from Scorer and GoldParse which I need and add code for my custom attribute, but that seems like a bad solution. Also, considering how spaCy encourages you to extend a pipeline and a Doc, I would be surprised if the evaluation of those extensions were this hard.

Unfortunately, it actually is this hard in spacy v2. It's very hard to add things to GoldParse (basically a don't-try-this-at-home level of hard) and the Scorer is also hard to extend.
We're working on this for the upcoming spacy v3, where the scoring methods will be implemented more generally and each component will be able to provide its own score method. Be aware that this is still unstable, but if you're curious you can have a look at: https://github.com/explosion/spaCy/pull/5731. GoldParse has been replaced with Example, which stores both the gold annotation and the predicted annotation on individual Doc objects, getting rid of the restrictions related to GoldParse.
If you have a doc-level extension (as above) then you should probably just use a different library for evaluation. You could potentially use ROCAUCScore or PRFScore from spacy.scorer, but it may be easier to use something like sklearn metrics instead. (The ROCAUCScore is just a simplified version of the sklearn ROC AUC metric.)
If you have a token-level extension, for v2 I think the best you can do within spacy is to use PRFScore and extract the alignment bits based on words from a GoldParse to use outside of the scorer itself. Something like this:
import spacy
from spacy.scorer import PRFScore
nlp = spacy.load("my_model")
score = PRFScore()
for text, gold_words, gold_attrs in zip(texts, gold_words_list, gold_attrs_list):
# NOTE: gold_attrs must be aligned with gold_words
# gold_words = ["a", "b", "c", ...]
# gold_attrs = ["a1", "b1", "c1", ...]
gold = GoldParse(nlp.make_doc(text), words=gold_words)
doc = nlp(text)
gold_values = set()
cand_values = set()
for i, gold_attr in enumerate(gold_attrs):
gold_values.add((i, gold_attr))
for token in doc:
if token.orth_.isspace():
continue
gold_i = gold.cand_to_gold[token.i]
if gold_i is not None:
cand_values.add((gold_i, doc._.attr))
score.score_set(cand_values, gold_values)
print(score.fscore)
This is an untested sketch that should parallel how token.tag is evaluated in the Scorer. The alignment bits are the trickiest part, so if you don't have misalignments between gold words and spacy's tokenization, then you may also be better off exporting your results and using a different library for evaluation.

Related

Patient name extraction using MedSpacy

I was looking for some guidence on NER using medspacy. Aware of disease extraction using MedSpacy but the aim is to extract patient name from medical report using medspacy.
Text supposed to be :
patient:Jeromy, David (DOB)
Date range 2020 to 2022. Visited Dr Brian. Suffered from ...
This type of dataset is there, want to extract patient name from all the pages of medical reports using MedSpacy. I know target rules can be helpful but any clarified guidence will be appreciated.
Thanks & regards
If you find that the default SpaCy NER model is not sufficient, as it will not pick up names such as "Byrn, John", I have a couple of suggestions:
Train a custom NER component using SpaCy's Prodigy annotation tool, which you can use to easily label some examples of names. This is a rather simple task, so you can likely train a model with less than 100 diverse examples. Note: Prodigy is a paid tool, so see my other suggestions if you do not have access/are not willing to pay.
Train a custom NER component without Prodigy. Similar to the above approach, but slightly more involved. This Medium article provides a beginner-friendly introduction to doing so, and you can also refer to SpaCy's own documentation. You can provide SpaCy with some examples of texts and the entities you want extracted, like so:
TRAIN_DATA = [
('Patient: Byrn, John', {
'entities': [(9, 19, 'PERSON')]
}),
('John Byrn received 10mg of advil', {
'entities': [(0, 10, 'PERSON')]
})
]
Build rules based on existing SpaCy components. You can leverage existing SpaCy pipeline components (you don't necessarily need MedSpaCy for this), such as POS tagging and Dependency Parsing. For example, you can look for proper nouns in your documents to identify names. Check out the docs on POS tagging here.
Try other pretrained NER models. There may be other models that are better suited to your task. Check out other models on SpaCy Universe, or even better, on HuggingFaceHub, which contains some of the best models out there for every use case. Added bonus of HF Hub is that you can try out the models on each model model page, and assess the performance on some examples before you decide.
Hope this helps!

Kernel's hyper-parameters; initialization and setting bounds

I think many other people like me might be interested in how they can use GPFlow for their special problems. The key is how GPFlow is customizable, and a good example would be very helpful.
In my case, I read and tried lots of comments in raised issues without any real success. Setting kernel model parameters is not straightforward (creating with default values, and then do it via the delete object method). Transform method is vague.
It would be really helpful if you could add an example showing. how one can initialize and set bounds of an anisotropic kernel model (length-scales values and bounds, variances, ...) and specially adding observations error (as an array-like alpha parameter)
If you just want to set a value, then you can do
model = gpflow.models.GPR(np.zeros((1, 1)),
np.zeros((1, 1)),
gpflow.kernels.RBF(1, lengthscales=0.2))
Alternatively
model = gpflow.models.GPR(np.zeros((1, 1)),
np.zeros((1, 1)),
gpflow.kernels.RBF(1))
model.kern.lengthscales = 0.2
If you want to change the transform, you either need to subclass the kernel, or you can also do
with gpflow.defer_build():
model = gpflow.models.GPR(np.zeros((1, 1)),
np.zeros((1, 1)),
gpflow.kernels.RBF(1))
transform = gpflow.transforms.Logistic(0.1, 1.))
model.kern.lengthscales = gpflow.params.Parameter(0.3, transform=transform)
model.compile()
You need the defer_build to stop the graph being compiled before you've changed the transform. Using the approach above, the compilation of the tensorflow graph is delayed (until the explicit model.compile()) so is built with the intended bounding transform.
Using an array parameter for likelihood variance is outside the scope of gpflow. For what it's worth (and because it has been asked about before), that particular model is especially problematic as it is not clear how test points are defined.
Setting kernel parameters can be done using the .assign() function, or through direct assignment. See the notebook https://github.com/GPflow/GPflow/blob/develop/doc/source/notebooks/understanding/tf_graphs_and_sessions.ipynb. You do not need to delete a parameter to assign a new value to it.
If you want to have per-datapoint noise, you will need to implement your own custom likelihood, which you can do by taking Gaussian likelihood in likelihoods.py as an example.
If by "bounds" you mean limiting the optimisation range for a parameter, you can use the Logistic transform. If you want to pass in a custom transformation for a parameter, you can pass a constructed Parameter object into constructors with a custom transform. Alternatively you can assign a newly created Parameter with a new transform to the model.
Here is more information on how to access and change GPflow parameters: viewing, getting and settings parameters documentation.
Extra bit for #user1018464 answer about replacing transform in existing parameter: changing transformation is a bit tricky, you can't change transformation once a model was compiled in TensorFlow.
E.g.
likelihood = gpflow.likelihoods.Gaussian()
likelihood.variance.transform = gpflow.transforms.Logistic(1., 10.)
----
GPflowError: Parameter "Gaussian/variance" has already been compiled.
Instead you have to reset GPflow object:
likelihood = gpflow.likelihoods.Gaussian() # All tensors compiled
likelihood.clear()
likelihood.variance.transform = gpflow.transforms.Logistic(2, 5)
likelihood.variance = 2.5
likelihood.compile()

binary classification target specifically on false positive

I got a little confused when using models from sklearn, how do I set the specific optimization functions? for example, when RandomForestClassifier is used, how do I let the model 'know' that I want to maximize 'recall' or 'F1 score'. or 'AUC' instead of 'accuracy'?
Any suggestions? Thank you.
What you are looking for is Parameter Tuning. Basically, first you select an estimator , then you define a hyper-parameter space (i.e. all possible parameters and their respective values that you want to tune), a cross validation scheme and scoring function. Now depending upon your choice of searching the parameter space, you can choose the following:
Exhaustive Grid Search
In this approach, sklearn creates a grid of all possible combination of hyper-paramter values defined by the user using the GridSearchCV method. For instance, :
my_clf = DecisionTreeClassifier(random_state=0,class_weight='balanced')
param_grid = dict(
classifier__min_samples_split=[5,7,9,11],
classifier__max_leaf_nodes =[50,60,70,80],
classifier__max_depth = [1,3,5,7,9]
)
In this case, the grid specified is a cross-product of values of classifier__min_samples_split, classifier__max_leaf_nodes and classifier__max_depth. The documentation states that:
The GridSearchCV instance implements the usual estimator API: when “fitting” it on a dataset all the possible combinations of parameter values are evaluated and the best combination is retained.
An example for using GridSearch :
#Create a classifier
clf = LogisticRegression(random_state = 0)
#Cross-validate the dataset
cv=StratifiedKFold(n_splits=n_splits).split(features,labels)
#Declare the hyper-parameter grid
param_grid = dict(
classifier__tol=[1.0,0.1,0.01,0.001],
classifier__C = np.power([10.0]*5,list(xrange(-3,2))).tolist(),
classifier__solver =['newton-cg', 'lbfgs', 'liblinear', 'sag'],
)
#Perform grid search using the classifier,parameter grid, scoring function and the cross-validated dataset
grid_search = GridSearchCV(clf, param_grid=param_grid, verbose=10,scoring=make_scorer(f1_score),cv=list(cv))
grid_search.fit(features.values,labels.values)
#To get the best score using the specified scoring function use the following
print grid_search.best_score_
#Similarly to get the best estimator
best_clf = grid_logistic.best_estimator_
print best_clf
You can read more about it's documentation here to know about the various internal methods, etc. to retrieve the best parameters, etc.
Randomized Search
Instead of exhaustively checking for the hyper-parameter space, sklearn implements RandomizedSearchCV to do a randomized search over the paramters. The documentation states that:
RandomizedSearchCV implements a randomized search over parameters, where each setting is sampled from a distribution over possible parameter values.
You can read more about it from here.
You can read more about other approaches here.
Alternative link for reference:
How to Tune Algorithm Parameters with Scikit-Learn
What is hyperparameter optimization in machine learning in formal terms?
Grid Search for hyperparameter and feature selection
Edit: In your case, if you want to maximize the recall for the model, you simply specify recall_score from sklearn.metrics as the scoring function.
If you wish to maximize 'False Positive' as stated in your question, you can refer this answer to extract the 'False Positives' from the confusion matrix. Then use the make scorer function and pass it to the GridSearchCV object for tuning.
I would suggest you grab a cup of coffee and read (and understand) the following
http://scikit-learn.org/stable/modules/model_evaluation.html
You need to use something along the lines of
cross_val_score(model, X, y, scoring='f1')
possible choices are (check the docs)
['accuracy', 'adjusted_mutual_info_score', 'adjusted_rand_score',
'average_precision', 'completeness_score', 'explained_variance',
'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted',
'fowlkes_mallows_score', 'homogeneity_score', 'mutual_info_score',
'neg_log_loss', 'neg_mean_absolute_error', 'neg_mean_squared_error',
'neg_mean_squared_log_error', 'neg_median_absolute_error',
'normalized_mutual_info_score', 'precision', 'precision_macro',
'precision_micro', 'precision_samples', 'precision_weighted', 'r2',
'recall', 'recall_macro', 'recall_micro', 'recall_samples',
'recall_weighted', 'roc_auc', 'v_measure_score']
Have fun
Umberto

adding gaussian noise to all tensorflow variables

I'm working on a project which needs to evaluate the performance of CNN/RNN after adding noise to all the variables. For example, if we have an simple MLP, I want to add a random gaussian noise to all the weight parameters, which is not difficult. However, it doesn't seem easy to manipulate the variables for RNN. For example, the variables inside the tf.contrib.rnn.BasicLSTMCell are encapsulated and not accessble for users.
I found a possible way to do this by using the tf.train.saver() function. I can print all the variables including the encapsulated variables. However, how to modify the value of all the variables is still not clear.
Is there an easy way to do this?
You can use tf.trainable_variables (doc) or tf.global_variables (doc) to get those variables, and add noisy to them.

How is tf.summary.tensor_summary meant to be used?

TensorFlow provides a tf.summary.tensor_summary() function that appears to be a multidimensional variant of tf.summary.scalar():
tf.summary.tensor_summary(name, tensor, summary_description=None, collections=None)
I thought it could be useful for summarizing inferred probabilities per class ... somewhat like
op_summary = tf.summary.tensor_summary('classes', some_tensor)
# ...
summary = sess.run(op_summary)
writer.add_summary(summary)
However it appears that TensorBoard doesn't provide a way to display these summaries at all. How are they meant to be used?
I cannot get it to work either. It seems like that feature is still under development. See this video from the TensorFlow Dev Summit that states that the tensor_summary is still under development (starting at 9:17): https://youtu.be/eBbEDRsCmv4?t=9m17s. It will probably be better defined and examples should be provided in the future.