Train custom NER component with a base model in spaCy v3 - spacy

I'm having problems training a custom NER component within a base model in spaCy's new version.
So far, I've been training my NER model at CLI with the following command:
python -m spacy train en model training validation --base-model en_core_web_sm --pipeline "ner" -R -n 10
Depending on the use case, I took en_core_web_sm or en_core_web_lg as the base model to make use of the other components like tagger and pos.
In spaCy version 3 a config file is required to handle the command at CLI. I'm using the following configurations for training:
[paths]
train = "training/"
dev = "validation/"
vectors = null
init_tok2vec = null
[system]
gpu_allocator = null
seed = 0
[nlp]
lang = "en"
pipeline = ["ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"#tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100
[components.ner.model]
#architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[corpora]
[corpora.dev]
#readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.train]
#readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null
[training.batcher]
#batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
#schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
#loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
#optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
[pretraining]
[initialize]
vectors = null
init_tok2vec = null
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
Since I'm not familiar to spaCy's new version, these are pretty much the default settings.
Unfortunately, I can only the the model from scratch and I can't find an option anymore, to only train the NER component within an existing language model.
I have also tried to add the parser component in the configuration file with
[components]
[components.parser]
source = "en_core_web_sm"
...
But then the model is not even loadable raising the following error
nn_parser.pyx in spacy.syntax.nn_parser.Parser.from_disk()
nn_parser.pyx in spacy.syntax.nn_parser.Parser.Model()
TypeError: Model() takes exactly 1 positional argument (0 given)

In SpaCy 3.0, what you want to do first is initialize your config file to have components that you need:
python -m spacy init config config.cfg --lang en --pipeline tagger,parser,ner,attribute_ruler,senter,lemmatizer,tok2vec
Then, you want to go to the config.cfg and override settings - for example, you can use vectors from existing model:
[initialize]
vectors = "en_core_veb_lg"
init_tok2vec = null
vocab_data = null
lookups = null
before_init = null
after_init = null
Then you can run the train command:
python -m spacy train config.cfg --paths.train ./path_to_your_train_data.spacy --paths.dev ./path_to_your_validation_data.spacy --output ./your_model_name
I also found that it's possible to just go to the model folder and swap out components manually, as well as load different components from different models in the code into a single pipeline.
If you need to use a component from an existing model, you can use the following setting in your config.cfg:
[components.tagger]
source = "en_core_web_lg"
For more info on using existing models and components go to SpaCy documentation.

Related

How to set "budget" tag for xgboost hyperband optimization with mlr3tuningspaces?

I am trying to tune xgboost with hyperband and I would like to use the suggested default tuning space from the mlr3tuningspaces package. However, I don't find how to tag a hyperparameter with "budget" while using lts .
Below, I reproduced the mlr3hyperband package example to illustrate my issue:
library(mlr3verse)
library(mlr3hyperband)
library(mlr3tuningspaces)
## this does not work, because I don't know how to tag a hyperparameter
## with "budget" while using the suggested tuning space
search_space = lts("classif.xgboost.default")
search_space$values
## this works because it has a hyperparameter (nrounds) tagged with "bugdget"
search_space = ps(
nrounds = p_int(lower = 1, upper = 16, tags = "budget"),
eta = p_dbl(lower = 0, upper = 1),
booster = p_fct(levels = c("gbtree", "gblinear", "dart"))
)
# hyperparameter tuning on the pima indians diabetes data set
instance = tune(
method = "hyperband",
task = tsk("pima"),
learner = lrn("classif.xgboost", eval_metric = "logloss"),
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
search_space = search_space,
term_evals = 100
)
# best performing hyperparameter configuration
instance$result
Thanks for pointing this out. I will add the budget tag to the default search space. Until then you can use this code.
library(mlr3hyperband)
library(mlr3tuningspaces)
library(mlr3learners)
# get learner with search space in one go
learner = lts(lrn("classif.xgboost"))
# overwrite nrounds with budget tag
learner$param_set$values$nrounds = to_tune(p_int(1000, 5000, tags = "budget"))
instance = tune(
method = "hyperband",
task = tsk("pima"),
learner = learner,
resampling = rsmp("cv", folds = 3),
measures = msr("classif.ce"),
term_evals = 100
)
Update 28.06.2022
The new API in version 0.3.0 is
learner = lts(lrn("classif.xgboost"), nrounds = to_tune(p_int(1000, 5000, tags = "budget"))

Spacy v3 - ValueError: [E030] Sentence boundaries unset

I'm training an entity linker model with spacy 3, and am getting the following error when running spacy train:
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer'). Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start. .
I've tried with both transformer and tok2vec pipelines, it seems to be failing on this line:
File "/usr/local/lib/python3.7/dist-packages/spacy/pipeline/entity_linker.py", line 252, in update sentences = [s for s in eg.reference.sents]
Running spacy debug data shows no errors.
I'm using the following config, before filling it in with spacy init fill-config:
[paths]
train = null
dev = null
kb = "./kb"
[system]
gpu_allocator = "pytorch"
[nlp]
lang = "en"
pipeline = ["transformer","parser","sentencizer","ner", "entity_linker"]
batch_size = 128
[components]
[components.transformer]
factory = "transformer"
[components.transformer.model]
#architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
tokenizer_config = {"use_fast": true}
[components.transformer.model.get_spans]
#span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96
[components.sentencizer]
factory = "sentencizer"
punct_chars = null
[components.entity_linker]
factory = "entity_linker"
entity_vector_length = 64
get_candidates = {"#misc":"spacy.CandidateGenerator.v1"}
incl_context = true
incl_prior = true
labels_discard = []
[components.entity_linker.model]
#architectures = "spacy.EntityLinker.v1"
nO = null
[components.entity_linker.model.tok2vec]
#architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 2
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true
[components.parser]
factory = "parser"
[components.parser.model]
#architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = false
nO = null
[components.parser.model.tok2vec]
#architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[components.parser.model.tok2vec.pooling]
#layers = "reduce_mean.v1"
[components.ner]
factory = "ner"
[components.ner.model]
#architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null
[components.ner.model.tok2vec]
#architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[components.ner.model.tok2vec.pooling]
#layers = "reduce_mean.v1"
[corpora]
[corpora.train]
#readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
[corpora.dev]
#readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
[training.optimizer]
#optimizers = "Adam.v1"
[training.optimizer.learn_rate]
#schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5
[training.batcher]
#batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
[initialize]
vectors = ${paths.vectors}
[initialize.components]
[initialize.components.sentencizer]
[initialize.components.entity_linker]
[initialize.components.entity_linker.kb_loader]
#misc = "spacy.KBFromFile.v1"
kb_path = ${paths.kb}
I can write a script to add the sentence boundaries in manually to the docs, but am wondering why the sentencizer component is not doing this for me, is there something missing in the config?
You haven't put the sentencizer in annotating_components, so the updates it makes aren't visible to other components during training. Take a look at the relevant section in the docs.

Retrain Frozen Graph in Tensorflow 2.x

I have managed this implementation on retraining frozen graph in tensorflow 1 according to this wonderful detail topic. Basically, the methodology is described:
Load frozen model
Replace the constant frozen node with variable node.
The newly replaced variable node then will be redirected to the corresponding output of the frozen node.
This works in tensorflow 1.x by checking the tf.compat.v1.trainable_variables. However, in tensorflow 2.x, it can't work anymore.
Below is the code snippet:
1/ Load frozen model
frozen_path = '...'
detection_graph = tf.Graph()
with detection_graph.as_default():
od_graph_def = tf.compat.v1.GraphDef()
with tf.compat.v1.io.gfile.GFile(frozen_path, 'rb') as fid:
serialized_graph = fid.read()
od_graph_def.ParseFromString(serialized_graph)
tf.graph_util.import_graph_def(od_graph_def, name='')
2/ Create a clone
with detection_graph.as_default():
const_var_name_pairs = {}
probable_variables = [op for op in detection_graph.get_operations() if op.type == "Const"]
available_names = [op.name for op in detection_graph.get_operations()]
for op in probable_variables:
name = op.name
if name+'/read' not in available_names:
continue
tensor = detection_graph.get_tensor_by_name('{}:0'.format(name))
with tf.compat.v1.Session() as s:
tensor_as_numpy_array = s.run(tensor)
var_shape = tensor.get_shape()
# Give each variable a name that doesn't already exist in the graph
var_name = '{}_turned_var'.format(name)
var = tf.Variable(name=var_name, dtype=op.outputs[0].dtype, initial_value=tensor_as_numpy_array,trainable=True, shape=var_shape)
const_var_name_pairs[name] = var_name
3/ Relace frozen node by Graph Editor
import graph_def_editor as ge
ge_graph = ge.Graph(detection_graph.as_graph_def())
name_to_op = dict([(n.name, n) for n in ge_graph.nodes])
for const_name, var_name in const_var_name_pairs.items():
const_op = name_to_op[const_name+'/read']
var_reader_op = name_to_op[var_name + '/Read/ReadVariableOp']
ge.swap_outputs(ge.sgv(const_op), ge.sgv(var_reader_op))
detection_training_graph = ge_graph.to_tf_graph()
with detection_training_graph.as_default():
writer = tf.compat.v1.summary.FileWriter('remap', detection_training_graph )
writer.close
The problem was my Graph Editor when I import the tf.graph_def instead of the original tf.graph that has Variables.
Quickly solve by fixing step 3
Sol1: Using Graph Editor
ge_graph = ge.Graph(detection_graph)
for const_name, var_name in const_var_name_pairs.items():
const_op = ge_graph._node_name_to_node[const_name+'/read']
var_reader_op = ge_graph._node_name_to_node[var_name+'/Read/ReadVariableOp']
ge.swap_outputs(ge.sgv(const_op), ge.sgv(var_reader_op))
However, this requires disable eager execution. To work around with eager execution, you should attach the MetaGraphDef to Graph Editor as below
with detection_graph.as_default():
meta_saver = tf.compat.v1.train.Saver()
meta = meta_saver.export_meta_graph()
ge_graph = ge.Graph(detection_graph,collections=ge.graph._extract_collection_defs(meta))
However, this is the trickest to make the model trainable in tf2.x
Instead of using Graph Editor to export directly the graph, we should export ourselves. The reason is that the Graph Editor make the Variables data type to be resources. Therefore, we should export the graph as graphdef and import the variable def to the graph:
test_graph = tf.Graph()
with test_graph.as_default():
tf.import_graph_def(ge_graph.to_graph_def(), name="")
for var_name in ge_graph.variable_names:
var = ge_graph.get_variable_by_name(var_name)
ret = variable_pb2.VariableDef()
ret.variable_name = var._variable_name
ret.initial_value_name = var._initial_value_name
ret.initializer_name = var._initializer_name
ret.snapshot_name = var._snapshot_name
ret.trainable = var._trainable
ret.is_resource = True
tf_var = tf.Variable(variable_def=ret,dtype=tf.float32)
test_graph.add_to_collections(var.collection_names, tf_var)
Sol2: Manually map by Graphdef
with detection_graph.as_default() as graph:
training_graph_def = remap_input_node(detection_graph.as_graph_def(),const_var_name_pairs)
current_var = (tf.compat.v1.trainable_variables())
assert len(current_var)>0, "no training variables"
detection_training_graph = tf.Graph()
with detection_training_graph.as_default():
tf.graph_util.import_graph_def(training_graph_def, name='')
for var in current_var:
ret = variable_pb2.VariableDef()
ret.variable_name = var.name
ret.initial_value_name = var.name[:-2] + '/Initializer/initial_value:0'
ret.initializer_name = var.name[:-2] + '/Assign'
ret.snapshot_name = var.name[:-2] + '/Read/ReadVariableOp:0'
ret.trainable = True
ret.is_resource = True
tf_var = tf.Variable(variable_def=ret,dtype=tf.float32)
detection_training_graph.add_to_collections({'trainable_variables', 'variables'}, tf_var)
current_var = (tf.compat.v1.trainable_variables())
assert len(current_var)>0, "no training variables"

Bayesian IRT Pymc3 - Parameter inference

I would like to estimate IRT model using PyMC3.
I generated data with the following distribution:
alpha_fix = 4
beta_fix = 100
theta= np.random.normal(100,15,1000)
prob = np.exp(alpha_fix*(theta-beta_fix))/(1+np.exp(alpha_fix*(theta-beta_fix)))
prob_tt = tt._shared(prob)
Then I created a model using PyMC3 to infer the parameter:
irt = pm.Model()
with irt:
# Priors
alpha = pm.Normal('alpha',mu = 4 , tau = 1)
beta = pm.Normal('beta',mu = 100 , tau = 15)
thau = pm.Normal('thau' ,mu = 100 , tau = 15)
# Modelling
p = pm.Deterministic('p',tt.exp(alpha*(thau-beta))/(1+tt.exp(alpha*(thau-beta))))
out = pm.Normal('o',p,observed = prob_tt)
Then I infer through the model:
with irt:
mean_field = pm.fit(10000,method='advi', callbacks=[pm.callbacks.CheckParametersConvergence(diff='absolute')])
Finally, Sample from the model to get compute posterior:
pm.plot_posterior(mean_field.sample(1000), color='LightSeaGreen');
But the results of the "alpha" (mean of 2.2) is relatively far from the expected one (4) even though the prior on alpha was well-calibrated.
Would you have an idea of the origin of this gap and how to fix it?
Thanks a lot,
out = pm.Normal('o',p,observed = prob_tt)
Why you are using Normal instead of Bernoulli ? Also, what is the variance of normal ?

Perform Kriging Interpolation using Arcpy

I have list of point feature class. I am trying to write a python script to perform Krigging interpolation. I am getting error massage in this code "Point_Num" is not defined,
Below script i am working
import arcpy
from arcpy import env
from arcpy.sa import *
arcpy.env.overwriteOutput = True
# Check out the ArcGIS Spatial Analyst extension license
arcpy.CheckOutExtension("Spatial")
In_Point = r"D:\NPP_Test\MERRA_TEMP_2012C" #(Point feature name:r001_mean, r002_mean.....r012_mean )
Out_Raster = r"D:\NPP_Test\MERRA_TEMP_2012D"
points = arcpy.ListFeatureClasses()
zFields = "GRID_CODE"
#Kriging Veriable
cellSize = 0.05
lagSize = 0.5780481172534
majorRange = 6
partialSill = 3.304292110
nugget = 0.002701348
kRadius = RadiusFixed(20000, 1)
#Mask region of interest
mask = r"D:\Gujarta Shape file\GUJARATSTATE.shp"
for zField in zFields:
Point = Point_Num[:3]
kModelUniversalObj = KrigingModelUniversal("LINEARDRIFT", lagSize, majorRange, partialSill, nugget)
OutKriging = Kriging(inPointFeatures, zField, kModelUniversalObj, cellSize, kRadius)
#IDWMASk = ExtractByMask(outIDW, mask)
KrigMask = ExtractByMask(OutKriging, mask)
#Save outraster as the same name of input
KrigMask.save("r{}.tif".format(Point_Num))