Spacy v3 - ValueError: [E030] Sentence boundaries unset - spacy

I'm training an entity linker model with spacy 3, and am getting the following error when running spacy train:
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with: nlp.add_pipe('sentencizer'). Alternatively, add the dependency parser or sentence recognizer, or set sentence boundaries by setting doc[i].is_sent_start. .
I've tried with both transformer and tok2vec pipelines, it seems to be failing on this line:
File "/usr/local/lib/python3.7/dist-packages/spacy/pipeline/entity_linker.py", line 252, in update sentences = [s for s in eg.reference.sents]
Running spacy debug data shows no errors.
I'm using the following config, before filling it in with spacy init fill-config:
[paths]
train = null
dev = null
kb = "./kb"
[system]
gpu_allocator = "pytorch"
[nlp]
lang = "en"
pipeline = ["transformer","parser","sentencizer","ner", "entity_linker"]
batch_size = 128
[components]
[components.transformer]
factory = "transformer"
[components.transformer.model]
#architectures = "spacy-transformers.TransformerModel.v3"
name = "roberta-base"
tokenizer_config = {"use_fast": true}
[components.transformer.model.get_spans]
#span_getters = "spacy-transformers.strided_spans.v1"
window = 128
stride = 96
[components.sentencizer]
factory = "sentencizer"
punct_chars = null
[components.entity_linker]
factory = "entity_linker"
entity_vector_length = 64
get_candidates = {"#misc":"spacy.CandidateGenerator.v1"}
incl_context = true
incl_prior = true
labels_discard = []
[components.entity_linker.model]
#architectures = "spacy.EntityLinker.v1"
nO = null
[components.entity_linker.model.tok2vec]
#architectures = "spacy.HashEmbedCNN.v1"
pretrained_vectors = null
width = 96
depth = 2
embed_size = 2000
window_size = 1
maxout_pieces = 3
subword_features = true
[components.parser]
factory = "parser"
[components.parser.model]
#architectures = "spacy.TransitionBasedParser.v2"
state_type = "parser"
extra_state_tokens = false
hidden_width = 128
maxout_pieces = 3
use_upper = false
nO = null
[components.parser.model.tok2vec]
#architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[components.parser.model.tok2vec.pooling]
#layers = "reduce_mean.v1"
[components.ner]
factory = "ner"
[components.ner.model]
#architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = false
nO = null
[components.ner.model.tok2vec]
#architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0
[components.ner.model.tok2vec.pooling]
#layers = "reduce_mean.v1"
[corpora]
[corpora.train]
#readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
[corpora.dev]
#readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
[training]
accumulate_gradient = 3
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
[training.optimizer]
#optimizers = "Adam.v1"
[training.optimizer.learn_rate]
#schedules = "warmup_linear.v1"
warmup_steps = 250
total_steps = 20000
initial_rate = 5e-5
[training.batcher]
#batchers = "spacy.batch_by_padded.v1"
discard_oversize = true
size = 2000
buffer = 256
[initialize]
vectors = ${paths.vectors}
[initialize.components]
[initialize.components.sentencizer]
[initialize.components.entity_linker]
[initialize.components.entity_linker.kb_loader]
#misc = "spacy.KBFromFile.v1"
kb_path = ${paths.kb}
I can write a script to add the sentence boundaries in manually to the docs, but am wondering why the sentencizer component is not doing this for me, is there something missing in the config?

You haven't put the sentencizer in annotating_components, so the updates it makes aren't visible to other components during training. Take a look at the relevant section in the docs.

Related

How to optimize for a variable that goes into the argument of a function in pyomo?

I am trying to code a first order plus dead time (FOPDT) model and use it
for PID tuning. The inspiration for the work is the scipy code from: https://apmonitor.com/pdc/index.php/Main/FirstOrderOptimization
When I use model.Thetam() in the ODE constraint, it does not optimize Thetam,
keeps it at the initial value. When I use only model.Theta then the code throws an error -
ValueError: object arrays are not supported if I remove it from uf argument i.e.model.Km * (uf(tt - model.Thetam)-model.U0))
and if I remove it from the if statement (if tt > model.Thetam), then the error is - ERROR:pyomo.core:Rule failed when generating expression for Constraint ode with index 0.0: PyomoException: Cannot convert non-constant Pyomo expression (Thetam < 0.0) to bool. This error is usually caused by using a Var, unit, or mutable Param in a Boolean context such as an "if" statement, or when checking container membership or equality.
Code:
`url = 'http://apmonitor.com/pdc/uploads/Main/data_fopdt.txt'
data = pd.read_csv(url)
data = data.iloc[1:]
t = data['time'].values - data['time'].values[0]
u = data['u'].values
yp = data['y'].values
u0 = u[0]
yp0 = yp[0]
yf = interp1d(t, yp)
# specify number of steps
ns = len(t)
delta_t = t[1]-t[0]
# create linear interpolation of the u data versus time
uf = interp1d(t,u,fill_value="extrapolate")
model = ConcreteModel()
model.T = ContinuousSet(initialize = t)
model.Y = Var(model.T)
model.dYdT = DerivativeVar(model.Y, wrt = (model.T))
model.Y[0].fix(yp0)
model.Yp0 = Param(initialize = yp0)
model.U0 = Param(initialize = u0)
model.Km = Var(initialize = 2, bounds = (0.1, 10))
model.Taum = Var(initialize = 3, bounds = (0.1, 10))
model.Thetam = Var(initialize = 0, bounds = (0, 10))
model.ode = Constraint(model.T,
rule = lambda model, tt: model.dYdT[tt] == (-(model.Y[tt]-model.Yp0) + model.Km * (uf(tt - model.Thetam())-model.U0))/model.Taum if tt > model.Thetam()
else model.dYdT[tt] == -(model.Y[tt]-model.Yp0)/model.Taum)
def obj_rule(m):
return sum((m.Y[i] - yf(i))**2 for i in m.T)
model.obj = Objective(rule = obj_rule)
discretizer = TransformationFactory('dae.finite_difference')
discretizer.apply_to(model, nfe = 500, wrt = model.T, scheme = 'BACKWARD')
opt=SolverFactory('ipopt', executable='/content/ipopt')
opt.solve(model)#, tee = True)
model.pprint()
model2 = ConcreteModel()
model2.T = ContinuousSet(initialize = t)
model2.Y = Var(model2.T)
model2.dYdT = DerivativeVar(model2.Y, wrt = (model2.T))
model2.Y[0].fix(yp0)
model2.Yp0 = Param(initialize = yp0)
model2.U0 = Param(initialize = u0)
model2.Km = Param(initialize = 3.0145871)#3.2648)
model2.Taum = Param(initialize = 1.85862177) # 5.2328)
model2.Thetam = Param(initialize = 0)#2.936839032) #0.1)
model2.ode = Constraint(model2.T,
rule = lambda model, tt: model.dYdT[tt] == (-(model.Y[tt]-model.Yp0) + model.Km * (uf(tt - model.Thetam())-model.U0))/model.Taum)
discretizer2 = TransformationFactory('dae.finite_difference')
discretizer2.apply_to(model2, nfe = 500, wrt = model2.T, scheme = 'BACKWARD')
opt2=SolverFactory('ipopt', executable='/content/ipopt')
opt2.solve(model2)#, tee = True)
# model.pprint()
t = [i for i in model.T]
ypred = [model.Y[i]() for i in model.T]
ytrue = [yf(i) for i in model.T]
yoptim = [model2.Y[i]() for i in model2.T]
plt.plot(t, ypred, 'r-')
plt.plot(t, ytrue)
plt.plot(t, yoptim)
plt.legend(['pred', 'true', 'optim'])
`

Tensorflow functional API

num_tags = 12
num_words = 10000
num_departments = 4
title_input = keras.Input(
shape=(None,), name="title"
)
body_input = keras.Input(shape=(None,), name="body") # Variable-length sequence of ints
tags_input = keras.Input(
shape=(num_tags,), name="tags"
)
title_features = layers.Embedding(num_words, 64)(title_input)
body_features = layers.Embedding(num_words, 64)(body_input)
title_features = layers.LSTM(128)(title_features)
body_features = layers.LSTM(32)(body_features)
x = layers.concatenate([title_features, body_features, tags_input])
priority_pred = layers.Dense(1, name="priority")(x)
department_pred = layers.Dense(num_departments, name="department")(x)
model = keras.Model(
inputs=[title_input, body_input, tags_input],
outputs=[priority_pred, department_pred],
)
I want to make a separate call for priority and department to get one output instead of two.
Is it possible to do that?

DQN is not training well

import tensorflow as tf
import keras
import numpy as np
import gym
import random
from keras.layers import *
model = keras.models.Sequential()
model.add(Dense(12,activation = 'tanh',input_shape = (4,)))
model.add(Dense(2,activation = 'linear'))
model.compile(optimizer = 'adam',loss = 'MSE',metrics = ['accuracy'])
env = gym.make("CartPole-v1")
def preprocessing(state):
return np.reshape(state,(1,4))
replay_mem = []
replay_size = 32
reward_list = []
local_mean = list()
for episode in range(2000):
done = False
state = env.reset()
count = 0
reward_tot = 0
model.save("cartpole-dqn.h5")
while not done:
count += 1
e = (1/(episode/200+1))
#epslion greedy search
Q = model.predict(preprocessing(state))
if e>np.random.rand(1):
action = env.action_space.sample()
else:
action = np.argmax(Q)
#take_action and set reward
state_next, reward, done, info = env.step(action)
if done:
reward = - 100
#replay_mem
replay_mem.append([state,action,reward,state_next,done])
if len(replay_mem)>2048:
del replay_mem[0]
state = state_next
reward_tot += reward
state = state_next
Q_list = []
state_list = []
#set this_replay_Size
if len(replay_mem)<replay_size:
this_replay_size = len(replay_mem)
else:
this_replay_size = replay_size
#sample random batch from replay_memory
for sample in random.sample(replay_mem,this_replay_size):
state_m,action_m,reward_m,state_next_m,done_m = sample
if done:
Q[0,action] = reward_m
else:
Q_new = model.predict(preprocessing(state_next_m))
Q[0,action] = reward_m + 0.97*np.max(Q_new)
Q_list.append(Q)
state_list.append(state_m)
#convert to nupmy array and train
Q_list = np.array(Q_list)
state_list = np.array(state_list)
hist = model.fit(state_list,Q_list,epochs = 5,verbose = 0)
#print("Done :",done," Reward :",reward," Reward_total :",reward_tot)
local_mean.append(reward_tot)
reward_list.append(reward_tot)
if episode%10 == 0:
print("Episode :",episode+1," Reward_total :", reward_tot," Reward_local_mean :",np.mean(local_mean))
local_mean = list()
print("*******End Learning")
this is my full code of my model
i implement dqn algorithm
what's wrong with my code? i've train this model abbout 1000episode but there no progress
what should i change?
should i train more episode?
or is there anything wrong with my implementation?
i've been working on this project soo long help me please

Train custom NER component with a base model in spaCy v3

I'm having problems training a custom NER component within a base model in spaCy's new version.
So far, I've been training my NER model at CLI with the following command:
python -m spacy train en model training validation --base-model en_core_web_sm --pipeline "ner" -R -n 10
Depending on the use case, I took en_core_web_sm or en_core_web_lg as the base model to make use of the other components like tagger and pos.
In spaCy version 3 a config file is required to handle the command at CLI. I'm using the following configurations for training:
[paths]
train = "training/"
dev = "validation/"
vectors = null
init_tok2vec = null
[system]
gpu_allocator = null
seed = 0
[nlp]
lang = "en"
pipeline = ["ner"]
batch_size = 1000
disabled = []
before_creation = null
after_creation = null
after_pipeline_creation = null
tokenizer = {"#tokenizers":"spacy.Tokenizer.v1"}
[components]
[components.ner]
factory = "ner"
moves = null
update_with_oracle_cut_size = 100
[components.ner.model]
#architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[corpora]
[corpora.dev]
#readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
gold_preproc = false
limit = 0
augmenter = null
[corpora.train]
#readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 2000
gold_preproc = false
limit = 0
augmenter = null
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
seed = ${system.seed}
gpu_allocator = ${system.gpu_allocator}
dropout = 0.1
accumulate_gradient = 1
patience = 1600
max_epochs = 0
max_steps = 20000
eval_frequency = 200
frozen_components = []
before_to_disk = null
[training.batcher]
#batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
get_length = null
[training.batcher.size]
#schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
t = 0.0
[training.logger]
#loggers = "spacy.ConsoleLogger.v1"
progress_bar = false
[training.optimizer]
#optimizers = "Adam.v1"
beta1 = 0.9
beta2 = 0.999
L2_is_weight_decay = true
L2 = 0.01
grad_clip = 1.0
use_averages = false
eps = 0.00000001
learn_rate = 0.001
[training.score_weights]
ents_per_type = null
ents_f = 1.0
ents_p = 0.0
ents_r = 0.0
[pretraining]
[initialize]
vectors = null
init_tok2vec = null
vocab_data = null
lookups = null
before_init = null
after_init = null
[initialize.components]
Since I'm not familiar to spaCy's new version, these are pretty much the default settings.
Unfortunately, I can only the the model from scratch and I can't find an option anymore, to only train the NER component within an existing language model.
I have also tried to add the parser component in the configuration file with
[components]
[components.parser]
source = "en_core_web_sm"
...
But then the model is not even loadable raising the following error
nn_parser.pyx in spacy.syntax.nn_parser.Parser.from_disk()
nn_parser.pyx in spacy.syntax.nn_parser.Parser.Model()
TypeError: Model() takes exactly 1 positional argument (0 given)
In SpaCy 3.0, what you want to do first is initialize your config file to have components that you need:
python -m spacy init config config.cfg --lang en --pipeline tagger,parser,ner,attribute_ruler,senter,lemmatizer,tok2vec
Then, you want to go to the config.cfg and override settings - for example, you can use vectors from existing model:
[initialize]
vectors = "en_core_veb_lg"
init_tok2vec = null
vocab_data = null
lookups = null
before_init = null
after_init = null
Then you can run the train command:
python -m spacy train config.cfg --paths.train ./path_to_your_train_data.spacy --paths.dev ./path_to_your_validation_data.spacy --output ./your_model_name
I also found that it's possible to just go to the model folder and swap out components manually, as well as load different components from different models in the code into a single pipeline.
If you need to use a component from an existing model, you can use the following setting in your config.cfg:
[components.tagger]
source = "en_core_web_lg"
For more info on using existing models and components go to SpaCy documentation.

Generating a value in a step function based on a variable

Im creating an optimization model using gurobi and have some trouble with one of my constraints. The constraint is used to establish the quantity and is based on supply and demand curves. The supply curves cause the problems as it is a step curve. As seen in the code, the problem is when im writing the def MC section.
Demand_Curve1_const = 250
Demand_Curve1_slope = -0.025
MC_water = 0
MC_gas = 80
MC_coal = 100
CAP_water = 5000
CAP_gas = 2500
CAP_coal = 2000
model = pyo.ConcreteModel()
model.Const_P1 = pyo.Param(initialize = Demand_Curve1_const)
model.slope_P1 = pyo.Param(initialize = Demand_Curve1_slope)
model.MCW = pyo.Param(initialize = MC_water)
model.MCG = pyo.Param(initialize = MC_gas)
model.MCC = pyo.Param(initialize = MC_coal)
model.CW = pyo.Param(initialize = CAP_water)
model.CG = pyo.Param(initialize = CAP_gas)
model.CC = pyo.Param(initialize = CAP_coal)
model.qw = pyo.Var(within = pyo.NonNegativeReals)
model.qg = pyo.Var(within = pyo.NonNegativeReals)
model.qc = pyo.Var(within = pyo.NonNegativeReals)
model.d = pyo.Var(within = pyo.NonNegativeReals)
def MC():
if model.d <=5000:
return model.MCW
if model.d >= 5000 and model.d <= 7500:
return model.MCG
if model.d >= 7500 :
return model.MCC
def Objective(model):
return(model.Const_P1*model.d + model.slope_P1*model.d*model.d - (model.MCW*model.qw + model.MCG*model.qg + model.MCC*model.qc))
model.OBJ = pyo.Objective(rule = Objective, sense = pyo.maximize)
def P1inflow(model):
return(MC == model.Const_P1+model.slope_P1*model.d*2)
model.C1 = pyo.Constraint(rule = P1inflow)
Your function MC as stated would make the model nonlinear, and in a rather nasty way (discontinuous).
Piecewise linear functions are often modeled through binary variables or SOS2 sets (Special Ordered Sets of type 2). As you are using Pyomo, you can also use a tool that can generate MIP formulations automatically for you. See help(Piecewise).
An example that fits your description is here.