Error parsing config overrides - `nlp.tokenizer` sections does not exist - spacy

I encountered a weird problem in Spacy and I am not sure whether I am doing something wrong or it is a genuine bug.
I use Spacy project and create a default config file via:
python -m spacy init config spacy.cfg
Then I try to load an NLP object using this config:
import spacy
config = spacy.util.load_config('./spacy.cfg')
nlp = spacy.load("en_core_web_sm", config=config)
When do this I get the following error:
ConfigValidationError:
Error parsing config overrides
nlp -> tokenizer -> #tokenizers not a section value that can be overwritten
By looking at what is going on inside through PDB I noticed that section nlp.tokenizer is not created. Instead the config stores the following ugly item within the NLP section:
'tokenizer': '{"#tokenizers":"spacy.Tokenizer.v1"}'
which does not seem to look allright.
I am using Spacy v3.0.3 on Ubuntu 20.04.2 LTS.

There's nothing wrong with that tokenizer setting. en_core_web_sm already contains its own config including a tokenizer, which is what you can't override.
You want to load your config starting from a blank pipeline instead of a pretrained pipeline:
nlp = spacy.blank("en", config=config)
Be aware that the language en needs to match here with the spacy init config language setting or it won't be able to load the config.

Related

Streamlit Cloud OSError: [E053] Could not read config file

I am deploying an app that specifically requires spaCy==3.3.1 to Streamlit cloud, which I added to the requirement.txt as well as the link to download and install en_core_web_sm which is
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm.
The import is okay but when I load the model i.e. nlp = spacy.load("en_core_web_sm"), I will get the OSError below:
OSError: [E053] Could not read config file from /home/appuser/venv/lib/python3.9/site-packages/en_core_web_sm/en_core_web_sm-2.2.0/config.cfg
What am I doing wrong?
Thanks in advance for your help.
My requirements.txt:
spacy==3.3.1
https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.0/en_core_web_sm-2.2.0.tar.gz#egg=en_core_web_sm
Main code:
import en_core_web_sm
nlp = spacy.load("en_core_web_sm")
I expected no error but got the OSError: [E053] above.
Packages trained with spaCy v2.x are not compatible with spaCy v3.x
(source)
Note that the model has version 2.2.0 but spacy has version 3.3.1.
I'd suggest to use a newer version of the model. The requirements could look like:
spacy==3.3.1
en-core-web-sm # https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.3.0/en_core_web_sm-3.3.0-py3-none-any.whl

tensorflow data validation tfdv fails on google cloud dataflow with "Can't get attribute 'NumExamplesStatsGenerator' "

I am following this "get started" tensorflow tutorial on how to run tfdv on apache beam on google cloud dataflow. My code is very similar to the one in the tutorial:
import tensorflow_data_validation as tfdv
from apache_beam.options.pipeline_options import PipelineOptions, GoogleCloudOptions, StandardOptions, SetupOptions, WorkerOptions
PROJECT_ID = 'my-project-id'
JOB_NAME = 'my-job-name'
REGION = "europe-west3"
NETWORK = "regions/europe-west3/subnetworks/mysubnet"
GCS_STAGING_LOCATION = 'gs://my-bucket/staging'
GCS_TMP_LOCATION = 'gs://my-bucket/tmp'
GCS_DATA_LOCATION = 'gs://another-bucket/my-data.CSV'
# GCS_STATS_OUTPUT_PATH is the file path to which to output the data statistics
# result.
GCS_STATS_OUTPUT_PATH = 'gs://my-bucket/stats'
# downloaded locally with: pip download tensorflow_data_validation --no-deps --platform manylinux2010_x86_64 --only-binary=:all:
#(would be great to use it have it on cloud storage) PATH_TO_WHL_FILE = 'gs://my-bucket/wheels/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'
PATH_TO_WHL_FILE = '/Users/myuser/some-folder/tensorflow_data_validation-1.7.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl'
# Create and set your PipelineOptions.
options = PipelineOptions()
# For Cloud execution, set the Cloud Platform project, job_name,
# staging location, temp_location and specify DataflowRunner.
google_cloud_options = options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = GCS_STAGING_LOCATION
google_cloud_options.temp_location = GCS_TMP_LOCATION
google_cloud_options.region = REGION
options.view_as(StandardOptions).runner = 'DataflowRunner'
setup_options = options.view_as(SetupOptions)
# PATH_TO_WHL_FILE should point to the downloaded tfdv wheel file.
setup_options.extra_packages = [PATH_TO_WHL_FILE]
# Worker options
worker_options = options.view_as(WorkerOptions)
worker_options.subnetwork = NETWORK
worker_options.max_num_workers = 2
print("Generating stats...")
tfdv.generate_statistics_from_tfrecord(GCS_DATA_LOCATION, output_path=GCS_STATS_OUTPUT_PATH, pipeline_options=options)
print("Stats generated!")
The code above starts a dataflow job but unfortunately it fails with the following error:
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
return dill.loads(s)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 462, in find_class
return StockUnpickler.find_class(self, module, name)
AttributeError: Can't get attribute 'NumExamplesStatsGenerator' on <module 'tensorflow_data_validation.statistics.stats_impl' from '/usr/local/lib/python3.8/site-packages/tensorflow_data_validation/statistics/stats_impl.py'>
I couldn't find on the internet anything similar.
If it can help, on my local machine (MACOS) I have the following versions:
Apache Beam version: 2.34.0
Tensorflow version: 2.6.2
TensorFlow Transform version: 1.4.0
TFDV version: 1.4.0
Apache beam on cloud runs with Apache Beam Python 3.8 SDK 2.34.0
BONUS QUESTION: Another question I have is around the PATH_TO_WHL_FILE. I tried to put it on a storage bucket but Beam doesn't seem to be able to pick it up. Only locally, which is actually a problem, because it would make it more difficult to distribute this code. What would be a good practice to distribute this wheel file?
Based on the name of the attribute NumExamplesStatsGenerator, it's a generator that is not pickle-able.
But I couldn't find the attribute from the module now.
A search indicates that in 1.4.0 this module contains this attribute.
So you may want to try a newer versioned TFDV.
PATH_TO_WHL_FILE indicates a file to stage/distribute to Dataflow for execution, so you can use a file on GCS.

OSError: [E053] Could not read config.cfg from C:\Users

I'm trying to run spaCY's lemmatizer on a text by running the command nlp = spacy.load("en_core_web_sm", disable=["parser", "ner"]), but then I get the following error:
OSError: [E053] Could not read config.cfg from C:\Users.
I'm using spaCy version 3.2.1, and I also installed en-core-web-sm-2.2.0.
I also get this warning, but I'm not sure what it means:
UserWarning: [W094] Model 'en_core_web_sm' (2.2.0) specifies an under-constrained spaCy version requirement: >=2.2.0. This can lead to compatibility problems with older versions, or as new spaCy versions are released, because the model may say it's compatible when it's not. Consider changing the "spacy_version" in your meta.json to a version range, with a lower and upper pin. For example: >=3.2.1,<3.3.0.
Hope someone can help me.
A v2 spaCy model won't work with spaCy v3 - you need to update your model. Do this:
spacy download en_core_web_sm
The error could be easier to understand, but it's not a situation that comes up much - usually you'd have to upgrade from spaCy v2 to v3 but not upgrade your models for that to happen. Not sure how you got in that state.

Is there any way to load FaceNet model as a tf.keras.layers.Layer using Tensorflow 2.3?

I want to use FaceNet as a embedding layer (which won't be trainable).
I tried loading FaceNet like so :
tf.keras.models.load_model('./path/tf_facenet')
where directory ./path/tf_facenet contains 4 files that can be downloaded at https://drive.google.com/file/d/0B5MzpY9kBtDVZ2RpVDYwWmxoSUk/edit
but a message error shows up :
OSError: SavedModel file does not exist at: ./path/tf_facenet/{saved_model.pbtxt|saved_model.pb}
And the h5 files downloaded from https://github.com/nyoki-mtl/keras-facenet doesn't seem to work either (they use tensorflow 1.3)
I had issued like you when load model facenet-keras. Maybe you python env missing h5py modules.
So you should install that conda install h5py
Hope you success!!!

MultiWorkerMirroredStrategy() not working on Google AI-Platform (CMLE)

I'm getting the following error while using MultiWorkerMirroredStrategy() for training Custom Estimator on Google AI-Platform (CMLE).
ValueError: Unrecognized task_type: 'master', valid task types are: "chief", "worker", "evaluator" and "ps".
Both MirroredStrategy() and PamameterServerStrategy() are working fine on AI-Platform with their respective config.yaml files. I'm currently not providing device scopes for any operations. Neither I'm providing any device filter in session config, tf.ConfigProto(device_filters=device_filters).
The config.yaml file which I'm using for training with MultiWorkerMirroredStrategy() is:
trainingInput:
scaleTier: CUSTOM
masterType: standard_gpu
workerType: standard_gpu
workerCount: 4
The masterType input is mandatory for submitting the training job on AI-Platform.
Note: It's showing 'chief' as a valid task type and 'master' as invalid. I'm providing tensorflow-gpu==1.14.0 in setup.py for trainer package.
I got into same issue. As far I understand MultiWorkerMirroredStrategy config values are different from other strategies and from what CMLE provides by default: https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras#multi-worker_configuration
It doesn't support 'master' node, it calls it 'chief' instead.
If you are running your jobs in container, you can try using 'useChiefInTfConfig' flag, see documentation here: https://developers.google.com/resources/api-libraries/documentation/ml/v1/python/latest/ml_v1.projects.jobs.html
Otherwise you might try hacking your TF_CONFIG manually:
TF_CONFIG = os.environ.get('TF_CONFIG')
if TF_CONFIG and '"master"' in TF_CONFIG:
os.environ['TF_CONFIG'] = TF_CONFIG.replace('"master"', '"chief"')
(1) This appears to be a bug then with MultiWorkerMirroredStrategy. Please file a bug in TensorFlow. In TensorFlow 1.x, it should be using master and in TensorFlow 2.x, it should be using chief. The code is (wrongly) asking for chief, and AI Platform (because you are using 1.14) is providing only master. Incidentally: master = chief + evaluator.
(2) Do not have add tensorflow to your setup.py. Provide the tensorflow framework you want AI Platform to use using the --runtime-version (See https://cloud.google.com/ml-engine/docs/runtime-version-list) flag to gcloud.