SpaCy use Lemmatizer as stand-alone component - spacy

I want to use SpaCy's lemmatizer as a standalone component (because I have pre-tokenized text, and I don't want to re-concatenate it and run the full pipeline because SpaCy will most likely tokenize differently in some cases).
I found the lemmatizer in the package but I somehow needs to load the dictionaries with the rules to initialize this Lemmatizer.
These files must be somewhere in the model of the English or German model, right? I couldn't find them there.
from spacy.lemmatizer import Lemmatizer
where do the LEMMA_INDEX, etc. files are comming from?
lemmatizer = Lemmatizer(LEMMA_INDEX, LEMMA_EXC, LEMMA_RULES)
I found a similar question here: Spacy lemmatizer issue/consistency
but this one did not entirely answer how to get these dictionary files from the model. The spacy.lang.* parameter seems to no longer exist in newer versions.

Here's an extracted bit of code I had, that used the SpaCy lemmatizer by itself. I'm not somewhere I can run it so it might have a small bug or two if I made an editing mistake.
Note that in general, you need to know the upos for the word in order to lemmatize correctly. This code will return all the possible lemmas but I would advise modifying it to pass in the correct upos for your word.
class SpacyLemmatizer(object):
def __init__(self, smodel):
import spacy
self.lemmatizer = spacy.load(smodel).vocab.morphology.lemmatizer
# get the lemmas for every upos
def getLemmas(self, entry):
possible_lemmas = set()
for upos in ('NOUN', 'VERB', 'ADJ', 'ADV'):
lemmas = self.lemmatizer(entry, upos, morphology=None)
lemma = lemmas[0] # See morphology.pyx::lemmatize
possible_lemmas.add( lemma )
return possible_lemmas

Related

when trying to load external tfrecord with TFDS, given tf.train.Example, how to get tfds.features?

What I need help with / What I was wondering
Hi, I am trying to load external tfrecord files with TFDS. I have read the official doc here, and find I need to define the feature structure using tfds.features. However, since the tfrecords files are alreay generated, I do not have control the generation pipeline. I do, however, know the tf.train.Example structre used in TFRecordWriter during generation, shown as follows.
from tensorflow.python.training.training import BytesList, Example, Feature, Features, Int64List
dict(Example=Features({
'image': Feature(bytes_list=BytesList(value=[img_str])), # img_str is jpg encoded image raw bytes
'caption': Feature(bytes_list=BytesList(value=[caption])), # caption is a string
'height': Feature(bytes_list=Int64List(value=[caption])),
'width': Feature(bytes_list=Int64List(value=[caption])),
})
The doc only describes how to translate tfds.features into the human readable structure of the tf.train.Example. But nowhere does it mention how to translate a tf.train.Example into tfds.features, which is needed to automatically add the proper metadata fileswith tfds.folder_dataset.write_metadata.
I wonder how to translate the above tf.train.Example into tfds.features? Thanks a lot!
BTW, while I understand that it is possible to directly read the data as it is in TFRecord with tf.data.TFRecordDataset and then use map(decode_fn) for decoding as suggested here, it seems to me this approach lacks necessary metadata like num_shards or shard_lengths. In this case, I am not sure if it is still ok to use common operations like cache/repeat/shuffle/map/batch on that tf.data.TFRecordDataset. So I think it is better to stick to the tfds approach.
What I've tried so far
I have searched the official doc for quite some time but cannot find the answer. There is a Scalar class in tfds.features, which I assume could be used to decode Int64List. But How can I decode the BytesList?
Environment information
tensorflow-datasets version: 4.8.2
tensorflow version: 2.11.0
After some searching, I find the simplest solution is
features = tfds.features.FeaturesDict({
'image': tfds.features.Image(), # << Ideally best if add the `shape=(h, w, c)` info
'caption': tfds.features.Text(),
'height': tf.int32,
'width': tf.int32,
})
Then I can load the data either with tfds.folder_dataset.write_metadata or directly with:
ds = tf.data.TFRecordDataset()
ds = ds.map(features.deserialize_example)
batch, shuffle ,... will work as expected in both cases.
The TFDS metadata would be helpful if fine-grained split control is needed.

How to use tf.data.Dataset.ignore_errors to ignore errors in a Tensorflow Dataset?

When loading images from a directory in Tensorflow, you use something like:
dataset = tf.keras.utils.image_dataset_from_directory(
"S:\\Images",
batch_size=32,
image_size=(128,128),
label_mode=None,
validation_split=0.20, #Reserve 20% of images for validation
subset='training', #If we specify a validation_split, we *must* specify subset
seed=619 #If using validation_split we *must* specify a seed to ensure there is no overlap between training and validation data
)
But of course some of the images (.jpg, .png, .gif, .bmp) will be invalid. So we want to ignore those errors; just skip them (and ideally log the filenames so they can be repaired, removed, or deleted).
There have been some ideas along the way of how to ignore invalid images:
Method 1: tf.contrib.data.ignore_errors (Tensorflow 1.x only)
Warning: The tf.contrib module will not be included in TensorFlow 2.0.
Sample usage:
dataset = dataset.apply(tf.contrib.data.ignore_errors())
The only down-side of this method is that it was only available in Tensorflow 1. Trying to use it today simply won't work, as the tf.contib namespace no longer exists. That led to a built-in method:
Method 2: tf.data.experimental.ignore_errors(log_warning=False) (deprecated)
From the documentation:
Creates a Dataset from another Dataset and silently ignores any errors. (deprecated)
Deprecated: THIS FUNCTION IS DEPRECATED. It will be removed in a future version. Instructions for updating: Use tf.data.Dataset.ignore_errors instead.
Sample usage:
dataset = dataset.apply(tf.data.experimental.ignore_errors(log_warning=True))
And this method works. It works great. And it has the advantage of working.
But it's apparently deprecated, and they documentation says we should use method 3:
Method 3 - tf.data.Dataset.ignore_errors(log_warning=False, name=None)
Drops elements that cause errors.
Sample usage:
dataset = dataset.ignore_errors(log_warning=True, name="Loading images from directory")
Except it doesn't work
The dataset.ignore_errors attribute doesn't work, and gives the error:
AttributeError: 'BatchDataset' object has no attribute 'ignore_errors'
Which means:
the thing that works is deprecated
they tell us to use this other thing instead
and "provide the instructions for updating"
but the other thing doesn't work
So we ask Stackoverflow:
How do i use tf.data.Dataset.ignore_errors to ignore errors?
Bonus Reading
TensorFlow Dataset `.map` - Is it possible to ignore errors?
TensorFlow: How to skip broken data
Untested Workaround
Not only is it not what i was asking, but people are not allowed to read this:
It looks like the tf.data.Dataset.ignore_errors() method is not
available in the BatchDataset object, which is what you are using in
your code. You can try using tf.data.Dataset.filter() to filter out
elements that cause errors when loading the images. You can use a
try-except block inside the lambda function passed to filter() to
catch the errors and return False for elements that cause errors,
which will filter them out. Here's an example of how you can use
filter() to achieve this:
def filter_fn(x):
try:
# Load the image and do some processing
# Return True if the image is valid, False otherwise
return True
except:
return False
dataset = dataset.filter(filter_fn)
Alternatively, you can use the tf.data.experimental.ignore_errors()
method, which is currently available in TensorFlow 2.x. This method
will silently ignore any errors that occur while processing the
elements of the dataset. However, keep in mind that this method is
experimental and may be removed or changed in a future version.
tf.data.Dataset.ignore_errors was introduced in TensorFlow 2.11. You can use tf.data.experimental.ignore_errors for older versions like so:
dataset.apply(tf.data.experimental.ignore_errors())

How not to get "datum" as the lemma for "data" when using Spacy?

I've run into a quite common word "data" which gets assigned a lemma "datum" from lookups exceptions table spacy uses. I understand that the lemma is technically correct, but in today's english, "data" in its basic form is just "data".
I am using the lemmas to get a sort of keywords from text and if I have a text about data, I can't possibly tag it with "datum".
I was wondering if there is another way to arrive at plain "data" then constructing another "my_exceptions" dictionary used for overriding post-processing.
Thanks for any suggestions.
You could use Lemminflect which works as an add-in pipeline component for SpaCy. It should give you better results.
To use it with SpaCy, just import lemminflect and call the new ._.lemma() function on the Token, ie.. token._.lemma(). Here's an example..
import lemminflect
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp('I got the data')
for token in doc:
print('%-6s %-6s %s' % (token.text, token.lemma_, token._.lemma()))
I -PRON- I
got get get
the the the
data datum data
Lemminflect has a prioritized list of lemmas, based on occurrence in corpus data. You can see all lemmas with...
print(lemminflect.getAllLemmas('data'))
{'NOUN': ('data', 'datum')}
It's relatively easy to customize the lemmatizer once you know where to look. The original lemmatizer tables are from the package spacy-lookups-data and are loaded in the model under nlp.vocab.lookups. You can use a local install of spacy-lookups-data to customize the tables for new/blank models, but if you just want to make a few modifications to the entries for an existing model, you can modify the lemmatizer tables on the fly.
Depending on whether your pipeline includes a tagger, the lemmatizer may be referring to rules+exceptions (with a tagger) or to a simple lookup table (without a tagger), both of which include an exception that lemmatizes data to datum by default. If you remove this exception, you should get data as the lemma for data.
For a pipeline that includes a tagger (rule-based lemmatizer)
# tested with spaCy v2.2.4
import spacy
nlp = spacy.load("en_core_web_sm")
# remove exception from rule-based exceptions
lemma_exc = nlp.vocab.lookups.get_table("lemma_exc")
del lemma_exc[nlp.vocab.strings["noun"]]["data"]
assert nlp.vocab.morphology.lemmatizer("data", "NOUN") == ["data"]
# "data" with the POS "NOUN" has the lemma "data"
doc = nlp("data")
doc[0].pos_ = "NOUN" # just to make sure the POS is correct
assert doc[0].lemma_ == "data"
For a pipeline without a tagger (simple lookup lemmatizer)
import spacy
nlp = spacy.blank("en")
# remove exception from lookups
lemma_lookup = nlp.vocab.lookups.get_table("lemma_lookup")
del lemma_lookup[nlp.vocab.strings["data"]]
assert nlp.vocab.morphology.lemmatizer("data", "") == ["data"]
doc = nlp("data")
assert doc[0].lemma_ == "data"
For both: save model for future use with these modifications included in the lemmatizer tables
nlp.to_disk("/path/to/model")
Also be aware that the lemmatizer uses a cache, so make any changes before running your model on any texts or you may run into problems where it returns lemmas from the cache rather than the updated tables.

TFX StatisticsGen for image data

Hi I've trying to get a TFX Pipeline going just as an exercise really. I'm using ImportExampleGen to load TFRecords from disk. Each Example in the TFRecord contains a jpg in the form of a byte string, height, width, depth, steering and throttle labels.
I'm trying to use StatisticsGen but I'm receiving this warning;
WARNING:root:Feature "image_raw" has bytes value "None" which cannot be decoded as a UTF-8 string. and crashing my Colab Notebook. As far as I can tell all the byte-string images in the TFRecord are not corrupt.
I cannot find concrete examples on StatisticsGen and handling image data. According to the docs Tensorflow Data Validation can deal with image data.
In addition to computing a default set of data statistics, TFDV can also compute statistics for semantic domains (e.g., images, text). To enable computation of semantic domain statistics, pass a tfdv.StatsOptions object with enable_semantic_domain_stats set to True to tfdv.generate_statistics_from_tfrecord.
But I'm not sure how this fits in with StatisticsGen.
Here is the code that instantiates an ImportExampleGen then the StatisticsGen
from tfx.utils.dsl_utils import tfrecord_input
from tfx.components.example_gen.import_example_gen.component import ImportExampleGen
from tfx.proto import example_gen_pb2
examples = tfrecord_input(_tf_record_dir)
# https://www.tensorflow.org/tfx/guide/examplegen#custom_inputoutput_split
# has a good explanation of splitting the data the 'output_config' param
# Input train split is _tf_record_dir/*'
# Output 2 splits: train:eval=8:2.
train_ratio = 8
eval_ratio = 10-train_ratio
output = example_gen_pb2.Output(
split_config=example_gen_pb2.SplitConfig(splits=[
example_gen_pb2.SplitConfig.Split(name='train',
hash_buckets=train_ratio),
example_gen_pb2.SplitConfig.Split(name='eval',
hash_buckets=eval_ratio)
]))
example_gen = ImportExampleGen(input=examples,
output_config=output)
context.run(example_gen)
statistics_gen = StatisticsGen(
examples=example_gen.outputs['examples'])
context.run(statistics_gen)
Thanks in advance.
From git issue response
Thanks Evan Rosen
Hi Folks,
The warnings you are seeing indicate that StatisticsGen is trying to treat your raw image features like a categorical string feature. The image bytes are being decoded just fine. The issue is that when the stats (including top K examples) are being written, the output proto is expecting a UTF-8 valid string, but instead gets the raw image bytes. Nothing is wrong with your setups from what I can tell, but this is just an unintended side-effect of a well-intentioned warning in the event that you have a categorical string feature which can't be serialized. We'll look into finding a better default that handles image data more elegantly.
In the meantime, to tell StatisticsGen that this feature is really an opaque blob, you can pass in a user-modified schema as described in the StatsGen docs. To generate this schema, you can run StatisticsGen and SchemaGen once (on a sample of data) and then modify the inferred schema to annotate that image features. Here is a modified version of the colab from #tall-josh:
Open In Colab
The additional steps are a bit verbose, but having a curated schema is often a good practice for other reasons. Here is the cell that I added to the notebook:
from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2
# Load autogenerated schema (using stats from small batch)
schema = tfx.utils.io_utils.SchemaReader().read(
tfx.utils.io_utils.get_only_uri_in_dir(
tfx.types.artifact_utils.get_single_uri(schema_gen.outputs['schema'].get())))
# Modify schema to indicate which string features are images.
# Ideally you would persist a golden version of this schema somewhere rather
# than regenerating it on every run.
for feature in schema.feature:
if feature.name == 'image/raw':
feature.image_domain.SetInParent()
# Write modified schema to local file
user_schema_dir ='/tmp/user-schema/'
tfx.utils.io_utils.write_pbtxt_file(
os.path.join(user_schema_dir, 'schema.pbtxt'), schema)
# Create ImportNode to make modified schema available to other components
user_schema_importer = tfx.components.ImporterNode(
instance_name='import_user_schema',
source_uri=user_schema_dir,
artifact_type=tfx.types.standard_artifacts.Schema)
# Run the user schema ImportNode
context.run(user_schema_importer)
Hopefully you find this workaround is useful. In the meantime, we'll take a look at a better default experience for image-valued features.
Groked this and found the solution to be dramatically simpler than i thought...
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
import logging
...
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
...
context = InteractiveContext(pipeline_name='my_pipe')
...
c = StatisticsGen(...)
...
context.run(c)

TensorFlow: Opening log data written by SummaryWriter

After following this tutorial on summaries and TensorBoard, I've been able to successfully save and look at data with TensorBoard. Is it possible to open this data with something other than TensorBoard?
By the way, my application is to do off-policy learning. I'm currently saving each state-action-reward tuple using SummaryWriter. I know I could manually store/train on this data, but I thought it'd be nice to use TensorFlow's built in logging features to store/load this data.
As of March 2017, the EventAccumulator tool has been moved from Tensorflow core to the Tensorboard Backend. You can still use it to extract data from Tensorboard log files as follows:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
event_acc = EventAccumulator('/path/to/summary/folder')
event_acc.Reload()
# Show all tags in the log file
print(event_acc.Tags())
# E. g. get wall clock, number of steps and value for a scalar 'Accuracy'
w_times, step_nums, vals = zip(*event_acc.Scalars('Accuracy'))
Easy, the data can actually be exported to a .csv file within TensorBoard under the Events tab, which can e.g. be loaded in a Pandas dataframe in Python. Make sure you check the Data download links box.
For a more automated approach, check out the TensorBoard readme:
If you'd like to export data to visualize elsewhere (e.g. iPython
Notebook), that's possible too. You can directly depend on the
underlying classes that TensorBoard uses for loading data:
python/summary/event_accumulator.py (for loading data from a single
run) or python/summary/event_multiplexer.py (for loading data from
multiple runs, and keeping it organized). These classes load groups of
event files, discard data that was "orphaned" by TensorFlow crashes,
and organize the data by tag.
As another option, there is a script
(tensorboard/scripts/serialize_tensorboard.py) which will load a
logdir just like TensorBoard does, but write all of the data out to
disk as json instead of starting a server. This script is setup to
make "fake TensorBoard backends" for testing, so it is a bit rough
around the edges.
I think the data are encoded protobufs RecordReader format. To get serialized strings out of files you can use py_record_reader or build a graph with TFRecordReader op, and to deserialize those strings to protobuf use Event schema. If you get a working example, please update this q, since we seem to be missing documentation on this.
I did something along these lines for a previous project. As mentioned by others, the main ingredient is tensorflows event accumulator
from tensorflow.python.summary import event_accumulator as ea
acc = ea.EventAccumulator("folder/containing/summaries/")
acc.Reload()
# Print tags of contained entities, use these names to retrieve entities as below
print(acc.Tags())
# E. g. get all values and steps of a scalar called 'l2_loss'
xy_l2_loss = [(s.step, s.value) for s in acc.Scalars('l2_loss')]
# Retrieve images, e. g. first labeled as 'generator'
img = acc.Images('generator/image/0')
with open('img_{}.png'.format(img.step), 'wb') as f:
f.write(img.encoded_image_string)
You can also use the tf.train.summaryiterator: To extract events in a ./logs-Folder where only classic scalars lr, acc, loss, val_acc and val_loss are present you can use this GIST: tensorboard_to_csv.py
Chris Cundy's answer works well when you have less than 10000 data points in your tfevent file. However, when you have a large file with over 10000 data points, Tensorboard will automatically sampling them and only gives you at most 10000 points. It is a quite annoying underlying behavior as it is not well-documented. See https://github.com/tensorflow/tensorboard/blob/master/tensorboard/backend/event_processing/event_accumulator.py#L186.
To get around it and get all data points, a bit hacky way is to:
from tensorboard.backend.event_processing.event_accumulator import EventAccumulator
class FalseDict(object):
def __getitem__(self,key):
return 0
def __contains__(self, key):
return True
event_acc = EventAccumulator('path/to/your/tfevents',size_guidance=FalseDict())
It looks like for tb version >=2.3 you can streamline the process of converting your tb events to a pandas dataframe using tensorboard.data.experimental.ExperimentFromDev().
It requires you to upload your logs to TensorBoard.dev, though, which is public. There are plans to expand the capability to locally stored logs in the future.
https://www.tensorflow.org/tensorboard/dataframe_api
You can also use the EventFileLoader to iterate through a tensorboard file
from tensorboard.backend.event_processing.event_file_loader import EventFileLoader
for event in EventFileLoader('path/to/events.out.tfevents.xxx').Load():
print(event)
Surprisingly, the python package tb_parse has not been mentioned yet.
From documentation:
Installation:
pip install tensorflow # or tensorflow-cpu pip install -U tbparse # requires Python >= 3.7
Note: If you don't want to install TensorFlow, see Installing without TensorFlow.
We suggest using an additional virtual environment for parsing and plotting the tensorboard events. So no worries if your training code uses Python 3.6 or older versions.
Reading one or more event files with tbparse only requires 5 lines of code:
from tbparse import SummaryReader
log_dir = "<PATH_TO_EVENT_FILE_OR_DIRECTORY>"
reader = SummaryReader(log_dir)
df = reader.scalars
print(df)