I am creating a phrasematcher with SpaCy like this:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en")
label = "SKILL"
print("Creating the matcher...")
start = time.time()
matcher = PhraseMatcher(nlp.vocab)
for i in list_skills:
matcher.add(label, None, nlp(i))
My list_skills is very big, so the creation of matcher takes a long time, and I reuse it very often. Is there a way to save the matcher to disk, and reload it later on without having to recreate it everytime ?
You can save time some time initially by using nlp.tokenizer.pipe() to process your texts:
for doc in nlp.tokenizer.pipe(list_skills):
matcher.add(label, None, doc)
This just tokenizes, which is much faster than running the full en pipeline. If you're using certain attr settings with PhraseMatcher, you may need nlp.pipe() instead, but you should get an error if this is the case.
You can pickle a PhraseMatcher to save it to disk. Unpickling is not extremely fast because it has to reconstruct some internal data structures, but it should be a quite a bit faster than creating the PhraseMatcher from scratch.
import pickle
filename = 'finalized_matcher.sav'
pickle.dump(matcher, open(filename, 'wb'))
loaded_matcher = pickle.load(open(filename, 'rb'))
Related
I have customized NER pipeline with following procedure
doc = nlp("I am going to Vallila. I am going to Sörnäinen.")
for ent in doc.ents:
print(ent.text, ent.label_)
LABEL = 'DISTRICT'
TRAIN_DATA = [
(
'We need to deliver it to Vallila', {
'entities': [(25, 32, 'DISTRICT')]
}),
(
'We need to deliver it to somewhere', {
'entities': []
}),
]
ner = nlp.get_pipe("ner")
ner.add_label(LABEL)
nlp.disable_pipes("tagger")
nlp.disable_pipes("parser")
nlp.disable_pipes("attribute_ruler")
nlp.disable_pipes("lemmatizer")
nlp.disable_pipes("tok2vec")
optimizer = nlp.get_pipe("ner").create_optimizer()
import random
from spacy.training import Example
for i in range(25):
random.shuffle(TRAIN_DATA)
for text, annotation in TRAIN_DATA:
example = Example.from_dict(nlp.make_doc(text), annotation)
nlp.update([example], sgd=optimizer)
I tried to save that customized NER to disk and load it again with following code
ner.to_disk('/home/feru/ner')
import spacy
from spacy.pipeline import EntityRecognizer
nlp = spacy.load("en_core_web_lg", disable=['ner'])
ner = EntityRecognizer(nlp.vocab)
ner.from_disk('/home/feru/ner')
nlp.add_pipe(ner)
I got however following error:
---> 10 ner = EntityRecognizer(nlp.vocab)
11 ner.from_disk('/home/feru/ner')
12 nlp.add_pipe(ner)
~/.local/lib/python3.8/site-packages/spacy/pipeline/ner.pyx in
spacy.pipeline.ner.EntityRecognizer.init()
TypeError: init() takes at least 2 positional arguments (1 given)
This method to save and load custom component from disk seems to be from some erly SpaCy version. What's the second argument EntityRecognizer needs?
The general process you are following of serializing a single component and reloading it is not the recommended way to do this in spaCy. You can do it - it has to be done internally, of course - but you generally want to save and load pipelines using high-level wrappers. In this case this means that you would save like this:
nlp.to_disk("my_model") # NOT ner.to_disk
And then load it with spacy.load("my_model").
You can find more detail about this in the saving and loading docs. Since it seems you're just getting started with spaCy, you might want to go through the course too. It covers the new config-based training in v3, which is much easier than using your own custom training loop like in your code sample.
If you want to mix and match components from different pipelines, you still will generally want to save entire pipelines, and you can then combine components from them using the "sourcing" feature.
I have a large h5 file with 5-dimensional numpy array in HDFS. File size is ~130Gb. I am facing memory issues while loading the file with process gets killed with OOM Error even though machine has 256Gb RAM. How can I write the file in chunks and load back in chunks? I looked around and found that h5py provides method to chunk the dataset like so but how do I load back the data in chunks? Also will it work if the file resides in HDFS?
dset = f.create_dataset("Images2", (100,480,640), 'f', chunks=True)
Idea is to load the file in batches for less I/O time as well as memory issues. Any help would be much appreciated.
Two similar (but different) h5py I/O concepts are mentioned in the answer and comments above:
HDF5 Chunking is used to enable chunked I/O for improved performance. Chunking may not help if you get an OOM error when you try to read a large dataset with insufficient memory.
NumPy style Slicing is used to read a slice of the data from the drive to memory (or write a slice of data to the drive). Slicing is the key to avoid OOM errors when reading very large files.
Also, when creating very large datasets, you generally need to make
it resizeable. You can allocate an initial size, then use the ".resize()" method to increase the size on disk.
I wrote a simple example that shows how to use both slicing and chunking. It loads 100 images at a time into a resizeable dataset. It then closes the file and reopens (read-only) to read 100 images at a time into a NumPy array.
Effective chunking requires appropriate size/shape and is based on your array shape and I/O needs. I set the chunk size/shape in my example to match the size of 100 image array I was writing/reading.
This example should get you started. You will need to modify to use a 5-d array/dataset.
import numpy as np
import h5py
with h5py.File('SO_64645940.h5','w') as h5w:
img_ds = h5w.create_dataset('Images', shape=(100,480,640), dtype='f', maxshape=(None,480,640),chunks=(10,480,640))
next_img_row = 0
arr = np.random.random(100*480*640).reshape(100,480,640)
for cnt in range(1,10):
# print(cnt,img_ds.len(),next_img_row)
if img_ds.len() == next_img_row :
img_ds.resize(100*cnt,axis=0)
print('new ds size=',img_ds.len())
h5w['Images'][next_img_row:next_img_row+100] = arr
next_img_row += 100
with h5py.File('SO_64645940.h5','r') as h5r:
for cnt in range(10):
print('get slice#',str(cnt))
img_arr = h5r['Images'][cnt*100:(cnt+1)*100]
Chunking in HDF5 means that the data is not stored contigous, but in chunks.
See information here: https://docs.h5py.org/en/stable/high/dataset.html#chunked-storage
--> So this doesn't help you with your problem.
The solution might be that you build a function yourself to load the data chunkwise.
I made it for example this way for getting the data chunked:
def get_chunked(data, chunk_size=100):
for i in give_chunk(len(data), chunk_size):
chunked_array = data[i]
yield chunked_array
def give_chunk(length, chunk_size):
it = iter(range(length))
while True:
chunk = list(itertools.islice(it, chunk_size))
if not chunk:
break
yield chunk
For writing the data to HDF5 you can create the dataset first and then write the data chunk wise with slicing, see h5py documentation: https://docs.h5py.org/en/stable/high/dataset.html#reading-writing-data
I really can recommend this book for basic knowledge about HDF5: https://www.oreilly.com/library/view/python-and-hdf5/9781491944981/
Hi I've trying to get a TFX Pipeline going just as an exercise really. I'm using ImportExampleGen to load TFRecords from disk. Each Example in the TFRecord contains a jpg in the form of a byte string, height, width, depth, steering and throttle labels.
I'm trying to use StatisticsGen but I'm receiving this warning;
WARNING:root:Feature "image_raw" has bytes value "None" which cannot be decoded as a UTF-8 string. and crashing my Colab Notebook. As far as I can tell all the byte-string images in the TFRecord are not corrupt.
I cannot find concrete examples on StatisticsGen and handling image data. According to the docs Tensorflow Data Validation can deal with image data.
In addition to computing a default set of data statistics, TFDV can also compute statistics for semantic domains (e.g., images, text). To enable computation of semantic domain statistics, pass a tfdv.StatsOptions object with enable_semantic_domain_stats set to True to tfdv.generate_statistics_from_tfrecord.
But I'm not sure how this fits in with StatisticsGen.
Here is the code that instantiates an ImportExampleGen then the StatisticsGen
from tfx.utils.dsl_utils import tfrecord_input
from tfx.components.example_gen.import_example_gen.component import ImportExampleGen
from tfx.proto import example_gen_pb2
examples = tfrecord_input(_tf_record_dir)
# https://www.tensorflow.org/tfx/guide/examplegen#custom_inputoutput_split
# has a good explanation of splitting the data the 'output_config' param
# Input train split is _tf_record_dir/*'
# Output 2 splits: train:eval=8:2.
train_ratio = 8
eval_ratio = 10-train_ratio
output = example_gen_pb2.Output(
split_config=example_gen_pb2.SplitConfig(splits=[
example_gen_pb2.SplitConfig.Split(name='train',
hash_buckets=train_ratio),
example_gen_pb2.SplitConfig.Split(name='eval',
hash_buckets=eval_ratio)
]))
example_gen = ImportExampleGen(input=examples,
output_config=output)
context.run(example_gen)
statistics_gen = StatisticsGen(
examples=example_gen.outputs['examples'])
context.run(statistics_gen)
Thanks in advance.
From git issue response
Thanks Evan Rosen
Hi Folks,
The warnings you are seeing indicate that StatisticsGen is trying to treat your raw image features like a categorical string feature. The image bytes are being decoded just fine. The issue is that when the stats (including top K examples) are being written, the output proto is expecting a UTF-8 valid string, but instead gets the raw image bytes. Nothing is wrong with your setups from what I can tell, but this is just an unintended side-effect of a well-intentioned warning in the event that you have a categorical string feature which can't be serialized. We'll look into finding a better default that handles image data more elegantly.
In the meantime, to tell StatisticsGen that this feature is really an opaque blob, you can pass in a user-modified schema as described in the StatsGen docs. To generate this schema, you can run StatisticsGen and SchemaGen once (on a sample of data) and then modify the inferred schema to annotate that image features. Here is a modified version of the colab from #tall-josh:
Open In Colab
The additional steps are a bit verbose, but having a curated schema is often a good practice for other reasons. Here is the cell that I added to the notebook:
from google.protobuf import text_format
from tensorflow.python.lib.io import file_io
from tensorflow_metadata.proto.v0 import schema_pb2
# Load autogenerated schema (using stats from small batch)
schema = tfx.utils.io_utils.SchemaReader().read(
tfx.utils.io_utils.get_only_uri_in_dir(
tfx.types.artifact_utils.get_single_uri(schema_gen.outputs['schema'].get())))
# Modify schema to indicate which string features are images.
# Ideally you would persist a golden version of this schema somewhere rather
# than regenerating it on every run.
for feature in schema.feature:
if feature.name == 'image/raw':
feature.image_domain.SetInParent()
# Write modified schema to local file
user_schema_dir ='/tmp/user-schema/'
tfx.utils.io_utils.write_pbtxt_file(
os.path.join(user_schema_dir, 'schema.pbtxt'), schema)
# Create ImportNode to make modified schema available to other components
user_schema_importer = tfx.components.ImporterNode(
instance_name='import_user_schema',
source_uri=user_schema_dir,
artifact_type=tfx.types.standard_artifacts.Schema)
# Run the user schema ImportNode
context.run(user_schema_importer)
Hopefully you find this workaround is useful. In the meantime, we'll take a look at a better default experience for image-valued features.
Groked this and found the solution to be dramatically simpler than i thought...
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext
import logging
...
logger = logging.getLogger()
logger.setLevel(logging.CRITICAL)
...
context = InteractiveContext(pipeline_name='my_pipe')
...
c = StatisticsGen(...)
...
context.run(c)
I got a TFRecord data file filename = train-00000-of-00001 which contains images of unknown size and maybe other information as well. I know that I can use dataset = tf.data.TFRecordDataset(filename) to open the dataset.
How can I extract the images from this file to save it as a numpy-array?
I also don't know if there is any other information saved in the TFRecord file such as labels or resolution. How can I get these information? How can I save them as a numpy-array?
I normally only use numpy-arrays and am not familiar with TFRecord data files.
1.) How can I extract the images from this file to save it as a numpy-array?
What you are looking for is this:
record_iterator = tf.python_io.tf_record_iterator(path=filename)
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# Exit after 1 iteration as this is purely demonstrative.
break
2.) How can I get these information?
Here is the official documentation. I strongly suggest that you read the documentation because it goes step by step in how to extract the values that you are looking for.
Essentially, you have to convert example to a dictionary. So if I wanted to find out what kind of information is in a tfrecord file, I would do something like this (in context with the code stated in the first question): dict(example.features.feature).keys()
3.) How can I save them as a numpy-array?
I would build upon the for loop mentioned above. So for every loop, it extracts the values that you are interested in and appends them to numpy arrays. If you want, you could create a pandas dataframe from those arrays and save it as a csv file.
But...
You seem to have multiple tfrecord files...tf.data.TFRecordDataset(filename) returns a dataset that is used to train models.
So in the event for multiple tfrecords, you would need a double for loop. The outer loop will go through each file. For that particular file, the inner loop will go through all of the tf.examples.
EDIT:
Converting to np.array()
import tensorflow as tf
from PIL import Image
import io
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# Get the values in a dictionary
example_bytes = dict(example.features.feature)['image_raw'].bytes_list.value[0]
image_array = np.array(Image.open(io.BytesIO(example_bytes)))
print(image_array)
break
Sources for the code above:
Base code
Converting bytes to PIL.JpegImagePlugin.JpegImageFile
Converting from PIL.JpegImagePlugin.JpegImageFile to np.array
Official Documentation for PIL
EDIT 2:
import tensorflow as tf
from PIL import Image
import io
import numpy as np
# Load image
cat_in_snow = tf.keras.utils.get_file(path, 'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')
#------------------------------------------------------Convert to tfrecords
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def image_example(image_string):
feature = {
'image_raw': _bytes_feature(image_string),
}
return tf.train.Example(features=tf.train.Features(feature=feature))
with tf.python_io.TFRecordWriter('images.tfrecords') as writer:
image_string = open(cat_in_snow, 'rb').read()
tf_example = image_example(image_string)
writer.write(tf_example.SerializeToString())
#------------------------------------------------------
#------------------------------------------------------Begin Operation
record_iterator = tf.python_io.tf_record_iterator(path to tfrecord file)
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# OPTION 1: convert bytes to arrays using PIL and IO
example_bytes = dict(example.features.feature)['image_raw'].bytes_list.value[0]
PIL_array = np.array(Image.open(io.BytesIO(example_bytes)))
# OPTION 2: convert bytes to arrays using Tensorflow
with tf.Session() as sess:
TF_array = sess.run(tf.image.decode_jpeg(example_bytes, channels=3))
break
#------------------------------------------------------
#------------------------------------------------------Compare results
(PIL_array.flatten() != TF_array.flatten()).sum()
PIL_array == TF_array
PIL_img = Image.fromarray(PIL_array, 'RGB')
PIL_img.save('PIL_IMAGE.jpg')
TF_img = Image.fromarray(TF_array, 'RGB')
TF_img.save('TF_IMAGE.jpg')
#------------------------------------------------------
Remember that tfrecords is just simply a way of storing information for tensorflow models to read in an efficient manner.
I use PIL and IO to essentially convert the bytes to an image. IO takes the bytes and converts them to a file like object that PIL.Image can then read
Yes, there is a pure tensorflow way to do it: tf.image.decode_jpeg
Yes, there is a difference between the two approaches when you compare the two arrays
Which one should you pick? Tensorflow is not the way to go if you are worried about accuracy as stated in Tensorflow's github : "The TensorFlow-chosen default for jpeg decoding is IFAST, sacrificing image quality for speed". Credit for this information belongs to this post
I am trying to create a 78TB HDF5 dataset by filling it in a 2d block-partition manner. This is very slow when the block I'm writing spans rows that haven't ever been written to, because HDF5 is going in and allocating the diskspace and filling in the missing entries with zero.
Instead, I would like h5py to allocate the disk space for my dataset as soon as its created, and never fill it. This is possible with the C api according to Table 16 in the HDF5 Dataset documentation, but how can I do this with h5py, preferably with the high level interface?
I believe that you want to set the fill time to "never", with the H5Pset_fill_time() routine, but I don't know the h5py way to do that.
As Quincey suggested. You can use the low-level H5py API to create the dataset with the FILL_TIME_NEVER property then convert it back to a high-level Dataset object:
# create the rows dataset using the low-level api so I can force it to not do zero-filling, then convert to a high level object
spaceid = h5py.h5s.create_simple((numRows, numCols))
plist = h5py.h5p.create(h5py.h5p.DATASET_CREATE)
plist.set_fill_time(h5py.h5d.FILL_TIME_NEVER)
plist.set_chunk((rowchunk, colchunk))
datasetid = h5py.h5d.create(fout.id, "rows", h5py.h5t.NATIVE_DOUBLE, spaceid, plist)
rows = h5py.Dataset(datasetid)
Try specifying a chunk shape that matches your write pattern. For example if you are writing in blocks of 1024x1024, it would look like this:
import h5py
import numpy as np
f = h5py.File('mybigdset.h5', 'w')
dset = f.create_dataset('dset', (78*1024*1024, 1024*1024), dtype='f4', chunks=(1024,1024))
arr = np.random.rand(1024,1024)
dset[0:1024, 0:1024] = arr
f.close()
Thankfully, this didn't use 78TB of disk, the file size was just 4MB.