Export vectors from fastText to spaCy - spacy

I downloaded the fasttext.cc vectors of 1.5gb, I used example code spaCy examples vectors_fast_text. I executed the following command in the terminal:
python config/vectors_fast_text.py vectors_loc data/vectors/wiki.pt.vec
After a few minutes with the processor at 100%, I received the following text:
class colspan 0.32231358
What happens from here? How can I export these vectors elsewhere, such as for example with my AWS S3 training templates?

I modified the example script, to load the existing data of my language, read the file word2vec and at the end write all the content in a folder (this folder needs to exist).
Follow vectors_fast_text.py:
[LANGUAGE] = example: "pt"
[FILE_WORD2VEC] = "./data/word2vec.txt"
from __future__ import unicode_literals
import plac
import numpy
import spacy
from spacy.language import Language
#plac.annotations()
def main():
nlp = spacy.load('[LANGUAGE]')
with open("[FILE_WORD2VEC]", 'rb') as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
nlp.vocab.reset_vectors(width=int(nr_dim))
count = 0
for line in file_:
count += 1
line = line.rstrip().decode('utf8')
pieces = line.rsplit(' ', int(nr_dim))
word = pieces[0]
print("{} - {}".format(count, word))
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
nlp.to_disk("./models/new_nlp/")
if __name__ == '__main__':
plac.call(main)
Type in the terminal:
python vectors_fast_text.py
It will take about 10 minutes to finish, depending on the size of the word2vec file. In the script I made the print of the word, so that you can follow.
After that, you must type in the terminal:
python -m spacy package ./models/new_nlp/ ./my_models/
python setup.py sdist
And then you will have a "zip" file.
pip install /path/to/pt_example_model-1.0.0.tar.gz
A detailed tutorial can be found on the spaCy website:
https://spacy.io/usage/training

Related

How to load a model using Tensorflow Hub and make a prediction?

This should be a simple task: Download a model saved in tensorflow_hub format, load using tensorflow_hub, and use..
This is the model I am trying to use (simCLR stored in Google Cloud): https://console.cloud.google.com/storage/browser/simclr-checkpoints/simclrv2/pretrained/r50_1x_sk0;tab=objects?pageState=(%22StorageObjectListTable%22:(%22f%22:%22%255B%255D%22))&prefix=&forceOnObjectsSortingFiltering=false
I downloaded the /hub folder as they say, using
gsutil -m cp -r \
"gs://simclr-checkpoints/simclrv2/pretrained/r50_1x_sk0/hub" \
.
The /hub folder contains the files:
/saved_model.pb
/tfhub_module.pb
/variables/variables.index
/variables/variables.data-00000-of-00001
So far so good.
Now in python3, tensorflow2, tensorflow_hub 0.12 I run the following code:
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
path_to_hub = '/home/my_name/my_path/simclr/hub'
# Attempt 1
m = tf.keras.models.Sequential([hub.KerasLayer(path_to_hub, input_shape=(224,224,3))])
# Attempt 2
m = tf.keras.models.Sequential(hub.KerasLayer(hubmod))
m.build(input_shape=[None,224,224,3])
# Attempt 3
m = hub.KerasLayer(hub.load(hubmod))
# Toy Data Test
X = np.random.random((1,244,244,3)).astype(np.float32)
y = m.predict(X)
None of these 3 options to load the hub model work, with the following errors:
Attempt 1 :
ValueError: Error when checking input: expected keras_layer_2_input to have shape (224, 224, 3) but got array with shape (244, 244, 3)
Attempt 2:
tensorflow.python.framework.errors_impl.UnknownError: Failed to get convolution algorithm. This is probably because cuDNN failed to initialize, so try looking to see if a warning log message was printed above.
[[{{node sequential_3/keras_layer_3/StatefulPartitionedCall/base_model/conv2d/Conv2D}}]] [Op:__inference_keras_scratch_graph_46402]
Function call stack:
keras_scratch_graph
Attempt 3:
ValueError: Expected a string, got <tensorflow.python.training.tracking.tracking.AutoTrackable object at 0x7fa71c7a2dd0>
These 3 attempts are all code taken from tensorflow_hub tutorials and are repeated in other answers in stackoverflow, but none works, and I don't know how to continue from those error messages.
Appreciate any help, thanks.
Update 1:
Same issues happen if I try with this ResNet50 hub/
https://storage.cloud.google.com/simclr-gcs/checkpoints/ResNet50_1x.zip
As #Frightera pointed out, there was an error with the input shapes. Also the error on "Attempt 2" was solved by allowing for memory growth on the selected GPU. "Attempt 3" still does not work, but at least there are two methods for loading and using a model saved in /hub format:
import numpy as np
import tensorflow as tf
import tensorflow_hub as hub
gpus = tf.config.experimental.list_physical_devices('GPU')
tf.config.experimental.set_visible_devices(gpus[0], 'GPU')
tf.config.experimental.set_memory_growth(gpus[0], True)
hubmod = 'https://tfhub.dev/google/imagenet/mobilenet_v2_035_96/feature_vector/5'
# Alternative 1 - Works!
m = tf.keras.models.Sequential([hub.KerasLayer(hubmod, input_shape=(96,96,3))])
print(m.summary())
# Alternative 2 - Works!
m = tf.keras.models.Sequential(hub.KerasLayer(hubmod))
m.build(input_shape=[None, 96,96,3])
print(m.summary())
# Alternative 3 - Doesnt work
#m = hub.KerasLayer(hub.load(hubmod))
#m.build(input_shape=[None, 96,96,3])
#print(m.summary())
# Test
X = np.random.random((1,96,96,3)).astype(np.float32)
y = m.predict(X)
print(y.shape)

Spacy - erroneous config.file

While training ner with custom labels I created a .json file the exactly similar way but with my own data as stated in the example.
Then I tried to convert it (both train/dev) to the binary format needed for training using the command:
python -m spacy convert train.json ./ -t spacy
which did result in creating 2 files.
The error I got while launching the training process:
[E923] It looks like there is no proper sample data to initialize the Model of component 'ner'. To check your input data paths and annotation, run: python -m spacy debug data config.cfg
The debug command output is the same.
The problem is that there are overlapping entities. For each word there should be only one tag.
The solution of the problem can be (code from spacy_convert_script):
import srsly
import spacy
for f in ["train.json", "dev.json"]:
nlp = spacy.blank("en")
db = DocBin()
for text, annot in srsly.read_json(f):
doc = nlp.make_doc(text)
ents = []
try:
for start, end, label in annot["entities"]:
span = doc.char_span(start, end, label=label)
if span is None:
msg = f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{doc.text[start:end]}' does not align with token boundaries:\n\n{repr(text)}\n"
warnings.warn(msg)
else:
ents.append(span)
doc.ents = ents
db.add(doc)
except:
print(doc.text, ents) #see which texts cause the problem
continue
db.to_disk(f.split('.')[0]+'.spacy')
That just would result in skipping the texts which cause problems. To choose one of the overlapping entities:
try:
x = 0
for start, end, label in annot["entities"]:
span = doc.char_span(start, end, label=label)
if span is None:
msg = f"Skipping entity [{start}, {end}, {label}] in the following text because the character span '{doc.text[start:end]}' does not align with token boundaries:\n\n{repr(text)}\n"
warnings.warn(msg)
else:
if start > x and end > x:
x = end
ents.append(span)

How to implement SciBERT with pytorch; error while loading

I am trying to use SciBERT pre-trained model, namely: scibert-scivocab-uncased the following way:
!pip install pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import logging
import matplotlib.pyplot as plt
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
model = BertModel.from_pretrained('/Users/.../Downloads/scibert_scivocab_uncased-3.tar.gz')
And I get the following error:
EOFError: Compressed file ended before the end-of-stream marker was reached
I downloaded the file from the website (https://github.com/allenai/scibert)
I converted it from "tar" to gzip
Nothing worked.
Any hint on how to approach this?
Thank you!
In the new version of pytorch-pretrained-BERT i.e. in transformers, you can do the following to load a pretrained model after you un-tar:
import AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("/your/local/path/to/scibert_scivocab_uncased")
Need to unzip the package and rename the json file to config.json
Then just address the folder pathname where you have unzipped the package. It should work

Why does jupyter fail to display displacy results?

I'm trying to run displacy in jupyter notebook but I get the following error:
TypeError: init() got an unexpected keyword argument 'encoding'
Code:
import en_core_web_sm
import spacy
nlp = en_core_web_sm.load()
from spacy import displacy
doc4 = nlp("The rat the cat the dog chased killed ate the malt")
displacy.render(doc4, style ='dep', jupyter = True)
However, the dependencies are appearing through the following code:
for x in doc4:
print(x.text, x.pos_, x.dep_)
Can someone please guide as to what's wrong?
import spacy
from spacy import displacy
nlp = spacy.load("en_core_web_sm")
doc4 = nlp("The rat the cat the dog chased killed ate the malt")
displacy.render(doc4, style ='dep', jupyter = True)
(and your exact code also) works for me on spacy==2.0.16
You can try it out also on a jupyter running in the cloud, such as https://colab.research.google.com

Tensorflow: Load unknown TFRecord dataset

I got a TFRecord data file filename = train-00000-of-00001 which contains images of unknown size and maybe other information as well. I know that I can use dataset = tf.data.TFRecordDataset(filename) to open the dataset.
How can I extract the images from this file to save it as a numpy-array?
I also don't know if there is any other information saved in the TFRecord file such as labels or resolution. How can I get these information? How can I save them as a numpy-array?
I normally only use numpy-arrays and am not familiar with TFRecord data files.
1.) How can I extract the images from this file to save it as a numpy-array?
What you are looking for is this:
record_iterator = tf.python_io.tf_record_iterator(path=filename)
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# Exit after 1 iteration as this is purely demonstrative.
break
2.) How can I get these information?
Here is the official documentation. I strongly suggest that you read the documentation because it goes step by step in how to extract the values that you are looking for.
Essentially, you have to convert example to a dictionary. So if I wanted to find out what kind of information is in a tfrecord file, I would do something like this (in context with the code stated in the first question): dict(example.features.feature).keys()
3.) How can I save them as a numpy-array?
I would build upon the for loop mentioned above. So for every loop, it extracts the values that you are interested in and appends them to numpy arrays. If you want, you could create a pandas dataframe from those arrays and save it as a csv file.
But...
You seem to have multiple tfrecord files...tf.data.TFRecordDataset(filename) returns a dataset that is used to train models.
So in the event for multiple tfrecords, you would need a double for loop. The outer loop will go through each file. For that particular file, the inner loop will go through all of the tf.examples.
EDIT:
Converting to np.array()
import tensorflow as tf
from PIL import Image
import io
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# Get the values in a dictionary
example_bytes = dict(example.features.feature)['image_raw'].bytes_list.value[0]
image_array = np.array(Image.open(io.BytesIO(example_bytes)))
print(image_array)
break
Sources for the code above:
Base code
Converting bytes to PIL.JpegImagePlugin.JpegImageFile
Converting from PIL.JpegImagePlugin.JpegImageFile to np.array
Official Documentation for PIL
EDIT 2:
import tensorflow as tf
from PIL import Image
import io
import numpy as np
# Load image
cat_in_snow = tf.keras.utils.get_file(path, 'https://storage.googleapis.com/download.tensorflow.org/example_images/320px-Felis_catus-cat_on_snow.jpg')
#------------------------------------------------------Convert to tfrecords
def _bytes_feature(value):
"""Returns a bytes_list from a string / byte."""
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def image_example(image_string):
feature = {
'image_raw': _bytes_feature(image_string),
}
return tf.train.Example(features=tf.train.Features(feature=feature))
with tf.python_io.TFRecordWriter('images.tfrecords') as writer:
image_string = open(cat_in_snow, 'rb').read()
tf_example = image_example(image_string)
writer.write(tf_example.SerializeToString())
#------------------------------------------------------
#------------------------------------------------------Begin Operation
record_iterator = tf.python_io.tf_record_iterator(path to tfrecord file)
for string_record in record_iterator:
example = tf.train.Example()
example.ParseFromString(string_record)
print(example)
# OPTION 1: convert bytes to arrays using PIL and IO
example_bytes = dict(example.features.feature)['image_raw'].bytes_list.value[0]
PIL_array = np.array(Image.open(io.BytesIO(example_bytes)))
# OPTION 2: convert bytes to arrays using Tensorflow
with tf.Session() as sess:
TF_array = sess.run(tf.image.decode_jpeg(example_bytes, channels=3))
break
#------------------------------------------------------
#------------------------------------------------------Compare results
(PIL_array.flatten() != TF_array.flatten()).sum()
PIL_array == TF_array
PIL_img = Image.fromarray(PIL_array, 'RGB')
PIL_img.save('PIL_IMAGE.jpg')
TF_img = Image.fromarray(TF_array, 'RGB')
TF_img.save('TF_IMAGE.jpg')
#------------------------------------------------------
Remember that tfrecords is just simply a way of storing information for tensorflow models to read in an efficient manner.
I use PIL and IO to essentially convert the bytes to an image. IO takes the bytes and converts them to a file like object that PIL.Image can then read
Yes, there is a pure tensorflow way to do it: tf.image.decode_jpeg
Yes, there is a difference between the two approaches when you compare the two arrays
Which one should you pick? Tensorflow is not the way to go if you are worried about accuracy as stated in Tensorflow's github : "The TensorFlow-chosen default for jpeg decoding is IFAST, sacrificing image quality for speed". Credit for this information belongs to this post