How to implement SciBERT with pytorch; error while loading - error-handling

I am trying to use SciBERT pre-trained model, namely: scibert-scivocab-uncased the following way:
!pip install pytorch-pretrained-bert
import torch
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
import logging
import matplotlib.pyplot as plt
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
model = BertModel.from_pretrained('/Users/.../Downloads/scibert_scivocab_uncased-3.tar.gz')
And I get the following error:
EOFError: Compressed file ended before the end-of-stream marker was reached
I downloaded the file from the website (https://github.com/allenai/scibert)
I converted it from "tar" to gzip
Nothing worked.
Any hint on how to approach this?
Thank you!

In the new version of pytorch-pretrained-BERT i.e. in transformers, you can do the following to load a pretrained model after you un-tar:
import AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("/your/local/path/to/scibert_scivocab_uncased")

Need to unzip the package and rename the json file to config.json
Then just address the folder pathname where you have unzipped the package. It should work

Related

b'No files matched pattern error with Tensorflow

I started to write the code below to review my dataset. I'm trying to reach my images, so I add the path to the images.
import tensorflow as tf
import json
import numpy as np
from matplotlib import pyplot as plt
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
tf.config.experimental.set_memory_growth(gpu, True)
print(tf.config.list_physical_devices('GPU'))
images = tf.data.Dataset.list_files('data\\images\\*.jpg',shuffle=False)
But I got the error below.
Expected 'tf.Tensor(False, shape=(), dtype=bool)' to be true. Summarized data: b'No files matched pattern:: data\\images\\*.jpg'
As you can see my folder hierarchy is correct. And there is images with .jpg format in the images folder.
I also tried to change to string to variations below but none of it worked.
'\\data\\images\\*.jpg'
'/data/images/*.jpg'
'data/images/*.jpg'
What I am missing here, can you help me out please?
UPDATE:
I couldn't make it work with relative path, so I go with the absolute path and it worked.

Synapse Analytics Auto ML Predict No module named 'azureml.automl'

I follow the official tutotial from microsoft: https://learn.microsoft.com/en-us/azure/synapse-analytics/machine-learning/tutorial-score-model-predict-spark-pool
When I execute:
#Bind model within Spark session
model = pcontext.bind_model(
return_types=RETURN_TYPES,
runtime=RUNTIME,
model_alias="Sales", #This alias will be used in PREDICT call to refer this model
model_uri=AML_MODEL_URI, #In case of AML, it will be AML_MODEL_URI
aml_workspace=ws #This is only for AML. In case of ADLS, this parameter can be removed
).register()
I got : No module named 'azureml.automl'
My Notebook
As per the repro from my end, the above code which you have shared works as excepted and I don't see any error message which you are experiencing.
I had even tested the same code on the newly created Apache spark 3.1 runtime and it works as expected.
I would request you to create a new cluster and see if you are able to run the above code.
I solved it. In my case it works best like this:
Imports
#Import libraries
from pyspark.sql.functions import col, pandas_udf,udf,lit
from notebookutils.mssparkutils import azureML
from azureml.core import Workspace, Model
from azureml.core.authentication import ServicePrincipalAuthentication
from azureml.core.model import Model
import joblib
import pandas as pd
ws = azureML.getWorkspace("AzureMLService")
spark.conf.set("spark.synapse.ml.predict.enabled","true")
Predict function
def forecastModel():
model_path = Model.get_model_path(model_name="modelName", _workspace=ws)
modeljob = joblib.load(model_path + "/model.pkl")
validation_data = spark.read.format("csv") \
.option("header", True) \
.option("inferSchema",True) \
.option("sep", ";") \
.load("abfss://....csv")
validation_data_pd = validation_data.toPandas()
predict = modeljob.forecast(validation_data_pd)
return predict

Google colab issue importing ue using different class files

I am trying to use Google colab for my project for which I have to upload a few python files because I need those class files.But while executing the main function.It is constantly throwing me an error 'module object has no attribute' . Is there some memory issue with colab or what! Help would be much appreciated.
import numpy as np
import time
import tensorflow as tf
import NN
import Option
import Log
import getData
import Quantize
AttributeError: 'module' object has no attribute 'NN'
I uploaded all files using following code :
from google.colab import files
src = list(files.upload().values())[0]
open('Option.py','wb').write(src)
import Option
But its always giving me error on some or the other files which I am importing.
The updated version (for a few weeks) can save the files without you having to call open(fname, 'wb').write(src)
So, you only have to upload your 5 files: NN.py, Option.py, Log.py, getData.py, and Quantize.py (and probably other dependency + data) then try importing each one e.g. import NN to see if there's any error.

Export vectors from fastText to spaCy

I downloaded the fasttext.cc vectors of 1.5gb, I used example code spaCy examples vectors_fast_text. I executed the following command in the terminal:
python config/vectors_fast_text.py vectors_loc data/vectors/wiki.pt.vec
After a few minutes with the processor at 100%, I received the following text:
class colspan 0.32231358
What happens from here? How can I export these vectors elsewhere, such as for example with my AWS S3 training templates?
I modified the example script, to load the existing data of my language, read the file word2vec and at the end write all the content in a folder (this folder needs to exist).
Follow vectors_fast_text.py:
[LANGUAGE] = example: "pt"
[FILE_WORD2VEC] = "./data/word2vec.txt"
from __future__ import unicode_literals
import plac
import numpy
import spacy
from spacy.language import Language
#plac.annotations()
def main():
nlp = spacy.load('[LANGUAGE]')
with open("[FILE_WORD2VEC]", 'rb') as file_:
header = file_.readline()
nr_row, nr_dim = header.split()
nlp.vocab.reset_vectors(width=int(nr_dim))
count = 0
for line in file_:
count += 1
line = line.rstrip().decode('utf8')
pieces = line.rsplit(' ', int(nr_dim))
word = pieces[0]
print("{} - {}".format(count, word))
vector = numpy.asarray([float(v) for v in pieces[1:]], dtype='f')
nlp.vocab.set_vector(word, vector) # add the vectors to the vocab
nlp.to_disk("./models/new_nlp/")
if __name__ == '__main__':
plac.call(main)
Type in the terminal:
python vectors_fast_text.py
It will take about 10 minutes to finish, depending on the size of the word2vec file. In the script I made the print of the word, so that you can follow.
After that, you must type in the terminal:
python -m spacy package ./models/new_nlp/ ./my_models/
python setup.py sdist
And then you will have a "zip" file.
pip install /path/to/pt_example_model-1.0.0.tar.gz
A detailed tutorial can be found on the spaCy website:
https://spacy.io/usage/training

How to read Ogg or MP3 audio files in a TensorFlow graph?

I've seen image decoders like tf.image.decode_png in TensorFlow, but how about reading audio files (WAV, Ogg, MP3, etc.)? Is it possible without TFRecord?
E.g. something like this:
filename_queue = tf.train.string_input_producer(['my-audio.ogg'])
reader = tf.WholeFileReader()
key, value = reader.read(filename_queue)
my_audio = tf.audio.decode_ogg(value)
Yes, there are special decoders, in the package tensorflow.contrib.ffmpeg. To use it, you need to install ffmpeg first.
Example:
audio_binary = tf.read_file('song.mp3')
waveform = tf.contrib.ffmpeg.decode_audio(audio_binary, file_format='mp3', samples_per_second=44100, channel_count=2)
The answer from #sygi is unfortunately not supported in TensorFlow 2.x. An alternative solution would be to use some external library (e.g. pydub or librosa) to implement the mp3 decoding step, and integrate it in the pipeline through the use of tf.py_function. So you can do something along the lines of:
from pydub import AudioSegment
import tensorflow as tf
dataset = tf.data.Dataset.list_files('path/to/mp3s/*')
def decode_mp3(mp3_path):
mp3_path = mp3_path.numpy().decode("utf-8")
mp3_audio = AudioSegment.from_file(mp3_path, format="mp3")
return mp3_audio.get_array_of_samples()
dataset = dataset.map(lambda path:
tf.py_function(func=decode_mp3, inp=[path], Tout=tf.float32))
for features in dataset.take(3):
data = features.numpy()
plt.plot(data)
plt.show()
Such a function has recently been added to tensorflow_io (here).
You can use it like this:
content = tf.io.read_file(path)
audio = tfio.experimental.audio.decode_ogg(content)
For the latest versions of tensorflow, All audio related utilities have been moved/added to tensorflow_io (here). To install run pip install tensorflow.io
import tensorflow_io as tfio
import tensorflow as tf
fp = 'path/to/mp3'
audio = tfio.audio.decode_mp3(tf.io.read_file(fp))