Please see the following code. After having read in a csv file of 5000 rows, I get the error message:
nlp = spacy.blank("en")
nlp.max_length = 3000000
nlp.add_pipe(
"text_categorizer",
config={
"data": data,
"model": "sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2",
"device": "cpu"
}
)
ValueError: [E088] Text of length 2508705 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).
Is there anyway through which this can be solved?
Thanks in advance!
Setting nlp.max_length should work in general (up until you run out of memory, at least):
import spacy
nlp = spacy.blank("en")
nlp.max_length = 10_000_000
doc = nlp("a " * 2_000_000)
assert len(doc.text) == 4_000_000
I doubt that the sentence-transformers model can handle texts of this length, though? In terms of the linguistic annotation it's unlikely to be useful to have single docs that are this long.
Related
I'm following this tutorial (https://colab.research.google.com/github/tensorflow/text/blob/master/docs/guide/subwords_tokenizer.ipynb#scrollTo=kh98DvoDz7Jn) to generate a vocabulary from a custom dataset. In the tutorial, it takes around 2 minutes for this code to complete:
bert_vocab_args = dict(
# The target vocabulary size
vocab_size = 8000,
# Reserved tokens that must be included in the vocabulary
reserved_tokens=reserved_tokens,
# Arguments for `text.BertTokenizer`
bert_tokenizer_params=bert_tokenizer_params,
# Arguments for `wordpiece_vocab.wordpiece_tokenizer_learner_lib.learn`
learn_params={},
)
pt_vocab = bert_vocab.bert_vocab_from_dataset(
train_pt.batch(1000).prefetch(2),
**bert_vocab_args
)
On my dataset it takes a lot longer... I tried increasing the batch number as well as decreasing the size of the vocabulary, all to no avail. Is there any way to make this go faster?
I ran into the same issue. This is how I resolved it:
First I checked the number of elements in the dataset:
examples, metadata = tfds.load('my_dataset', as_supervised=True, with_info=True)
print(metadata)
In my case, the dataset contained more than 5 million elements, which explains why creating the vocabulary took an endless amount of time.
The portuguese vocabulary of the tensorflow example is built using some 50000 elements. So I selected 1% of my dataset:
train_tokenize, metadata = tfds.load('my_dataset', split='train[:1%]',as_supervised=True, with_info=True)
I then used this dataset to develop the vocabulary, which took some 2 minutes:
train_en_tokenize = train_tokenize.map(lambda en, ol: en)
train_ol_tokenize = train_tokenize.map(lambda en, ol: ol)
ol_vocab = bert_vocab.bert_vocab_from_dataset(
train_ol_tokenize.batch(1000).prefetch(2),
**bert_vocab_args
)
en_vocab = bert_vocab.bert_vocab_from_dataset(
train_en_tokenize.batch(1000).prefetch(2),
**bert_vocab_args
)
where ol stands for the 'other language' I am developing the model for.
I am currently doing the following tf tutorial : https://www.tensorflow.org/tutorials/text/solve_glue_tasks_using_bert_on_tpu
Testing the outputs of the tokenize function on different sentences, I wonder what happens when tokenizing unknown words.
Loading model:
bert_model_name = 'bert_en_uncased_L-12_H-768_A-12'
tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/3'
tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
bert_preprocess = hub.load(tfhub_handle_preprocess)
Tokenizing sentence/word:
tok = bert_preprocess.tokenize(tf.constant(['Tensorsss bla']))
print(tok)
# Output:
<tf.RaggedTensor [[[23435, 4757, 2015], [1038, 2721]]]>
Shouldnt it be so every word is tokenized to a single token ? Those are obviously made up words, but I am wondering what happens when you encode those words to fixed length vectors.
Also, how does the tokenizer transform the made up words in 3 different tokens ? Does it split the unknown words into different known parts ?
The default cache location for the tensorflow/bert_en_uncased_preprocess/3 model is /tmp/tfhub_modules/602d30248ff7929470db09f7385fc895e9ceb4c0 (more on caching). In the assets directory, you'll find vocab.txt, which is the used vocabulary. You can use the file to look up what token the token-id i corresponds to by looking at line i+1 of the file i.e.
sed '23436q;d' /tmp/tfhub_modules/602d30248ff7929470db09f7385fc895e9ceb4c0/assets/vocab.txt
> tensor
Doing that for all token-ids returns
[tensor, ##ss, ##s], [b, ##la]
As you can see, this confirms your theory that words are split into different known parts. More details on the exact algorithm can be found in Subword tokenizers.
I'm using tf.keras.preprocessing.text.Tokenizer to build a vocabulary for my corpus (5 million documents). The tokenizer finds 145K tokens. The issue is that the embedding layer has far too many parameters.
What's a simple way to force the tokenizer to only consider the top N most common words? Then anything outside of that would get a .
Solution
As indicated by #MarcoCerliani in the comment, you can simply change the num_words parameter in the tokenizer. That won't change the tokenizer's internal data - it'll instead return OOV tokens for words outside the range specified by num_words.
tokenizer = tf.keras.preprocessing.text.Tokenizer(oov_token='<OOV>')
tokenizer.fit_on_texts(huge_corpus)
len(tokenizer.word_counts)
>>> (some huge vocabulary size)
# Change vocabulary size
tokenizer.num_words = 100
tokenizer.texts_to_sequences(texts)
>>> (any words in `texts` not within 100 most common words
>>> will get `1` (out of vocab).
How do I limit the number of CPUs used by Spacy?
I want to extract parts-of-speech and named entities from a large set of sentences. Because of limitations regarding RAM, I first use the Python NLTK to parse my documents into sentences. I then iterate over my sentences and use nlp.pipe() to do the extractions. However, when I do this, Spacy consumes the whole of my computer; Spacy uses every available CPU. Such is not nice because my computer is shared. How can I limit the number of CPUs used by Spacy? Here is my code to date:
# require
from nltk import *
import spacy
# initialize
file = './walden.txt'
nlp = spacy.load( 'en' )
# slurp up the given file
handle = open( file, 'r' )
text = handle.read()
# parse the text into sentences, and process each one
sentences = sent_tokenize( text )
for sentence in nlp.pipe( sentences, n_threads=1 ) :
# process each token
for token in sentence : print( "\t".join( [ token.text, token.lemma_, token.tag_ ] ) )
# done
quit()
My answer to my own question is, "Call the operating system and employ a Linux utility named taskset."
# limit ourselves is a few processors only
os.system( "taskset -pc 0-1 %d > /dev/null" % os.getpid() )
This particular solution limits the running process to cores #1 and #2. This solution is good enough for me.
When trying to create a new version in the google cloud console, I get an error like,
Field: version.deployment_uri Error: The total size of files in gs://my-bucket/ml/ is 2150116163 bytes, which exceeds the allowed maximum of 1073741824 bytes.
My model is an RNN model. I believe the embed sequence, vocab size, is likely the cause of the large model.
Is there a quota setting that can be adjusted for larger models?
Unfortunately, that limit is not adjustable at this time, although it may be in the future.
Are you comfortable sharing how large your model is? That information is valuable for us for planning purposes.
In the meantime, you will need to adjust the vocab and embedding sizes or otherwise reduce the size of the model.