I have a corpus of text and I would like to find embeddings for words starting from a characters. So I have a sequence of characters as input and I want to project it into a multidimensional space.
As an initialization, I would like to fit already learned word embeddings (for example, the Google ones).
I have some doubts:
Do I need use a character embedding vector for each input
character in the input sequence? would it be a problem if I use
simply the ascii or utf-8 encoding ?
despite of what be the input
vector definition (embedding vec, ascii,..)it's really confusing to
select a proper model there are several options but im not sure
which one is the better choice :seq2seq, auto-encoder, lstm,
multi-regressor+lstm ?
Could you give me any sample code by
keras or tensorflow?
I answer each question:
If you want to exploit characters similarities (that area far relatives of phonetic similarities too), you need an embedding layer. Encodings are symbolic inputs while embeddings are continuous inputs. With symbolic knowledge any kind of generalization is impossible because you have no concept of distance (or similarity), while with embeddings you can behave similarly with similar inputs (and so generalizing). However, since the input space is very small, short embeddings are sufficient.
The model highly depends on the kind of phenomena you want to capture. A model that I see often in literature and seems working well in different task is a multilayer bidirectional-lstm on the characters with a linear layer in the top.
The code is similar to all the RNN implementation of Tensorflow. A good way to start is the Tensorflow tutorial https://www.tensorflow.org/tutorials/recurrent. The function for creating the bidirectional is https://www.tensorflow.org/api_docs/python/tf/nn/static_bidirectional_rnn
From experience, I had problems to fit word-based word embeddings using a character model. The reason is that a word-based model will put morphologically similar words very far if there is no semantic similarities. A character-based model can't do it because morphologically similar input cannot be distinguished very well (are very close in the embedded space).
This is one of the reason why, in literature, people often use characters-models as a plus to word models and not as "per se" models. It is an open research area if a character model can be enough to capture both semantic and morphological similarities.
Related
I tried to classify protein using its sequences into their families. Can I use deep convolutional models on this purpose even though they use RGB 3 input metrics of an image? Is there any specific way to convert dataset other than the image in order to classify using these models. I'm new to Artificial neural networks, your suggestions are highly appreciated.
First you need to understand that the models you have in mind are tasked with a very difficult problem: Object Recognition in colored images therefore the models used are very big.
Then you need to know the purpose of using CNNs, is to extract as many features as we can from colored images in order to perform detection.
With the knowledge above considered I think classifying protein using its sequences seems achievable with a much more smaller convolutional model. You may need at max 10 layers of convolution. To conclude you should not need a CNN as complex as google inception model.
About your data: There is no rule about CNNs which say you can only use RGB pictures. These pictures are only arrays. If you have any kind of numeric data which can be used in algorithmic operations ofcourse, you can definitely use CNNs for feature extraction. I recommend you to take a look at this example.
I also recommend you to take a look at the following libraries. SK-LEARN, KERAS and PYTORCH. These libraries are very begginer friendly and they have amazing documentaries.
Best of luck.
The model can now recognize only single letter with tf. How can I make it recognize for sequential letters words?
Handwritten Digit Recognition. ... MNIST is a widely used dataset for the hand-written digit classification task. It consists of 70,000 labeled 28x28 pixel grayscale pix of hand-written digits. The dataset is split into 60,000 education images and 10,000 test images.
Depending on the quality and types of images, the difficulty of the task various. If you were to do text detection in natural scenes, it is quite difficult and requires multiple models, there are plenty of research papers in this area. And lots of Kaggle notebooks. This link (a good read), explains the various factors to take into account and why it is so difficult, also shares his implementation.
If you are trying to identify text in a simple binary image, then this might help Separate image of text into component character images
I'm using cnn built by keras(tensorflow) to do visual recognition.
I wonder if there is a way to know what my own tensorflow model "see".
Google had a news showing the cat face in the AI brain.
https://www.smithsonianmag.com/innovation/one-step-closer-to-a-brain-79159265/
Can anybody tell me how to take out the image in my own cnn networks.
For example, what my own cnn model recognize a car?
We have to distinguish between what Tensorflow actually see:
As we go deeper into the network, the feature maps look less like the
original image and more like an abstract representation of it. As you
can see in block3_conv1 the cat is somewhat visible, but after that it
becomes unrecognizable. The reason is that deeper feature maps encode
high level concepts like “cat nose” or “dog ear” while lower level
feature maps detect simple edges and shapes. That’s why deeper feature
maps contain less information about the image and more about the class
of the image. They still encode useful features, but they are less
visually interpretable by us.
and what we can reconstruct from it as a result of some kind of reverse deconvolution (which is not a real math deconvolution in fact) process.
To answer to your real question, there is a lot of good example solution out there, one you can study it with success: Visualizing output of convolutional layer in tensorflow.
When you are building a model to perform visual recognition, you actually give it similar kinds of labelled data or pictures in this case to it to recognize so that it can modify its weights according to the training data. If you wish to build a model that can recognize a car, you have to perform training on a large train data containing labelled pictures. This type of recognition is basically a categorical recognition.
You can experiment with the MNIST dataset which provides with a dataset of pictures of digits for image recognition.
I'm trying to work out what's the best model to adapt for an open named entity recognition problem (biology/chemistry, so no dictionary of entities exists but they have to be identified by context).
Currently my best guess is to adapt Syntaxnet so that instead of tagging words as N, V, ADJ etc, it learns to tag as BEGINNING, INSIDE, OUT (IOB notation).
However I am not sure which of these approaches is the best?
Syntaxnet
word2vec
seq2seq (I think this is not the right one as I need it to learn on two aligned sequences, whereas seq2seq is designed for sequences of differing lengths as in translation)
Would be grateful for a pointer to the right method! thanks!
Syntaxnet can be used to for named entity recognition, e.g. see: Named Entity Recognition with Syntaxnet
word2vec alone isn't very effective for named entity recognition. I don't think seq2seq is commonly used either for that task.
As drpng mentions, you may want to look at tensorflow/tree/master/tensorflow/contrib/crf. Adding an LSTM before the CRF layer would help a bit, which gives something like:
LSTM+CRF code in TensorFlow: https://github.com/Franck-Dernoncourt/NeuroNER
I want to use the textsum model for tagging named entities. Hence the target size vocab is very small. While training there doesn't seem to be an option to provide different vocabs on the encoder and on the decoder side-or is there?
See code lines on Github
if hps.mode == 'train':
model = seq2seq_attention_model.Seq2SeqAttentionModel(hps, vocab, num_gpus=FLAGS.num_gpus)
Hrishikesh I don't believe there is a way to provide separate vocab files, but not fully understanding why you need it. The vocab simply provides a numerical way of representing a word. Therefore when the model is working with these words, it uses the numerical representations of them. Once the hypothesis is complete and the statistical choices for words have been chosen, it then simply uses the vocab file to convert back the vocab index to it's associated word. I hope this helps answer your question and solidify why you shouldn't need to have separate vocab files. That said, I may jsut be misunderstanding your need for it and I apologize if that is the case.
No there is no out-of-the-box option to use the textsum in this way. I don't see any reason why it shouldn't be possible to modify the architecture to achieve this, though. Would be interested if you pointed towards some literature on using seq2seq w/attention models for NER