What are some techniques to improve contextual accuracy of semantic search engine using BERT? - semantic-web

I am implementing a semantic search engine using BERT (using cosine distance) To a certain extend the method is able to find out sentences in a high level context. However when it comes narrowed down context of the sentence, it gives several issues.
Example:
If my search term is "Wrong Product", the search engine might match with sentences like "I bought a washing machine but it was defective".
I understand that the fixes to these is always finding some equilibrium and live with minor errors.
However if there are any rules, techniques which has improved your semantic search implementation accuracy please do share.

Related

Word2Vec Tensorflow tutorial weird output

I'm trying out the Word2Vec tutorial at tensorflow (see here: https://www.tensorflow.org/tutorials/text/word2vec)
While all seems to work fine, the output is somewhat unexpected to me, especially the small cluster in the PCA. The 'closet' words in the embedding dimension also don't make much sense, especially compared to other examples.
Am I doing something (trivially) wrong? Or is this expected?
For completeness, I run this in the nvidia-docker image, but also found similar results running cpu only.
Here is the projected embedding showing the cluster.
There can be various reasons.
One reason is that this is due to the so-called hubness problem of embedding spaces, which is an artifact of the high-dimensional space. Some words end up close to a large part of the space and act as sort of hubs in the nearest neighbor search, so through these words, you can get quickly from everywhere to everywhere.
Another reason might be that the model is just undertrained for this particular word. Word embeddings are typically trained on very large datasets, such that every word appears in sufficiently many contexts. If a word does not appear frequently enough or in too ambiguous contexts, then it also ends up to be similar to basically everything.

How to build short sentences with a small letter set restriction?

I'm looking for a way to write a program that creates short german sentences with a restricted letter set. The sentences can be nonsense but should grammatically be correct. The following examples only contain the letters "aeilmnost":
"Antonia ist mit Tina im Tal."
"Tamina malt mit lila Tinte Enten."
"Tina nimmt alle Tomaten mit."
For this task I need a dictionary like this one (found in the answer to "Where can I find a parsable list of German words?"). The research area for programatically create text is NLG - Natural Language Generation. On the NLG-Wiki I found a large table of NLG systems. I picked two from the list, which could be appropriate:
SimpleNLG - a Java API, which has also an adaption for the german language
KOMET - multilingual generation, from University Bremen
Do you have worked with a NLG library and have some advice which one to use for building short sentences with a letter set restriction?
Can you recommend a paper to this topic?
Grammatically correct is a pretty fuzzy area, since grammar is not to strictly defined as one might think. What you really want here though, is a part-of-speech tagger, and a markov chain.
Specifically a markov chain says that given a certain state (the first word for instance) there's just a certain chance of moving on to another state (the next word). They are relatively easy to write from scracth, but I've got a gist here in python that shows how they work if you want an example.
Once you've got that I would suggest a part-of-speech-based markov chain, combined with just checking to see if words are constructed from your desired character set. In general the algorithm would go something like this:
Pick first word at random, checking that it is constructed solely from your desired set of characters
Use the Markov Chain to predict the next word
Check if that word is an appropriate part of speech, and that it conforms to the desired character set.
If not, predict another word until it is the case.
If so, then repeat starting at 2 to completion.
Hope that's what you're looking for. Let me know if you have any more questions.
As Slater Tyranus already said, Markov chains certainly form the basis of this task. I am going to suggest a more heavy-duty approach. It is considerably more work, but is likely to give much better results in terms of grammatical correctness.
Language Model based on PCFG parse trees: A language model works by assigning a probability to a sequence of words. It requires training data, however, in order to be built first. In your case, the training process should disregard words containing letters outside the limited set.
While theoretically a language model based on parse trees is much more likely to serve your purpose, there is one caveat: due to the kind of letter-based restriction you have, data sparsity will certainly raise its ugly head. Backoff techniques (e.g. Katz's backoff model) can help a bit, but it will essentially depend on whether or not you can train on enough enough data.
As far as readily available parsers are concerned, the Stanford NLP group provides a German parser based on the Negra corpus, as mentioned in their home page.

how to determine the accuracy of a semantic search engine?

my cousin has created a semantic search engine and he claims that his search engine is the most accurate.
I've seen many semantic search engine and they all look the same to me, because they are not designed to give you results based on the keyword you type.
So if you are creating a semantic search engine, how to to determine the accuracy of its results?
Actually sarnold's suggestion is not far off the mark.
What you would typically do is to take a whole bunch of people and have them try out a bunch of standard queries. Or if you wanted to make the experiment fairer you might let each user pick their own queries to avoid any accusation of bias (because you could pick standard queries you knew your engine was good at answering).
For each query the user would look through the first 10 or so results and say whether they thought each result was relevant or not (you may want to have users score on a scale rather than just yes/no).
Then for each of the queries you can calculate accuracy scores, depending on exactly how you set up the experiment Precision and Recall may be the most appropriate measures though these rely on having a known expected answer which you may not necessarily have. It may be simpler and more appropriate to calculate a simple percentage accuracy.
To determine whether your search engine was better than your competitors you'd have the same people perform the same queries on those search engines scoring in the same way. Having done this you can then calculate and compare the scores for the search engines against your own.

Suitability of Naive Bayes classifier in Mahout to classifying websites

I'm currently working on a project that requires a database categorising websites (e.g. cnn.com = news). We only require broad classifications - we don't need every single URL classified individually. We're talking to the usual vendors of such databases, but most quotes we've had back are quite expensive and often they impose annoying requirements - like having to use their SDKs to query the database.
In the meantime, I've also been exploring the possibility of building such a database myself. I realise that this is not a 5 minute job, so I'm doing plenty of research.
From reading various papers on the subject, it seems a Naive Bayes classifier is generally the standard approach for doing this. However, many of the papers suggest enhancements to improve its accuracy in web classification - typically by making use of other contextual information, such as hyperlinks, header tags, multi-word phrases, the URL, word frequency and so on.
I've been experimenting with Mahout's Naive Bayes classifier against the 20 Newsgroup test dataset, and I can see its applicability to website classification, but I'm concerned about its accuracy for my use case.
Is anyone aware of the feasibility of extending the Bayes classifier in Mahout to take into account additional attributes? Any pointers as to where to start would be much appreciated.
Alternatively, if I'm barking up entirely the wrong tree please let me know!
You can control the input about as much as you'd like. In the end the input is just a feature vector. The feature vector's features can be words, or bigrams -- but they can also be whatever you want. So, yes, you can inject new features by modifying the input as you like.
How best to weave in those features is another topic entirely -- there's not one best way to convert them to numbers. Mahout in Action covers this reasonably well FWIW.

Non-Speech Noise or Sound Recognition Software?

I'm working on some software for children, and looking to add the ability for the software to respond to a number of non-speech sounds. For instance, clapping, barking, whistling, fart noises, etc.
I've used CMU Sphinx and the Windows Speech API in the past, however, as far as I can tell neither of these have any support for non-speech noises, and in fact I believe actively filter them out.
In general I'm looking for "How do I get this functionality" but I suspect it may help if I break it down into three questions that are my guesses for what to search for next:
Is there a way to use one of the main speech recognition engines to recognize non-word sounds by changing an acoustic model or pronunciation lexicon?
(or) Is there already an existing library to do non-word noise recognition?
(or) I have a bit of familiarity with Hidden Markov Models and the underlying tech of voice recognition from college, but no good estimate on how difficult it would be to create a very small noise/sound recognizer from scratch (suppose <20 noises to be recognized). If 1) and 2) fail, any estimation on how long it would take to roll my own?
Thanks
Yes, you can use speech recognition software like CMU Sphinx for recognition of non-speech sounds. For this, you need to create your own acoustical and language models and define the lexicon restricted to your task. But to train the corresponding acoustic model, you must have enough training data with annotated sounds of interest.
In short, the sequence of steps is the following:
First, prepare resources for training: lexicon, dictionary etc. The process is described here: http://cmusphinx.sourceforge.net/wiki/tutorialam. But in your case, you need to redefine phoneme set and the lexicon. Namely, you should model fillers as real words (so, no ++ around) and you don't need to define the full phoneme set. There are many possibilities, but probably the most simple one is to have a single model for all speech phonemes. Thus, your lexicon will look like:
CLAP CLAP
BARK BARK
WHISTLE WHISTLE
FART FART
SPEECH SPEECH
Second, prepare training data with labels: Something similar to VoxForge, but text annotations must contain only labels from your lexicon. Of course, non-speech sounds must be labeled correctly as well. Good question here is where to get large enough amount of such data. But I guess it should be possible.
Having that, you can train your model. The task is simpler compared to speech recognition, for instance, you don't need to use triphones, just monophones.
Assuming equal prior probability of any sound/speech, the simplest language model can be a loop-like grammar (http://cmusphinx.sourceforge.net/wiki/tutoriallm):
#JSGF V1.0;
/**
* JSGF Grammar for Hello World example
*/
grammar foo;
public <foo> = (CLAP | BARK | WHISTLE | FART | SPEECH)+ ;
This is the very basic approach to using ASR toolkit for your task. In can be further improved by fine-tuning HMMs configurations, using statistical language models and using fine-grained phonemes modeling (e.g. distinguishing vowels and consonants instead of having single SPEECH model. It depends on nature of your training data).
Outside the framework of speech recognition, you can build a simple static classifier that will analyze the input data frame by frame. Convolutional neural networks that operate over spectrograms perform quite well for this task.
I don't know any existing libraries you can use, I suspect you may have to roll your own.
Would this paper be of interest? It has some technical detail, they seem to be able to recognise claps and differentiate them from whistles.
http://www.cs.bham.ac.uk/internal/courses/robotics/halloffame/2001/team14/sound.htm