Query for substrings from freeform STT input - sql

I have a PostgreSQL database with vocabulary in a table.
I want to receive Speech to Text (STT) input and query my vocabulary table for matches.
This is tricky since STT is somewhat free-form.
Let's say the table contains the following vocabulary and phrases:
How are you?
Hi
Nice to meet you
Hill
Nice
And the user is prompted to speak: "Hi, nice to meet you"
I transcribe their input as it comes in as "Hi nice to meet you" and query my database for individual vocabulary matches. I want to return:
[
{
id: 2,
word: "Hi"
},
{
id: 3,
word: "Nice to meet you"
}
]
I could query with wildcards where word ilike '%${term}% but then I'd need to pass in the correct substring so it'd find the match, e.g., where word ilike '%Hi%, but this may incorrectly return Hill. I could also split the spoken input by space, giving me ["Hi", "nice", "to", "meet", you"], and loop through each word looking for a match, but this may return Nice rather than the phrase Nice to meet you.
Q: How can I correctly pass substrings to a query and return accurate results for free-form speech?

Two PostgreSQL functions could help you here:
to_tsvector: creates a text search list of tokens (lexemes: unit of lexical meaning)
to_tsquery for querying the vector for occurrences of certain words or phrases.
See Mastering PostgreSQL Tools: Full-Text Search and Phrase Search
If that's not enough you need to turn to natural language processing (NLP).
Something like PyTextRank could help (something that goes beyond the bag-of-words technique):
import spacy
import pytextrank
text = "Hi, how are you?"
# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")
# add PyTextRank to the spaCy pipeline
tr = pytextrank.TextRank()
nlp.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc = nlp(text)
# examine the top-ranked phrases in the document
for p in doc._.phrases:
print("{:.4f} {:5d} {}".format(p.rank, p.count, p.text))
print(p.chunks)

Related

Can I clear the stopword list in lucene.net in order for exact matches to work better?

When dealing with exact matches I'm given a real world query like this:
"not in education, employment, or training"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? education employment ? training"
Here's a more contrived example:
"there is no such thing"
Converted to a Lucene query with stopwords removed gives:
+Content:"? ? ? ? thing"
My goal is to have searches like these match only the exact match as the user entered it.
Could one solution be to clear the stopwords list? would this have adverse affects? if so what? (my google-fu failed)
This all depends on the analyzer you are using. The StandardAnalyzer uses Stop words and strips them out, in fact the StopAnalyzer is where the StandardAnalyzer gets its stop words from.
Use the WhitespaceAnalyzer or create your own by inheriting from one that most closely suits your needs and modify it to be what you want.
Alternatively, if you like the StandardAnalyzer you can new one up with a custom stop word list:
//This is what the default stop word list is in case you want to use or filter this
var defaultStopWords = StopAnalyzer.ENGLISH_STOP_WORDS_SET;
//create a new StandardAnalyzer with custom stop words
var sa = new StandardAnalyzer(
Version.LUCENE_29, //depends on your version
new HashSet<string> //pass in your own stop word list
{
"hello",
"world"
});

How to extract only English words from a from big text corpus using nltk?

I am want remove all non dictionary english words from text corpus. I have removed stopwords, tokenized and countvectorized the data. I need extract only the English words and attach them back to the dataframe .
data['Clean_addr'] = data['Adj_Addr'].apply(lambda x: ' '.join([item.lower() for item in x.split()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if not item.isdigit()]))
data['Clean_addr']=data['Clean_addr'].apply(lambda x:"".join([item.lower() for item in x if item not in string.punctuation]))
data['Clean_addr'] = data['Clean_addr'].apply(lambda x: ' '.join([item.lower() for item in x.split() if item not in (new_stop_words)]))
cv = CountVectorizer( max_features = 200,analyzer='word')
cv_addr = cv.fit_transform(data.pop('Clean_addr'))
Sample Dump of the File I am using
https://www.dropbox.com/s/allhfdxni0kfyn6/Test.csv?dl=0
after you first tokenize your text corpus, you could instead stem the word tokens
import nltk
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language="english")
SnowballStemmer
is algorithm executing stemming
stemming is just the process of breaking a word down into its root.
is passed the argument 'english'   ↦   porter2 stemming algorithm
more precisely, this 'english' argument   ↦   stem.snowball.EnglishStemmer
(porter2 stemmer considered to be better than the original porter stemmer)
 
stems = [stemmer.stem(t) for t in tokenized]
Above, I define a list comprehension, which executes as follows:
list comprehension loops over our tokenized input list tokenized
(tokenized can also be any other other iterable input instance)
list comprehension's action is to perform a .stem method on each tokenized word using the SnowballStemmer instance stemmer
list comprehension then collects only the set of English stems
i.e., it is a list that should collect only stemmed English word tokens
 
Caveat:   list comprehension could conceivably include certain identical inflected words in other languages which English decendends from because porter2 would mistakenly think them English words
Down To The Essence
I had a VERY similar need. Your question appeared in my search. Felt I needed to look further, and I found THIS. I did a bit of modification for my specific needs (only English words from TONS of technical data sheets = no numbers or test standards or values or units, etc.). After much pain with other approaches, the below worked. I hope it can be a good launching point for you and others.
import nltk
from nltk.corpus import stopwords
words = set(nltk.corpus.words.words())
stop_words = stopwords.words('english')
file_name = 'Full path to your file'
with open(file_name, 'r') as f:
text = f.read()
text = text.replace('\n', ' ')
new_text = " ".join(w for w in nltk.wordpunct_tokenize(text)
if w.lower() in words
and w.lower() not in stop_words
and len(w.lower()) > 1)
print(new_text)
I used the pyenchant library to do this.
import enchant
d = enchant.Dict("en_US")
def get_eng_words(data):
eng =[]
for sample in tqdm(data):
sentence=''
word_tokens = nltk.word_tokenize(sample)
for word in word_tokens:
if(d.check(word)):
if(sentence ==''):
sentence = sentence + word
else:
sentence = sentence +" "+ word
print(sentence)
eng.append(sentence)
return eng
To save it just do this!
sentences=get_eng_words(df['column'])
df['column']=pd.DataFrame(sentences)
Hope it helps anyone!

How to do a startsWith and then Contains Search using Lucene.NET 3.0?

What is the best way to search and Index in Lucene.NET 3.0 so that the results come out ordered in the following way:
Results that start with the full query text (as a single word) e.g. "Bar Acme"
Results that start with the search term as a word fragment e.g. "Bart Simpson"
Results that contain the query text as a full word e.g. "National Bar Association"
Results that contain the query text as a fragment e.g. "United Bartenders Inc"
Example: Searching for Bar
Ordered Results:
Bar Acme
Bar Lunar
Bart Simpson
National Bar Association
International Bartenders Association
Lucene doesn't generally support searching/scoring based on position within a field. It would be possible to support it if you prefix every fields with some known fieldstart delimiter, or something. I don't really think it makes sense, in the lens of a full text search where position within the text field isn't relevant (ie. if I were searching for Bar in a document, I would likely be rather annoyed if "Bart Simpson" were returned before "national bar association")
Apart from that though, a simple prefix search handles everything else. So if you simply add your start of word token, you can search for the modified term with a higher boost prefix query than the original, and then you should have precisely what you describe.
It can be achieved with linq. Make lucene search with hit count Int32.MaxValue. Loop the results of ScoreDocs and store it in collection Searchresults.
sample code:
Searchresults = (from scoreDoc in results.ScoreDocs select (new SearchResults { suggestion = searcher.Doc(scoreDoc.Doc).Get("suggestion") })).OrderBy(x => x.suggestion).ToList();
Searchresultsstartswith = Searchresults.Where(x => x.suggestion.ToLower().StartsWith(searchStringLinq.ToLower())).Take(10).ToList();
if (SearchresultsStartswith.Count > 0)
return SearchresultsStartswith.ToList();
else
return Searchresults.Take(10).ToList();

Lucens best way to do "starts-with" queries

I want to be able to do the following types of queries:
The data to index consists of (let's say), music videos where only the title is interesting.
I simply want to index these and then create queries for them such that, whatever word or words the user used in the query, the documents containing those words, in that order, at the beginning of the tile will be returned first, followed (in no particular order) by documents containing at least one of the searched words in any position of the title. Also all this should be case insensitive.
Example:
For documents:
Video1Title = Sea is blue
Video2Title = Wild sea
Video3Title = Wild sea Whatever
Video4Title = Seaside Whatever
If I search "sea" I want to get
"Video1Title = Sea is blue"
first followed by all the other documents that contain "sea" in title, but not at the beginning.
If I search "Wild sea" I want to get
Video2Title = Wild sea
Video3Title = Wild sea Whatever
first followed by all the other documents that have "Wild" or "Sea" in their title but don't have "Wild Sea" as title prefix.
If I search "Seasi" I don't wanna get anything (I don't care for Keyword Tokenization and prefix queries).
Now AFAIKS, there's no actual way to tell Lucene "find me documents where word1 and word2 and etc. are in positions 1 and 2 and 3 and etc."
There are "workarounds" to simulate that behaviour:
Index the field twice. In field1 you have the words tokenized (using perhaps StandardAnalyzer) and in field2 you have them all clumped up into one element (using KeywordAnalyzer). Then if you search something like :
+(field1:word1 word2 word3) (field2:"word1 word2 word3*")
effectively telling Lucene "Documents must contain word1 or word2 or word3 in the title, and furthermore those that match "title starts with >word1 word2 word3<" are better (get higher score).
Add a "lucene_start_token" to the beginning of the field when indexing them such that
Video2Title = Wild sea is indexed as "title:lucene_start_token Wild sea" and so on for the rest
Then do a query such that:
+(title:sea) (title:"lucene_start_token sea")
and having Lucene return all documents which contain my search word(s) in the title and also give a better score on those who matched "lucene_start_token+search words"
My question is then, are there indeed better ways to do this (maybe using PhraseQuery and Term position)? If not, which of the above is better perfromance-wise?
You can use Lucene Payloads for that. You can give custom boost for every term of the field value.
So, when you index your titles you can start using a boost factor of 3 (for example):
title: wild|3.0 creatures|2.5 blue|2.0 sea|1.5
title: sea|3.0 creatures|2.5
Indexing this way you are boosting nearest terms to the start of title.
The main problem using this approach is you have to tokenize by yourself and add all this boost information "manually" as the Analyzer needs the text structured that way (term1|1.1 term2|3.0 term3).
What you could do is index the title and each token separately, e.g. text wild deep blue endless sea would be indexed like:
title: wild deep blue endless sea
t1: wild
t2: deep
t3: blue
t4: endless
t5: sea
Then if someone queries "wild deep", the query would be rewritten into
title:"wild deep" OR (t1:wild AND t2:deep)
This way you will always find all matching documents (if they match title) but matching t1..tN tokens will score the relevant documents higher.

Need to extract information from free text, information like location, course etc

I need to write a text parser for the education domain which can extract out the information like institute, location, course etc from the free text.
Currently i am doing it through lucene, steps are as follows:
Index all the data related to institute, courses and location.
Making shingles of the free text and searching each shingle in location, course and institute index dir and then trying to find out which part of text represents location, course etc.
In this approach I am missing lot of cases like B.tech can be written as btech, b-tech or b.tech.
I want to know is there any thing available which can do all these kind of things, I have heard about Ling-pipe and Gate but don't know how efficient they are.
You definitely need GATE. GATE has 2 main most frequently used features (among thousands others): rules and dictionaries. Dictionaries (gazetteers in GATE's terms) allow you to put all possible cases like "B.tech", "btech" and so on in a single text file and let GATE find and mark them all. Rules (more precisely, JAPE-rules) allow you to define patterns in text. For example, here's pattern to catch MIT's postal address ("77 Massachusetts Ave., Building XX, Cambridge MA 02139"):
{Token.kind == number}(SP){Token.orth == uppercase}(SP){Lookup.majorType == avenue}(COMMA)(SP)
{Token.string == "Building"}(SP){Token.kind == number}(COMMA)(SP)
{Lookup.majorType == city}(SP){Lookup.majorType == USState}(SP){Token.kind == number}
where (SP) and (COMMA) - macros (just to make text shorter), {Somthing} - is annotation, , {Token.kind == number} - annotation "Token" with feature "kind" equal to "number" (i.e. just number in the text), {Lookup} - annotation that captures values from dictionary (BTW, GATE already has dictionaries for such things as US cities). This is quite simple example, but you should see how easily you can cover even very complicated cases.
I didn't use Lucene but in your case I would leave different forms of the same keyword as they are and just hold a link table or such. In this table I'd keep the relation of these different forms.
You may need to write a regular expression to cover each possible form of your vocabulary.
Be careful about your choice of analyzer / tokenizer, because words like B.tech can be easily split into 2 different words (i.e. B and tech).
You may want to check UIMA. As Lingpipe and Gate, this framework features text annotation, which is what you are trying to do. Here is a tutorial which will help you write an annotator for UIMA:
http://uima.apache.org/d/uimaj-2.3.1/tutorials_and_users_guides.html#ugr.tug.aae.developing_annotator_code
UIMA has addons, in particular one for Lucene integration.
You can try http://code.google.com/p/graph-expression/
example of Adress parsing rules
GraphRegExp.Matcher Token = match("Token");
GraphRegExp.Matcher Country = GraphUtils.regexp("^USA$", Token);
GraphRegExp.Matcher Number = GraphUtils.regexp("^\\d+$", Token);
GraphRegExp.Matcher StateLike = GraphUtils.regexp("^([A-Z]{2})$", Token);
GraphRegExp.Matcher Postoffice = seq(match("BoxPrefix"), Number);
GraphRegExp.Matcher Postcode =
mark("Postcode", seq(GraphUtils.regexp("^\\d{5}$", Token), opt(GraphUtils.regexp("^\\d{4}$", Token))))
;
//mark(String, Matcher) -- means creating chunk over sub matcher
GraphRegExp.Matcher streetAddress = mark("StreetAddress", seq(Number, times(Token, 2, 5).reluctant()));
//without new lines
streetAddress = regexpNot("\n", streetAddress);
GraphRegExp.Matcher City = mark("City", GraphUtils.regexp("^[A-Z]\\w+$", Token));
Chunker chunker = Chunkers.pipeline(
Chunkers.regexp("Token", "\\w+"),
Chunkers.regexp("BoxPrefix", "\\b(POB|PO BOX)\\b"),
new GraphExpChunker("Address",
seq(
opt(streetAddress),
opt(Postoffice),
City,
StateLike,
Postcode,
Country
)
).setDebugString(true)
);
B.tech can be written as btech, b-tech or b.tech
Lucene will let you do fuzzy searches based on the Levenshtein Distance. A query for roam~ (note the ~) will find terms like foam and roams.
That might allow you to match the different cases.