Using Lucene Fuzzy search with a word that has no aliases - lucene

I wish do searches using fuzzy search. Using Luke to help me, if I search for a word that has aliases (eg similar words) it all works as expected:
However if I enter a search term that doesn't have any similar words (eg a serial code), the search fails and I get no results, even though it should be valid:
Do I need to structure my search in a different way? Why don't I get the same in the second search as the first, but with only one "term"?

You have not specified Lucene version so I would assume you are using 6.x.x.
The behavior that you are seeing is a correct behavior of Lucene Fuzzy Search.
Refer this and I quote ,
At most, this query will match terms up to 2 edits.
Which roughly but not very accurately means that two texts varying with maximum of two characters at any positions would be a returned as match if using FuzzyQuery.
Below is a sample output from one of my simple Java programs that I illustrate here,
Lets say three Indexed Docs have a field with values like -
"123456787" , "123456788" , "123456789" ( Appended 7 , 8 and 9 to
– 12345678 )
Results :
No Hits Found for search string -> 123456 ( Edit distance = 3 , last
3 digits are missing)
3 Docs found !! for Search String -> 1234567 ( Edit distance = 2 )
3 Docs found !! for Search String -> 12345678 ( Edit distance = 1 )
1 Docs found !! for Search String -> 1236787 ( Edit distance = 2 for
found one, missing 4 , 5 and last digit for remaining two documents)
No Hits Found for search string -> 123678789 ( Edit distance = 4 ,
missing 4 , 5 and last two digits)
So you should read more about Edit Distance.
If your requirement is to match N-Continuous characters without worrying about edit distance , then N-Gram Indexing using NGramTokenizer is the way to go.
See this too for more about N-Gram

Related

Postgres regex rowwise on large comma separated text string

I have created a table with various columns to filter on such as id, date, and regex_col in the examples below. The goal of this database is to allow the user to filter appropriately for the json_b_value they are looking for. The current database is not very large around ~100M rows.
I have taken the row names out of json_b_value to create the regex_col with the thought process that I can index the regex_col in some way and allow users to regex search for the json_b_value they are looking for. The text in the regex_col is stored as a large comma separated string, with the number of words ranging from 10 - 150.
id | date | regex_col | json_b_value
1 2019 'some','stuff','to','search' json
2 2018 'different','stuff','other' json
3 2019 'lots','of','stuff' json
The user will interact and search this column using a selectize.js dropdown. A separate table takes all of the comma separated words from regex_col and binds them all together rowwise, like below. Then words matching their search will populate as they type, any words not matching anything will result in a null.
search_words |
'some'
'stuff'
'to'
'search'
What would be an effective way to index the regex_col? Is this the optimal way to do this, should I even be creating the regex_col or should I be trying to optimize around the json_b_value?
example of json value for id 1 below
[{"regex_col":"some","current":100,"previous":200},{"regex_col":"stuff","current":200,"previous":400},{"regex_col":"to","current":300,"previous":600},{"regex_col":"search","current":400,"previous":800}]
There can be a lot of factors, but in most cases the best solution is the folowing:
Create a new table with two columns
id, term
and then populate
1 'some'
1 stuff'
1 'to'
1 'search'
2 'different'
2 'stuff'
2 'other'
3 'lots'
3 'of'
3 'stuff'
Now put an index on this table and you are GTG

Basic SQL Script to find a special character, but only when present more than once

I am relearning MS-SQL for a project.
I have a table with a field where the data includes the special character |.
Most times the field does not have it, sometimes once, sometimes 4 times.
I have been able to get it filtered to when present, but I would like to try to show only the times it appears more than once.
This is what I have come up so far:
SELECT UID, OBJ_UID, DESCRIPTION
FROM SPECIFICS
WHERE (NAMED LIKE '%[|]%')
Is there an easy way?
You can replace | with blank and compare length of strings
SELECT
UID, OBJ_UID, DESCRIPTION
FROM
SPECIFICS
WHERE
LEN(NAMED) - LEN(REPLACE(NAMED, '|', '')) > 1
Query returns rows where | appears more than one time

Fulltext search in SQL Server

I'm trying to create a simple search page on my site but finding it difficult to get full text search working as I would expect it to word.
Example search:
select *
from Movies
where contains(Name, '"Movie*" and "2*"')
Example data:
Id Name
------------------
1 MyMovie
2 MyMovie Part 2
3 MyMovie Part 3
4 MyMovie X2
5 Other Movie
Searches like "movie*" return no results since the term is in the middle of a work.
Searches like "MyMovie*" and "2*" only return MyMovie Part 2 and not MyMovie Part X2
It seems like I could just hack together a dynamic SQL query that will just
and name like '%movie%' and name like '%x2%' and it would work better than full text search which seems odd since it's a large part of SQL but doesn't seem to work as good as a simple like usage.
I've turned off my stop list so the number and single character results appear but it just doesn't seem to work well for what I'm doing which seems rather basic.
select
*
from Movies
where
name like ('%Movie%')
or name like ('%2%')
;
select * from Movies where freetext(Name, ' "Movie*" or "2*" ')

Aggregating MoreLikeThis Results in RavenDB

I have been trying out the MoreLikeThis Bundle to bring back a set of documents ordered by the number of matches in a field called 'backyardigans' compared to a key document. This all works as expected.
But what I would like to do is order by the number of matches of 3 separate fields added together.
An example record would be:
var data = new Data{
backyardigans = "Pablo Tasha Uniqua Tyrone Austin",
engines = "Thomas Percy Henry Toby",
pigs = "Daddy Peppa George Mummy Granny"
};
If another document matched 1 backyardigan 2 engines and 1 pig it would get a score of 4
If another document matched 2 backyardigans 4 engines and 0 pigs it would get a score of 6
These aggregated scores would be the field we would order the results by so they would come back 6,4 and so on.
Is there a way to achieve this with the MoreLikeThis bundle please?
This isn't possible, we use only a single field frequency for this.
This is important because we need to compare the score on a field basis, and it isn't really possible to compare it on a global basis without taking into account the per fields values.
Note that this is also a limitation in the underlying Lucene implementation, so there isn't much we can do about it.

Full text search with php

I do not get any results for the following query:
"SELECT * FROM test2 WHERE MATCH(txt) AGAINST('hello' IN BOOLEAN MODE)"
while test2 looks like:
id | txt
1 | ...
2 | ...
3 | ...
4 | ...
.. | ...
txt is 30 characters long (TEXT) and fulltext. I have about 16 records (tiny db) and the word hello is placed almost in every record in txt along with other words. I just wanted to know how full-text search works. So i get zero results and I can't understand why.
there are two reasons that you are not getting any results:
Reason 1: your search word 'hello' occurs in too many rows.
A natural language search interprets
the search string as a phrase in
natural human language (a phrase in
free text). There are no special
operators. The stopword list applies.
In addition, words that are present in
50% or more of the rows are considered
common and do not match. Full-text
searches are natural language searches
if no modifier is given.
Source: http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html
Reason 2: your search word 'hello' is on the stop-word list.
Any word on the stopword list will never match!
Source: http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html