SQL with Regular Expressions vs Indexes with Logical Merging Functions - sql

I am trying to develop a complex textual search engine.
I have thousands of textual pages from many books.
I need to search pages that contain specified complex logical criterias.
These criterias can contain virtually any compination of the following:
A: Full words.
B: Word roots (semilar to stems; i.e. all words with certain key letters).
C: Word templates (in some languages roots are filled in certain templates to form various part of speech such as adjactives, past/present verbs...).
D: Logical connectives: AND/OR/XOR/NOT/IF/IFF and parentheses to state priorities.
Now, would it be faster to have the pages' full text in database (not indexed) and search through them all using SQL and Regular Expressions ?
Or would it be better to construct indexes of word/root/template-page-location tuples.
Hence, we can boost searching for individual words/roots/templates.
However, it gets tricky as we introduce logical connectives into our queries.
I thought of doing the following steps in such cases:
1: Seperately search for each individual words/roots/templates in the specified query.
2: On priority bases, we merge two result lists (from step 1) at a time depedning on the logical connective
For example, if we are searching for "he AND (is OR was)":
1: We shall search for "he", "is" and "was" seperately and get result lists for each word.
2: Merge the result lists of "is" and "was" using the merging function OR-MERGE.
3: Merge the merged result list from the OR-MERGE function with the one of "he" using the merging function AND-MERGE.
The result of step 3 is then returned as the result of the specified query.
What do you think gurues ? Which is faster ? Any better ideas ?
Thank you all in advance.

There are plenty of off-the-shelf solutions to this kind of problem. I would strongly recommend you use one of those instead of developing your own.
You don't say what database solution you're using. If it's Microsoft SQL Server, you could use its Full Text Search features. If it's MySQL, take a look at its Full-Text Search Functions. I'm sure Oracle, DB2 and any other major DBMS will have similar functionality.
Alternatively, take a look at Apache's Lucene for Java or Lucene for .NET. This will allow you to index documents without needing to use a DBMS.

Related

PostgreSQL Full Text Search with substrings

I'm trying to create the fastest way to search millions (80+ mio) of records in a PostgreSQL (version 9.4), over multiple columns.
I would like to try and use standard PostgreSQL, and not Solr etc.
I'm currently testing Full Text Search followed https://blog.lateral.io/2015/05/full-text-search-in-milliseconds-with-postgresql/.
It works, but I would like some more flexible way to search.
Currently, if I have a column containing ex. "Volvo" and one containing "Blue" I am able to find the record with the search string "volvo blue", but I would like to also find the record using "volvo blu" as if I used LIKE and "%blu%'.
Is that possible with full text search?
The only option to something like this is by using the pg_trgm contrib module.
This enables you to create a GIN or GiST index that indexes all sequences of three characters, which can be used for a search with the similarity operator %.
Two notes:
Using the % operator may return “false positive” results, so be sure to add a second condition (e.g. with LIKE) that eliminates those.
A trigram search works well with longer search strings, but performs badly with short search strings because of the many false positive results.
If that is not good enough for your purposes, you'll have to resort to an third-party solution.

Efficient method for storing simple regular expressions

I have a list of simple regular expressions:
ABC.+DE.+FHIJ.+
.+XY.+Z.+AB
.+KLM.+NO.+J.+
QRST.+UV
they all have alternating patterns of .+ and some text (I will call "words") repeated some number of times. A pattern may or may not begin or end in .+. These regular expression are all mutually exclusive. When another regex is added I want to remove any other matching regular expressions, and add one regular expression that combines the added one with all of its matches. For example, adding:
.+J.+
would match,
ABC.+DE.+FHIJ.+
.+KLM.+NO.+J.+
and thus, these would be remove and replaced with the added regular expression resulting in:
.+J.+
.+XY.+Z.+AB
QRST.+UV
I need to store these patterns either in some data structure or (preferably) in a database in an efficient manner. I first tried a tree of dictionaries, only to realize that in the case that a regex starts with a .* it has to search the entire tree for the next word, which is order O(2^n). Unfortunately, (unless I am mistaken) it appears that neither SQLite (which I am using) nor any other relational database that I have used, supports "regular expression" as a data type. My question is, is there an efficient method for storing and retrieving such simple regular expressions? If there is no canned method, is there some data structure that would be relatively efficient (say, at worst amortized polynomial time)?
Could you please explain what you are using these regular expressions for as that would make it easier to provide a better answer? In particular when I see the way you are splitting your regular expressions I'm wondering if a Trie or a Directed acyclic word graph would be a better fit.
From their you may find your answer is as simple as providing better normalization or finding an alternative no SQL db made specifically for your problem area.

SQL - searching database with the LIKE operator

Given your data stored somewhere in a database:
Hello my name is Tom I like dinosaurs to talk about SQL.
SQL is amazing. I really like SQL.
We want to implement a site search, allowing visitors to enter terms and return relating records. A user might search for:
Dinosaurs
And the SQL:
WHERE articleBody LIKE '%Dinosaurs%'
Copes fine with returning the correct set of records.
How would we cope however, if a user mispells dinosaurs? IE:
Dinosores
(Poor sore dino). How can we search allowing for error in spelling? We can associate common misspellings we see in search with the correct spelling, and then search on the original terms + corrected term, but this is time consuming to maintain.
Any way programatically?
Edit
Appears SOUNDEX could help, but can anyone give me an example using soundex where entering the search term:
Dinosores wrocks
returns records instead of doing:
WHERE articleBody LIKE '%Dinosaurs%' OR articleBody LIKE '%Wrocks%'
which would return squadoosh?
If you're using SQL Server, have a look at SOUNDEX.
For your example:
select SOUNDEX('Dinosaurs'), SOUNDEX('Dinosores')
Returns identical values (D526) .
You can also use DIFFERENCE function (on same link as soundex) that will compare levels of similarity (4 being the most similar, 0 being the least).
SELECT DIFFERENCE('Dinosaurs', 'Dinosores'); --returns 4
Edit:
After hunting around a bit for a multi-text option, it seems that this isn't all that easy. I would refer you to the link on the Fuzzt Logic answer provided by #Neil Knight (+1 to that, for me!).
This stackoverflow article also details possible sources for implentations for Fuzzy Logic in TSQL. Once respondant also outlined Full text Indexing as a potential that you might want to investigate.
Perhaps your RDBMS has a SOUNDEX function? You didn't mention which one was involved here.
SQL Server's SOUNDEX
Just to throw an alternative out there. If SSIS is an option, then you can use Fuzzy Lookup.
SSIS Fuzzy Lookup
I'm not sure if introducing a separate "search engine" is possible, but if you look at products like the Google search appliance or Autonomy, these products can index a SQL database and provide more searching options - for example, handling misspellings as well as synonyms, search results weighting, alternative search recommendations, etc.
Also, SQL Server's full-text search feature can be configured to use a thesaurus, which might help:
http://msdn.microsoft.com/en-us/library/ms142491.aspx
Here is another SO question from someone setting up a thesaurus to handle common misspellings:
FORMSOF Thesaurus in SQL Server
Short answer, there is nothing built in to most SQL engines that can do dictionary-based correction of "fat fingers". SoundEx does work as a tool to find words that would sound alike and thus correct for phonetic misspellings, but if the user typed in "Dinosars" missing the final U, or truly "fat-fingered" it and entered "Dinosayrs", SoundEx would not return an exact match.
Sounds like you want something on the level of Google Search's "Did you mean __?" feature. I can tell you that is not as simple as it looks. At a 10,000-foot level, the search engine would look at each of those keywords and see if it's in a "dictionary" of known "good" search terms. If it isn't, it uses an algorithm much like a spell-checker suggestion to find the dictionary word that is the closest match (requires the fewest letter substitutions, additions, deletions and transpositions to turn the given word into the dictionary word). This will require some heavy procedural code, either in a stored proc or CLR Db function in your database, or in your business logic layer.
You can also try the SubString(), to eliminate the first 3 or so characters . Below is an example of how that can be achieved
SELECT Fname, Lname
FROM Table1 ,Table2
WHERE substr(Table1.Fname, 1,3) || substr(Table1.Lname,1 ,3) = substr(Table2.Fname, 1,3) || substr(Table2.Lname, 1 , 3))
ORDER BY Table1.Fname;

Testing phrases to see if they match each other

I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?
An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:
It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
Entries:
testing
testing phrases
phrases
phrases to
to
to see
see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.
When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.

All of these words feature

I have a "description" field indexed in Lucene.This field contains a book's description.
How do i achieve "All of these words" functionality on this field using BooleanQuery class?
For example if a user types in "top selling book" then it should return books which have all of these words in its description.
Thanks!
There are two pieces to get this to work:
You need the incoming documents to be analysed properly, so that individual words are tokenised and indexed separately
The user query needs to be tokenised, and the tokens combined with the AND operator.
For #1, there are a number of Analyzers and Tokenizers that come with Lucene - have a look in the org.apache.lucene.analysis package. There are options for many different languages, stemming, stopwords and so on.
For #2, there are again a lot of query parsers that come with Lucene, mainly in the org.apache.lucene.queryParser packagage. MultiFieldQueryParser might be good for you: to require every term to be present, just call
QueryParser.setDefaultOperator(QueryParser.AND_OPERATOR)
Lucene in Action, although a few versions old, is still accurate and extremely useful for more information on analysis and query parsing.
I believe if you add all query parts (one per term) via
BooleanQuery.add(Query, BooleanClause.Occur)
and set that second parameter to the constant BooleanClause.Occur.MUST, then you should get what you want. The equivalent query syntax would be "+term1+term2 +term3 ...".