Efficient method for storing simple regular expressions - sql

I have a list of simple regular expressions:
ABC.+DE.+FHIJ.+
.+XY.+Z.+AB
.+KLM.+NO.+J.+
QRST.+UV
they all have alternating patterns of .+ and some text (I will call "words") repeated some number of times. A pattern may or may not begin or end in .+. These regular expression are all mutually exclusive. When another regex is added I want to remove any other matching regular expressions, and add one regular expression that combines the added one with all of its matches. For example, adding:
.+J.+
would match,
ABC.+DE.+FHIJ.+
.+KLM.+NO.+J.+
and thus, these would be remove and replaced with the added regular expression resulting in:
.+J.+
.+XY.+Z.+AB
QRST.+UV
I need to store these patterns either in some data structure or (preferably) in a database in an efficient manner. I first tried a tree of dictionaries, only to realize that in the case that a regex starts with a .* it has to search the entire tree for the next word, which is order O(2^n). Unfortunately, (unless I am mistaken) it appears that neither SQLite (which I am using) nor any other relational database that I have used, supports "regular expression" as a data type. My question is, is there an efficient method for storing and retrieving such simple regular expressions? If there is no canned method, is there some data structure that would be relatively efficient (say, at worst amortized polynomial time)?

Could you please explain what you are using these regular expressions for as that would make it easier to provide a better answer? In particular when I see the way you are splitting your regular expressions I'm wondering if a Trie or a Directed acyclic word graph would be a better fit.
From their you may find your answer is as simple as providing better normalization or finding an alternative no SQL db made specifically for your problem area.

Related

Custom, user-definable "wildcard" constants in SQL database search -- possible?

My client is making database searches using a django webapp that I've written. The query sends a regex search to the database and outputs the results.
Because the regex searches can be pretty long and unintuitive, the client has asked for certain custom "wildcards" to be created for the regex searches. For example.
Ω := [^aeiou] (all non-vowels)
etc.
This could be achieved with a simple permanent string substitution in the query, something like
query = query.replace("Ω", "[^aeiou]")
for all the elements in the substitution list. This seems like it should be safe, but I'm not really sure.
He has also asked that it be possible for the user to define custom wildcards for their searches on the fly. So that there would be some other input box where a user could define
∫ := some other regex
And to store them you might create a model
class RegexWildcard(models.Model):
symbol = ...
replacement = ...
I'm personally a bit wary of this, because it does not seem to add a whole lot of functionality, but does seem to add a lot of complexity and potential problems to the code. Clients can now write their queries to a db. Can they overwrite each other's symbols?
That I haven't seen this done anywhere before also makes me kind of wary of the idea.
Is this possible? Desirable? A great idea? A terrible idea? Resources and any guidance appreciated.
Well, you're getting paid by the hour....
I don't see how involving the Greek alphabet is to anyone's advantage. If the queries are stored anywhere, everyone approaching the system would have to learn the new syntax to understand them. Plus, there's the problem of how to type the special symbols.
If the client creates complex regular expressions they'd like to be able to reuse, that's understandable. Your application could maintain a list of such expressions that the user could add to and choose from. Notionally, the user would "click on" an expression, and it would be inserted into the query.
The saved expressions could have user-defined names, to make them easier to remember and refer to. And you could define a syntax that referenced them, something otherwise invalid in SQL, such as ::name. Before submitting the query to the DBMS, you substitute the regex for the name.
You still have the problem of choosing good names, and training.
To prevent malformed SQL, I imagine you'll want to ensure the regex is valid. You wouldn't want your system to store a ; drop table CUSTOMERS; as a "regular expression"! You'll either have to validate the expression or, if you can, treat the regex as data in a parameterized query.
The real question to me, though, is why you're in the vicinity of standardized regex queries. That need suggests a database design issue: it suggests the column being queried is composed of composite data, and should be represented as multiple columns that can be queried directly, without using regular expressions.

Expanding an arbitrary Lucene Query

I am using Lucene 3.6.1. I receive a query from a user. This query may contain + or - operators, and may also contain phrases. In certain circumstances, I would like to expand the query by adding some extra terms that I compute. These terms are optional. However, any required include/exclude constraints specified by the user must be respected.
My initial strategy was to create a BooleanQuery, add a clause to it that contains the parsed user query, and then add further clauses that contain my expansion terms. The expansion terms would all be added as Occur.SHOULD. My question is how to constrain the user's query. I can imagine three possibilities:
The user's query contains no operators, which means I can include it as an Occur.SHOULD clause.
The user's query contains a + operator, so I need to include it as an Occur.MUST clause.
The user's query contains a - operator, but also other terms: Do I still include it as an Occur.MUST clause?
The question implicit in these three choices is how do I tell which condition is appropriate? I suppose I can rewrite the query and test for BooleanQuery instances, but that seems brittle.
I suppose can also try to tactic of creating a single string from the user's input and from my expansion terms, like this:
(fld1:userterm1 userterm2 -fld2:userterm3 +userterm4)^10 (fld1:expterm1)^8 (fld2:expterm2)^7 ...
Is this the best way to go? Or is there some elegant programmatic solution?
Okay, Not sure how useful this answer will be, but can't seem to come up with a hard and fast answer here, so I'll list a couple possibilities that come to mind:
First, a problem:
Modifying the query to look like:
(userquery) (other) (stuff)
I makes some sense to add the + with he rules you've shown, but a '-' prohibited term will be hard to respect correctly, since (query -prohibition) (other) will allow matches on other with prohibition present as well, and +(query -prohibition) (other) will require 'query' be matched.
The only way I see to really do that part right is to propagate the prohibited term into your automatically added terms as well, or extract it out to a parent query layer, more like (query -prohibition) --> (query) (other) -(prohibition).
And with user entered queries of arbitrary complexity, that may not be a great strategy.
If you want to tackle it by modifying the query string, then you should probably just add any terms to the end of the query. Nothing more to it.
I don't believe
(fld1:userterm1 userterm2 -fld2:userterm3 +userterm4)^10 (fld1:expterm1)^8 (fld2:expterm2)^7 ...
Is satisfactory, because userterm4 is only required within it's subquery, but a match Only on expterm1 is still acceptable. However, a query like:
fld1:userterm1 userterm2 -fld2:userterm3 +userterm4 (fld1:expterm1)^.8 (fld2:expterm2)^.7 ...
Should, I think, satisfy your needs, and prevents you from having to worry about the internals of your queryparser. I think this is the best approach.
I can also see logic in a query structured like
+(parsed userquery) (other stuff)
Effectively, always requiring a match on the user query. Lucene implicitly does this, in a sense, as it won't return a result that matches no term, even if no required fields are present in the query. This would then be using your added terms to impact scoring, rather than return a larger set of documents. This doesn't quite address what your asking, but might be worth considering.
If, despite the aforementioned problems of applying them, you still want to detect '+' and '-' operators, I think it can be reasonably assumed that a StandardQueryParser will return a BooleanQuery at base level for any query that you need to check for these operands on. You might have to worry about, for instance, DisjunctionMaxQueries, as well as what will happen when you have a simple query with an operator, like:
+myterm
I don't know if QueryParser would simply return a TermQuery, losing the plus (since it would be redundant without another term present). Concerns like that make me hesitant to address it in this way.
Similarly, attempting to detect these values from the query string must make assumptions about how things are parsed, and could become complicated.
To sumamrize, I think the best options are to, either: add terms to the end of the raw query string before doing any parsing, or treat the user query as atomic, and define the appropriate booleanclause independant of it's contents when adding to a boolean clause wrapping it with whatever other queries you need to include.

RESTful API Design OR Predicates

I'm designing a RESTful API and I'm trying to work out how I could represent a predicate with OR an operator when querying for a resource.
For example if I had a resource Foo with a property Name, how would you search for all Foo resources with a name matching "Name1" OR "Name2"?
This is straight forward when it's an AND operator as I could do the following:
http://www.website.com/Foo?Name=Name1&Age=19
The other approach I've seen is to post the search in the body.
You will need to pick your own approach, but I can name few that seem to be pretty logical (although not without disadvantages):
Option 1.: Using | operator:
http://www.website.com/Foo?Name=Name1|Name2
Option 2.: Using modified query param to allow selection by one of the values from the set (list of possible comma-separated values):
http://www.website.com/Foo?Name_in=Name1,Name2
Option 3.: Using PHP-like notation to provide list instead of single string:
http://www.website.com/Foo?Name[]=Name1&Name[]=Name2
All of the above mentioned options have one huge advantage: they do not interfere with other query params.
But as I mentioned, pick your own approach and be consistent about it across your API.
Well one quick way to fixing that is to add an additional parameter that is identifying the relationship between your parameters wether they're an and or an or for example:
http://www.website.com/Foo?Name=Name1&Age=19&or=true
Or for much more complex queries just keep a single parameter and in it include your whole query by making up your own little query language and on the server side you would parse the whole string and extract the information and the statement.

Testing phrases to see if they match each other

I have a large number of phrases (~ several million), each less than six or seven words and the large majority less than five, and I would like to see if they "phrase match" each other. This is a search engine marketing term - essentially, A phrase matches B if A is contained in B. Right now, they are stored in a db (postgres), and I am performing a join on regexes (see this question). It is running impossibly slowly even after trying all basic optimization tricks (indexing, etc) and trying the suggestions provided.
Is there an easier way to do this? I am not averse to a non-DB solution. Is there any reason to think that regexes are overkill and are taking way longer than a different solution?
An ideal algorithm for doing sub-string matching is AhoCorsick.
Although you will have to read the data out of the database to use it, it is tremendously fast, when compared to more naive methods.
See here for a related question on substring matching:
And here for an AhoCorsick implementation in Java:
It would be great to get a little more context as to why you need to see which phrases are subsets of others: for example, it seems strange that the DB would be built in such a way anyway: you're having to do the work now because the DB is not in an appropriate format, so it makes sense that you should 'fix' the DB or the way in which it is built, instead.
It depends massively on what you are doing with the data and why, but I have found it useful in the past to break things down into single words and pairs of words, then link resources or phrases to those singles/pairs.
For example to implement a search I have done:
Source text: Testing phrases to see
Entries:
testing
testing phrases
phrases
phrases to
to
to see
see
To see if another phrase was similar (granted, not contained within) you would break down the other phrase in the same way and count the number of phrases common between them.
It has the nice side effect of still matching if you were to use (for example) "see phases to testing": because the individual words would match.. but because the order is different the pairs wouldn't, so it's taking phrases (consecutive words) into account at the same time, the number of matches wouldn't be as high, good for use as a 'score' in matching.
As I say that -kind- of thing has worked for me, but it would be great to hear some more background/context, so we can see if we can find a better solution.
When you have the 'cleaned column' from MaasSQL's previous answer, you could, depending on the way "phrase match" works exactly (I don't know), sort this column based on the length of the containing string.
Then make sure you run the comparison query in a converging manner in a procedure instead of a flat query, by stepping through your table (with a cursor) and eliminating candidates for comparison through WHERE statements and through deleting candidates that have already been tested (completely). You may need a temporary table to do this.
What do I mean with 'WHERE' statement previously? Well, if the comparison value is in a column sorted on length, you'll never have to test whether a longer string matches inside a shorter string.
And with deleting candidates: starting with the shortest strings, once you've tested all strings of a certain length, you'll can remove them from the comparison table, as any next test you'll do will never get a match.
Of course, this requires a bit more programming than just one SQL statement. And is dependent on the way "phrase match" works exactly.
DTS or SSIS may be your friend here as well.

SQL with Regular Expressions vs Indexes with Logical Merging Functions

I am trying to develop a complex textual search engine.
I have thousands of textual pages from many books.
I need to search pages that contain specified complex logical criterias.
These criterias can contain virtually any compination of the following:
A: Full words.
B: Word roots (semilar to stems; i.e. all words with certain key letters).
C: Word templates (in some languages roots are filled in certain templates to form various part of speech such as adjactives, past/present verbs...).
D: Logical connectives: AND/OR/XOR/NOT/IF/IFF and parentheses to state priorities.
Now, would it be faster to have the pages' full text in database (not indexed) and search through them all using SQL and Regular Expressions ?
Or would it be better to construct indexes of word/root/template-page-location tuples.
Hence, we can boost searching for individual words/roots/templates.
However, it gets tricky as we introduce logical connectives into our queries.
I thought of doing the following steps in such cases:
1: Seperately search for each individual words/roots/templates in the specified query.
2: On priority bases, we merge two result lists (from step 1) at a time depedning on the logical connective
For example, if we are searching for "he AND (is OR was)":
1: We shall search for "he", "is" and "was" seperately and get result lists for each word.
2: Merge the result lists of "is" and "was" using the merging function OR-MERGE.
3: Merge the merged result list from the OR-MERGE function with the one of "he" using the merging function AND-MERGE.
The result of step 3 is then returned as the result of the specified query.
What do you think gurues ? Which is faster ? Any better ideas ?
Thank you all in advance.
There are plenty of off-the-shelf solutions to this kind of problem. I would strongly recommend you use one of those instead of developing your own.
You don't say what database solution you're using. If it's Microsoft SQL Server, you could use its Full Text Search features. If it's MySQL, take a look at its Full-Text Search Functions. I'm sure Oracle, DB2 and any other major DBMS will have similar functionality.
Alternatively, take a look at Apache's Lucene for Java or Lucene for .NET. This will allow you to index documents without needing to use a DBMS.