PyMongo $in + $regex - mongodb-query

How can I combine $regex with $in in PyMongo?
I want to search for either /*.heavy.*/ or /*.metal.*/.
I tried in python without success:
db.col.find({'music_description' : { '$in' : [ {'$regex':'/*.heavy.*/'} ]} })
The equivalent in Mongo shell is:
db.inventory.find( { music_description: { $in: [ /heavy/, /metal/ ] } } )

Use python regular expressions.
import re
db.col.find({'music_description': {'$in': [ re.compile('.*heavy.*'), re.compile('.*metal.*')]}})

Why even bother using an $in?
You're wasting processing by evaluating the field for each value within the list, and since each value is a regex it has its own performance considerations,
Depending on how long your query strings get, it might be prudent just to wrap them up in one regex and avoid the $in query all together
import re
db.col.find({'music_description': re.compile('heavy|metal')})
similarly in mongo shell
db.inventory.find({music_description: /heavy|metal/})
as for [user2998367]'s answer, you're wasting efficiency compiling a regex with greedy wildcards for the sole purpose of a match, the difference between re.search and re.match in python requires the use of the wildcards for re.search purposes, but re.match behaves as 'anywhere in string', as does MongoDB, its only really needed if you're intending to extract, which you'd need to do later after querying anyhow, or if you're reusing a compiled regex somewhere else that you specifically need re.search over re.match

Related

What is the benefit of defining datatypes for literals in an RDF graph?

I am using rdflib in Python to build my first rdf graph. However, I do not understand the explicit purpose of defining Literal datatypes. I have scraped over the documentation and did my due diligence with google and the stackoverflow search, but I cannot seem to find an actual explanation for this. Why not just leave everything as a plain old Literal?
From what I have experimented with, is this so that you can search for explicit terms in your Sparql query with BIND? Does this also help with FILTERing? i.e. FILTER (?var1 > ?var2), where var1 and var2 should represent integers/floats/etc? Does it help with querying speed? Or am I just way off altogether?
Specifically, why add the following triple to mygraph
mygraph.add((amazingrdf, ns['hasValue'], Literal('42.0', datatype=XSD.float)))
instead of just this?
mygraph.add((amazingrdf, ns['hasValue'], Literal("42.0")))
I suspect that there must be some purpose I am overlooking. I appreciate your help and explanations - I want to learn this right the first time! Thanks!
Comparing two xsd:integer values in SPARQL:
ASK { FILTER (9 < 15) }
Result: true. Now with xsd:string:
ASK { FILTER ("9" < "15") }
Result: false, because when sorting strings, 9 comes after 1.
Some equality checks with xsd:decimal:
ASK { FILTER (+1.000 = 01.0) }
Result is true, it’s the same number. Now with xsd:string:
ASK { FILTER ("+1.000" = "01.0") }
False, because they are clearly different strings.
Doing some maths with xsd:integer:
SELECT (1+1 AS ?result) {}
It returns 2 (as an xsd:integer). Now for strings:
SELECT ("1"+"1" AS ?result) {}
It returns "11" as an xsd:string, because adding strings is interpreted as string concatenation (at least in Jena where I tried this; in other SPARQL engines, adding two strings might be an error, returning nothing).
As you can see, using the right datatype is important to communicate your intent to code that works with the data. The SPARQL examples make this very clear, but when working directly with an RDF API, the same kind of issues crop up around object identity, ordering, and so on.
As shown in the examples above, SPARQL offers convenient syntax for xsd:string, xsd:integer and xsd:decimal (and, not shown, for xsd:boolean and for language-tagged strings). That elevates those datatypes above the rest.

Flex (lexical analyzer) {+} and {-} operators

I am trying to make a simple scanner using Flex. In the declarations section, I am trying to use the {-} operator to exclude reserved words from an id, but I can't get it to work. Every example I have found uses the {+} and {-} operators as in the following code:
[a-z]{-}[d]
However, I am trying to use these operators as in the following code, but I always get errors:
invalid_id "char"|"else"|"if"|"class"|"new"|"return"|"void"|"while"|"int"
all_ids [a-zA-Z_][a-zA-Z0-9_]*
id {all_ids}{-}{invalid_id}
Is there any way to make it work? Can these operators be used without square brackets?
The {-} and {+} operators only work on character classes like [a-z], not on full regular expressions or strings. flex does not have support for a more general {-} operator. The more general version of {+} of course is |. Your particular problem is in general solved by the fact that if two patterns match the same string, then the first pattern will be used. So changing your specification to the following will actually exclude all keywords from the IDs.
%%
"char"|"else"|"if"|"class"|"new"|"return"|"void"|"while"|"int" { return KEYWORD; }
[a-zA-Z_][a-zA-Z0-9_]* { return ID; }
%%

Doing multiple gsub in a rails 3 app on email body template

I have an email template where the user can enter text like this:
Hello {first_name}, how are you?
and when the email actually gets sent it replaces the placeholder text {first_name} with the actual value.
There will be several of these placeholders, though, and I wasn't sure that gsub is meant to be used like this.
body = #email.body.gsub("{first_name}", #person.first_name)gsub("{last_name}", #person.last_name).gsub("",...).gsub("",...).gsub("",...).etc...
Is there a cleaner solution to achieving this functionality? Also, if anyone's done something similar to this, did they find that they eventually hit a point where using multiple gsubs on a few paragraphs for hundreds of emails was just too slow?
EDIT
I ran some tests comparing multiple gsubs vs using regex and it came out that the regex was usually 3x FASTER than using multiple gsubs. However, I think the regex code is a littler harder to read as-is, so I'm going to have to clean it up a bit but it does indeed seem that using regex is significantly faster than multiple gsubs. Since my use case will involve multiple substitutions for a large number of documents, the faster solution is better for me, even though I'll have to add a little more documentation.
You have to put in regular expressions all strings you want to catch and in the hash you put the replacement of all catches:
"123456789".gsub /(123|456)/, "123" => "ABC",
"456" => "DEF"
This code only works for ruby 1.9.
If you can use a template library like erb or haml, they are the proper tool for this kind of task.

Lucene ignore keywords in search term

This seems like it should be simple, but I can't figure out how to get Lucene to ignore the AND, OR, and NOT keywords - the query parser throws a parse error when it gets one. I have a query builder class that splits the search term so that it searches on the words themselves as well as on n-grams in the word. I'm using Lucene in Java.
So in a search for, say, "ANDERSON COOPER" the query string looks like:
name: (ANDERSON COOPER "ANDERSON COOPER")^5 gram4: ( ANDE NDER DERS ERSO RSON
SONC ONCO NCOO COOP OOPE OPER)
the query parser throws an error when it gets those ANDs. Ideally, I'd like the parser to just ignore AND, OR, NOT altogether, and I'll use the &&, ||, and ! equivalents if I need them - do I have to modify the code in the QueryParser class itself to get this? Or is there an easier way? I could also just insert an escape character for these cases if that is the best way to do it, but adding \ before the word AND doesn't seem to do anything.
You can wrap the AND in quotes like this: "AND". Is that easy? A regex could probably do that easily if you know exactly what your queries look like.
The parser shouldn't have a problem with it, and the PhraseQuery will be rewritten as a term query, so it will be a small constant-time performance difference big-oh O(1).
The regex could probably look like this:
\b(AND|OR|NOT)\b
Which would be replaced with
"$1"

How to make the Lucene QueryParser more forgiving?

I'm using Lucene.net, but I am tagging this question for both .NET and Java versions because the API is the same and I'm hoping there are solutions on both platforms.
I'm sure other people have addressed this issue, but I haven't been able to find any good discussions or examples.
By default, Lucene is very picky about query syntax. For example, I just got the following error:
[ParseException: Cannot parse 'hi there!': Encountered "<EOF>" at line 1, column 9.
Was expecting one of:
"(" ...
"*" ...
<QUOTED> ...
<TERM> ...
<PREFIXTERM> ...
<WILDTERM> ...
"[" ...
"{" ...
<NUMBER> ...
]
Lucene.Net.QueryParsers.QueryParser.Parse(String query) +239
What is the best way to prevent ParseExceptions when processing queries from users? It seems to me that the most usable search interface is one that always executes a query, even if it might be the wrong query.
It seems that there are a few possible, and complementary, strategies:
"Clean" the query prior to sending it to the QueryProcessor
Handle exceptions gracefully
Show an intelligent error message to the user
Perhaps execute a simpler query, leaving off the erroneous bit
I don't really have any great ideas about how to do any of those strategies. Has anyone else addressed this issue? Are there any "simple" or "graceful" parsers that I don't know about?
Yo can make Lucene ignore the special characters by sanitizing the query with something like
query = QueryParser.Escape(query)
If you do not want your users to ever use advanced syntax in their queries, you can do this always.
If you want your users to use advanced syntax but you also want to be more forgiving with the mistakes you should only sanitize after a ParseException has occured.
Well, the easiest thing to do would be to give the raw form of the query a shot, and if that fails, fall back to cleaning it up.
Query safe_query_parser(QueryParser qp, String raw_query)
throws ParseException
{
Query q;
try {
q = qp.parse(raw_query);
} catch(ParseException e) {
q = null;
}
if(q==null)
{
String cooked;
// consider changing this "" to " "
cooked = raw_query.replaceAll("[^\w\s]","");
q = qp.parse(cooked);
}
return q;
}
This gives the raw form of the user's query a chance to run, but if parsing fails, we strip everything except letters, numbers, spaces and underscores; then we try again. We still risk throwing ParseException, but we've drastically reduced the odds.
You could also consider tokenizing the user's query yourself, turning each token into a term query, and glomming them together with a BooleanQuery. If you're not really expecting your users to take advantage of the features of the QueryParser, that would be the best bet. You'd be completely(?) robust, and users could search for whatever funny characters will make it through your analyzer
FYI... Here is the code I am using for .NET
private Query GetSafeQuery(QueryParser qp, String query)
{
Query q;
try
{
q = qp.Parse(query);
}
catch(Lucene.Net.QueryParsers.ParseException e)
{
q = null;
}
if(q==null)
{
string cooked;
cooked = Regex.Replace(query, #"[^\w\.#-]", " ");
q = qp.Parse(cooked);
}
return q;
}
I'm in the same situation as you.
Here's what I do. I do catch the exception, but only so that I can make the error look prettier. I don't change the text.
I also provide a link to an explanation of the Lucene syntax which I have simplified a little bit:
http://ifdefined.com/btnet/lucene_syntax.html
I do not know much about Lucene.net. For general Lucene, I highly recommend the book Lucene in Action. For the question at hand, it depends on your users. There are strong reasons, such as ease of use, security and performance, to limit your users' queries. The book shows ways to parse the queries using a custom parser instead of QueryParser. I second Jay's idea about the BooleanQuery, although you can build stronger queries using a custom parser.
If you don't need all Lucene features, you might go better by writing your own query parser. It's not as complicated as it might seem in the first place.