Find all Lucene documents having a certain field

Find all Lucene documents having a certain field - lucene

I want to find all documents in the index that have a certain field, regardless of the field's value. If at all possible using the query language, not the API.
Is there a way?

If you know the type of data stored in your field, you can try a range query. Per example, if your field contain string data, a query like field:[a* TO z*] would return all documents where there is a string value in that field.

I've done some experimenting, and it seems the simplest way to achieve this is to create a QueryParser and call SetAllowLeadingWildcard( true ) and search for field:* like so:
var qp = new QueryParser( Lucene.Net.Util.Version.LUCENE_29, field, analyzer );
qp.SetAllowLeadingWildcard( true );
var query = qp.Parse( "*" ) );
(Note I am setting the default field of the QueryParser to field in its constructor, hence the search for just "*" in Parse()).
I cannot vouch for how efficient this method is over other methods, but being the simplest method I can find, I would expect it to be at least as efficient as field:[* TO *], and it avoids having to do hackish things like field:[0* TO z*], which may not account for all possible values, such as values starting with non-alphanumeric characters.

Another solution is using a ConstantScoreQuery with a FieldValueFilter
new ConstantScoreQuery(new FieldValueFilter("field"))

Related

How to search in lucene where there is only a single token for a field

I am creating an index where the documents are only a single term.
I am indexing domain names, so the field "domain" would look like:
example.com
thisiscool.com
justtesting.org
cnn.com
I am creating my search terms etc. programatically, and because all my document field is just a single term, it appears as though my searches won't work as they are since there is only a single term and if I add multiple terms in a boolean query it will never find anything.
How should I be searching given I have only a single term? I want to make this as efficient as possible.
Query term = new TermQuery("domain", "this")
Query term2 = new TermQuery("domain", "cool")
// add to boolean query
bq.add(term, Occur.MUST)
bq.add(term2, Occur.MUST)
indexSearcher.search(bq, 100)
I was expecting to get "thisiscool.com" back, but I get 0 hits. My guess is because lucene can't break things down into tokens, so it will never find any document that has both tokens "this" and "cool".
How should I be searching given this scenerio?

Apply a wildcard to your search clause.
Query term = new TermQuery("domain", "this*");
Query term2 = new TermQuery("domain", "cool*"); // *cool* won't work sadly
However, that might not work because the logic is going to result in a query like this, where the domain has to begin with "this" as well as "cool"
bq.add(term, Occur.MUST)
bq.add(term2, Occur.MUST)
=> +domain:this* +domain:cool*
Query term = new TermQuery("domain", "this*cool*");
=> +domain:this*cool* // probably gets hits
If you're using newer versions then you can use regular expressions in queries:
http://lucene.apache.org/core/6_6_0/core/org/apache/lucene/util/automaton/RegExp.html
The above example isn't actually how you should do this. I tested it out, and it doesn't even really work. What you'll want to do is build specialized queries, such as PrefixQuery, WildcardQuery, or RegexpQuery.
Additionally, if you're not using QueryParser or something that takes an Analyzer, queries have to match exactly to what's in your index. If domain is a TextField it might have been lowercased or had something else happen to it, so you'll need to know that too.
I'd just use regex.
RegExp r = new RegExp("this.*cool");
Query q = new RegexpQuery(new Term("domain", r.toString()));
It can be slow, but if you don't prefix with any char it should be perfectly fine. I'm also not entirely sure how to ignore case with this, but that might be default.

Regex match SQL values string with multiple rows and same number of columns

I tried to match the sql values string (0),(5),(12),... or (0,11),(122,33),(4,51),... or (0,121,12),(31,4,5),(26,227,38),... and so on with the regular expression
\(\s*\d+\s*(\s*,\s*\d+\s*)*\)(\s*,\s*\(\s*\d+\s*(\s*,\s*\d+\s*)*\))*
and it works. But...
How can I ensure that the regex does not match a values string like (0,12),(1,2,3),(56,7) with different number of columns?
Thanks in advance...

As i mentioned in comment to the question, the best way to check if input string is valid: contains the same count of numbers between brackets, is to use client side programm, but not clear SQL.
Implementation:
List<string> s = new List<string>(){
"(0),(5),(12)", "(0,11),(122,33),(4,51)",
"(0,121,12),(31,4,5),(26,227,38)","(0,12),(1,2,3),(56,7)"};
var qry = s.Select(a=>new
{
orig = a,
newst = a.Split(new string[]{"),(", "(", ")"},
StringSplitOptions.RemoveEmptyEntries)
})
.Select(a=>new
{
orig = a.orig,
isValid = (a.newst
.Sum(b=>b.Split(new char[]{','},
StringSplitOptions.RemoveEmptyEntries).Count()) %
a.newst.Count()) ==0
});
Result:
orig isValid
(0),(5),(12) True
(0,11),(122,33),(4,51) True
(0,121,12),(31,4,5),(26,227,38) True
(0,12),(1,2,3),(56,7) False
Note: The second Select statement gets the modulo of sum of comma instances and the count of items in string array returned by Split function. If the result isn't equal to zero, it means that input string is invalid.
I strongly believe there's a simplest way to achieve that, but - at this moment - i don't know how ;)

:(
Unless you add some more constraints, I don't think you can solve this problem only with regular expressions.
It isn't able to solve all of your string problems, just as it cannot be used to check that the opening and closing of brackets (like "((())()(()(())))") is invalid. That's a more complicated issue.
That's what I learnt in class :P If someone knows a way then that'd be sweet!
I'm sorry, I spent a bit of time looking into how we could turn this string into an array and do more work to it with SQL but built in functionality is lacking and the solution would end up being very hacky.
I'd recommend trying to handle this situation differently as large scale string computation isn't the best way to go if your database is to gradually fill up.
A combination of client and serverside validation can be used to help prevent bad data (like the ones with more numbers) from getting into the database.
If you need to keep those numbers then you could rework your schema to include some metadata which you can use in your queries, like how many numbers there are and whether it all matches nicely. This information can be computed inexpensively from your server and provided to the database.
Good luck!

Lucene Indexing to ignore apostrophes

I have a field that might have apostrophes in it.
I want to be able to:
1. store the value as is in the index
2. search based on the value ignoring any apostrophes.
I am thinking of using:
doc.add(new Field("name", value, Store.YES, Index.NO));
doc.add(new Field("name", value.replaceAll("['‘’`]",""), Store.NO, Index.ANALYZED));
if I then do the same replace when searching I guess it should work and use the cleared value to index/search and the value as is for display.
am I missing any other considerations here ?

Performing replaceAll directly on the value its a bad practice in Lucene, since it would a much better practice to encapsulate your tokenization recipe in an Analyzer. Also I don't see the benefit of appending fields in your use case (See Document.add).
If you want to Store the original value and yet be able to search without the apostrophes simply declare your field like this:
doc.add(new Field("name", value, Store.YES, Index.ANALYZED);
Then simply hook up a custom Tokenizer that will replace apostrophes (I think the Lucene's StandardAnalyzer already includes this transformation).
If you are storing the field with the aim of using highlighting you should also consider using Field.TermVector.WITH_POSITIONS_OFFSETS.

Is it possible to add custom metadata to a Lucene field?

I've come to the point where I need to store some additional data about where a particular field comes from in my Lucene.Net index. Specifically, I want to attach a guid to certain fields of a document when the field is added to the document, and retrieve it again when I get the document from a search result.
Is this possible?
Edit:
Okay, let me clarify a bit by giving an example.
Let's say I have an object that I want to allow the user to tag with custom tags like "personal", "favorite", "some-project". I do this by adding multiple "tag" fields to the document, like so:
doc.Add( new Field( "tag", "personal" ) );
doc.Add( new Field( "tag", "favorite" ) );
The problem is I now need to record some meta data about each individual tag itself, specifically a guid representing where that tag came from (imagine it as a user id). Each tag could potentially have a different guid, so I can't simply create a "tag-guid" field (unless the order of the values is preserved---see edit 2 below). I don't need this metadata to be indexed (and in fact I'd prefer it not to be, to avoid getting hits on metadata), I just need to be able to retrieve it again from the document/field.
doc.GetFields( "tag" )[0].Metadata...
(I'm making up syntax here, but I hope my point is clear now.)
Edit 2:
Since this is a completely different question, I've posted a new question for this approach: Is the order of multi-valued fields in Lucene stable?
Okay let's try another approach... The key problem area is the indeterminacy of the multiple field values under the same field name (e.g. "tag"). If I could introduce or obtain some kind of determinacy here, I might be able to store the metadata in another field.
For example, if I could rely on the order of the values of the field never changing, I could use an index in the set of values to identify exactly which tag I am referring to.
Is there any guarantee that the order I add the values to a field will remain the same when I retrieve the document at a later time?

Depending on your search requirements for this index, this may be possible. That way you can control the order of fields. It would require updating both fields as the tag list changes of course, but the overhead may be worth it.
doc.Add(new Field("tags", "{personal}|{favorite}"));
doc.Add(new Field("tagsref", "{1234}|{12345}"));
Note: using the {} allows you to qualify your search for uniqueness where similar values exist.
Example: If values were stored as "person|personal|personage" searching for "person" would return a document that has any one of person, personal or personage. By qualifying in curly brackets like so: "{person}|{personal}|{personage}", I can search for "{person}" and be sure it won't return false positives. Of course, this assumes you don't use curly brackets in your values.

I think you're asking about payloads.
Edit: From your use case, it sounds like you have no desire to use this metadata in your search, you just want it there. (Basically, you want to use Lucene as a database system.)
So, why can't you use a binary field?
ExtraData ed = new ExtraData { Tag = "tag", Type = "personal" };
byte[] byteData = BinaryFormatter.Serialize(ed); // this isn't the correct code, but you get the point
doc.Add(new Field("myData", byteData, Field.Store.YES));
Then you can deserialize it on retrieval.

How to make Lucene match all words in query?

I am using Lucene to allow a user to search for words in a large number of documents. Lucene seems to default to returning all documents containing any of the words entered.
Is it possible to change this behaviour? I know that '+' can be use to force a term to be included but I would like to make that the default action.
Ideally I would like functionality similar to Google's: '-' to exclude words and "abc xyz" to group words.
Just to clarify
I also thought of inserting '+' into all spaces in the query. I just wanted to avoid detecting grouped terms (brackets, quotes etc) and potentially breaking the query. Is there another approach?

This looks similar to the Lucene Sentence Search question. If you're interested, this is how I answered that question:
String defaultField = ...;
Analyzer analyzer = ...;
QueryParser queryParser = new QueryParser(defaultField, analyzer);
queryParser.setDefaultOperator(QueryParser.Operator.AND);
Query query = queryParser.parse("Searching is fun");

Like Adam said, there's no need to do anything to the query string. QueryParser's setDefaultOperator does exactly what you're asking for.

Why not just preparse the user search input and adjust it to fit your criteria using the Lucene query syntax before passing it on to Lucene. Alternatively, you could just create some help documentation on how to use the standard syntax to create a specific query and let the user decide how the query should be performed.

Lucene has a extensive query language as described here that describes everything you want except for + being the default but that's something you can simple handle by replacing spaces with +. So the only thing you need to do is define the format you want people to enter their search queries in (I would strongly advise to adhere to the default Lucene syntax) and then you can write the transformations from your own syntax to the Lucene syntax.

The behavior is hard-coded in method addClause(List, int, int, Query) of class org.apache.lucene.queryParser.QueryParser, so the only way to change the behavior (other than the workarounds above) is to change that method. The end of the method looks like this:
if (required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST));
else if (!required && !prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.SHOULD));
else if (!required && prohibited)
clauses.addElement(new BooleanClause(q, BooleanClause.Occur.MUST_NOT));
else
throw new RuntimeException("Clause cannot be both required and prohibited");
Changing "SHOULD" to "MUST" should make clauses (e.g. words) required by default.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas