Using lucene to search data differently for different users conditionally - lucene

Consider that the entities that I need to perform text search are as following
Sample{
int ID, //Unique ID
string Name,//Searchable field
string Description //Searchable field
}
Now, I have several such entities which are commonly shared by all the users but each user can associate different tags, Notes etc to any of these entities. For simplicity lets say a user can add tags to a Sample entity.
UserSampleData{
int ID, //Sample ID
int UserID, //For condition
string tags //Searchable field
}
When a user performs search, I want to search for the given string in the fields Name, Description and tags associated to that Sample by the current user. I am pretty new to using lucene indexing and I am not able to figure how can I design a index and also the queries for such a situation. I need the results sorted on the relevance with the search query. Following approaches crossed my mind, but I have a feeling there could be better solutions:
Separately query 2 different entities Samples and UserSampleData and somehow mix the 2 results. For the results that intersect, we need to combine the match scores by may be averaging.
Flatten out the data by combining both the entities => multiple entries for same ID.

You could use a JoinUtil Lucene class but you must rename the second "ID" field of UserDataSample document into SAMPLE_ID (or another name different from "ID").
Below an example:
r = DirectoryReader.open(dir);
final Version version = Version.LUCENE_47; // Your lucene version
final IndexSearcher searcher = new IndexSearcher(r);
final String fromField = "ID";
final boolean multipleValuesPerDocument = false;
final String toField = "SAMPLE_ID";
String querystr = "UserID:xxxx AND yourQueryString"; //the userID condition and your query String
Query fromQuery = new QueryParser(version, "NAME", new WhitespaceAnalyzer(version)).parse(querystr);
final Query joinQuery = JoinUtil.createJoinQuery(fromField, multipleValuesPerDocument, toField, fromQuery, searcher, ScoreMode.None);
final TopDocs topDocs = searcher.search(joinQuery, 10);
Check the bug https://issues.apache.org/jira/browse/LUCENE-4824). I don't know if the bug is automatically solved into the current version of LUCENE otherwise I think you must convert the type of your ID fields to String.

I think that you need Relational Data. Handling relational data is not simple with Lucene. This is a useful blog post for.

Related

Linq where a record contains 2 matched fields

I’m working with an existing database with a design I’m not in control of, I’m using EF4, and querying using LINQ. I work in VB.Net but would be quite happy to translate a c# solution.
I would like to pull records from a table where two of the fields match a pair of items from a list.
So i have a list of
Public Class RequestInfo
Public Property INSP_ROUTINE_NM As String
Public Property FEATURE_ID As String
End Class
And I would like to query a table and pull any records where both INSP_ROUTINE_NM and FEATURE_ID match one of the Request Info items.
I can use contains easy enough on either of the fields
Dim Features = (From F In MLDb.TBL_FeatureInfoSet _
Where (C_Request.Select(Function(x) x.INSP_ROUTINE_NM)).Contains(F.INSP_ROUTINE_NM) Select F.FEATURE_ID, F.FEATURE_RUN_NO, F.INSP_ROUTINE_NM).ToList
I could use two contains calls but that would pull any record where both records match somewhere in the list not necessarily any one pair from the request.
You can try this:
C#
var Features= (from f in MLDb.TBL_FeatureInfoSet
let q = C_Request.Select(x=>x.INSP_ROUTINE_NM)
where q.Contains(f.INSP_ROUTINE_NM) || q.Contains(f.INSP_ROUTINE_NM)
// where q.Contains(f.INSP_ROUTINE_NM) && q.Contains(f.INSP_ROUTINE_NM)
select new {f.FEATURE_ID, f.FEATURE_RUN_NO}).ToList();

Need a concept on fetching data with HQL while three or more tables are in use

A small briefing on what I am trying to do.
I have three tables Content(contentId, body, timeofcreation), ContentAttachmentMap(contentId, attachmentId) and Attachment(attachmentId, resourceLocation).
The reason I adopted to create the mapping table because in future application the attachment can also be shared with different content.
Now I am using HQL to get data. My objectives is as follows:
Get All contents with/without Attachments
I have seen some examples in the internet like you can create an objective specific class (not POJO) and put the attribute name from the select statement within its constructor and the List of that Class object is returned.
For e.g. the HQL will be SELECT new com.mydomain.myclass(cont.id, cont.body) ..... and so on.
In my case I am looking for the following SELECT new com.mydomain.contentClass(cont.id, cont.body, List<Attachment>) FROM ...`. Yes, I want to have the resultList contain contentid, contentbody and List of its Attachments as a single result List item. If there are no attachments then it will return (cont.id, contentbody, null).
Is this possible? Also tell me how to write the SQL statements.
Thanks in advance.
I feel you are using Hibernate in a fundamentally wrong way. You should use Hibernate to view your domain entity, not to use it as exposing the underlying table.
You don't need to have that contentClass special value object for all these. Simply selecting the Content entity serves what you need.
I think it will be easier to have actual example.
In your application, you are not seeing it as "3 tables", you should see it as 2 entities, which is something look like:
#Entity
public class Content {
#Id
Long id;
#Column(...)
String content;
#ManyToMany
#JoinTable(name="ContentAttachmentMap")
List<Attachment> attachments;
}
#Entity
public class Attachment {
#Id
Long id;
#Column(...)
String resourceLocation
}
And, the result you are looking for is simply the result of HQL of something like
from Content where attachments IS EMPTY
I believe you can join fetch too in order to save DB access:
from Content c left join fetch c.attachments where c.attachments IS EMPTY

How should I filter the results of a RavenDB Lucene search?

Say I had a User class like this:
public class User
{
public bool IsActive {get;set;}
public string[] Tags{get;set;}
public string Description {get;set;}
}
I would like to use RavenDB to search for the set of users that match following criteria:
IsActive = true
Tags contains both 'hello' and 'world'
Description has the following phrase 'abject failure'
I have researched the Lucene Query syntax, I have even got some stuff working, but it all feels dreadfully clunky with lots of combinatorial string-building to create a text-based lucene query string. I hesitate to put my code up here because it is quite smelly.
I think what I want to do it submit a Lucene Search for the Description and Tags and then filter it with a Where clause for the IsActive field, perhaps like this Filter RavenDB Search Results. But I got lost.
I am using the latest official release (960) so all the groovy stuff that comes after this is not available to me yet. For example, this solution is verboten as 960 does not appear to support the .As<T>() extension.
Question
How do I construct the required Index and Query to perform a search that combines:
a single constraint, eg IsActive
a collection constraint, eg Tags
a free-text constraint eg Description
to return a strongly typed list of User objects?
Thank you for any code examples or pointers.
You query it like this:
var results = (from u in Session.Query<User>("YourUserIndex")
where u.IsActive && u.Tags.Any(x=>x == "hello") && x.Tags.Any(x=>x=="world")
select u)
.Search(x=>x.Description, "abject failure")
.ToList();
Where YourUserIndex looks like this:
from u in docs.Users
select new { u.IsActive, u.Tags, u.Description };
And you need to mark the Description field as analyzed.

Lucene Query Syntax

I'm trying to use Lucene to query a domain that has the following structure
Student 1-------* Attendance *---------1 Course
The data in the domain is summarised below
Course.name Attendance.mandatory Student.name
-------------------------------------------------
cooking N Bob
art Y Bob
If I execute the query "courseName:cooking AND mandatory:Y" it returns Bob, because Bob is attending the cooking course, and Bob is also attending a mandatory course. However, what I really want to query for is "students attending a mandatory cooking course", which in this case would return nobody.
Is it possible to formulate this as a Lucene query? I'm actually using Compass, rather than Lucene directly, so I can use either CompassQueryBuilder or Lucene's query language.
For the sake of completeness, the domain classes themselves are shown below. These classes are Grails domain classes, but I'm using the standard Compass annotations and Lucene query syntax.
#Searchable
class Student {
#SearchableProperty(accessor = 'property')
String name
static hasMany = [attendances: Attendance]
#SearchableId(accessor = 'property')
Long id
#SearchableComponent
Set<Attendance> getAttendances() {
return attendances
}
}
#Searchable(root = false)
class Attendance {
static belongsTo = [student: Student, course: Course]
#SearchableProperty(accessor = 'property')
String mandatory = "Y"
#SearchableId(accessor = 'property')
Long id
#SearchableComponent
Course getCourse() {
return course
}
}
#Searchable(root = false)
class Course {
#SearchableProperty(accessor = 'property', name = "courseName")
String name
#SearchableId(accessor = 'property')
Long id
}
What you are trying to do is sometimes known as "scoped search" or "xml search" - the ability to search based on a set of related sub-elements. Lucene does not support this natively but there are some tricks you can do to get it to work.
You can put all of the course data associated with a student in a single field. Then bump the term position by a fixed amount (like 100) between the terms for each course. You can then do a proximity search with phrase queries or span queries to force a match for attributes of a single course. This is how Solr supports multi-valued fields.
Another workaround is to add fake getter and index it
Something like:
#SearchableComponent
Course getCourseMandatory() {
return course + mandatory;
}
Try
+courseName:cooking +mandatory:Y
We use pretty similar queries and this works for us:
+ProdLineNum:1920b +HouseBrand:1
This selects everything in product line 1920b that is also a house brand (generic).
You can just create queries as text string and then parse that to get your query object. Presume you have seen Apache Lucene - Query Parser Syntax ?

Lucene query - "Match exactly one of x, y, z"

I have a Lucene index that contains documents that have a "type" field, this field can be one of three values "article", "forum" or "blog". I want the user to be able to search within these types (there is a checkbox for each document type)
How do I create a Lucene query dependent on which types the user has selected?
A couple of prerequisites are:
If the user doesn't select one of the types, I want no results from that type.
The ordering of the results should not be affected by restricting the type field.
For reference if I were to write this in SQL (for a "blog or forum search") I'd write:
SELECT * FROM Docs
WHERE [type] in ('blog', 'forum')
For reference, should anyone else come across this problem, here is my solution:
IList<string> ALL_TYPES = new[] { "article", "blog", "forum" };
string q = ...; // The user's search string
IList<string> includeTypes = ...; // List of types to include
Query searchQuery = parser.Parse(q);
Query parentQuery = new BooleanQuery();
parentQuery.Add(searchQuery, BooleanClause.Occur.SHOULD);
// Invert the logic, exclude the other types
foreach (var type in ALL_TYPES.Except(includeTypes))
{
query.Add(
new TermQuery(new Term("type", type)),
BooleanClause.Occur.MUST_NOT
);
}
searchQuery = parentQuery;
I inverted the logic (i.e. excluded the types the user had not selected), because if you don't the ordering of the results is lost. I'm not sure why though...! It is a shame as it makes the code less clear / maintainable, but at least it works!
Add a constraints to reject documents that weren't selected. For example, if only "article" was checked, the constraint would be
-(type:forum type:blog)
While erickson's suggestion seems fine, you could use a positive constraint ANDed with your search term, such as text:foo AND type:article for the case only "article" was checked,
or text:foo AND (type:article OR type:forum) for the case both "article" and "forum" were checked.