Lucene - How can you modify a Query object's terms? - lucene

I want to add new fields to my Lucene-based search engine site, however I want to be able to intercept queries and modify them before I pass them on to the Searcher.
For example each document has the field userid so you can search for documents authored by a particular user by their ID, e.g. foo bar userid:123 however I want to add the ability to search by username.
I'd like to add a field user:RonaldMcDonald to queries (not to documents), however I want to be able to intercept that term and replace it with an equivalent userid:123 term (my own code would be responsible for converting "RonaldMcDonald" to "123").
Here's the simple code I'm using right now:
Int32 get = (pageIndex + 1) * pageSize;
Query query;
try {
query = _queryParser.Parse( queryText );
} catch(ParseException pex) {
log.Add("Could not parse query.");
throw new SearchException( "Could not parse query text.", pex );
}
log.Add("Parsed query.");
TopDocs result = _searcher.Search( query, get );
I've had a look at the Query class, but I can't see any way to retrieve, remove, or insert terms.

You can subclass the QueryParser and override NewTermQuery.
QP qp = new QP("user", new SimpleAnalyzer());
var s = qp.Parse("user:RonaldMcDonald data:[aaa TO bbb]");
Where s is will be userid:123 data:[aaa TO bbb]
public class QP : QueryParser
{
Dictionary<string, string> _dict =
new Dictionary<string, string>(new MyComparer()) {{"RonaldMcDonald","123"} };
public QP(string field, Analyzer analyzer) : base(field, analyzer)
{
}
protected override Query NewTermQuery(Term term)
{
if (term.Field() == "user")
{
//Do your username -> userid mapping
return new TermQuery(new Term("userid", _dict[term.Text()]));
}
return base.NewTermQuery(term);
}
//Case insensitive comparer
class MyComparer : IEqualityComparer<string>
{
public bool Equals(string x, string y)
{
return String.Compare(x, y, true, CultureInfo.InvariantCulture)==0;
}
public int GetHashCode(string obj)
{
return obj.ToLower(CultureInfo.InvariantCulture).GetHashCode();
}
}
}

Related

Search where A or B with querydsl and spring data rest

http://localhost:8080/users?firstName=a&lastName=b ---> where firstName=a and lastName=b
How to make it to or ---> where firstName=a or lastName=b
But when I set QuerydslBinderCustomizer customize
#Override
default public void customize(QuerydslBindings bindings, QUser user) {
bindings.bind(String.class).all((StringPath path, Collection<? extends String> values) -> {
BooleanBuilder predicate = new BooleanBuilder();
values.forEach( value -> predicate.or(path.containsIgnoreCase(value) );
});
}
http://localhost:8080/users?firstName=a&firstName=b&lastName=b ---> where (firstName=a or firstName = b) and lastName=b
It seem different parameters with AND. Same parameters with what I set(predicate.or/predicate.and)
How to make it different parameters with AND like this ---> where firstName=a or firstName=b or lastName=b ??
thx.
Your current request param are grouped as List firstName and String lastName. I see that you want to keep your request parameters without a binding, but in this case it would make your life easier.
My suggestion is to make a new class with request param:
public class UserRequest {
private String lastName;
private List<String> firstName;
// getters and setters
}
For QueryDSL, you can create a builder object:
public class UserPredicateBuilder{
private List<BooleanExpression> expressions = new ArrayList<>();
public UserPredicateBuilder withFirstName(List<String> firstNameList){
QUser user = QUser.user;
expressions.add(user.firstName.in(firstNameList));
return this;
}
//.. same for other fields
public BooleanExpression build(){
if(expressions.isEmpty()){
return Expressions.asBoolean(true).isTrue();
}
BooleanExpression result = expressions.get(0);
for (int i = 1; i < expressions.size(); i++) {
result = result.and(expressions.get(i));
}
return result;
}
}
And after you can just use the builder as :
public List<User> getUsers(UserRequest userRequest){
BooleanExpression expression = new UserPredicateBuilder()
.withFirstName(userRequest.getFirstName())
// other fields
.build();
return userRepository.findAll(expression).getContent();
}
This is the recommended solution.
If you really want to keep the current params without a binding (they still need some kind of validation, otherwise it can throw an Exception in query dsl binding)
you can group them by path :
Map<StringPath,List<String>> values // example firstName => a,b
and after that to create your boolean expression based on the map:
//initial value
BooleanExpression result = Expressions.asBoolean(true).isTrue();
for (Map.Entry entry: values.entrySet()) {
result = result.and(entry.getKey().in(entry.getValues());
}
return userRepository.findAll(result);

Is there any form to write to BigQuery specifying the name of destination tables dynamically?

Is there any form to write to BigQuery specifying the name of destination tables dynamically?
Now I have:
bigQueryRQ
.apply(BigQueryIO.Write
.named("Write")
.to("project_name:dataset_name.table_name")
.withSchema(Table.create_auditedTableSchema())
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_IF_NEEDED)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_APPEND));
But I need the "table_name" as a dynamic table name that depends on the "tablerow" data that I want to write.
I have the same problem.
How about to group rows by tags, and apply BigQueryIO.Write for every group separately?
public static class TagMarker extends DoFn<TableRow, TableRow> {
private Map<String, TupleTag<TableRow>> tagMap;
public TagMarker(Map<String, TupleTag<TableRow>> tagMap) {
this.tagMap = tagMap;
}
#Override
public void processElement(ProcessContext c) throws Exception {
TableRow item = c.element();
c.sideOutput(tagMap.get(getTagName(item)), item);
}
private String getTagName(TableRow row) {
// There will be your logic of determinate table by row
return "table" + ((String)row.get("msg")).substring(0, 1);
}
}
private static class GbqWriter extends PTransform<PCollection<TableRow>, PDone> {
#Override
public PDone apply(PCollection<TableRow> input) {
TupleTag<TableRow> mainTag = new TupleTag<TableRow>();
TupleTag<TableRow> tag2 = new TupleTag<TableRow>();
TupleTag<TableRow> tag3 = new TupleTag<TableRow>();
Map<String, TupleTag<TableRow>> tagMap = new HashMap<String, TupleTag<TableRow>>();
tagMap.put("table1", mainTag);
tagMap.put("table2", tag2);
tagMap.put("table3", tag3);
List<TupleTag<?>> tags = new ArrayList<TupleTag<?>>();
tags.add(tag2);
tags.add(tag3);
PCollectionTuple result = input.apply(
ParDo.withOutputTags(mainTag, TupleTagList.of(tags)).of(new TagMarker(tagMap))
);
PDone done = null;
for (String tableId : tagMap.keySet()) {
done = writeToGbq(tableId, result.get(tagMap.get(tableId)).setCoder(TableRowJsonCoder.of()));
}
return done;
}
private PDone writeToGbq(String tableId, PCollection<TableRow> rows) {
PDone done = rows
.apply(BigQueryIO.Write.named("WriteToGbq")
.to("<project>:<dataset>." + tableId)
.withSchema(getSchema())
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
);
return done;
}
}
I am not sure about rewriting variable done. Is it correct? Can it brake rewriting to GBQ after fail.
And this way is suitable only if you know list of tables which we want write to before parsing rows.
Unfortunately, we don't provide an API to name the BigQuery table in a data-dependent way. Generally speaking, data-dependent BigQuery table destination(s) may be error prone.
That said, we are working on improving flexibility in this area. No estimates at this time, but we hope to get this soon.

Hibernate Search with AnalyzerDiscriminator - Analyzer called only when creating Entity?

can you help me?
I am implementing Hibernate Search, to retrieve results for a global search on a localized website (portuguese and english content)
To do this, I have followed the steps indicated on the Hibernate Search docs:
http://docs.jboss.org/hibernate/search/4.5/reference/en-US/html_single/#d0e4141
Along with the specific configuration in the entity itself, I have implemented a "LanguageDiscriminator" class, following the instructions in this doc.
Because I am not getting exactly the results I was expecting (e.g. my entity has the text "Capuchinho" stored, but when I search for "capucho" I get no hits), I have decided to try and debug the execution, and try to understand if the Analyzers which I have configured are being used at all.
When creating a new record for the entity in the database, I can see that the "getAnalyzerDefinitionName()" method from the "LanguageDiscriminator" gets called. Great. But the same does not happen when I execute a search. Can anyone explain me why?
I am posting the key parts of my code below. Thanks a lot for any feedback!
This is one entity I want to index
#Entity
#Table(name="NEWS_HEADER")
#Indexed
#AnalyzerDefs({
#AnalyzerDef(name = "en",
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = SnowballPorterFilterFactory.class,
params = {#Parameter(name="language", value="English")}
)
}
),
#AnalyzerDef(name = "pt",
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = SnowballPorterFilterFactory.class,
params = {#Parameter(name="language", value="Portuguese")}
)
}
)
})
public class NewsHeader implements Serializable {
static final long serialVersionUID = 20140301L;
private int id;
private String articleHeader;
private String language;
private Set<NewsParagraph> paragraphs = new HashSet<NewsParagraph>();
/**
* #return the id
*/
#Id
#Column(name="ID")
#GeneratedValue(strategy=GenerationType.AUTO)
#DocumentId
public int getId() {
return id;
}
/**
* #param id the id to set
*/
public void setId(int id) {
this.id = id;
}
/**
* #return the articleHeader
*/
#Column(name="ARTICLE_HEADER")
#Field(index=Index.YES, store=Store.NO)
public String getArticleHeader() {
return articleHeader;
}
/**
* #param articleHeader the articleHeader to set
*/
public void setArticleHeader(String articleHeader) {
this.articleHeader = articleHeader;
}
/**
* #return the language
*/
#Column(name="LANGUAGE")
#Field
#AnalyzerDiscriminator(impl=LanguageDiscriminator.class)
public String getLanguage() {
return language;
}
...
}
This is my LanguageDiscriminator class
public class LanguageDiscriminator implements Discriminator {
#Override
public String getAnalyzerDefinitionName(Object value, Object entity, String field) {
String result = null;
if (value != null) {
result = (String) value;
}
return result;
}
}
This is my search method present in my SearchDAO
public List<NewsHeader> searchParagraph(String patternStr) {
Session session = null;
Transaction tx;
List<NewsHeader> result = null;
try {
session = sessionFactory.getCurrentSession();
FullTextSession fullTextSession = Search.getFullTextSession(session);
tx = fullTextSession.beginTransaction();
// Create native Lucene query using the query DSL
QueryBuilder queryBuilder = fullTextSession.getSearchFactory()
.buildQueryBuilder().forEntity(NewsHeader.class).get();
org.apache.lucene.search.Query luceneSearchQuery = queryBuilder
.keyword()
.onFields("articleHeader", "paragraphs.content")
.matching(patternStr)
.createQuery();
// Wrap Lucene query in a org.hibernate.Query
org.hibernate.Query hibernateQuery =
fullTextSession.createFullTextQuery(luceneSearchQuery, NewsHeader.class, NewsParagraph.class);
// Execute search
result = hibernateQuery.list();
} catch (Exception xcp) {
logger.error(xcp);
} finally {
if ((session != null) && (session.isOpen())) {
session.close();
}
}
return result;
}
When creating a new record for the entity in the database, I can see that the "getAnalyzerDefinitionName()" method from the "LanguageDiscriminator" gets called. Great. But the same does not happen when I execute a search. Can anyone explain me why?
The selection of the analyzer is dependent on the state of a given entity, in your case NewsHeader. You are dealing with entity instances during indexing. While querying you don't have entities to start with, you are searching for them. Which analyzer would you Hibernate Search to select for your query?
That said, I think there is a shortcoming in the DSL. It does not allow you to explicitly specify the analyzer for a class. There is ignoreAnalyzer, but that's not what you want. I guess you could create a feature request in the Search issue tracker - https://hibernate.atlassian.net/browse/HSEARCH.
In the mean time you can build the query using the native Lucene query API. However, you will need to know which language you are targeting with your query (for example via the preferred language of the logged in user or whatever). This will depend on your use case. It might be you are looking at the wrong feature to start with.

Lucene.Net is not giving any results but Luke does

I have created a Lucene Index using StandardAnalyzer with following three fields.
StreetName
City
State
I am using below wrapper class to ease out writing boolean queries
public interface IQuery
{
BooleanQuery GetQuery();
}
public class QueryParam : IQuery
{
public string[] Fields { get; set; }
public string Term { get; set; }
private BooleanQuery _indexerQuery;
public QueryParam(string term, params string[] fields)
{
Term = term;
Fields = fields;
}
public BooleanQuery GetQuery()
{
_indexerQuery = new BooleanQuery();
foreach (var field in Fields)
_indexerQuery.Add(new FuzzyQuery(new Term(field, Term)), Occur.SHOULD);
return _indexerQuery;
}
}
public class AndQuery : IQuery
{
private readonly IList<IQuery> _queryParams = new List<IQuery>();
private BooleanQuery _indexerQuery;
public AndQuery(params IQuery[] queryParams)
{
foreach (var queryParam in queryParams)
{
_queryParams.Add(queryParam);
}
}
public BooleanQuery GetQuery()
{
_indexerQuery = new BooleanQuery();
foreach (var query in _queryParams)
_indexerQuery.Add(query.GetQuery(), Occur.MUST);
return _indexerQuery;
}
}
public class OrQuery : IQuery
{
private readonly IList<IQuery> _queryParams = new List<IQuery>();
private readonly BooleanQuery _indexerQuery = new BooleanQuery();
public OrQuery(params IQuery[] queryParams)
{
foreach (var queryParam in queryParams)
{
_queryParams.Add(queryParam);
}
}
public BooleanQuery GetQuery()
{
foreach (var query in _queryParams)
_indexerQuery.Add(query.GetQuery(), Occur.SHOULD);
return _indexerQuery;
}
public OrQuery AddQuery(IQuery query)
{
_queryParams.Add(query);
return this;
}
}
Below query is not giving me any results in Lucene.Net but when i search the same query in Luke,it works flawlessly.
var query = new AndQuery(new QueryParam(city.ToLower(), "city"), new QueryParam(state.ToLower(), "state"), new QueryParam(streetAddress.ToLower(), "streetname"));
Executing query.GetQuery() gives me below resultant query.
{+(city:tampa~0.5) +(state:fl~0.5) +(street:tennis court~0.5)}
You can search using BooleanQuery. Break your term with white space in segments, then create the query and search in index.
EX:-
BooleanQuery booleanQuery = new BooleanQuery()
BooleanQuery searchTermQuery = new BooleanQuery();
foreach (var searchTerm in searchTerms)
{
var searchTermSegments = searchTerm.Split(new[] { " " }, StringSplitOptions.RemoveEmptyEntries);
if (searchTermSegments.Count() > 1)
{
searchTermQuery.Clauses().Clear();
foreach (var SegTex in searchTermSegments)
{
searchTermQuery.Add( new FuzzyQuery(new Term("FieldName", SegTex.ToLower().Trim())),BooleanClause.Occur.MUST);
}
booleanQuery.Add(searchTermQuery, BooleanClause.Occur.MUST);
}
else
{
booleanQuery.Add(new FuzzyQuery(new Term("FieldName", searchTerm.ToLower().Trim())), BooleanClause.Occur.MUST);
}
}
The problem is the treatment of tennis court. You haven't shown how you are indexing these fields, but I will assume they are tokenized in the index, using something like a StandardAnalyzer, for instance. This means, "tennis court" will be split into two separate terms "tennis" and "court". When creating a FuzzyQuery manually, though, there is no analysis or tokenization, and so you will only have a single term "tennis court". There is a large edit distance between "tennis court" and either "tennis" (6 edits) or "court" (7 edits), so neither of them match.
A source of confusion here seems to be that
+(city:tampa~0.5) +(state:fl~0.5) +(street:tennis court~0.5)
Seems to work. It is not safe, however, to assume that the text query output for debugging can be run through the queryparser to generate the same query, and this is a good example. The QueryParser syntax is simply not capable of expressing everything you can do with manually constructed queries. Running that query through the query parser will generate a query more like:
+(city:tampa~0.5) +(state:fl~0.5) +((street:tennis) (defaultField:court~0.5))
Which will find a match, since we can expect it to find city:tampa, state:fl, and street:tennis (See this Lucene Query Parser documentation section for another example explaining this behavior of the query parser). Whether it finds a match on court in the default field I have no idea, but it doesn't really need to.
A PhraseQuery is the typical way to string multiple terms (words) together in a Lucene Query (this would look like street:"tennis court" in a parsed query).

ROW_NUMBER() and nhibernate - finding an item's page

given a query in the form of an ICriteria object, I would like to use NHibernate (by means of a projection?) to find an element's order,
in a manner equivalent to using
SELECT ROW_NUMBER() OVER (...)
to find a specific item's index in the query.
(I need this for a "jump to page" functionality in paging)
any suggestions?
NOTE: I don't want to go to a page given it's number yet - I know how to do that - I want to get the item's INDEX so I can divide it by page size and get the page index.
After looking at the sources for NHibernate, I'm fairly sure that there exists no such functionality.
I wouldn't mind, however, for someone to prove me wrong.
In my specific setting, I did solve this problem by writing a method that takes a couple of lambdas (representing the key column, and an optional column to filter by - all properties of a specific domain entity). This method then builds the sql and calls session.CreateSQLQuery(...).UniqueResult(); I'm not claiming that this is a general purpose solution.
To avoid the use of magic strings, I borrowed a copy of PropertyHelper<T> from this answer.
Here's the code:
public abstract class RepositoryBase<T> where T : DomainEntityBase
{
public long GetIndexOf<TUnique, TWhere>(T entity, Expression<Func<T, TUnique>> uniqueSelector, Expression<Func<T, TWhere>> whereSelector, TWhere whereValue) where TWhere : DomainEntityBase
{
if (entity == null || entity.Id == Guid.Empty)
{
return -1;
}
var entityType = typeof(T).Name;
var keyField = PropertyHelper<T>.GetProperty(uniqueSelector).Name;
var keyValue = uniqueSelector.Compile()(entity);
var innerWhere = string.Empty;
if (whereSelector != null)
{
// Builds a column name that adheres to our naming conventions!
var filterField = PropertyHelper<T>.GetProperty(whereSelector).Name + "Id";
if (whereValue == null)
{
innerWhere = string.Format(" where [{0}] is null", filterField);
}
else
{
innerWhere = string.Format(" where [{0}] = :filterValue", filterField);
}
}
var innerQuery = string.Format("(select [{0}], row_number() over (order by {0}) as RowNum from [{1}]{2}) X", keyField, entityType, innerWhere);
var outerQuery = string.Format("select RowNum from {0} where {1} = :keyValue", innerQuery, keyField);
var query = _session
.CreateSQLQuery(outerQuery)
.SetParameter("keyValue", keyValue);
if (whereValue != null)
{
query = query.SetParameter("filterValue", whereValue.Id);
}
var sqlRowNumber = query.UniqueResult<long>();
// The row_number() function is one-based. Our index should be zero-based.
sqlRowNumber -= 1;
return sqlRowNumber;
}
public long GetIndexOf<TUnique>(T entity, Expression<Func<T, TUnique>> uniqueSelector)
{
return GetIndexOf(entity, uniqueSelector, null, (DomainEntityBase)null);
}
public long GetIndexOf<TUnique, TWhere>(T entity, Expression<Func<T, TUnique>> uniqueSelector, Expression<Func<T, TWhere>> whereSelector) where TWhere : DomainEntityBase
{
return GetIndexOf(entity, uniqueSelector, whereSelector, whereSelector.Compile()(entity));
}
}
public abstract class DomainEntityBase
{
public virtual Guid Id { get; protected set; }
}
And you use it like so:
...
public class Book : DomainEntityBase
{
public virtual string Title { get; set; }
public virtual Category Category { get; set; }
...
}
public class Category : DomainEntityBase { ... }
public class BookRepository : RepositoryBase<Book> { ... }
...
var repository = new BookRepository();
var book = ... a persisted book ...
// Get the index of the book, sorted by title.
var index = repository.GetIndexOf(book, b => b.Title);
// Get the index of the book, sorted by title and filtered by that book's category.
var indexInCategory = repository.GetIndexOf(book, b => b.Title, b => b.Category);
As I said, this works for me. I'll definitely tweak it as I move forward. YMMV.
Now, if the OP has solved this himself, then I would love to see his solution! :-)
ICriteria has this 2 functions:
SetFirstResult()
and
SetMaxResults()
which transform your SQL statement into using ROW_NUMBER (in sql server) or limit in MySql.
So if you want 25 records on the third page you could use:
.SetFirstResult(2*25)
.SetMaxResults(25)
After trying to find an NHibernate based solution for this myself, I ultimately just added a column to the view I happened to be using:
CREATE VIEW vw_paged AS
SELECT ROW_NUMBER() OVER (ORDER BY Id) AS [Row], p.column1, p.column2
FROM paged_table p
This doesn't really help if you need complex sorting options, but it does work for simple cases.
A Criteria query, of course, would look something like this:
public static IList<Paged> GetRange(string search, int rows)
{
var match = DbSession.Current.CreateCriteria<Job>()
.Add(Restrictions.Like("Id", search + '%'))
.AddOrder(Order.Asc("Id"))
.SetMaxResults(1)
.UniqueResult<Paged>();
if (match == null)
return new List<Paged>();
if (rows == 1)
return new List<Paged> {match};
return DbSession.Current.CreateCriteria<Paged>()
.Add(Restrictions.Like("Id", search + '%'))
.Add(Restrictions.Ge("Row", match.Row))
.AddOrder(Order.Asc("Id"))
.SetMaxResults(rows)
.List<Paged>();
}