Combining hits from multiple documents into a single hit in Lucene - lucene

I am trying to get a particular search to work and it is proving problematic. The actual source data is quite complex but can be summarised by the following example:
I have articles that are indexed so
that they can be searched. Each
article also has multiple properties
associated with it which are also
indexed and searchable. When users
search, they can get hits in either
the main article or the associated
properties. Regardless of where a hit
is achieved, the article is returned
as a search hit (ie. the properties
are never a hit in their own right).
Now for the complexity:
Each property has security on it,
which means that for any given user,
they may or may not be able to see the
property. If a user cannot see a
property, they obviously do not get a
search hit in it. This security check
is proprietary and cannot be done
using the typical mechanism of storing
a role in the index alongside the
other fields in the document.
I currently have an index that contains the articles and properties indexed separately (ie. an article is indexed as a document, and each property has its own document). When a search happens, a hit in article A or a hit in any of the properties of article A should be classed as hit for article A alone, with the scores combined.
To achieve this originally, Lucene v1.3 was modified to allow this to happen by changing BooleanQuery to have a custom Scorer that could apply the logic of the security check and the combination of two hits in different documents being classed as a hit in a single document. I am trying to upgrade this version to the latest (v2.3.2 - I am using Lucene.Net), but ideally without having to modify Lucene in any way.
An additional problem occurs if I do an AND search. If an article contains the word foo and one of its properties contains the word bar, then searching for "foo AND bar" will return the article as a hit. My current code deals with this inside the custom Scorer.
Any ideas how/if this can be done?
I am thinking along the lines of using a custom HitCollector and passing that into the search, but when doing the boolean search "foo AND bar", execution never reaches my HitCollector as the ConjunctionScorer filters out all of the results from the sub-queries before getting there.
EDIT:
Whether or not a user can see a property is not based on the property itself, but on the value of the property. I cannot therefore put the extra security conditions into the query upfront as I don't know the value to filter by.
As an example:
+---------+------------+------------+
| Article | Property 1 | Property 2 |
+---------+------------+------------+
| A | X | J |
| B | Y | K |
| C | Z | L |
+---------+------------+------------+
If a user can see everything, then searching for "B and Y" will return a single search result for article B.
If another user cannot see a property if its value contains Y, then searching for "B and Y" will return no hits.
I have no way of knowing what values a user can and cannot see upfront. They only way to tell is to perform the security check (currently done at the time of filtering a hit from a field in the document), which I obviously cannot do for every possible data value for each user.

Having now implemented this (after a lot of head-scratching and stepping through Lucene searches), I thought I'd post back on how I achieved it.
Because I am interested in all of the results (ie. not a page at a time), I can avoid using the Hits object (which has been deprecated in later versions of Lucene anyway). This means I can do my own hit collection using the Search(Weight, Filter, HitCollector) method of IndexSearcher, iterating over all possible results and combining document hits as appropriate. To do this, I had to hook into Lucene's querying mechanism, but only when AND and NOT clauses are present. This is achieved by:
Creating a custom QueryParser and overriding GetBooleanQuery(ArrayList, bool) to return my own implementation.
Creating a custom BooleanQuery (returned from the custom QueryParser) and overriding CreateWeight(Searcher) to return my own implementation.
Creating a custom Weight (returned from the custom BooleanQuery) and overriding Scorer(IndexReader) to return my own implementation.
Creating a custom BooleanScorer2 (returned from the custom Weight) and overriding the Score(HitCollector) method. This is what deals with the custom logic.
This might seem like a lot of classes, but most of them derive from a Lucene class and just override a single method.
The implementation of the Score(HitCollector) method in the custom BooleanScorer2 class now has the responsibility of doing the custom logic. If there are no required sub-scorers, the scoring can be passed to the base Score method and run as normal. If there are required sub-scorers, it means there was a NOT or an AND clause in the query. In this case, the special combination logic mentioned in the question comes into play. I have a class called ConjunctionScorer that does this (this is not related to the ConjunctionScorer in Lucene).
The ConjunctionScorer takes a list of scorers and iterates over them. For each one, I extract the hits and their scores (using the Doc() and Score() methods) and create my own search hits collection containing only those hits that the current user can see after performing the relevant security checks. If a hit has already been found by another scorer, I combine them together (using the mean of their scores for their new score). If a hit is from a prohibited scorer, I remove the hit if it was already found.
At the end of all of this, I set the hits onto the HitCollector passed into the BooleanScorer2.Score(HitCollector) method. This is a custom HitCollector that I passed into the IndexSearcher.Search(Query, HitCollector) method to originally perform the search. When this method returns, my custom HitCollector now contains my search results combined together as I wanted.
Hopefully this information will be useful to someone else faced with the same problem. It sounds like a lot of effort, but it is actually pretty trivial. Most of the work is done in combining the hits together in the ConjunctionScorer. Note that this is for Lucene v2.3.2, and may be different in later versions.

Review just another way
This security check is proprietary and
cannot be done using the typical
mechanism of storing a role in the
index alongside the other fields in
the document.
What about checking the permission of property on query building stage? So if property explicitly hidden from user avoid including it to result tree

Related

Cocoa Scripting: Deletion of elements in a loop getting out of sync

While adding scriptability to my Mac program, I am struggling with the common programming problem of deleting items from an indexed array where the item indexes shift due to removal of items.
Let's say my app maintains a data store in which objects of type "Person" are stored. In the sdef, I've define the Cocoa Key allPersons to access these elements. My app declares an NSArray *allPersons.
That far, it works well. E.g, this script works well:
repeat with p in every person
get name of p
end repeat
The problem starts when I want to support deletion of items, like this:
repeat with p in (get every person)
delete p
end repeat
(I realize that I could just write "delete every person", which works fine, but I want to show how "repeat" makes things more complicated).
This does not work because AppleScript keep using the original item numbers to reference the items even after deleting some of them, which naturally shifts the items and their numbering.
So, considering we have 3 Persons, "Adam", "Bonny" and "Clyde", this will happen:
get every person
--> {person 1, person 2, person 3}
delete person 1
delete person 2
delete person 3
--> error number -1719 from person 3
After deleting item 1 (Adam), the other items get renumbered to item 1 and 2. The second iteration deletes item 2 (which is now Clyde), and the third iteration attempts to delete item 3, which doesn't exist any more at that point.
How do I solve this?
Can I force the scripting engine to not address the items by their index number but instead by their unique ID so that this won't happen?
It's not your ObjC code, it's your misunderstanding of how repeat with VAR in EXPR loops work. (Not really your fault either: they're 1. counterintuitive, and 2. poorly explained.) When it first encounters your repeat statement, AppleScript sends your app a count event to get the number of items specified by EXPR, which in this case is an object specifier (query) that identifies all of the person elements in whatever. It then uses that information to generate its own sequence of by-index object specifiers, counting from 1 up to the result of the aforementioned count:
person 1 of whatever
person 2 of whatever
...
person N of whatever
What you need to realize is that an object specifier is a first-class query, not an object pointer (not that Apple tell you this either): it describes a request, not an object. Ignore the purloined jargon: Apple event IPC's nearest living relatives are RDBMSes, not Cocoa or SOAP or any of the OO messaging crud that modern developers so fixate on as The One True Way To Do... well, EVERYTHING.
It's only when that query is sent to your application in an Apple event that it's evaluated against the relational graph your Apple event IPC View-Controller – aka "Apple Event Object Model" – presents as an idealized, user-friendly representation of your Model's user date that it actually resolves to a specific Model object, or objects, with which the event handler should perform the requested operation.
Thus, when the delete command in your repeat loop tells your app to delete person 1 of whatever, all your remaining elements move down by one. But on the next iteration the repeat loop still generates the object specifier person 2 of whatever, which your script then sends off to your app, which resolves it to the second item in the collection – which was originally the third item, of course, until you shifted them all about.
Or, to borrow a phrase:
Nothing in AppleScript makes sense except in light of relational queries.
..
In fact, Apple events' query-based approach it actually makes a lot of sense considering it was originally designed to be efficient over very high-latency connections (i.e. System 7's abysmally inefficient process switcher), allowing a single Apple event carrying one or more complex queries to manipulate many objects at once. It's even quite elegant [when it works right], but is quite undone by idiots at Cupertino who think the best way to make programmers not hate the technology is to lie even harder about how it actually works.
So here, I suggest you go read this, which is not the best explanation either but still a damn sight better than anything you'll get from those muppets. And this, by its original designer that explains a lot of the rationale for creating a high-level coarse-grained query-based IPC system instead of the usual low-level fine-grained OO message passing crap.
Oh, and once you've done that, you might want to consider try running this instead:
delete every person whose name is "bob"
which is pretty much the whole point of creating a thick declarative-y abstraction that does all the work so the user doesn't have to.
And when nothing but an imperative client-side loop will do, you either want to get a list of by-ID object specifiers (which are the closest things to safe, persistent pointers that AEOMs can do) from the app first and then iterate over that, or at least use your own iterator loop that counts over elements in reverse:
repeat with i from (count every person) to 1 by -1
tell person i
..
end tell
end repeat
so that, assuming it's iterating over an ordered array on the server side, will delete from last to first, and so avoid the embarrassing off-by-N errors of your original script.
HTH
re: "If you want your scripable elements to be deletable, make sure you use NSUniqueIDSpecifiers to identify them."
Yes, Apple recommends using formUniqueId or formName for object specifiers, but you can't always do that. For instance, in the Text Suite, you really only have indexing to work with; e.g. character 1, word 3, paragraph 7, etc. You don't have unique IDs for text elements. In addition to deletion, ordering can be affected by other Standard Suite commands: open, close, duplicate, make, and move.
The app implementer is a programmer, but so is the scripter. So it is reasonable to expect the scripter to solve some problems themselves. For instance, if the app has 5 persons, and the scripter wants to delete persons 2 and 4, they can easily do so even with indexed deletion:
delete person 4
delete person 2
Deleting from the end of an ordered list forward solves the problem. AS also supports negative indexes, which can be used for the same purpose:
delete person -2
delete person -4
The key to solving this lies in implementing the objectSpecifier method correctly so that it does return an NSUniqueIDSpecifier.
My code did so far only return an index specifier and that was wrong for this purpose. I guess that had I posted my code (which is, unfortunately, too complex for that), someone may have noticed my mistake.
So, I guess the rule is: If you want your scripable elements to be deletable, make sure you use NSUniqueIDSpecifiers to identify them. For read-only element arrays, using an NSIndexSpecifier is (probably) safe, though, if your element array has persistent ordering behavior.
Update
As #foo points out, it's also important that the repeat command fetches the references to the items by using … in (get every person) and not just … in every person, because only the former leads to addressing the items by their id whereas the latter keeps indexing them as item N.

Best practice to keep RSS feeds unique in sql database

I am working on a project which shows rss feeds from different sites.
I keep them in the database, every 3 hours my program fetches and inserts them into sql database.
I want unique records for providers not to show duplicate content.
But problem is some providers do not give GUID field, and some others gives GUID field but not pubdate.. And some others does not even give GUID or PubDate just title and link.
So to keep rss feeds uniqe in sql server what would be the best way?
Should I check for first guid, then pubbdate, then link, then title? Will it be to good practice to compare link fields in SQL to check uniqueness?
Thanks.
I would develop a routine that takes certain key parameters like the title, source and body and then combines them to create a CRC hash. Then store the hash as an attribute with the feed and check for a matching hash before adding a new feed.
I'm not sure what your environment contraints are but here is an example for calculating CRC-32 in C#: http://damieng.com/blog/2006/08/08/calculating_crc32_in_c_and_net
Currently, this is what I am doing
# If we have a GUID in the feed item, use it as the feed_item_id else use link
# http://www.詹姆斯.com/blog/2006/08/rss-dup-detection
def build_feed_item_id(entry):
guid = trim(entry.get('id', ''))
if len(guid):
feed_item_id = guid
else:
feed_item_id = trim(entry.get('link', ''))
return hashlib.md5(feed_item_id.encode(encoding)).hexdigest()
It is based on the reasoning mentioned in the blog post linked in the snippet which I ll reference here in case the post gets taken down
RSS 2.0 has a guid element that fits the bill perfectly, but it’s not
a required element and many feeds don’t use it.
I can’t say for sure what algorithms applications are using, but after
running 150 tests on more than 20 different aggregators, I think have
a fair idea how many of them work.
As you would expect, for most the guid is considered the key element
for determining duplicates. This is pretty straightforward. If two
items have the same guid they are considered duplicates; if their
guids differ then they are considered different.
If a feed doesn’t contain guids, though, aggregators will most likely
resort to one of three general strategies – all of which involve the
link element in some way.
Technique 1
Guid must be unique
If a post doesnt have guid, consider link, title, description or any combination of them to get a unique hash
Technique 2
Link must be unique
If both link and guid are missing, check other elements such as title or description
Technique 3
Combination of link + title or link + description must be unique
The most obvious recommendation is that you should always include
guids in your feeds.
In addition, I would recommend you also include a unique link element
for each item in your feed, to allow for aggregators that don’t handle
guids very well. No two items should ever have the same link element,
and ideally a link should never change (if you do update a link, be
aware that it could show up as a new item for some aggregators).
Finally, although this is not essential, it is advisable that you
refrain from updating your article titles if at all possible. There
are at least two aggregators that will consider an entry with an
altered title to be a completely new post – somewhat annoying to
readers when all you’ve done is make a spelling correction in your
title.

django objects...values() select only some fields

I'm optimizing the memory load (~2GB, offline accounting and analysis routine) of this line:
l2 = Photograph.objects.filter(**(movie.get_selectors())).values()
Is there a way to convince django to skip certain columns when fetching values()?
Specifically, the routine obtains all rows of the table matching certain criteria (db is optimized and performs it very quickly), but it is a bit too much for python to handle - there is a long string referenced in each row, storing the urls for thumbnails.
I only really need three fields from each row, but, if all the fields are included, it suddenly consumes about 5kB/row which sadly pushes the RAM to the limit.
The values(*fields) function allows you to specify which fields you want.
Check out the QuerySet method, only. When you declare that you only want certain fields to be loaded immediately, the QuerySet manager will not pull in the other fields in your object, till you try to access them.
If you have to deal with ForeignKeys, that must also be pre-fetched, then also check out select_related
The two links above to the Django documentation have good examples, that should clarify their use.
Take a look at Django Debug Toolbar it comes with a debugsqlshell management command that allows you to see the SQL queries being generated, along with the time taken, as you play around with your models on a django/python shell.

Bind Top 5 Values of a To-Many Core Data Relationship to Text Fields

I am making an application that represents a cell phone bill using Core Data. I have three entities: Bill, Line, and Calls. Bills can have many lines, and lines can have many calls. All of this is set up with relationships. Right now, I have a table view that displays all of the bills. When you double click on a bill, a sheet comes down with a popup box that lists all of the lines on the bill. Below the popup box is a box that has many labels that display various information about that line. Below that information I want to list the top 5 numbers called by that line in that month. Lines has a to-many relationship with Calls, which has two fields, number and minutes. I have all of the calls for the selected line loaded into an NSArrayController with a sort descriptor that properly arranges the values. How do I populate 5 labels with the top 5 values of this array controller?
EDIT: The array of calls is already unique, when I gather the data, I combine all the individual calls into total minutes per number for each month. I just need to sort and display the first 5 records of this combined array.
I may be wrong (and really hope I am), but it looks like you'll need to use brute force on this one. There are no set / array operators that can help, nor does NSPredicate appear to help.
I think this is actually a bit tricky and it looks like you'll have to do some coding. The Core Data Programming Guide says:
If you execute a fetch directly, you
should typically not add
Objective-C-based predicates or sort
descriptors to the fetch request.
Instead you should apply these to the
results of the fetch. If you use an
array controller, you may need to
subclass NSArrayController so you can
have it not pass the sort descriptors
to the persistent store and instead do
the sorting after your data has been
fetched.
I think this applies to your case because it's important to consider whether sorting or filtering takes place first in a fetch request (when the fetch requests predicate and sort descriptors are set). This is because you'll be tempted to use the #distinctUnionOfObjects set/array operator. If the list is collapsed to uniques before sorting, it won't help. If it's applied after sorting, you can just set the fetch request's limit to 5 and there're your results.
Given the documentation, I don't know that this is how it will work. Also, in this case, it might be easier to avoid NSArrayController for this particular problem and just use NSTableViewDataSource protocol, but that's beyond the scope of this Q&A.
So, here's one way to do it:
Create a predicate to filter for the
selected bill's line items.*
Create a sort descriptor to sort the
line items by their telephone number
(which are hopefully in a
standardized format internally, else
trouble awaits) via #"call.number" in your case.
Create a fetch request for the line
item entity, with the predicate and
sort descriptors then execute it**.
With those sorted results, it would be nice if you could collapse and "unique" them easily, and again, you'll be tempted to use #distinctUnionOfObjects. Unfortunately, set/array operators won't be any help here (you can't use them directly on NSArray/NSMutableArray or NSSet/NSMutableSet instances). Brute force it is, then.
I'd create a topFive array and loop through the results, adding the number to topFive if it's not there already, until topFive has 5 items or until I'm out of results.
Displaying it in your UI (using Bindings or not) is, as I said, beyond the scope of this Q&A, so I'll leave it there. I'd LOVE to hear if there's a better way to do this - it's definitely one of those "looks like it should be easy but it's not" kind of things. :-)
*You could also use #unionOfObjects in your key path during the fetch itself to get the numbers of the calls of the line items of the selected bill, which would probably be more efficient a fetch, but I'm getting tired of typing, and you get the idea. ;-)
**In practice I'd probably limit the fetch request to something reasonable - some bills (especially for businesses and teenagers) can be quite large.

Getting lucene to return only unique threads (indexing both threads and posts)

I have a StackOverflow-like system where content is organised into threads, each thread having content of its own (the question body / text), and posts / replies.
I'm producing the ability to search this content via Lucene, and if possible I have decided I would like to index individual posts, (it makes the index easier to update, and means I have more control and ability to tweak the results), rather than index entire threads. The problem I have however is that I want the search to display a list of threads, rather than a list of posts.
How can I get Lucene to return only unique threads as results, while also searching the content of the posts?
Each document can have a "threadId" field. After running a search, you can loop through your result set and return all the unique threadId's.
The tricky part is specifying how many results you want to return. If you want to show say, 10 results on your results page, you'll probably need Lucene to return 10 + m results, since a certain percentage of the return set will be de-duped out, because they are posts belonging to the same thread. You'll need to incorporate some extra logic that will run another Lucene search if the deduped set is < 10.
This is what the Nutch project does when collapsing multiple search results that belong to the same domain.
When you index the threads, you should break each thread into postings and make each post a Document with a field containing a unique id identifying the thread to which it belongs.
When you do the search implementation, I would recommend using lucene 2.9 or later, which enables you to use a Collector. Collectors lets you preprocess the retrieved documents and thereby you'll be able to group together posts that originate from the same thread-id.
Just for completenes, latest Lucene versions (from 3.2 onwards) support a grouping API that is very useful for this kind of use-cases:
http://lucene.apache.org/java/3_2_0/api/contrib-grouping/org/apache/lucene/search/grouping/package-summary.html