I don't think this is a very obscure Lucene problem, but somehow I just don't seem to be able to find a good solution to it. I will use an example.
Let's say I am building a news articles website. Registered users can bookmark articles that they are interested in. I want to allow users to search for only articles that he/she bookmarks. For the sake of example, let's also assume that a user can potentially bookmark thousands of articles, and we have hundreds of thousands of users in our database. How do I build a scalable solution for this problem?
Thanks a lot!
This is a very typical Lucene problem as it does not support joins. More specifically, there's no first class support and you have to find your ways around it. I can suggest a few:
You could have a database, which has users, articles and bookmarks tables (the latter would have foreign keys pointing to the first two). You would also have articles indexed in Lucene. When running a search against articles, you could write a Lucene Filter which would exclude all articles not bookmarked by the current user.
You could index all articles and bookmarks in Lucene - probably best if you do this using separate indices. Then you could run a query for bookmarks (to retrieve which articles current user has bookmarked) and then run another separate query for articles. Like in the previous example, you could use the results of the first query to exclude all other articles which are not bookmarked by the current user.
I personally prefer option #1 as this is classical relational structure and databases are designed for exactly this purpose. With the option #2 you would have to modify both user storage and Lucene index when user gets deleted.
Related
Im new to solr...
I have been looking into related content recommendation engine... for implementing it to my core php and mongodb website.. its music listening website.. so i have added keywords to every music like singer,music,lyrics to mongodb.
My question is related music recommendation using keywords can solr (more like this handler) recommend it?
example keywords : bobby-singer,a-r-rehaman,shreya goshal
its should look for related keywords in order like:
bobby-singer,a-r-rehaman,shreya goshal
bobby-singer,a-r-rehaman
bobby-singer,shreya goshal
a-r-rehaman,shreya goshal
bobby-singer
a-r-rehaman
shreya goshal
my keywords are already in mongodb.. im planning to work with apache solr morelikethis handler.. or please recommend me some good recommendation engine..
Thanks
There are a couple of different things here.
First of all, you can use MLT to get Solr to bring to you related documents but...
I am wondering if you could also benefit from synonyms so that on certain searches you can get results that are similar which may satisfy the user
And also if you have already the list of relationships you can build a small index where you can run an OR of your query and get potential related searches or execute potential related searches and get related results
Hope this helps
My requirement is like job sites where a user can upload a document(can be PDF,Text or word document) like Resume/CV. Then all these documents can be searched for a specific or a combination of keyword and they also have to be ranked based on those key words. I need to know which technology can be good from performance point of view when the number of files are huge and also there are good number of request for searching and indexing.
The website is built using SQL Server. So can I store those files in SQL Server? Will be good in terms of performance.
Or can it be done alone using Lucene.NET and i can store those files in single folder?
I think, the best suggestion is to use Lucene ....
you can save your documents as they are with some unique path name/file_name , and use that as identifier when you index the documents ... I am sure you can find a lot of similar examples if you search Lucene ..
I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.
My company has a collection of about 3500 highly-structured Word docs (and growing) that contain multiple choice questions from one of our products. I've been tasked with writing a front-end that will let people find and use these in other products. There is some metadata on them that would go in a database, but we'd also like full-text search.
I've been given the option of using for the front-end either MS Access (because I know it well) or Rails (because I'm supposed to be learning it). I've done one Rails app and prefer to continue with it.
Rather than load the documents into the database, I thought it made more sense to just have them on the file system and store paths to them in the database.
I know I can use Ferret to search database fields but what's the best way to add full-text searching to a Rails app for a pile of files on the filesystem?
Not sure if there are any gems that would search the word files for you. Although you have mentioned that you do not want to load the entire documents into the database, you might look into just copying the text contents of each file in your db. You can use win32ol library for doing this (http://ruby-doc.org/stdlib/libdoc/win32ole/rdoc/classes/WIN32OLE.html) .. If I had to implement this, I would run a cron job every night (or whatever frequency seems fit) that would refresh the database content with the changes in the word files.
I am looking to understand how enterprise search solutions tackle the issue of user-permissions.
My question is on displaying the search results for users. The naive approach would display the search results to the user, and then if the user clicks a document he is not authorized to see, he will fail to open it. However, it is even forbidden to display a document's title or excerpt if the user does not have permission to read it. So do the various enterprise earch engines:
index each document together with its ACL?
index all documents with no permission info, but check each link in every search result to see whether the querying user has permission to view this link?
Option #2 makes more sense to me, but also seems much slower than option #1.
Option #1 suffers from the need to constantly update the changes in permissions on the indexed documents.
I am looking to understand what is the common approach in the existing solutions in the market today. Is there a third option?
I'm surprised to see that this 5 year old question hasn't got any answers, as I think it's quite a common and important problem in enterprise search.
As outlined in the question there are two common approaches to deal with document-level security:
early-binding-security: indexing ACL's along with the content, and
late-binding-security: handling security at query-time, by filtering out protected results
Handling security on content side only is never recommended as at that point in time confidential information might already have been revealed (e.g. title or preview of a document in the search result).
The advantage of implementing security with a late-binding approach is, that it's very flexible, because there is no need to re-index content upon changed ACLs. The biggest drawback however is, that by doing so, confidential information might be leaked via facet values, and it's not possible to retrieve and display correct facet counts. It also more difficult to properly populate the result list and handle pagination. Last but not least, this approach can significantly slow down the performance.
The advantage of implementing security with an early-binding approach is, that it addresses all of the above disadvantages for the price of re-indexing the content as soon as ACLs change. However, leaks are still possible, e.g. when a group membership or ACL just got changed and isn't reflected yet in the search index. To address this gap the two approaches early-binding and late-binding are often combined.
Last but not least there might be a third option, depending on the Enterprise Search Platform you are using: Attivio's Active Security is based on query time joins, which allows to index security information independent from the document itself, but at query time merges the two documents to ensure that only authorised content makes it into the search results.