How does enterprise search display results for the user and hide unauthorized results? - permissions

I am looking to understand how enterprise search solutions tackle the issue of user-permissions.
My question is on displaying the search results for users. The naive approach would display the search results to the user, and then if the user clicks a document he is not authorized to see, he will fail to open it. However, it is even forbidden to display a document's title or excerpt if the user does not have permission to read it. So do the various enterprise earch engines:
index each document together with its ACL?
index all documents with no permission info, but check each link in every search result to see whether the querying user has permission to view this link?
Option #2 makes more sense to me, but also seems much slower than option #1.
Option #1 suffers from the need to constantly update the changes in permissions on the indexed documents.
I am looking to understand what is the common approach in the existing solutions in the market today. Is there a third option?

I'm surprised to see that this 5 year old question hasn't got any answers, as I think it's quite a common and important problem in enterprise search.
As outlined in the question there are two common approaches to deal with document-level security:
early-binding-security: indexing ACL's along with the content, and
late-binding-security: handling security at query-time, by filtering out protected results
Handling security on content side only is never recommended as at that point in time confidential information might already have been revealed (e.g. title or preview of a document in the search result).
The advantage of implementing security with a late-binding approach is, that it's very flexible, because there is no need to re-index content upon changed ACLs. The biggest drawback however is, that by doing so, confidential information might be leaked via facet values, and it's not possible to retrieve and display correct facet counts. It also more difficult to properly populate the result list and handle pagination. Last but not least, this approach can significantly slow down the performance.
The advantage of implementing security with an early-binding approach is, that it addresses all of the above disadvantages for the price of re-indexing the content as soon as ACLs change. However, leaks are still possible, e.g. when a group membership or ACL just got changed and isn't reflected yet in the search index. To address this gap the two approaches early-binding and late-binding are often combined.
Last but not least there might be a third option, depending on the Enterprise Search Platform you are using: Attivio's Active Security is based on query time joins, which allows to index security information independent from the document itself, but at query time merges the two documents to ensure that only authorised content makes it into the search results.

Related

How to programatically retrieve information from organisation directory?

Background- my organisation have a global directory. The Active Directory only stores the employee numbers and employee name. No information about the role title is stored in Active Directory. (I have already built an LDAP query to retrieve all the information from AD, retrieving the role title is my issue).
On our intranet, there is global directory, which shows the role title. Now this is obvious to me that the role title is stored in some other database.(not AD)
I am wanting to write a script (not sure what to use), to pump a list of employee numbers in the search box and retrieve the role title.
Is this possible? I've never scripted anything to retrieve information from results coming from a website/intranet etc. Any guidance will be appreciated, LDAP queries unfortunately was not the right approach for me as the organisation does not store role title in AD. (I have thousands of employees to find and I don't think it's practical to search individually)
Gemmo
I take it you only have access to the front end of this system. It is not ideal, but the only way would be to use web scraping. That is, parse the HTML from the web page.
This method is time consuming to put together and very prone to breaking since it is entirely dependent on how the data is presented on the page. If anything changes, your web scraping can break.
But if you will only need to do it once, it might be worth it. A tool like this one could help you do it. (that is just the first one I found online. There are others, just search.)
But since we can't access this site, we can't really help any more than this.
Web scraping really is the absolute last resort. Any other way to get the data is better than this. Maybe you could even ask the administrators of that system to give you a one-time report of just the data you need to see. As long as they're willing, there's no reason they couldn't give you an Excel spreadsheet with the data.

Updating friendly name of a liferay page through SQL

Is there a way to update the Liferay's site page's friendly name through a SQL script?
We generally do this in the control panel through admin user.
While #steven35's answer might do the job, you're hitting a pet peeve of mine. On a different level, you're doing it right if you're doing it on the Control Panel, or through the API and you should not think about a way to ever write to Liferay's database. It might work for the moment, but it might also fail in unforeseen ways - sometimes long after your update.
There have been enough samples for this to happen. If you're changing data while Liferay is running, the cache will not be updated. In case these values are also indexed in the search index, they won't be updated there and random later uses might not find the correct page without you reindexing everything. The same value might be stored somewhere else - or translated. Numerous conditions can fail - and there's always one condition more than you expect and cater for. That one condition might break your neck.
Granted, the friendly name of a page might not fall into the most complex of these cases, but just don't get into the habit of writing to Liferay's database. Or, if you do, don't complain about future upgrades failing or requiring extra work, because the database contains values that the API didn't expect. The problem is that during the next upgrade (if you do it in - say - one year) you'll long have forgotten that you manually changed data in the database and blame Liferay for problems during your upgrade.
Changing data is exactly what the UI and the API are for.
Friendly urls are stored in LayoutFriendlyURL.friendlyURL in your Liferay database so the following query should work
UPDATE "yourdatabase"."LayoutFriendlyURL" SET "friendlyURL"="/newurl" WHERE "layoutFriendlyURLId"=12345;
You will also need to update the Layout table accordingly to match the new friendly url.

User Specific Lucene Search

I don't think this is a very obscure Lucene problem, but somehow I just don't seem to be able to find a good solution to it. I will use an example.
Let's say I am building a news articles website. Registered users can bookmark articles that they are interested in. I want to allow users to search for only articles that he/she bookmarks. For the sake of example, let's also assume that a user can potentially bookmark thousands of articles, and we have hundreds of thousands of users in our database. How do I build a scalable solution for this problem?
Thanks a lot!
This is a very typical Lucene problem as it does not support joins. More specifically, there's no first class support and you have to find your ways around it. I can suggest a few:
You could have a database, which has users, articles and bookmarks tables (the latter would have foreign keys pointing to the first two). You would also have articles indexed in Lucene. When running a search against articles, you could write a Lucene Filter which would exclude all articles not bookmarked by the current user.
You could index all articles and bookmarks in Lucene - probably best if you do this using separate indices. Then you could run a query for bookmarks (to retrieve which articles current user has bookmarked) and then run another separate query for articles. Like in the previous example, you could use the results of the first query to exclude all other articles which are not bookmarked by the current user.
I personally prefer option #1 as this is classical relational structure and databases are designed for exactly this purpose. With the option #2 you would have to modify both user storage and Lucene index when user gets deleted.

GetOrCreate in RavenDB, or a better alternative?

I have just started using RavenDB on a personal project and so far inserting, updating and querying have all been very easy to implement. However, I have come across a situation where I need a GetOrCreate method and I'm wondering what the best way to achieve this is.
Specifically I am integrating with OpenID and once authentication has taken place the user is redirected to my site. At this point I'd either like to retrieve their user record from Raven (by querying on the ClaimsIdentifier property) or create a new record. The user's ID is currently being set by Raven.
Obviously I can write this in two statements but without some sort of transaction around the select and the create I could potentially end up with two user records in the database with the same claims identifier.
Is there anyway to achieve this kind of functionality? Possibly even more importantly is do you think I'm going down the wrong path. I'm assuming even if I could create a transaction it would make scaling out to multiple servers difficult and in anycase could add a performance bottle-neck.
Would a better approach be to have the Query and Create operations as separate statements and check for duplicates when the user is retrieved and merge at that point. Or do something similar but on a scheduled task?
I can't help but feel I'm missing something obvious here so any advice on this problem would be greatly appreciated.
Note: while scaling out to multiple servers may seem unnessecary for a personal project, I'm using it as an evaluation of Raven before using it in work.
Dan, although RavenDB has support for transactions, I wouldn't go that way in your case. Instead, you could just use the users ClaimsIdentifier as the user documents id, because they are granted to be unique.
Alternatively, you can also stay with user ids being generated by Raven (HiLo btw) and use the new UniqueConstraintsBundle, which lets you attribute certain properties to be unique. Internally it will create an additional document that has the value of your unique property as its id.

Automating WebTrends analysis

Every week I access server logs processed by WebTrends (for about 7 profiles) and copy ad clickthrough and visitor information into Excel spreadsheets. A lot of it is just accessing certain sections and finding the right title and then copying the unique visitor information.
I tried using WebTrends' built-in query tool but that is really poorly done (only uses a drag-and-drop system instead of text-based) and it has a maximum number of parameters and maximum length of queries to query with. As far as I know, the tools in WebTrends are not suitable to my purpose of automating the entire web metrics gathering process.
I've gotten access to the raw server logs, but it seems redundant to parse that given that they are already being processed by WebTrends.
To me it seems very scriptable, but how would I go about doing that? Is screen-scraping an option?
I use ODBC for querying metrics and numbers out of webtrends. We even fill a scorecard with all key performance metrics..
Its in German, but maybe the idea helps you: http://www.web-scorecard.net/
Michael
Which version of WebTrends are you using? Unless this is a very old install, there should be options to schedule these reports to be emailed to you, and also to bookmark queries. Let me know which version it is and I can make some recommendations.