How to work with Sitecore Content Search and Page Editor - lucene

I rewrote a news application (overview + detail) from Fast-Query to Content Search. The performance gains were enormous, but i see some possible limitations, for which i don't know to handle in conjunction with the Page Editor.
When i use a fast query, i get an instance of a news even if there isn't a language version yet. In Lucene, i cannot find a result (because i filter for the language) and therefore the news detail is missing in the overview in the particular language.
EDIT for question 1
Lets assume that we have a solution with two languages (English and German). I have an item which currently only exist in a single english version. When i'm on an overview page in German and want to find this item with an fast query (the query does not matter), i will get this item back. In the wrong version, but i get it back. Now if i'm in Page Editor, i can go to this item and edit this in German, even if there is no version yet. The first click on the save button will create the first version for me.
When i want to find the item trough the Content Search, my natural way of query it is by making the same filter (probably by template, path and some channeling or whatever) AND a filter by the language property of the SearchResultItem, since i don't want multiple results for the same item. But since there is only a english version yet, the index only contains a single result in english and because of the language filter, i cannot find this item to call GetItem on it.
Since writing this question initially, i see two approaches to get around this:
a) Remove the language filter in Page Editor mode and filter it afterwards somehow (I don't know whether i'm able to get an item which i can edit in Page Editor in the German language)
b) Create a Page Editor specific master search index which holds an entry for all languages in the solution for every item, even if there isn't any in the specific language. I can add a computed field to indicate, whether this result is a real item version or not to filter at some point if necessary. Probably i'm able go call GetItem on this and enable Page Editor capabilities.
With lucene, i cannot find the detail item in a (currently) non-existing language version when i want to resolve it by it's display name trough Lucene (because there is no language version indexed yet).
EDIT for question 2
This goes hand in hand with question 1
In relation with Workflows, i see possible struggles with the Version which is indexed. Is the first Version in index, before you approve the version? Otherwise the overview has no chance to show this item till it is approved in Content Editor.
Has anybody used the Content Search for Page Editor relevant actions and has some advice how to get around such problems?

I've had the same issues with Sitecore 7, and while I don't have the ultimate solution for you, hopefully you will find the following information helpful.
I can't word my answer any better than this post: http://thegrumpycoder.com/post/75524076869/sitecore-content-search-beware-of-the-context
With Sitecore ContentSearch you are choosing which search index you would like to use. You are likely using sitecore_master_index when in Page Editor / Preview mode, and sitecore_web_index on the published site. As the web database only has one version of each item, you don't need to worry about their being multiple versions in the index. However sitecore_master_index will by default index all versions of an item in all languages. You can then potentially see items showing up multiple times in your listings components if you're not careful.
Sitecore 7 has a field "_latestversion" which you can add to all your queries, but it isn't reliable for a couple of reasons:
The latest version isn't necessarily the correct one, taking into account things like publishing restrictions and what date you have selected in the "Experience" view.
Due to a bug I have often been able to cause there to be more than one version of an item in the index where _latestversion is 1. Not after a complete rebuild, but after an edit or two. I saw this in Sitecore 7.0 and I'm not sure if it's been fixed yet.
Read http://www.sitecore.net/Learn/Blogs/Technical-Blogs/Sitecore-7-Development-Team/Posts/2013/04/Sitecore-7-Inbound-and-Outbound-Filter-Pipelines.aspx for more information on how you can use "Inbound filters" to ensure only the latest version makes it into the master index, but bear in mind that doesn't really solve the core issue in my opinion as the latest version isn't necessarily the correct one.
So taking this, and the fact that you need language fallbacks, you should probably not filter these results out at the Lucene level, but do the necessary magic yourself in code. This would need to:
Group versions together by their items
Choose the right version based on the current language, date, security and workflow
Apply your desired fallback logic if said version isn't found
Somehow work in pagination into this in a way which performs well
I also feel the following SO question is relevant:
Indexing Sitecore Item security and restricting returned search results - something else which can catch you out when you expect the Search API to work exactly in the same way as the Query API.
I'd be interested to know your thoughts and if you ever find a better solution! Thanks, Steve.

Related

How Grouping can be achieved in Solrnet/Solr(Lucene)?

I have Lucene files indexed according to pageIds (UniqueKey). and one document can have multiple pages. Now once user perform some search it gives us pages that matches search criteria.
I am using Lucene.Net 2.9.2
We have 2 problems...
1- The file size is around 800GB and it has 130 million rows (pages) so the search time was really slow (all queries taking more than a min (we only have to return limited rows at a time)
To overcome the performance issue I shifted to SOLR which resolved the performance issue (which is quite strange as I am not using any extra functionality provided by SOLR like sharding etc - so could it be that Lucene.NET 2.9.2 is not really equivalent to performance compared to same version of JAVA??) but now I am having another issue...
2- The individual 'lucene document' is one page but i want to show results 'grouped by' 'real documents'. How many results I should be returned should be configurable based on 'real documents' not 'pages' (coz thats how I want to show to the user).
So lets say I want 20 'real documents' and ALL pages in them that matches the search criteria (doesnt matter if one document has 100 pages and another just 1).
From what I could get from SOLR forums was that it can be achieved by SOLR-236 patch (field collapsing) but I have not been able to apply the patch correctly with trunk (gives lots of errors).
This is really imp for me and I dont have much time, so can someone please either send me the SOLR 1.4.1 binary with this patch applied or guide me if there is any other way.
I would really appreciate it. Thanks!!
If you have issues with the collapse patch, then the Solr issue tracker is the channel to report them. I can see that other people are currently having some issues with it, so I suggest getting involved in its development.
That said: I recommend that if your application needs to search for 'real documents', then build your index around these 'real documents', not their individual pages.
If your only requirement is to show page numbers, I would suggest to play with the highlighter or made some custom development. You can store the word number of the start and end of each page in a custom structure, and knowing the matched word position in the whole document you can know in what page it appears. If the documents are very large you will get a good performance improvement.
You could also have a look at SOLR-1682 : Implement CollapseComponent, I havent tested it yet, but as far as I know it solves the collapsing too.

Business Applications: What are the fundamental features of a search form?

In a typical business application it is quite common to have forms that are used for searching.
Some basic features are:
A pane that contains the search criteria
A grid to display the results
Sorting on the grid
A detail page that opens when an item is selected in the results grid
What other features would you expect in a business application's search functionality?
Maybe it's a bit trite but there is some sense in this picture:
removed dead ImageShack link
Do it as it shown at the second example, not as at the 3rd one.
There is a well known extreme programming principle - YAGNI. I think it's absolutely appliabe to almost any problem. You always can add something new if it's necessary, but it's much more difficult to remove something what is already exist because someone already uses it even if it's wrong.
How about the ability to save search criteria, in order to easily re-run a search later. Or, the ability to easily, cleanly, print the list of results.
If search refining is allowed (given a search result, limited future searches to the current results), you may also want to add a breadcrumb system, so that the user can see the sequence of refinements that lead you to the current result-set -- and by clicking on a breadcrumb, return to a previous refinement stage.
Faceted search:
(source: msdn.com)
This is displayed in the area in the right ellipse. There are filters and the engine shows the number of results that will remain after aplying the filter. This is very useful and can be done without pain in some search engines, such as Apache Solr. Of course, implement this only if filters make sense in your task.
Aggregate summary info, like total(s), count(s) or percentages.
One or more menus, like right click context for the grid, a ribbon or menu on top.
Your list for the UI elements is kinda good. Export, print (asking them whether it is really necessary to print this?), category/tag and language selection is worth to consider. Smart and working pagination (don't forget ordering).
Please do not force a search to open in a new (or even worse, always in the same window). Links of search results should be copy-pastable (always use GET),
But it really matters to have a functional (i.e. a really good) algorithm. Mostly I google company websites, because their search engine is, cough, awwwwkward. Looking for a feature chart, technical spec, pricing etc. one is not interested in press releases and vica-versa.
Search engine providers offer integration into company websites.
Use Auto-complete wherever possible on your text input fields.
If using selects or combo boxes with related information try and use chain selects to organise the information.
Where results depend on location try and serve relevant results.
Also remember to keep the search form as simple as possible even down to one text field. To refine the search you can have an alternate form as an "Advanced Search interface".
Printing, export.
A grid to display the results
Watch out not to display results a user is not authorized to see (roles / permissions / access rights).
A detail page that opens when an item is selected in the results grid
In case a user attempts to circumvent the search page links and enter some document directly, again, check out for permissions.
Validation, validation, validation.
It should be very hard, near impossible, for me to run a query that makes no sense. ie, start date occurring after an end date.
Export a numerical dataset (even if it only has one numeric column - so just make it so by default) to CSV for import into Excel (people love this function, even if only 1% of users seem to use it with any regularity. Just ask yourself when's the last time you highlighted something for copy-n-paste. Would it have been easier to open a CSV?
Refinable searches (think Google's use of site: -). People who use the search utility a lot will appreciate this. People who don't won't know it's not there.
The ability to choose to display 1 records, 5 records, 100 records, 1000 records, etc. "Paging" I believe is what we most commonly call it ;).
You mentioned sortable grids. Somebody else mentioned auto-sum or auto-count. Those are good if (once again) you have largely numeric data. But those are almost report-oriented functions.
Hope this helps.
One thing you can do is have a drop down of most common searches in plain english. e.g. "High value sales in New York in last 5 days". This is the equivalent of user selecting an amount, the city, date ranges etc. done conveniently for them.
Another thing is to have multiple search criteria tabs based on perspective of the user. Like "sales search", "reporting search", "admin search" etc.
ALso consider limiting the number of entries retrieved in the search and allow users to do more narrow searches. This depends on the business needs however.
The most commonly used search option listed first and in a prominent location.
I think your requirements are good. Take a cue from Google. Google got it right. One text box where you type whatever you want, and your engine spits out the answers. Most folks will try this, and if the answers are good enough, then that is what they will use. In the back-end, you'll probably want to flatten all of the data into a big honkin' table and then index it or use a SQL query with "LIKE" in it.
However, you will probably want to allow the user to refine the search. For this, have a link to "Advanced Search" and use a form there to specify filter criteria. This lets the user zero in on the results if basic search is not good enough. For the results on th is page, you will certainly want to have sorting on key fields, but do it after you have produced the initial result set.
It depends on the content that you are searching for.. make it relevant :) Search always look easy but can be incredibly difficult to get right.
Not mentioned yet, but very important I think - a search that actually works. This item is often neglected and makes the rest a bit moot.

Source core repositories and sticky notes

An interesting problem occured recently, and I've been thinking of the "best" way (for a given value of "best") to implement this.
In essence, it's one of tracking notes against source code. The example that flagged this was getting a problem fixed in live within SLAs, and how to best achieve this. Without going into all the details, it came down to finding a function that's used in a number of places which may or may not be buggy, yet the problem was being reporting only in a single location.
The fix to meet the SLAs was simply to add a check into the location where the problem was reported, rather than tweaking the common code and having to test everything that touches that function.
The interesting issue is then for upstreaming. The "correct" method would then be to go back and check the original function, validate it's correct for everywhere it's called and then make the change "properly" if its determined the library function is wrong.
The problem is this takes time, so upstreaming may simply take the workaround, etc. However if the problem occurs again (say six months later) in another location calling the same library function, there isn't an easy way to link the two problems together. You can search the bug tracking database, but this isn't guranteed to help - it depends if a note's been added saying something along the lines of "this library function needs more thorough checking, but no time to investigate now".
So the question is this: within a large team of developers (30 plus, split into teams of both support and on-going development), what methods do you use to manage (what are effectively) "sticky notes" against source code, short of adding a comment to the suspicious function's source code saying "this might be a bit dodgy"?
The problem with the commiting a comment is one of process: a change is a change, so committing a zero-change change (i.e., one where just comments are added) is not ideal; developers can make mistakes even adding a comment (hit a stray key or something) so it's always (IMO) better to commit only where actual changes are made.
Now a wiki could be used to track per-file notes, but we've got a minimum of four branches and inexcess of a few hundred files (SQL objects, source code, XML files, etc), so a wiki will get unmangable quite quickly.
This is the sort of thing that it would be nice if SCM's could support - bits of metadata against files that are simply notes, but don't add to the SCM's version history - that can be displayed when doing (say) an svn update, or manually viewed.
There may already be solutions out there -- so how do you manage this type of knowledge sharing?
Well we're now using this method: in each folder checked into SVN, we've created a .url shortcut (this is Windows we're dev'ing on) that links to a page on our development wiki about that folder. Thus we can update the Wiki info freely, and on checkout/update everyone gets a link that will take them to the appropriate Wiki page for that folder/module.
We've not long instigated it so we'll have to see how well it works long term -- but it's better than what we had before (i.e., nothing :-) ).

Problems using MySQL FULLTEXT indexing for programming-related data (SO Data Dump)

I'm trying to implement a search feature for an offline-accessible StackOverflow, and I'm noticing some problems with using MySQLs FULLTEXT indexing.
Specifically, by default FULLTEXT indexing is restricted to words between 4 and 84 characters long. Terms such as "PHP" or "SQL" would not meet the minimum length and searching for those terms would yield no results.
It is possible to modify the variable which controls the minimum length a word needs to be to be indexed (ft_min_word_len), but this is a system-wide change requiring indexes in all databases to be rebuilt. On the off chance others find this app useful, I'd rather keep these sort of variables as vanilla as possible. I found a post on this site the other day stating that changing that value is just a bad idea anyway.
Another issue is with terms like "VB.NET" where, as far as I can tell, the period in the middle of the term separates it into two indexed values - VB and NET. Again, this means searches for "VB.NET" would return nothing.
Finally, since I'm doing a direct dump of the monthly XML-based dumps, all values are converted to HTML Entities and I'm concerned that this might have an impact on my search results.
I found a blog post which tries to address these issues with the following advice:
keep two copies of your data - one with markup, etc. for display, and one modified for searching (remove unwanted words, markup, etc)
pad short terms so they will be indexed, I assume with a pre/suffix.
What I'd like to know is, are these really the best workarounds for these issues? It seems like semi-duplicating a > 1GB table is wasteful, but maybe that's just me.
Also, if anyone could recommend a good site to understand MySQL's FULLTEXT indexing, I'd appreciate it. To keep this question from being too cluttered, please leave the site recommendations in the question comments, or email me directly at the site on my user profile).
Thanks!
Additional Info:
I think I should clarify a couple of things.
I know "MySQL" tends to lead to the assumption of "web application", but that's not what I'm going for here. I could install Apache and PHP and run things that way, but I'm trying to keep this light. I can use my website for playing with PHP, so I don't feel the need to install it on my home machine too. I also hope this could be useful for others as well, and I don't want to force anyone else into installing a bunch of extra utilities. I went with MySQL since it was easy and needing to install some sort of DB was unavoidable.
The specifics of the project were going to be:
Desktop application written in C# (WinForms)
MySQL backend
I'm starting to wonder if I should just say to hell with it, and install everything I'd need to make this an (offline) webapp. As much as we'd all like to think our pet project is going to be used and loved by the community at large, I should know by now that this is likely going end up being only used by a single user.
From what was already said, I understand, that MySQL FullText is not for you ;) But why stick to MySQL? Try Sphinx:
http://www.sphinxsearch.com/
It will solve most of your problems.

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.