Search Items without presentation details in sitecore - lucene

Improve search performance.
We are currently on sitecore 8.1.3 in production and use Lucene Search to make the search work. We will be moving over to SOLR or Coveo search in near future. That said, we are trying to improve search functionality on our site.
In current scenario if a user searches on our site, Lucene search provides us with appropriate search results from sitecore content items. The results are a list of items in which some have presentation details where as some don't have presentation detail(which are basically datasource items, or pulled in multilist fields items). We displays results which have presentation details directly to user, however, the datasource items do not have presentation details attached to it, thus for such items we dispaly the items in which these respective items are referred as datasource items in presentation details, via sitecore link or are referenced in a multi-list field.
We are using Globals.LinkDatabase.GetItemReferrers(item, false) method to fetch the item where results items are referring to. We know this method is a heavy method. To improve the performance, we are filtering the items that are returned when we use Globals.LinkDatabase.GetItemReferrers(item, false) method. We select only the latest version of the item, we select an item only if the item has presentation details, we select only if the item is of same language as that of the context language. If the current item doesn't have presentation details, it will search for its related item with presentation details using the same function recursively. This logic or code that we have helps us to improve the performance at some level and yields the required results.
However this code slows down its performance if the number of search results is high. Say if I search for an item in which the Lucene search returns me say 10 items for it, our custom search code will then yield me say 100 related items(assuming the Datasource items of items found in the result can be reused across different items). The performance degrades when the Lucene search provides results with a huge count, say 500. In such scenarios we will be running our code recursively on 500 items and their related items. For better performance we have tried using LINQ query instead of foreach iterations wherever possible. The code works perfectly fine. We do get appropriate results, however the search slows down if the count is high for search items. Want to know if there is any more area where we can improve the performance.

The best way to improve the performance is to have a custom index that has the results you want to search and does not contain items that you do not want to return. In this way, your filtering is 'pre-done' during indexing.
The common way of doing is to use a computed field that will contain all the 'text' of the page (collating together content from datasources) so that the page's full contents are in a field in the index. This way, even if the text match would have been on a datasource, the page will still come back as a valid search result.
There is a blog from Kam Figy on this topic: https://kamsar.net/index.php/2014/05/indexing-subcontent/
Note that in addition to the computed field, you will also need to patch in the field to the index using a Sitecore config patch file. Kam's blog shows an example of that as well.

You need to index this data together to begin with, rather than trying to piece it together at runtime. You should also try to keep your indexes lean or use queries to restrict the results that are returned to only provide the relevant results.
I agree with the answer from Jason that a separate index is one of the best solutions, combined with a computed field that includes content from all referenced datasources.
Further, I would create a custom crawler which excludes items without any layout from the index. For an index which which is only used to provide results for site search, you only care about items with layout since only they have a navigable URL.
namespace MyProject.CMS.Custom.ContentSearch.Crawlers
{
public class CustomItemCrawler : Sitecore.ContentSearch.SitecoreItemCrawler
{
protected override bool IsExcludedFromIndex(SitecoreIndexableItem indexable, bool checkLocation = false)
{
bool isExcluded = base.IsExcludedFromIndex(indexable, checkLocation);
if (isExcluded)
return true;
Item obj = (Item)indexable;
return obj.Visualization != null && obj.Visualization.Layout != null;
}
protected override bool IndexUpdateNeedDelete(SitecoreIndexableItem indexable)
{
Item obj = indexable;
return obj.Visualization != null && obj.Visualization.Layout != null;
}
}
}
If for some reason you do not wish to create a separate index, or you only want to keep a single index (since you are using the Content Search API and require a full index for your component queries, or even just to minimise indexing speeds across multiple indexes) then I would consider creating a custom computed field in the index which stores [true/false]. The logic is the same as above. You can then filter in your search to only return results which have layout.
The combination of including/combining the content of the datasourse items during indexing and only returning items with layout should result in much better performance of your search queries.

Related

React-Admin filters that relate to the current results

We're really enjoying using the capabilities offered by React-Admin.
We're using <ReferenceArrayInput> to allow filtering of a <List> by Country. The drop-down contains all countries in the database.
But, we'd like it to just contain the countries that relate to the current set of filtered records.
So, in the context of the React-Admin demo, if we've filtered for Returned, then the Customer drop-down would only contain customers who have returned items (see below). This would make a real-difference in finding the records of interest.
Our current plan is to (somehow) handle this in our <DataProvider>. But, is there are more ReactAdmin friendly way of doing this?
So you want to build dependent filters, which is not a native feature of react-admin - and a complex beast to tame.
First, doing so in the dataProvider will not work, because you'll only have the data of the first page of results. A record in a following page may have another value for your array input.
You could implement that logic in a custom Input component instead. This component can wrap the original <ReferenceArrayInput> and read the current ListContext to get the current data and filter value (https://marmelab.com/react-admin/useListContext.html), then alter the array of possible values using the filter prop (https://marmelab.com/react-admin/ReferenceArrayInput.html#filter).

RavenDB paging via cursor

Paging in RavenDB is done via skip+take. This is default implementation I'm happy with most of the time. However for frequently changing data I want paging via a cursor. The cursor/after parameter specifies which was the last record displayed and where the list should continue on the next page.
This should work for data which can be dynamically sorted, so the sorting parameter is not fixed.
github is doing it this way on the "stars" page for example: https://github.com/[username]?after=Y3Vyc29&tab=stars
Any ideas how to achieve this in RavenDB?
there is no cursor pagination in RavenDB.
But you can use the 'last-modified to continuously iterate on frequently changing data.
from Orders as o
where o.'#metadata'.'#last-modified' > "2018-08-28:12:11"
order by o.'#metadata'.'#last-modified'
select {
A: o["#metadata"]["#last-modified"]
}.
You can also use Subscription

How to manage additional processed data in MarkLogic

MarkLogic 9.0.8.2
We have around 20M records in MarkLogic.
For one of the business requirement, we need to generate additional data for each xml and then need user will search this data.
As we can't change original document, so need input on what is best way to manage additional data. Following are the few which we have thought of
Create separate collection and store additional data in separate xml with same unique number i.e. same as original xml. So when user search for it, search in this collection and then retrieved original documents and send response back.
Store additional data in original document properties
We also need to create element range index to make sure it works when end user provide data in range operators.
<abc>
<xyz>
<quan>qty1</quan>
<value1>1.01325E+05</value1>
<unit>Pa</unit>
</xyz>
<xyz>
<quan>qty2</quan>
<value1>9.73E+02</value1>
<value2>1.373E+03</value2>
<unit>K</unit>
</xyz>
<xyz>
<quan>qty3</quan>
<value1>1.8E+03</value1>
<unit>s</unit>
</xyz>
<xyz>
<quan>qty4</quan>
<value1>3.6E+03</value1>
<unit>s</unit>
</xyz>
</abc>
We need to process data from value1 element. User will then search for something like
qty1 >= minvalue AND qty1<=maxvalue
qty2 >= minvalue AND qty2<=maxvalue
qty3 >= minvalue AND qty3<=maxvalue
So when user will search for qty1 then it should only get data from element where value is qty1 and so on.
So would like to know
What is best approach to store data like this
What kind of index i should create to implement this
I would recommend wrapping the original data in an envelope, which allows adding extra data in the header. It could also allow creating a canonical view on the relevant pieces of the data, and either store that as instance, and original as 'attachment' (sub-property, not an attached binary), or keep the instance as-is, and put canonical values for indexing in the header.
There is a lengthy blog article about the topic, that discusses pros and cons in high detail: https://www.marklogic.com/blog/envelope-design-pattern/
HTH!
Grtjn's answer would be the recommended solution, as it is more performant to keep all the information inside the document itself, versus having to query across both the document with the properties, but it would require changes to the document.
Option 1 & 2 could both work.
Properties documents already exist, so it doesn't add fragments, but the properties must conform to the schema.
Creating a sidecar document provides more flexibility, because you are creating new documents, it will increase number of fragments.

Sitecore: Programmatically trigger reindexing of related content

In my sitecore instance, I have content for 2 templates, Product and Product Category. The products have a multilist that link to the Product Category as lookups. The Products also have an indexing computed field setup that precomputes some data based on selected Product Categories. So when a user changes a Product, Sitecore's indexing strategy indexes the Product with the computed field.
My issue is, when a user changes the data in the Product Category, I want Sitecore to reindex all of the related products. I'm not sure how to do this. I do not see any hook where I could detect that a Product Category is being indexed, so I could programmatically trigger an index to products
You could achieve using the indexing.getDependencies pipeline. Add a processor to it - your custom class should derive from Sitecore.ContentSearch.Pipelines.GetDependencies.BaseProcessor and override Process(GetDependenciesArgs context).
In this function you can get your IndexedItem and add, based on this information, other items to the Dependencies. These dependencies will be indexed as well. Benefit of this way of working is that the dependent items will be indexed in the same job, instead of calling new jobs to update them.
Just be aware of the performance hit this could cause if badly written as this code will be called on all indexes. Get out of the function as soon as you can if not applicable.
Some known issues about this pipeline can be found on the kb.
One way would be to add a custom OnItemSave handler which will check if the template of the changed item has the ProductCategory template, and will problematically trigger the index update.
To only reindex the changed items, you can pickup the related Products and register them in the HistoryEngine by using the HistoryEngine.RegisterItem method:
Sitecore.Context.Database.Engines.HistoryEngine.RegisterItemSaved(myItem, new ItemChanges(myItem));
Some useful instructions how to create OnItemSave handler can be found here: https://naveedahmad.co.uk/2011/11/02/sitecore-extending-onitemsave-handler/
You could add or change one of existing index update strategies. (configuration\sitecore\contentSearch\indexConfigurations\indexUpdateStrategies configuration node)
As example you could take
Sitecore.ContentSearch.Maintenance.Strategies.SynchronousStrategy
One thing, you need to change
public void Run(EventArgs args, bool rebuildDescendants)
method. args contains changed item reference. All that you need, trigger update of index for related items.
After having custom update strategy you should add it to your index, strategies configuration node.

Paging dynamic query results

Paging using LINQ can be easily done using the Skip() and Take() extensions.
I have been scratching my head for quite some time now, trying to find a good way to perform paging of a dynamic collection of entities - i.e. a collection that can change between two sequential queries.
Assume a query that without paging will return 20 objects of type MyEntity.
The following two lines will trigger two DB hits that will populate results1 and results2 with all of the objects in the dataset.
List<MyEntity> results1 = query.Take(10).ToList();
List<MyEntity> results2 = query.Skip(10).Take(10).ToList();
Now let's assume the data is dynamic, and that a new MyEntity is inserted into the DB between the two queries, in such a way that the original query will place the new entity in the first page.
In that case, results2 list will contain an entity that also exists in results1,
causing duplication of results being sent to the consumer of the query.
Assuming a record from the first page was deleted, it will result missing a record that should have originally appear on results2.
I thought about adding a Where() clause to the query that verify that the records where not retrieved on a previous page, but it seems like the wrong way to go, and it won't help with the second scenario.
I thought about keeping a record of query executions' timestamps, attaching a LastUpdatedTimestamp to each entity and filtering entities that were changed after the previous page request. That direction will fail on the third page afterwards...
How is this normally done?
Is there a way to run a query against an old snapshot of the DB?
Edit:
The background is an Asp.NET MVC WebApi service that responds to a simple GET request to retrieve the list of entities. The client retrieves a page of entities, and presents them to the user. When scrolled down, the client sends another request to the service, to retrieve another page of entities, which should be added to the first page which is already presented to the user.
One way to solve this is to keep a TimeStamp of the first query and exclude any results that have been added after that time. However this solution puts the data into a "stale" state, i.e the data is fixed at a certain point in time, which is usually a bad idea unless you have a very good reason to do it.
Another way of doing this (again depending on your application) would be to store the list of all objects in memory and then only display a certain number of results at one time.
List<Entity> _list;
public class SomeCtor()
{
_list = _db.Entities.ToList();
}
public List<Entity> GetFirst10
{
get
{
return _list.Take(10);
}
}
public List<Entity> GetSecond10
{
get
{
return _list.Skip(10).Take(10);
}
}