How Grouping can be achieved in Solrnet/Solr(Lucene)? - lucene

I have Lucene files indexed according to pageIds (UniqueKey). and one document can have multiple pages. Now once user perform some search it gives us pages that matches search criteria.
I am using Lucene.Net 2.9.2
We have 2 problems...
1- The file size is around 800GB and it has 130 million rows (pages) so the search time was really slow (all queries taking more than a min (we only have to return limited rows at a time)
To overcome the performance issue I shifted to SOLR which resolved the performance issue (which is quite strange as I am not using any extra functionality provided by SOLR like sharding etc - so could it be that Lucene.NET 2.9.2 is not really equivalent to performance compared to same version of JAVA??) but now I am having another issue...
2- The individual 'lucene document' is one page but i want to show results 'grouped by' 'real documents'. How many results I should be returned should be configurable based on 'real documents' not 'pages' (coz thats how I want to show to the user).
So lets say I want 20 'real documents' and ALL pages in them that matches the search criteria (doesnt matter if one document has 100 pages and another just 1).
From what I could get from SOLR forums was that it can be achieved by SOLR-236 patch (field collapsing) but I have not been able to apply the patch correctly with trunk (gives lots of errors).
This is really imp for me and I dont have much time, so can someone please either send me the SOLR 1.4.1 binary with this patch applied or guide me if there is any other way.
I would really appreciate it. Thanks!!

If you have issues with the collapse patch, then the Solr issue tracker is the channel to report them. I can see that other people are currently having some issues with it, so I suggest getting involved in its development.
That said: I recommend that if your application needs to search for 'real documents', then build your index around these 'real documents', not their individual pages.

If your only requirement is to show page numbers, I would suggest to play with the highlighter or made some custom development. You can store the word number of the start and end of each page in a custom structure, and knowing the matched word position in the whole document you can know in what page it appears. If the documents are very large you will get a good performance improvement.

You could also have a look at SOLR-1682 : Implement CollapseComponent, I havent tested it yet, but as far as I know it solves the collapsing too.

Related

Picking the right database technique for file storage and search

for a personal project I am searching for the "most suitable" database engine to hit the following key issues
need to store large amounts of single different document files (PDF)
need to perform full-text search onto PDF (for this I plan to use OCR and save the processed data/metadata additionally to the database)
need to get pieces/chunks of the saved documents (for example from a specific year) and show a preview of lots of them within a nice web UI
as much performance as possible
Up to now I did work a lot with SQL (MySql) and have some theoretical knowledge about other systems (MemCached, Redis, PostgreSQ, MongoDb). But I`ve never used them in combination and never hit the point WHEN they should be used for WHAT exactly or how they can be combined.
I think especially for a project like this it`s very important to select the right engine from beginning not to hit performance issues later.
So especially to all experienced developers out there, what would be your favourite choiche for this kind of (I gues SQL may not be the only right solution) ?
Or at the end will it be better to store files within filesystem and keep only metadata in database ?
BTW my planned API backend for this is Laravel 7+, frontend will be Vue 2+.
Thank you very much !

How to work with Sitecore Content Search and Page Editor

I rewrote a news application (overview + detail) from Fast-Query to Content Search. The performance gains were enormous, but i see some possible limitations, for which i don't know to handle in conjunction with the Page Editor.
When i use a fast query, i get an instance of a news even if there isn't a language version yet. In Lucene, i cannot find a result (because i filter for the language) and therefore the news detail is missing in the overview in the particular language.
EDIT for question 1
Lets assume that we have a solution with two languages (English and German). I have an item which currently only exist in a single english version. When i'm on an overview page in German and want to find this item with an fast query (the query does not matter), i will get this item back. In the wrong version, but i get it back. Now if i'm in Page Editor, i can go to this item and edit this in German, even if there is no version yet. The first click on the save button will create the first version for me.
When i want to find the item trough the Content Search, my natural way of query it is by making the same filter (probably by template, path and some channeling or whatever) AND a filter by the language property of the SearchResultItem, since i don't want multiple results for the same item. But since there is only a english version yet, the index only contains a single result in english and because of the language filter, i cannot find this item to call GetItem on it.
Since writing this question initially, i see two approaches to get around this:
a) Remove the language filter in Page Editor mode and filter it afterwards somehow (I don't know whether i'm able to get an item which i can edit in Page Editor in the German language)
b) Create a Page Editor specific master search index which holds an entry for all languages in the solution for every item, even if there isn't any in the specific language. I can add a computed field to indicate, whether this result is a real item version or not to filter at some point if necessary. Probably i'm able go call GetItem on this and enable Page Editor capabilities.
With lucene, i cannot find the detail item in a (currently) non-existing language version when i want to resolve it by it's display name trough Lucene (because there is no language version indexed yet).
EDIT for question 2
This goes hand in hand with question 1
In relation with Workflows, i see possible struggles with the Version which is indexed. Is the first Version in index, before you approve the version? Otherwise the overview has no chance to show this item till it is approved in Content Editor.
Has anybody used the Content Search for Page Editor relevant actions and has some advice how to get around such problems?
I've had the same issues with Sitecore 7, and while I don't have the ultimate solution for you, hopefully you will find the following information helpful.
I can't word my answer any better than this post: http://thegrumpycoder.com/post/75524076869/sitecore-content-search-beware-of-the-context
With Sitecore ContentSearch you are choosing which search index you would like to use. You are likely using sitecore_master_index when in Page Editor / Preview mode, and sitecore_web_index on the published site. As the web database only has one version of each item, you don't need to worry about their being multiple versions in the index. However sitecore_master_index will by default index all versions of an item in all languages. You can then potentially see items showing up multiple times in your listings components if you're not careful.
Sitecore 7 has a field "_latestversion" which you can add to all your queries, but it isn't reliable for a couple of reasons:
The latest version isn't necessarily the correct one, taking into account things like publishing restrictions and what date you have selected in the "Experience" view.
Due to a bug I have often been able to cause there to be more than one version of an item in the index where _latestversion is 1. Not after a complete rebuild, but after an edit or two. I saw this in Sitecore 7.0 and I'm not sure if it's been fixed yet.
Read http://www.sitecore.net/Learn/Blogs/Technical-Blogs/Sitecore-7-Development-Team/Posts/2013/04/Sitecore-7-Inbound-and-Outbound-Filter-Pipelines.aspx for more information on how you can use "Inbound filters" to ensure only the latest version makes it into the master index, but bear in mind that doesn't really solve the core issue in my opinion as the latest version isn't necessarily the correct one.
So taking this, and the fact that you need language fallbacks, you should probably not filter these results out at the Lucene level, but do the necessary magic yourself in code. This would need to:
Group versions together by their items
Choose the right version based on the current language, date, security and workflow
Apply your desired fallback logic if said version isn't found
Somehow work in pagination into this in a way which performs well
I also feel the following SO question is relevant:
Indexing Sitecore Item security and restricting returned search results - something else which can catch you out when you expect the Search API to work exactly in the same way as the Query API.
I'd be interested to know your thoughts and if you ever find a better solution! Thanks, Steve.

How to create SEO-friendly paging for a grid?

I've got this grid (a list of products in an internet shop) for which I've no idea how big it can get. But I suppose a couple hundred items is quite realistic, especially for search results. Maybe even thousands, if we get a big client. :)
Naturally, I should use paging for such a grid. But how to do it so that search engine bots can crawl all the items too? I very much like this idea, but that only has first/last/prev/next links. If a search engine bot has to follow links 200 levels deep to get to the last page, I think it might give up pretty soon, and not enumerate all items.
What is the common(best?) practice for this?
Is it really the grid you want to have index by the search engine or are you afer a product detail page? If the last one is what you want, you can have a dynamic sitemap (XML) and the search engines will take it from there.
I run a number of price comparison sites and as such i've had the same issue as you before. I dont really have a concrete answer, i doubt anyone will have one tbh.
The trick is to try and make each page as unique as possible. The more unique pages, the better. Think of it as each page in google is a lottery ticket, the more tickets the more chances you have of winning.
So, back to your question. We tend to display 20 products per page and then have pagination at the bottom. AFAIK google and other bots will crawl all links on your site. They wouldnt give up. What we have noticed though is if your subsequent pages have the same SEO titles, H tags and is basically the same page but with different result sets then Google will NOT add the pages to the index.
Likewise i've looked at the site you suggested and would suggest changing the layout to be text and not images, an example of what i mean is on this site: http://www.shopexplorer.com/lcd-tv/index.html
Another point to remember is the more images etc... on the page the longer the page will take to load the worse your UI will be. I've also heard it affects quality on SEO ranking algorithms.
Not sure if i've given you enough to go on, but to recap:
i would limit the results to 20-30
I would use pagination but i would use text and not images
i would make sure the paginated pages have distinct enough 'SEO markers' [ title, h1 etc.. ] to be a unique page.
i.e.
LCD TV results page 2 > is bad
LCD TV results from Sony to Samsung > Better
Hopefully i've helped a little
EDIT:
Vlix, i've also seen your question ref: sitemaps. If you're concerned with that, i wouldnt be, then split the feed into multiple seperate feeds. Maybe on a category level, brand level etc... I'm not sure but i think google would want as many pages as possible. It will ignore the ones it doesnt like and just add the unique ones.
That at least, is how i understand it.
SEO is a dark art - nobody will be able to tell you exactly what to do and how to do it. However, I do have some general pointers.
Pleun is right - your objective should be to get the robots to your product detail page - that's likely to be the most keyword-rich, so optimize this page as much as you can! Semantic HTML, don't use images to show text, the usual.
Construct meaningful navigation schemes to lead the robots (and your visitors!) to your product detail pages. So, if you have 150K products, let's hope they are grouped into some kind of hierarchy, and that each (sub)category in that hierarchy has a managable (<50 or so) number of products. If your users have to go through lots and lots of pages in a single category to find the product they're interested in, they're likely to get bored and leave. Make this categorization into a navigation scheme, and make it SEO friendly - e.g. by using friendly URLs.
Create a sitemap - robots will crawl the entire sitemap, though they may not decide to pay much attention to pages that are hard to reach through "normal" navigation, even if they are in the sitemap.xml.
Most robots don't parse more than the first 50-100K of HTML. If your navigation scheme (with a data grid) is too big, the robot won't necessarily pick up or follow links at the end.
Hope this helps!

Problems using MySQL FULLTEXT indexing for programming-related data (SO Data Dump)

I'm trying to implement a search feature for an offline-accessible StackOverflow, and I'm noticing some problems with using MySQLs FULLTEXT indexing.
Specifically, by default FULLTEXT indexing is restricted to words between 4 and 84 characters long. Terms such as "PHP" or "SQL" would not meet the minimum length and searching for those terms would yield no results.
It is possible to modify the variable which controls the minimum length a word needs to be to be indexed (ft_min_word_len), but this is a system-wide change requiring indexes in all databases to be rebuilt. On the off chance others find this app useful, I'd rather keep these sort of variables as vanilla as possible. I found a post on this site the other day stating that changing that value is just a bad idea anyway.
Another issue is with terms like "VB.NET" where, as far as I can tell, the period in the middle of the term separates it into two indexed values - VB and NET. Again, this means searches for "VB.NET" would return nothing.
Finally, since I'm doing a direct dump of the monthly XML-based dumps, all values are converted to HTML Entities and I'm concerned that this might have an impact on my search results.
I found a blog post which tries to address these issues with the following advice:
keep two copies of your data - one with markup, etc. for display, and one modified for searching (remove unwanted words, markup, etc)
pad short terms so they will be indexed, I assume with a pre/suffix.
What I'd like to know is, are these really the best workarounds for these issues? It seems like semi-duplicating a > 1GB table is wasteful, but maybe that's just me.
Also, if anyone could recommend a good site to understand MySQL's FULLTEXT indexing, I'd appreciate it. To keep this question from being too cluttered, please leave the site recommendations in the question comments, or email me directly at the site on my user profile).
Thanks!
Additional Info:
I think I should clarify a couple of things.
I know "MySQL" tends to lead to the assumption of "web application", but that's not what I'm going for here. I could install Apache and PHP and run things that way, but I'm trying to keep this light. I can use my website for playing with PHP, so I don't feel the need to install it on my home machine too. I also hope this could be useful for others as well, and I don't want to force anyone else into installing a bunch of extra utilities. I went with MySQL since it was easy and needing to install some sort of DB was unavoidable.
The specifics of the project were going to be:
Desktop application written in C# (WinForms)
MySQL backend
I'm starting to wonder if I should just say to hell with it, and install everything I'd need to make this an (offline) webapp. As much as we'd all like to think our pet project is going to be used and loved by the community at large, I should know by now that this is likely going end up being only used by a single user.
From what was already said, I understand, that MySQL FullText is not for you ;) But why stick to MySQL? Try Sphinx:
http://www.sphinxsearch.com/
It will solve most of your problems.

Relevant Search Results Across Multiple Databases

I have three databases that all have the contents of several web pages in them. What would be the best way to go about searching all three and having the most relevant web page at the top of the search results?
The only way I can think of is break down content by word count and/or creating a complex set of search rules to give one content priority over another. This might be more trouble than what it's worth, but I was wondering if anybody knows a way or product out there that would be able to help me.
To further support Ivans answer above Lucene is the way to go. You haven't mentioned what platform you're on so I'll point out that you can use a .NET port of this too.
If you do use Lucene there is a very good book from Manning on the subject which I recommend you look at.
When it comes to populating your index, you have a couple of choices. For starters you can just dump all of your text into the index and allow the engine to just search on it. However, I'd recommend adding fixed fields to your index which will allow you to support things such as partitioned searches or searches against those fields only.
To explain, lets say you have a field for the website. Then you can partition your index by restricting the index search to those documents that have that website in that field.
The other process is to extract points of interest from your document and allow searches on those without searching the entire index entry. Your mileage may vary with this as the lucene engine is very well written so it may simply allow you to collect your searches into more logical units which helps you with your solution.
I've done this myself and it helps when answering management questions about what exactly is searched and indexed.
HTH!
If you're using MS SQL Server then the full text search can return a ranking for you. I haven't used it, so you'll need to check the documentation or online for specifics.