Orchard - Search & Indexing issue - lucene

I have a project completed with Orchard CMS. The all functionalities are implemented through modules. Search module was also working till a few days ago, but suddenly it is stopped to working, "without any reason".
The issue is that I can not rebuild/update indexes. When I run indexing, it will only index default list of fields (id, title, body, format, type, author, created, published, modified, culture) but my custom fields are not indexed.
I tried everything but without any success. I tried:
- Deleting Indexing/Search folder with all files
- Reinstalling Search/Indexing/Lucene modules
- Rebuilding and rebuilding indexes....
- Clearing solution and rebuilding...
I didn't extend any of Orchard modules, they are the same as when I downloaded them.
Any advice on this one...?
P.S. Yes, I already checked custom fields thats need to be indexed. :)
Thanks,

If you think the index is corrupted, delete App_data\Sites\Default\Search.settings.xml and App_data\Sites\Default\Indexes, then restart the app pool. You should then be able to rebuild the index.
Apparently you already did this, but for others who may not have, you also need to check the fields you want indexed under Settings/Search. This will include the fields in search.
But for the fields to be included in search, they need to be indexed first. For this, you need to go to Content/Content Types and edit the content type the fields are on. Check "index this content type for search". Also deploy the settings for each field you want indexed, and check "include in the index".

You'll need to run the "Recipe" to create the "Search" index.

It appears that Search + Lucene + Indexing works with Text Fields but not Numeric Fields.

When the search feature is enabled, the Settings screen in the dashboard displays the fields that will be queried from the index (listed on the Search screen).
enter image description here

Related

TYPO3 v7.6.x migration to Drupal 8

I have to migrate a complex TYPO3 v7.6.30 website to Drupal 8.
So far I have investigated how TYPO3's administration part works.
I've also been digging into the TYPO3 database to find the correct mapping pattern, but I just don't seem to be getting anywhere.
My question is if there is a nice way to map/join all of the content with it's images/files/categories, so I can get row by row all page content like:
title
description
text fields
images
documents
tables
...
So in the end I will end up with a joined table with all of the data for each page on a single row, which then I can map in the migration.
I need a smooth way to map the pages with their fields.
I need the same for users (haven't researched this one yet).
The same is for the nesting of the pages in order to recreate the menus in the new CMS.
Any help on this will be highly appreciated.
You need a detailed plan of the configuration and then much understanding how TYPO3 works.
Here a basic introduction:
All content is organized in records and the main table is pages, the pagetree.
For nearly all records you have some common fields:
uid unique identifier
pid page ID (in which 'page' is the record 'stored', important for editing) (even pages are stored in pages to build a page tree)
title name of record
hidden, deleted,starttime,endtime, fe_group for visibility
there are fields for
versioning and workspaces
language support
sorting
some records (especially tt_content) have type fields, which decide how the record and which fields of it are used
there are relations to files (which are represented by sys_file records, and other records like file metadata or categories).
Aside from the default content elments where the data is stored in the tt_content record itself you can have plugins which display other records, (e.g. news, addresses, events, ...) or which get their data from another application or server.
You need to understand the complete configuration to save all.
What you might need is a special rendering of the pages.
That is doable with TYPO3: aside from the default HTML-rendering you can define other page types where you can get the content in any kind you define. e.g. xml, json, CSV, ...
This needs detailed knowledge of the individual TYPO3 configuration. So nobody can give you a full detailed picture of your installation.
And of course you need a good knowledge of your drupal target installation to answer the question 'what information should be stored where?'

SOLR Atomic update of custom stored and index metadata clears full-text index

I use bin/post to index all my files in /documents (mounted volume). It works and full-text search works fine.
I do an atomic update for specific metadata that I added to the schema BEFORE posting all docs, it works too.
I do a full-text search to find back the document for which the metadata has been updated, it DOESN'T work anymore, the updates are there but it seems that the full-text index has disappeared.
I do a full re-index and then it overrides my added metadata for the doc, resetting it to the default value. Although the metadata field I added is both stored and indexed.
Not sure what to do. That means that each reindexing will reset my added metadata...not great
The update - under the hood - reconstructs the document from stored fields, applies changes and puts them back to disk. On Lucene level, there is no "document update", it is a higher level concept. That's how the search indexes stay fast in this architecture.
So, your full-text field which is not stored, does not show up in the reconstructed document and does not get stored again in the "updated document".
If you have such a mix of stored and non-stored fields, you have to merge your updates outside of Solr from the original full-content.
Alternatively, depending on your use case, if you are just returning those update values, you could inject them with custom SearchComponent, use ExternalFileField or similar. The user mailing list could be a good place to ask for various options possible.

Solr configuration on Heroku

I am using WebSolr Cobalt on Heroku.
The search is working if I search whether for the first letter or the full word, but no partial parts of the word.
Any help?
To enable partial word searching
you must edit your local schema.xml file, usually under solr/config, to add either:
NGramFilterFactory
EdgeNGramFilterFactory
Here's what mine looks like - sample schema.xml
EdgeNGram
I went with the EdgeN option. It doesn't allow for searching in the middle of words, but it does allow partial word search starting from the beginning of the word. This cuts way down on false positives / matches you don't want, performs better, and is usually not missed by the users. Also, I like the minGramSize=2 so you must enter a minimum of 2 characters. Some folks set this to 3.
Once your local is setup and working, you must edit the schema.xml used by websolr, otherwise you will get the default behavior which requires the full-word to be entered even if you have full text searching configured for your models.
To edit the websolr schema.xml
Go to the Heroku online dashboard for your app
Go to the resources tab, then click on the Websolr add-on
Click the default link under Indexes
Click on the Advanced Configuration link
Paste in your schema.xml from your local, including the config for your Ngram tokenizer of choice (mentioned above). Save.
Copy the link in the "Configure your Heroku application" box, then paste it into terminal to set your WEBSOLR_URL link in your heroku config.
Click the Index Status link to get nifty stats and see if you are running fast or slow.
Reindex everything
heroku run rake sunspot:reindex[5000]
Don't use heroku run rake sunspot:solr:reindex - it is deprecated, accepts no parameters and is WAY slower
Default batch size is 50, most people suggest using 1000, but I've seen significantly faster results (1000 rows per second as opposed to around 500 rps) by bumping it up to 5000+
Take it to the next level
5 ways to speed up indexing

Sitecore System Lucene Index for custom queries

I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.

Updating Lucene index from two different threads in a web application

I've a .net web application which uses Lucene.net for company search functionality.
When registered users add a new company,it is saved to database and also gets indexed in Lucene based company search index in real time.
When adding company in Lucene index, how do I handle use case of two or more logged-in users posting a new company at the same time?Also, will both these companies get indexed without any file lock, lock time out, etc. related issues?
Would appreciate if i could help with code as well.
Thanks.
By default Lucene.Net has inbuilt index locking using a text file. However if the default locking mode isn't good enough then there are others that you can use instead (which are included in the Lucene.Net source code).