How to locate index in AEM 6.2 - lucene

How to confirm if these users in section1 (/home/users/section1) got added to the AEM index?
I created a query builder that returns all users under section 1 but how can I know if those users got added to AEM index, or is there a better way. What exactly am I looking for in the query builder that tells me they are indexed in AEM?
curl -s -u username:password http://localhost:4502/bin/querybuilder.json?path=/home/users/section1&jcr:primaryType=rep:AuthorizableFolder&1_property=jcr:createdBy&1_property.value=admin&1_property.operation=like&p.limit=-1
My return query:
success":true,"results":57654,"total":57654,"more":false,"offset":0,"hits":[{"path":"/home/users/section1/useremail1#hotmail.com","excerpt":"","name":"useremail1#hotmail.com","title":"useremail1#hotmail.com","lastModified":"2017-09-09 14:59:23","created":"2017-09-26 03:03:07"}, ....etc

For users node, there is a OOTB index located at /oak:index/users. It's a lucene type of index and will index all rep:Users nodes. For the lucene index, you may browse the index content with a GUI tool called Luke.
For completeness of the solution, here is a high level guide based on the assumptions above. Hope it helps.
Locate the physical index file (looking for /oak:index/users)
localhost:4502/system/console/jmx/org.apache.jackrabbit.oak%3Aname%3DIndexCopier+support+statistics%2Ctype%3DIndexCopierStats
Download the lucene codec that corresponds to your oak version
I built the oak-lucene-xxx.jar from Oak project (https://github.com/apache/jackrabbit-oak/tags)
Download and run Luke (https://jackrabbit.apache.org/oak/docs/query/lucene.html#luke)
In Luke, go to Documents tab and Browse by term :path (quick tip: you can just type /home/users/section1 and hit enter)

Related

Templates used on each page

I have several different subsites using various templates. I need identify the templates used over the entire site. Is there a report or an api call that I can make to my sitecore site? I am new to sitecore. I know the locations of the articles but we have 100s of articles.
There are several ways this can be done. Here are a few options that may be useful depending on your needs:
In the content editor, you can select the root node of your site and click the Search icon next to the Content tab. Perform a blank search and you'll get a faceted search result with each template (See Template section to the right)
Using the Solr UI, you can perform a query where _path equals the site root item id. Note that the item id must be all lowercase without braces and dashes. It may be worth rebuilding the master index before doing this, as the index may be outdated.
If your Sitecore instance has Sitecore PowerShell Extensions (SPE) installed, you can simply query the path, such as Get-ChildItem -Path 'master:/sitecore/content/path/to/site' -Recurse -WithParent and then chain that result to a simple Format-Table or do more fancy stuff with it.
If you want to query a remote machine and play around with local code, you could use SPE remoting (basically the same as above, but from a remote host) or the Sitecore RESTful API for the ItemService (https://doc.sitecore.com/xp/en/developers/93/sitecore-experience-manager/the-restful-api-for-the-itemservice.html). Simply traverse the https://sitecore-host-name/sitecore/api/ssc/item/{itemid}/children (after auth).
If you're more into SQL, you could query the master database directly, for example SELECT i.ID, i.Name, i.TemplateID FROM Items i JOIN Descendants d ON i.Id=d.Descendant WHERE d.Ancestor='SiteRootItemId'. Note that the Sitecore databases are pretty complex and a lot of things are going on in the API layers, so avoid making updates in the database nor write application code talking directly to the database. But for investigation purposes, I think it's fine to query the database. Also, the Descendants table may not be fully up to date, so it's good to perform a "Rebuild Descendants" task on the master database before running such query. It can be done in the admin console at /sitecore/admin/DBCleanup.aspx.

RDF4J rdf lucene configuration

I have been trying for some time to configure my sesame RDF repository (at the moment is called RDF4j) in order to use full text queries.
I did not find much documentation about this configuration, I think that I need to create a template file so then I can use it with the console. Here is the little information about the topic https://groups.google.com/forum/#!topic/rdf4j-users/xw2UJCziKl8
Does anybody know any information about the configuration of RDF4j with Lucene? Any clue would be very appreciate. Otherway, I would think about change the whole repository for another, like for example virtuoso.
Thanks in advance,
You need to do these operations:
Start rdf4j-server. I used rdf4j-server.war (and rdf4j-workbench.war). My url was http://127.0.0.1:8080/rdf4j-server
Put lucene.ttl (lucene.ttl or this) into ~/.RDF4J/console/templates, where "~" your home directory
Correct settings in this file
Then start console from rdf4j distributive
Then execute next commands in console:
connect http://127.0.0.1:8080/rdf4j-server
create lucene
Enter lucene repository Id: myRepositoryId
Enter lucene repository name: myRepositoryName
Then you can see in http://127.0.0.1:8080/rdf4j-workbench created repository. If you add some triples you can use lucene search

cluster remote lucene indec with dcs

I try to use a lucene index on a remote server as an input for carrot2 installed on the same server. Regarding the documentation this should be possible with carrot2-dcs (documentation chapter 3.4 Carrot2 Document Clustering Server: Various document sources included. Carrot2 Document Clustering Server can fetch and cluster documents from a large number of sources, including major search engines and indexing engines (Lucene, Solr)).
After installing carrot2-dcs 3.9.3 I discovered that lucene isnĀ“t available as a document source. How to proceed?
To cluster content from a Lucene index, the index needs to be available on the server the DCS is running (either through the local file system or e.g. as an NSF mount).
To make the Lucene source visible in the DCS:
Open for editing: war/carrot2-dcs.war/WEB-INF/suites/source-lucene-attributes.xml
Uncomment the configuration sections and provide the location of your Lucene index and the fields that should serve documents' titles and content (at least one is required). Remember the fields must be "stored" in Lucene speak.
Make sure the edited file is packed back to the WAR archive and run the DCS. You should now see the Lucene document source.

Solr configuration on Heroku

I am using WebSolr Cobalt on Heroku.
The search is working if I search whether for the first letter or the full word, but no partial parts of the word.
Any help?
To enable partial word searching
you must edit your local schema.xml file, usually under solr/config, to add either:
NGramFilterFactory
EdgeNGramFilterFactory
Here's what mine looks like - sample schema.xml
EdgeNGram
I went with the EdgeN option. It doesn't allow for searching in the middle of words, but it does allow partial word search starting from the beginning of the word. This cuts way down on false positives / matches you don't want, performs better, and is usually not missed by the users. Also, I like the minGramSize=2 so you must enter a minimum of 2 characters. Some folks set this to 3.
Once your local is setup and working, you must edit the schema.xml used by websolr, otherwise you will get the default behavior which requires the full-word to be entered even if you have full text searching configured for your models.
To edit the websolr schema.xml
Go to the Heroku online dashboard for your app
Go to the resources tab, then click on the Websolr add-on
Click the default link under Indexes
Click on the Advanced Configuration link
Paste in your schema.xml from your local, including the config for your Ngram tokenizer of choice (mentioned above). Save.
Copy the link in the "Configure your Heroku application" box, then paste it into terminal to set your WEBSOLR_URL link in your heroku config.
Click the Index Status link to get nifty stats and see if you are running fast or slow.
Reindex everything
heroku run rake sunspot:reindex[5000]
Don't use heroku run rake sunspot:solr:reindex - it is deprecated, accepts no parameters and is WAY slower
Default batch size is 50, most people suggest using 1000, but I've seen significantly faster results (1000 rows per second as opposed to around 500 rps) by bumping it up to 5000+
Take it to the next level
5 ways to speed up indexing

Sitecore System Lucene Index for custom queries

I have been using Sitecore query and FAST query for some sections of the website. But with growing content these queries have gotten slow and I'd like to implement Lucene querying for content to speed up things.
I am wondering if I can just use the System index instead of having to setup a separate index. Does Sitecore by default index all content in the content editor? Is this a good approach or should I just create my own index?
(I'm going to assume your using Sitecore 6.4->6.6)
As with everything .. it depends .. Sitecore keeps an index of all the Sitecore items in its system index, you are welcome to use that. Sometimes you may want a more specialised or restricted list of items, like being based on a certain template, being indexed or need a checkbox field indexed (as the system one by default only indexes text fields).
Setting up your own search index is pretty easy.. It does require some fiddling with the web.config though (and I'd recommend adding as a .include file).
Create an new <index> node with its own id that will define the name of the collection and the folder it will go into. (You can check its working by looking for the dir in the /data/indexes directory of your installation.
.. next you can tell the crawler which database to look at (most likely master if you want unpublished content to be indexed or web for published stuff) and where to start the search from (in this example I am indexing only the news section). You can tag,boostand tell if whether to IndexAllFields (otherwise it will only index fields it understands as text .. rich-text / multi-line text / text etc).
.. Finally, you can tell the indexer which template types to include or exclude.
How the indexer works is that it will subscribed to item events within sitecore .. so every time an item is changed or moved or deleted the index will be updated automatically. Obviously if you are indexing the web db the items will need to have been published.
More in-depth info on the query syntax & indexing can be found here on SDN.
The search syntax and API is much improved in 6.4/6.5 but if you want to add extra kick then my colleague Alex Shyba's Advanced Database Crawler is worth checking out too.
Hope this helps :D
You will want to implement your own index. For the same reason that you are seeing things slow down when there is a lot of content, indexes slow down when there is a lot of content in it as well.
I prefer targeted indexes meant specifically to drive the functionality I need and only has the data in it that is required. This allows for smaller and more efficient index usage on your components.
Additionally, you probably want to look into the AdvancedDatabaseCrawler put together by Alex Shyba. There are a few blogs out there with some great posts on implementing this lucene indexing module.
A separate index is always a wise decision, you can keep it light. In big environments the system index can grow up to gigabytes.
You can exclude the content from the index, as you will only be using it for performing lookups, not showing content from the index.
Finally: the system index is for the master database, you'll be querying the web database, possibly on a content delivery server.