How can I exclude certain URLs in Solr / Lucene

How can I exclude certain URLs in Solr / Lucene - lucene

I have setup new instance of Solr indexing on a website. I want Solr NOT to index certain URL patterns. Is there any way of mentioning such exclude-pattern?
Regards,
Paras

It can be done in the program, index only if the pattern does not match the exclude pattern.

You can do the filtering in Solr using an UpdateRequestProcessor. In that UpdateRequestProcessor, you could decide whether or not to index the document if it matches or not your regex.

Do you have a crawler going about and collecting data? I would lean towards doing that logic in the crawler. Solr is more of respository, and I don't think is the best place to put lots of indexing logic in.
Eric

Related

Can we use lucene query to have fts_alfresco search?

I want to upgrade my Alfresco server to 5.2 and in all my custom webscripts am using lucene queries. Since from Alfresco 5.x lucene indexing has been removed and solr indexing is not instantaneous, am planing to use fts_alfresco search. While testing i found that few lucene queries can be used for fts_alfresco search without modifying. So my concern is will i be able to do fts_alfresco search using lucene query? If no, is there any better way to migrate all my lucene queries to fts_alfresco?
Thanks in advance.

You will need to test/check your queries since there are small differences (for instance, date range query is not the same), but in general there's no reason why you would not be able to use FTS.
I'm not sure a comprehensive documentation exists where you would see all those small differences, though. If you find it, please share.
"Alfresco FTS is compatible with most, if not all of the examples here.."
https://community.alfresco.com/docs/DOC-4673-search

Creating Lucene Index in a Database - Apache Lucene

I am using grails searchable plugin. It creates index files on a given location. Is there any way in searchable plugin to create Lucene index in a database?

Generally, no.
You can probably attempt to implement your own format but this would require a lot of effort.

I am no expert in Lucene, but I know that it is optimized to offer fast search over the filesystem. So it would be theoretically possible to build a Lucene index over the database, but the main feature of lucene (being a VERY fast search engine) would be lost.

As a point of interest, Compass supported storage of a Lucene index in a database, using a JdbcDirectory. This was, as far as I can figure, just a bad idea.
Compass, by the way, is now defunct, having been replaced by ElasticSearch.

Elastic search over an already existing lucene index

I have a system that uses lucene. Now for a few reasons I would like to add a distributed search feature over it.
The question is can I use the existing lucene index created by the IndexWriter of lucene, for searching with elastic search or should I create a new index using ES's IndexWriter.
P.S I discovered over the web that this is possible with solr, but afaik couldn't find anything tangible for es. Any help would be appreciated.

You need to reindex into ElasticSearch, you can't reuse an existing Lucene index.

Couple o' quick questions on Apache Lucene

-- I don't want to start any religious wars, but a quick google search indicates that Apache Lucene is the preferred open source tool for indexing and searching. Are there others?
-- What file format does Lucene use to store its index file(s)?
Thank is advance.
Doug

Which are the best alternatives to Lucene? And as a lucene user I can say it has improved a lot performance wise the last couple of versions (NOT meaning it was slow before!)
it uses an proprietary format see here

I suggest you to look at Sphinx.
I have an experience with Lucene.net and we have many problems with multithread indexing. Lucene stores index in files, and this files can be locked by anti-viruses software.
Also you can not compare numbers in Lucene: it is impossible to filter products by size and price.

Is there a set of best practices for building a Lucene index from a relational DB?

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?

Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.

I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)

We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.

As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I exclude certain URLs in Solr / Lucene - lucene

I have setup new instance of Solr indexing on a website. I want Solr NOT to index certain URL patterns. Is there any way of mentioning such exclude-pattern? Regards, Paras

It can be done in the program, index only if the pattern does not match the exclude pattern.

You can do the filtering in Solr using an UpdateRequestProcessor. In that UpdateRequestProcessor, you could decide whether or not to index the document if it matches or not your regex.

Do you have a crawler going about and collecting data? I would lean towards doing that logic in the crawler. Solr is more of respository, and I don't think is the best place to put lots of indexing logic in. Eric

Related

Can we use lucene query to have fts_alfresco search?

Creating Lucene Index in a Database - Apache Lucene

Elastic search over an already existing lucene index

Couple o' quick questions on Apache Lucene

Is there a set of best practices for building a Lucene index from a relational DB?

Categories

Resources