Mapping data into Elasticsearch from Nutch 1.x - indexing

I've been working with Nutch 1.10 to make some small web crawls and indexing the crawl data using Elasticsearch 1.4.1 - it seems that the only way to optimize the index mapping is to crawl first, review the mapping that ES did on its own and then change it accordingly (if necessary) with the mapping API.
Does anyone know of a more effective solution to optimize the mappings within an ES index for web crawling?
UPDATE:
Is it even possible to update an ES mapping from a Nutch web crawl?

There are two things to consider here:
What is the data that is Indexed?
How to index it correctly to es
Regarding the indexed data, the index-plugins you use affect this. For example, the basic-index will add content, host, url, etc. for every doc. You can either check the plugins' documentation or to simply see what is the output (like you did).
After you know the indexed data and how to you want to approach it in the es cluster, you can create a new index in es with the correct/ optimized mappings, and make sure Nutch will index to that index.
Of course you can also re-index what you already crawled (see this es article).

Related

What is a good approach to update a Solr schema for multi purposes/usecases?

I am using solr 4.10. Its schema is the one provided by apache nutch as it stores indexes of crawled data by Nutch. Now I have to index a user informations in index also so that we can search for users. For that I have to change schema. I have two options in this case
Create a new schema file and use solr in multi core form
Update existing schema file and add news fields in this schema.
What is the best approach for that task? If there is any other option other than above two, then please give that option and guide for better solution.
The best approach is use a specific core for each different scope information. It makes the index less complex and more specific to a better understanding and high performance search.
If you use the same core to index information of lots of different scopes you will make the index big and complex affecting the performance.
Other important benefit of split your index in different cores is that you can use different indexing strategies to each core.

Basic doubts about Nutch

right now i have a project in which i need to build a search engine, but i cannot use Solr, only nutch and lucece, tho while im searching in forums and such i find out alot of people saying nutch does the indexing, i installed nutch (1.4) and crawled data, but realized i got no index folder or something like that, only the crawled data.. So, the question is, does nutch actually index what it crawls or it needs Lucene for indexing and search?
PS. for this project, i cant use Solr, only pure nutch and lucene and i need to build everything using Java, so im really confused when people says that nutch does in fact index... Sorry for my bad english, its not my native language...
Lucene is used for indexing and search by Nutch. As I understand Nutch, it passes the pages it finds to Lucene for indexing.
Nutch will not index your data, it does not use Lucene so therefore it cannot create its own indexes. Nutch must pass the documents to Solr for it to be indexed.
Check out : nutch vs solr indexing

Creating Lucene Index in a Database - Apache Lucene

I am using grails searchable plugin. It creates index files on a given location. Is there any way in searchable plugin to create Lucene index in a database?
Generally, no.
You can probably attempt to implement your own format but this would require a lot of effort.
I am no expert in Lucene, but I know that it is optimized to offer fast search over the filesystem. So it would be theoretically possible to build a Lucene index over the database, but the main feature of lucene (being a VERY fast search engine) would be lost.
As a point of interest, Compass supported storage of a Lucene index in a database, using a JdbcDirectory. This was, as far as I can figure, just a bad idea.
Compass, by the way, is now defunct, having been replaced by ElasticSearch.

Elastic search over an already existing lucene index

I have a system that uses lucene. Now for a few reasons I would like to add a distributed search feature over it.
The question is can I use the existing lucene index created by the IndexWriter of lucene, for searching with elastic search or should I create a new index using ES's IndexWriter.
P.S I discovered over the web that this is possible with solr, but afaik couldn't find anything tangible for es. Any help would be appreciated.
You need to reindex into ElasticSearch, you can't reuse an existing Lucene index.

Is there a set of best practices for building a Lucene index from a relational DB?

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.