Basic doubts about Nutch

Basic doubts about Nutch - lucene

right now i have a project in which i need to build a search engine, but i cannot use Solr, only nutch and lucece, tho while im searching in forums and such i find out alot of people saying nutch does the indexing, i installed nutch (1.4) and crawled data, but realized i got no index folder or something like that, only the crawled data.. So, the question is, does nutch actually index what it crawls or it needs Lucene for indexing and search?
PS. for this project, i cant use Solr, only pure nutch and lucene and i need to build everything using Java, so im really confused when people says that nutch does in fact index... Sorry for my bad english, its not my native language...

Lucene is used for indexing and search by Nutch. As I understand Nutch, it passes the pages it finds to Lucene for indexing.

Nutch will not index your data, it does not use Lucene so therefore it cannot create its own indexes. Nutch must pass the documents to Solr for it to be indexed.
Check out : nutch vs solr indexing

Related

Can we use lucene query to have fts_alfresco search?

I want to upgrade my Alfresco server to 5.2 and in all my custom webscripts am using lucene queries. Since from Alfresco 5.x lucene indexing has been removed and solr indexing is not instantaneous, am planing to use fts_alfresco search. While testing i found that few lucene queries can be used for fts_alfresco search without modifying. So my concern is will i be able to do fts_alfresco search using lucene query? If no, is there any better way to migrate all my lucene queries to fts_alfresco?
Thanks in advance.

You will need to test/check your queries since there are small differences (for instance, date range query is not the same), but in general there's no reason why you would not be able to use FTS.
I'm not sure a comprehensive documentation exists where you would see all those small differences, though. If you find it, please share.
"Alfresco FTS is compatible with most, if not all of the examples here.."
https://community.alfresco.com/docs/DOC-4673-search

Mapping data into Elasticsearch from Nutch 1.x

I've been working with Nutch 1.10 to make some small web crawls and indexing the crawl data using Elasticsearch 1.4.1 - it seems that the only way to optimize the index mapping is to crawl first, review the mapping that ES did on its own and then change it accordingly (if necessary) with the mapping API.
Does anyone know of a more effective solution to optimize the mappings within an ES index for web crawling?
UPDATE:
Is it even possible to update an ES mapping from a Nutch web crawl?

There are two things to consider here:
What is the data that is Indexed?
How to index it correctly to es
Regarding the indexed data, the index-plugins you use affect this. For example, the basic-index will add content, host, url, etc. for every doc. You can either check the plugins' documentation or to simply see what is the output (like you did).
After you know the indexed data and how to you want to approach it in the es cluster, you can create a new index in es with the correct/ optimized mappings, and make sure Nutch will index to that index.
Of course you can also re-index what you already crawled (see this es article).

Creating Lucene Index in a Database - Apache Lucene

I am using grails searchable plugin. It creates index files on a given location. Is there any way in searchable plugin to create Lucene index in a database?

Generally, no.
You can probably attempt to implement your own format but this would require a lot of effort.

I am no expert in Lucene, but I know that it is optimized to offer fast search over the filesystem. So it would be theoretically possible to build a Lucene index over the database, but the main feature of lucene (being a VERY fast search engine) would be lost.

As a point of interest, Compass supported storage of a Lucene index in a database, using a JdbcDirectory. This was, as far as I can figure, just a bad idea.
Compass, by the way, is now defunct, having been replaced by ElasticSearch.

Elastic search over an already existing lucene index

I have a system that uses lucene. Now for a few reasons I would like to add a distributed search feature over it.
The question is can I use the existing lucene index created by the IndexWriter of lucene, for searching with elastic search or should I create a new index using ES's IndexWriter.
P.S I discovered over the web that this is possible with solr, but afaik couldn't find anything tangible for es. Any help would be appreciated.

You need to reindex into ElasticSearch, you can't reuse an existing Lucene index.

Is lucene index created with 2.3.1 compatible with Lucene 3.0.3

We have indexed our documents with Lucene 2.3.1 and now want to move to Lucene 3.0.3 for better features. I want to know whether the index will work as is and will I be able to add more documents, with 3.0.3, to the existing index without any hassles or do I have to re-index the whole thing.
Thanks a lot in advance.

I am quite sure that the indexes will be incompatible with Lucene 3 if they were built under Lucene 2 (in fact I'm 99% positive of this).
However, you may be able to convert them rather than rebuild them. Have a look here for some high-level guidance in this area.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Basic doubts about Nutch - lucene

Lucene is used for indexing and search by Nutch. As I understand Nutch, it passes the pages it finds to Lucene for indexing.

Nutch will not index your data, it does not use Lucene so therefore it cannot create its own indexes. Nutch must pass the documents to Solr for it to be indexed. Check out : nutch vs solr indexing

Related

Can we use lucene query to have fts_alfresco search?

Mapping data into Elasticsearch from Nutch 1.x

Creating Lucene Index in a Database - Apache Lucene

Elastic search over an already existing lucene index

Is lucene index created with 2.3.1 compatible with Lucene 3.0.3

Categories

Resources