cluster remote lucene indec with dcs - lucene

I try to use a lucene index on a remote server as an input for carrot2 installed on the same server. Regarding the documentation this should be possible with carrot2-dcs (documentation chapter 3.4 Carrot2 Document Clustering Server: Various document sources included. Carrot2 Document Clustering Server can fetch and cluster documents from a large number of sources, including major search engines and indexing engines (Lucene, Solr)).
After installing carrot2-dcs 3.9.3 I discovered that lucene isnĀ“t available as a document source. How to proceed?

To cluster content from a Lucene index, the index needs to be available on the server the DCS is running (either through the local file system or e.g. as an NSF mount).
To make the Lucene source visible in the DCS:
Open for editing: war/carrot2-dcs.war/WEB-INF/suites/source-lucene-attributes.xml
Uncomment the configuration sections and provide the location of your Lucene index and the fields that should serve documents' titles and content (at least one is required). Remember the fields must be "stored" in Lucene speak.
Make sure the edited file is packed back to the WAR archive and run the DCS. You should now see the Lucene document source.

Related

AEM (Adobe Experience Manager) Indexed PDF Search Results

My employer has recently switched its CMS to AEM(Adobe Experience Manager).
We store a large amount of documentation and our site users need to be able to find the information contained within those documents, some of which are 100s pages in length.
Adobe are disappointingly saying their search tool will not search PDFs. Is there any format for producing or saving pdfs that allow the content be indexed?
I think you need to configure external index/search tools like Apache Solr and use REST endpoint to sync DAM data and fetch results on queries.
Out of the box AEM supports most binary formats, without needing for SOLR. You only need this in advanced scenarios, like exposing search outside of Authoring or having millions of assets.
When any asset is uploaded to AEM Dam it will go though a Dam Asset Workflow which has a step Metadata Processor. That step will extract content from the asset. So "binary" assets like Word docs, Excel and PDF it will be searchable. As long as you have Dam Asset Update workflow enabled you will be ok.

Can Apache solr stores actual files which are uploaded on it?

This is my first time on Stack Overflow. Thanks to all for providing valuable information and helping one another.
I am currently working on Apache Solr 7. There is a POC I need to complete as I have less time so putting this question here. I have setup SOLR on my windows machine. I have created core and uploaded a PDF document using /update/extract from Admin UI. After uploading, I can see the metadata of the file if I query from the Admin UI using query button. I was wondering if I can get the actusl content of the PDF as well. I can see there is one tlog file gets generated under /data/tlog/tlog000... with raw PDF data but not the actual file.
So the question are,
1. Can I get the PDF content?
2. does Solr stores the actual file somewhere?
a. If it stores then where it does?
b. If it does not store then, is there a way to store THE FILE?
Regards,
Munish Arora
Solr will not sore the actual file anywhere.
Depending on your config it can store the binary content though.
Using the extract request handler Apache Solr relies on Apache Tika[1] to extract the content from the document[2].
So you can search and return the content of the pdf and a lot of other metadata if you like.
[1] https://tika.apache.org/
[2] https://lucene.apache.org/solr/guide/6_6/uploading-data-with-solr-cell-using-apache-tika.html

How to get webgraph in apache nutch?

I have generated webgrapgh db in apache nutch using command 'bin/nutch webgraph -segmentDir crawl/segments -webgraphdb crawl/webgraphdb'.... It generated three folders in crawl/webgraphdb which are inlinks, outlinks and nodes. Each of those folders contained two binary files like data and index. How to get visual web graph in apache nutch? What is the use of web graph?
The Webgraph is intented to be a step in the score calculation based on the link structure (i.e webgraph):
webgraph will generate the data structure for the specified segment/s
linkrank will calculate the score based on the previous structures
scoreupdater will update the score from the webgraph back into the crawldb
Be aware that this program is very CPU/IO intensive and that will ignore the internal links of a website by default.
You could use the nodedumper command to get useful data out of the webgraph data, including the actual score of a node and the highest scored inlinks/outlinks. But this is not intented to be visualized, although you could parse the output of this command and generate any visualization that you may need.
That being said, since Nutch 1.11 the plugin index-links has been added, which will allow you to index into Solr/ES the inlinks and outlinks of each URL. I've used this plugin indexing into Solr along with the sigmajs library to generate some graph visualizations of the link structure of my crawls, perhaps this could suit your needs.

How to boost document based on keywords in apache solr

I am using apache solr 4.10. I have to boost documents against which they are served so that they will have better score. Now for that, I have to log that document id as well as that query.
How to save document id and query ?
Second, I have to use that information to boost that document. How should I do it?

Using ElasticSearch and/or Solr as a datastore for MS Office and PDF documents

I'm currently designing a full text search system where users perform text queries against MS Office and PDF documents, and the result will return a list of documents that best match the query. The user will then be to select any document returned and view that document within MS Word, Excel, or a PDF viewer.
Can I use ElasticSearch or Solr to import the raw binary documents (ie. .docx, .xlsx, .pdf files) into its "data store", and then export the document to the user's device on command for viewing.
Previously, I used MongoDB 2.6.6 to import the raw files into GridFS and the extracted text into a separate collection (the collection contained a text index) and that worked fine. However, MongoDB full text searching is quite basic and therefore I'm now looking at either Solr or ElasticSearch to perform more complex text searching.
Nick
Both Solr and Elasticsearch will index the content of the document. Solr has that built-in, Elasticsearch needs a plugin. Easy either way and both use Tika under the covers.
Neither of them will store the document itself. You can try making them do it, but they are not designed for it and you will suffer.
Additionally, neither Solr nor Elasticsearch are currently recommended as a primary storage. They can do it, but it is not as mission critical for them as - say - for a filesystem implementation.
So, I would recommend having the files somewhere else and using Solr/Elasticsearch for searching only. That's where they shine.
I would try the Elasticsearch attachment plugin. Details can be found here:
https://www.elastic.co/guide/en/elasticsearch/plugins/2.2/mapper-attachments.html
https://github.com/elasticsearch/elasticsearch-mapper-attachments
It's built on top of Apache Tika:
http://tika.apache.org/1.7/formats.html
Attachment Type
The attachment type allows to index different "attachment" type field
(encoded as base64), for example, Microsoft Office formats, open
document formats, ePub, HTML, and so on (full list can be found here).
The attachment type is provided as a plugin extension. The plugin is a
simple zip file that can be downloaded and placed under
$ES_HOME/plugins location. It will be automatically detected and the
attachment type will be added.
Supported Document Formats
HyperText Markup Language
XML and derived formats
Microsoft Office document formats
OpenDocument Format
iWorks document formats
Portable Document Format
Electronic Publication Format
Rich Text Format
Compression and packaging formats
Text formats
Feed and Syndication formats
Help formats
Audio formats
Image formats
Video formats
Java class files and archives
Source code
Mail formats
CAD formats
Font formats
Scientific formats
Executable programs and libraries
Crypto formats
A bit late to the party but this may help someone :)
I had a similar problem and some research led me to fscrawler. Description:
This crawler helps to index binary documents such as PDF, Open Office, MS Office.
Main features:
Local file system (or a mounted drive) crawling and index new files,
update existing ones and removes old ones. Remote file system over SSH
crawling.
REST interface to let you "upload" your binary documents to elasticsearch.
Regarding solr:
If the docs only need to be returned on metadata searches, Solr features a BinaryField fieldtype, to which you can send binary data base64 encoded.Keep in mind that in general people recommend against doing this, as it may increase your index (RAM requirements/performance), and if possible a set-up where you store the files externally (and the path to the file in solr) might bea better choice.
If you want solr to automatically index the text inside the pdf/doc -- that's possible with the extractingrequesthandler: https://wiki.apache.org/solr/ExtractingRequestHandler
Elasticsearch do store documents (.pdfs, .docs for instance) in the _source field. It can be used as a NoSQL datastore (same as MongoDB).