I have a solr instance up and running and it should read parquet files to index. Right now, I am converting the parquet to flat text file and then having solr index them. I'd like to know if it is possible to read the parquet file directly for Solr to consume?
Thanks
Directly: no, not possible.
If you want something more integrated than what you are actually doing (converting to text and indexing might be good enough already), you can follow two ways:
Create an specialized code around DIH, you probably can write a specialized DataSource, so you could use DIH to do the indexing.
Just write some java code using SolrJ that reads your file and indexes to Solr
Related
I have these 3 files in a folder and they are all related to an index created by Lucene:
_0.cfs
segments_2
segments.gen
What are they all used for, and is it possible to convert any of them to a human-readable format to discern a bit more about how lucene works with its indexes?
The two segments files store information about the segments, and the .cfs is a compound file consisting of other index files (like index, storage, deletion, etc. files).
For documentation of different types of files used to create a Lucene index, see this summary of file extensions
Generally, no, Lucene files are not human readable. They are designed more for efficiency and speed than human readability. The way to get a human readable format is to access them through the Lucene API (via Luke, or Solr, or something like that).
If you want a thorough understanding of the file formats in use, the codecs package would be the place to look.
I am using grails searchable plugin. It creates index files on a given location. Is there any way in searchable plugin to create Lucene index in a database?
Generally, no.
You can probably attempt to implement your own format but this would require a lot of effort.
I am no expert in Lucene, but I know that it is optimized to offer fast search over the filesystem. So it would be theoretically possible to build a Lucene index over the database, but the main feature of lucene (being a VERY fast search engine) would be lost.
As a point of interest, Compass supported storage of a Lucene index in a database, using a JdbcDirectory. This was, as far as I can figure, just a bad idea.
Compass, by the way, is now defunct, having been replaced by ElasticSearch.
what is diference between a single file like lucene's Compound File and muti file for dependent index type?
it uses less files to keep the index, so it can help avoid the Too many open files issue (that in unix you handle with ulimit).
It is also slower.
Is it possible to use Hive for querying Lucene index which is distributed over Hadoop???
Hadapt is a startup whose software bridges Hadoop with a SQL front-end (like Hive) and hybrid storage engines. They offer a archival text search capability that may meet your needs.
Disclaimer: I work for Hadapt.
As far as I know you can essentially write custom "row-extraction" code in Hive so I would guess that you could. I've never used Lucene and barely used Hive, so I can't be sure. If you find a more conclusive answer to your question, please post it!
I know this is a fairly old post, but thought I could offer a better alternative.
In your case, instead of going through the hassle of mapping your HDFS Lucene index to hive schema, it's better to push them into pig, because pig can read flat files. Unless you want a Relational way of storing your data, you could probably process them through Pig and use, Hbase as your DB.
You could write a custom input format for Hive to access lucene index in Hadoop.
How to read a lucene index directory stored over HDFS i.e. How to get IndexReader for the index stored over HDFS. The IndexReader is to opened in a map task.
Something like: IndexReader reader = IndexReader.open("hdfs/path/to/index/directory");
Thanks,
Akhil
If you want to open a Lucene index that's stored in HDFS for the purpose of searching, you're out of luck. AFAIK, there is no implementation of Directory for HDFS that allows for search operations. One reason this is the case is because HDFS is optimized for sequential reads of large blocks, not small, random reads which Lucene incurs.
In the Nutch project, there is an implementation of HDFSDirectory which you can use to create an IndexReader, but only delete operations work. Nutch only uses HDFSDirectory to perform document deduplication.
I think the Katta project might be what you are looking for. I haven't used it myself but been researching these kind of solutions recently and this seems to fit the bill.
It's a distributed version of lucene using sharded indexes.
http://katta.sourceforge.net/