How to use a R-Tree in Cassandra as indexing - indexing

As Here said: "In version 3.4 they introduced SASI indexes (SSTable Attached Secondary Indexes). Basically a portion of the index is associated to each SStable and hence distributed with it. This means you can actually use references to access data rather than pay for another read path. In terms of implementation, they are b-trees."
So I want to replace R-Tree instead of B-Tree for indexing is Cassandra. Cassandra is in java, I see all files about B-Tree there in a folder but there is no documentation about parameters and functions there. I have R-Tree source code in java but I don't know how replace it in Cassandra as it indexing method.
p.s. the language doesn't matter for me and can be anyone. no limits.

Your question is too broad and general.
My advice is that you first familiarise yourself with the implementation and inner-workings of secondary indexes in Cassandra before embarking on an ambitious goal of rewriting the code. Cheers!

Related

Managing the neo4j index's life cycle (CRUD)

I have limited (and disjointed) experience with databases, and nearly none with indexes. Based on web search, reading books, and working with ORMs my understanding can be summed up as follows:
An index in databases is similar to a book index in that it lists "stuff" that's in the book and tells you where to find it. This helps with lookup efficiency (this is most probably not the only benefit)
In (at least some) RDBMS's, primary key fields get automatically indexed so u never have to directly manipulate them.
I'm tinkering with neo4j and it seems you have to be deliberate about indexes so now I need to understand them but I cannot find clear answers to:
How are indexes managed in neo4j?
I know there's automatic indexing, how does it work?
If you choose to manually manage your own indexes, what can you control about them? Perhaps,index name, etc?
Would appreciate answers or pointers to answers, thanx.
Neo4j uses Apache Lucene under the covers if you want index engine like capabilities for your data. You can index nodes and/or relationships- the index helps you look up a particular instance/set of nodes or relationships.
Manual Indexing:
You can create as many node/relationship indexes as you want and you can specify a name for each index. The config can also be controlled i.e. whether you want exact matching (the default) or Lucenes full text indexing support. Once you have the index, you simply add nodes/relationships to it and the key/value you want indexed. You do however need to take care of "updating" data in the index yourself if you make changes to the node properties.
Auto-Indexing:
Here you get one index for nodes and one index for relations if you turn them on in the neo4j.properties file. You may specify what properties are to be indexed and from the point of turning them on, the index is automatically managed for you i.e. any nodes created after this point are added to the index and updated/removed automatically.
More reading:
http://docs.neo4j.org/chunked/stable/indexing.html
The above applies to versions < 2.0
2.0 adds more around the concept of indexing itself, you might want to go through
http://www.neo4j.org/develop/labels
http://blog.neo4j.org/2013/04/nodes-are-people-too.html
Hope that helps.

Can a Lucene index be stored in an RDBMS

If I don't have access to the file system, but do have access to a MySQL instance, can I store the lucene index in a mysql database. I was able to find the DbDirectory and thought that it might do the trick. However, it looks like it works with a Berkeley DB rather than an RDBMS.
There are some contribution that stores lucene index in more simple datastores (from perspective of data model). For example, BerkleyDB and Cassandra. So technically it is possible to write implementation of Directory which would store index in Jdbc. There is one in Compass framework.
I dont believe you can, it would defeat the purpose of Lucene. If your indexing does not take to long you could consider a RAMDirectory which I believe stores it in memory.

Fulltext Search with InnoDB

I'm developing a high-volume web application, where part of it is a MySQL database of discussion posts that will need to grow to 20M+ rows, smoothly.
I was originally planning on using MyISAM for the tables (for the built-in fulltext search capabilities), but the thought of the entire table being locked due to a single write operation makes me shutter. Row-level locks make so much more sense (not to mention InnoDB's other speed advantages when dealing with huge tables). So, for this reason, I'm pretty determined to use InnoDB.
The problem is... InnoDB doesn't have built-in fulltext search capabilities.
Should I go with a third-party search system? Like Lucene(c++) / Sphinx? Do any of you database ninjas have any suggestions/guidance? LinkedIn's zoie (based off Lucene) looks like the best option at the moment... having been built around realtime capabilities (which is pretty critical for my application.) I'm a little hesitant to commit yet without some insight...
(FYI: going to be on EC2 with high-memory rigs, using PHP to serve the frontend)
Along with the general phasing out of MyISAM, InnoDB full-text search (FTS) is finally available in MySQL 5.6.4 release.
Lots of juicy details at https://dev.mysql.com/doc/refman/5.6/en/innodb-fulltext-index.html.
While other engines have lots of different features, this one is InnoDB, so it's native (which means there's an upgrade path), and that makes it a worthwhile option.
I can vouch for MyISAM fulltext being a bad option - even leaving aside the various problems with MyISAM tables in general, I've seen the fulltext stuff go off the rails and start corrupting itself and crashing MySQL regularly.
A dedicated search engine is definitely going to be the most flexible option here - store the post data in MySQL/innodb, and then export the text to your search engine. You can set up a periodic full index build/publish pretty easily, and add real-time index updates if you feel the need and want to spend the time.
Lucene and Sphinx are good options, as is Xapian, which is nice and lightweight. If you go the Lucene route don't assume that Clucene will better, even if you'd prefer not to wrestle with Java, although I'm not really qualified to discuss the pros and cons of either.
You should spend an hour and go through installation and test-drive of Sphinx and Lucene. See if either meets your needs, with respect to data updates.
One of the things that disappointed me about Sphinx is that it doesn't support incremental inserts very well. That is, it's very expensive to reindex after an insert, so expensive that their recommended solution is to split your data into older, unchanging rows and newer, volatile rows. So every search your app does would have to search twice: once on the larger index for old rows and also on the smaller index for recent rows. If that doesn't integrate with your usage patterns, this Sphinx is not a good solution (at least not in its current implementation).
I'd like to point out another possible solution you could consider: Google Custom Search. If you can apply some SEO to your web application, then outsource the indexing and search function to Google, and embed a Google search textfield into your site. It could be the most economical and scalable way to make your site searchable.
Perhaps you shouldn't dismiss MySQL's FT so quickly. Craigslist used to use it.
MySQL’s speed and Full Text Search has enabled craigslist to serve their users .. craigslist uses MySQL to serve approximately 50 million searches per month at a rate of up to 60 searches per second."
edit
As commented below, Craigslist seems to have switched to Sphinx some time in early 2009.
Sphinx, as you point out, is quite nice for this stuff. All the work is in the configuration file. Make sure whatever your table is with the strings has some unique integer id key, and you should be fine.
try this
ROUND((LENGTH(text) - LENGTH(REPLACE(text, 'serchtext', ''))) / LENGTH('serchtext'),0)!=0
You should take a look at Sphinx. It is worth a try. It's indexing is super fast and it is distributed. You should take a look at this (http://www.percona.com/webinars/2012-08-22-full-text-search-throwdown) webminar. It talks about searching and has some neat benchmarks. You may find it helpful.
If everything else fails, there's always soundex_match, which sadly isn't really fast an accurate
For anyone stuck on an older version of MySQL / MariaDB (i.e. CentOS users) where InnoDB doesn't support Fulltext searches, my solution when using InnoDB tables was to create a separate MyISAM table for the thing I wanted to search.
For example, my main InnoDB table was products with various keys and referential integrity. I then created a simple MyISAM table called product_search containing two fields, product_id and product_name where the latter was set to a FULLTEXT index. Both fields are effectively a copy of what's in the main product table.
I then search on the MyISAM table using fulltext, and do an inner join back to the InnoDB table.
The contents of the MyISAM table can be kept up-to-date via either triggers or the application's model.
I wouldn't recommend this if you have multiple tables that require fulltext, but for a single table it seems like an adequate work around until you can upgrade.

Tips/recommendations for using Lucene

I'm working on a job portal using asp.net 3.5
I've used Lucene for job and resume search functionality.
Would like to know tips/recommendations if any with respect to Lucene performance optimization, scalability, etc.
Thanks a ton!
I've documented how I used Lucene.NET (in BugTracker.NET) here:
http://www.ifdefined.com/blog/post/2009/02/Full-Text-Search-in-ASPNET-using-LuceneNET.aspx
One thing you should keep in mind is that it is very hard to cluster or replicate lucene indexes in large installations, like fail over scenarios or distributed systems. So you should either have a good way to replicate your index jobs or the whole database.
If you use a sort, watch out for the size of the comparators. When sorts are used, for each document returned by the searcher there will be a comparator object stored for each SortField in the Sort object. Depending on the size of the documents and the number of fields you want to sort on, this can become a big headache.

Is there a set of best practices for building a Lucene index from a relational DB?

I'm looking into using Lucene and/or Solr to provide search in an RDBMS-powered web application. Unfortunately for me, all the documentation I've skimmed deals with how to get the data out of the index; I'm more concerned with how to build a useful index. Are there any "best practices" for doing this?
Will multiple applications be writing to the database? If so, it's a bit tricky; you have to have some mechanism to identify new records to feed to the Lucene indexer.
Another point to consider is do you want one index that covers all of your tables, or one index per table. In general, I recommend one index, with a field in that index to indicate which table the record came from.
Hibernate has support for full text search, if you want to search persistent objects rather than unstructured documents.
There's an OpenSymphony project called Compass of which you should be aware. I have stayed away from it myself, primarily because it seems to be way more complicated than search needs to be. Also, as I can tell from the documentation (I confess I haven't found the time necessary to read it all), it stores Lucene segments as blobs in the database. If you're familiar with the Lucene architecture, Compass implements a Lucene Directory on top of the database. I think this is the wrong approach. I would leverage the database's built-in support for indexing and implement a Lucene IndexReader instead. The same criticism applies to distributed cache implementations, etc.
I haven't explored this at all, but take a look at LuSql.
Using Solr would be straightforward as well but there'll be some DRY-violations with the Solr schema.xml and your actual database schema. (FYI, Solr does support wildcards, though.)
We are rolling out our first application that uses Solr tonight. With Solr 1.3, they've included the DataImportHandler that allows you to specify your database tables (they call them entities) along with their relationships. Once defined, a simple HTTP request will tirgger an import of your data.
Take a look at the Solr wiki page for DataImportHandler for details.
As introduction:
Brian McCallister wrote a nice blog post: Using Lucene with OJB.