Solr Indexing in Storm topology vs Hbase NG Indexer - indexing

I am working on designing the Data Indexing feature into Solr. We are using Storm Topology and have a Hbase Bolt where it is adding data into Hbase. The requirement is what ever data we are adding into Hbase, needs to be indexed as well.
The following are the options:
Add code to index in Solr, in Hbase bolt itself.
Create a new bolt, and separate Solr indexing
Use Hbase ND indexer, and integrate Solr indexer with Hbase row insertion.
The first two option, are similar to transactions, meaning both Hbase and Solr or none. But not sure, if we can do this, as we are dealing with data on large scale.
For third option, the starting point is Hbase, so all data is assumed to be in there. However, we do not have complete control on debugging because we have to deploy the jar into Indexer environment.
Please help me, which design is preferable.

After some analysis, we went ahead and implemented the design with NGHbas indexer. One argument is that we cannot gaurantee same data in hbase and solr as we cannot handle transactions at large scale. Also we have similar design for streaming data. So made used of the setup

Related

How does Apache Ignite partition the spatial data?

I ran a geospatial query on Apache Ignite successfully. But I didn't understand how their partitioning work. How does Apache Ignite partition the spatial data among nodes when we use PARTITIONED CacheMode? Do they use any partitioning technique like Grid or Quad-tree? I saw that they are creating 1024 partitions for each data set. How can I change the number of partitions? I already read their documents but I didn't find anything about this. Any suggestion or document links will be appreciated.
Partitioning is performed on key basis using Rendezvous hashing. Apache Ignite is key-value based.
You can change its properties by specifying affinityFunction in CacheConfiguration.
Partitioning is usually wholly unrelated to spatial indexing since spatial index is secondary.

Difference between Partial Replication and Sharding?

I was wondering if sharding is an alternate name for partial replication or not. What I have figured out that --
Partial Repl. – each data item has only copies at some but not all of the nodes (‘Sharding’?)
Pure Partial Repl. – has only copies of a subset of the data item but no node contains a full copy of the database
Hybrid Partial Repl. – a set of nodes are full replicas and another set of nodes are partial replicas
Partial replication is an interesting way, in which you distribute the data with replication from a master to slaves, each contains a portion of the data. Eventually you get an array of smaller DBs, read only, each contains a portion of the data. Reads can very well be distributed and parallelized.
But what about the writes?
Those are still clogged, in 1 big fat lazy master database, tasks as buffer management, locking, thread locks/semaphores, and recovery tasks - are the real bottleneck of the OLTP, they make writes impossible to scale... See more in my blog post here: http://database-scalability.blogspot.com/2012/08/scale-up-partitioning-scale-out.html. BTW - your topic right here just gave me a great idea for another post. I'll link to this question and give you the credit! :)
Sharding is where data appears only once, within an array of DBs. Each database is the complete owner of the data, data is read from there, data is written to there. This way, reads and writes are distributed and parallelized. Real scale-out can be acheived.
Sharding is a mess to handle, to maintain, it's hard as hell. ScaleBase (I work there), enable automatic transparent scale-out, just throw it in the middle and you'll have 10 DBs at the back, and it'll look like 1 to your app. Automatic, transparent super-sharding - in a box.
Sharding is a method of horizontal partitioning of a table. It doesn't related to replication.
Traditionally an RDBMS server located in the center of system with star like topology. That's why it becomes:
the single point of failure
the performance bottleneck of the system
To resolve issue #1 you use replication: if original server dies you fail over to a replica.
To resolve issue #2 you can:
use sharding
1.1 do sharding by yourself
1.2 use your RDBMS "out of the box" clustering mechanism
migrate to a NoSQL solution
Sharding allows you to scale out database to many servers by splitting the data among them. However sharding is a trade-off. It limits you in data joining/intersecting/etc.
You still have issue #1 if you use sharding. So it's a good practice to replicate sharded nodes.

Best practices for syncing Lucene repository with source data?

I am designing an application which will have a heavy reliance on searching using a Lucene.NET repository. The repository will be built using data from an operational database that is constantly changing. I'm trying to figure out the best strategy to keep the Lucene repository synced up with the source database. Should I have a service running that wakes up every few minutes, queries the database for updated records, and adds/removes from the Lucene index? Should I rebuild the Lucene repository every night and tolerate some latency in the data?
What are the best practices for keeping the data in a Lucene repository fresh? How do the different strategies affect latency, performance, etc.?
Lucene is capable of performing so called near real-time search, which means that the updates to the index can be seen in query results almost instantly. So you can freely send the updates as soon as they are saved in the database -- Lucene should have no problem in handling even quite frequent updates, as for example Twitter search is built with it (of course, to maintain such big load, you would need to distribute your index).
So preferably, you would send your updates in some code that triggers after transaction is committed. It is hard to say anything more specific, without knowing what database or queuing system are you using. Some general thoughts on this matter, as well as examples of using it along with CouchDB or RabbitMQ are shown in elasticsearch river documentation.

Redis as a database

I want to use Redis as a database, not a cache. From my (limited) understanding, Redis is an in-memory datastore. What are the risks of using Redis, and how can I mitigate them?
You can use Redis as an authoritative store in a number of different ways:
Turn on AOF (Append-only File store) see AOF docs. This will keep a log of all Redis commands made against your dataset in real-time.
Run Redis using Master-Slave replication see replication docs. This will allow you to provide high-availability if one of your instances fails.
If you're running on something like EC2 you can EBS back your Redis partition to provide another layer of protection against instance failure.
On the horizon is Redis Cluster - this is specifically designed as a way to run Redis in a way that should help with HA and scalability. However, this won't appear for at least another six months or so.
Redis is an in-memory store which can also write the data back to disc. You can specify how many times to do a fsync to make redis safer(but also slower => trade-off) .
But still I am not certain if redis is in state yet to really store (mission) critical data in it (yet?). If for example it is not a huge problem when 1 more tweets(twitter.com) or something similiar get losts then I would certainly use redis. There is also a lot of information available about persistence at redis's own website.
You should also be aware of some persistence problems which could occur by reading antirez(redis maintainers) blog article. You should read his blog because he has some interesting articles.
I would like to share a few things that we have learned by using Redis as a primary Database in our service. We choose Redis since we had data that could not be partitioned. We wanted to get the best performance we could get out of one box
Pros:
Redis was unbeatable in raw performance. We got 10K transactions per second out of the box (Note that one transaction involved multiple Redis commands). We were able to hit a rate of 25K+ transactions per second after a few optimizations, along with LUA scripts. So when it comes to performance per box, Redis is unmatched.
Redis is very simple to setup and has a very small learning curve as opposed to other SQL and NoSQL datastores.
Cons:
Redis supports only few primitive Data Structures like Hashes, Sets, Lists etc. and operations on these Data Structures. These are more than sufficient when you are using Redis as a cache, but if you want to use Redis as a full fledged primary data store, you will feel constrained. We had a tough time modelling our data requirements using these simple types.
The biggest problem we have seen with Redis was the lack of flexibility. Once you have solutioned the structure of your data, any modifications to storage requirements or access patterns virtually requires re-thinking of the entire solution. Not sure if this is the case with all NoSQL data stores though (I have heard MongoDB is more flexible, but haven't used it myself)
Since Redis is single threaded, CPU utilization is very low. You can't put multiple Redis instances on the same machine to improve CPU utilization as they will compete for the same disk, making disk as the bottleneck.
Lack of horizontal scalability is a problem as mentioned by other answers.
As Redis is an in-memory storage, you cannot store large data that won't fit you machine's memory size. Redis usually work very bad when the data it stores is larger than 1/3 of the RAM size. So, this is the fatal limitation of using Redis as a database.
Certainly, you can distribute you big data into several Redis instances, but you have to do it all on your own manually. The operation usually be done like this(assuming you have only 1 instance from start):
Use its master-slave mechanism to replicate data to the second machine, Now you have 2 copies of the same data.
Cut off the connection between master and slave.
Delete the first half(split by hashing, etc) of data on the first machine, and delete the second half of data on the second machine.
Tell all clients(PHP, C, etc...) to operate on the first machine if the specified keys are on that machine, otherwise operate on the second machine.
This is the way how Redis scales! You also have to stop your service to prevent any writes during the migration.
To the expierence we encounter, we have this conclusion to Redis: Redis is not the right choice to store more than 30G data, Redis is not scalable, Redis is quite suitable for prototype development.
We later find an alternative to Redis, that is SSDB(https://github.com/ideawu/ssdb), a leveldb server that supports nearly all the APIs of Redis, it is suitable for storing more than 1TB of data, that only depends on the size of you harddisk.
Redis is a database, that means we can use it for persisting information for any kind of app, information like user accounts, blog posts, comments and so on. After storing information we can retrieve it later on by writing queries.
Now this behavior is similar to just about every other database, but what is the difference? Or rather why would we use it over any other database?
Redis is fast.
Redis is not fast because it's written in a special programming language or anything like that, it's fast because all data is stored in-memory.
Most databases store all their information between both the memory of a computer and the hard drive. Accessing data in-memory is fast, but getting it stored on a hard disk is relatively slow.
So rather than storing memory in hard disk, Redis decided to store it in memory.
Now, the downside to this is that working with data that is larger than the amount of memory your computer has, that is not going to work.
That may sound like a tremendous problem, but Redis has clear strategies for working around this limitation.
The above is just the first reason why Redis is so fast.
The second reason is that Redis stores all of its data or rather organizes all of its data in simple data structures such as Doubly Linked Lists, Sorted Sets and so on.
These data structures have well-known and well-understood performance characteristics. So as developers we can decide exactly how our information is organized and how to efficiently query data.
It's also very fast because Redis is simple in nature, it's not feature heavy; feature heavy datastores like Postgres have performance penalties.
So to use Redis as a database you have to know how to store in limited space, you have to know how to organize it into these simple data structures mentioned above and you have to understand how to work around the limited feature set.
So as far as mitigating risks, the way you start to do that is to start to think Redis Design Methodology and not SQL Database Design Methodology. What do I mean?
So instead of, step 1. Put the data in tables, step 2. figure out how we will query it.
With Redis it's more:
Step 1. Figure out what queries we need to answer.
Step 2. Structure data to best answer those queries.

Is there a way to shard and replicate neo4j data?

I am considering the option of neo4j for some of the new projects I am working for. For the given data needs (inherently graph based) neo4j fits well and a quick prototype is giving good response time for me. What I want to understand is how to scale a neo4j deployment. Specifically:
How do I shard my data across neo4j deployments. Since neo4j is deployed on a single machine, there is a limit to how much data I can store in a single machine and hence I would like to know how to distribute it. Clearly if I split it on users, then relationships between disconnected users (across shards) cannot be maintained.
How do I replicate the neo4j data? I am potentially thinking of putting up a sql-like-setup with masters used for write and slaves used for reads so that we can both scale up our potentially readers and writers, but also have a backup of our data in real time. I understand that all the neo4j data is stored in a filesystem - which is inherently non-replicatable. Is there a way I can do it here? Perhaps, something akin to a mysql bin log?
sharding is as of now not handled by Neo4j itself, but by the domain, much as you describe. Neo4j 2.0 is going to target that problem.
For replication, Online Backup is working and real High Availability with Master failover is in the works, using ZooKeeper to track the cluster nodes and elect new masters, etc.
Any more details on your app sharding requirements? What domain etc?