Is there any concept of auto commit in hbase? - apache

I am new to hbase and want to learn more. I just want to know if there is any auto commit concept available in HBASE?

HBase documentation it is not an ACID compliant database. However, it does guarantee certain specific properties.
This specification enumerates the ACID properties of HBase.
Their is a concept of AutoFlush in HBase which is similar to autocommit.
How ever If you are using Apache Phoenix for fetching or updating data in HBase, then you can set property phoenix.connection.autoCommit to true by default it is false.

Commits come majorly at two places : insert/update(Put in HBase) and delete(Delete in HBase)
Since we are in Big Data environment, the requirements would be different when you are ingesting huge volumes of data.
As metnioned in Documentation, the autoCommit should be set to false - for better performance rather than each record maintained individually. It helps in handling buffers in general and load at region server for HBase.
Delete
HBase does not modify data in place, and so deletes are handled by creating new markers called tombstones. These tombstones, along with the dead values, are cleaned up on major compactions
One last word on Phoenix, any layer coming on top of HBase will eventually work based on HBase architecture. Hope this helps in your design

Related

How to export data into cdv from GemFire

I am trying to build a ETL process that extracts data out to GemFire and load in Teradata. However, I am not finding a good mechanism to export data out. The only thing I have found so far is rest api that gets all entries from a region. However, is this good for bulk export ? It will give data back in json which has to be parsed before loading in table, which I assume won't be very performant for large volume of data. Is there any other solution to this ? Like exporting data as csv from GemFire? Or ODBC/JDBC connection to GemFire ? I found both the bulk export and ODBC/JDBC in Gemfire XD documentation but not in core GemFire? So are they not supported in core GemFire? What is the difference between core GemFire and the XD version ?
These are two different products designed with different things in mind. GemFire XD provides a low-latency SQL interface to in-memory table data, so it's generally used as an in-memory RDBMS. GemFire, on the other hand, is an in-memory data grid with "no restrictions" regarding the data you insert into the regions, you basically deal with custom java objects, not with tables. Also, I know GemFire XD was built on top of GemFire in the past, not sure what's the current status of that (you might want to have a look at Snappy Data for more details).
That said, and strictly speaking of GemFire, you can export snapshots of your regions and import them afterwards into another cluster. Even better, you can read a snapshot entry by entry for further processing or transformation into other formats, which I believe is exactly what you're looking for. Please have a look at Cache and Region Snapshots to get the details.
Hope this helps. Cheers.

How to handle data from an external, independent data source with Pivotal GemFire?

I am new to GemFire.
Currently we are using an MySQL DB and would like to move to GemFire.
How to move the existing data stored in MySQL over to GemFire? I.e., is there any way to to import existing MySQL data into GemFire?
There are many different options available for you to migrate data from 1 data store (e.g. an RDBMS like MySQL) to an IMDG (e.g. Pivotal GemFire). Pivotal GemFire does not provide any tools for this purpose OOTB.
However, you could...
A) Write a Spring Batch application to migrate all your data from MySQL to Pivotal GemFire in 1 large swoop. This is typical for most large-scale conversion processes, converting from 1 data store to another, either as part of an upgrade or a migration.
The advantage of using Pivotal GemFire as your target data store is that it stores Java Objects. So, if you are, say, using an ORM tool (e.g. Hibernate) to map the data stored in your MySQL database tables back to your application domain objects, you can then immediately and simply turnaround and store those same Objects directly into a corresponding Region in Pivotal GemFire. There is no additional mapping required to store an Object into GemFire.
Although, if you need something less immediate, then you can also...
B) Take advantage of Pivotal GemFire's CacheLoader, and maybe even the CacheWriter mechanisms. The CacheLoader and CacheWriter are implementations of the "Read-Through" and "Write-Through" design patterns.
More details of this approach can be found here.
In a nutshell, you implement a CacheLoader to load data from some external data source on Cache miss. You attach, or register the CacheLoader with a GemFire Region when the Region is created. When a Key (which can correspond to your MySQL Table Primary Key) is requested (Region.get(key)) and an entry does not exist, then GemFire will consult the CacheLoader to resolve the value, providing you actually registered a CacheLoader with the Region.
In this way, you slowly build up Pivotal GemFire from the MySQL RDBMS based on need.
Clearly, it is quite likely Pivotal GemFire will not be able to store all the data from your RDBMS in "memory". So, you can enable both Persistence and Overflow [to Disk] capabilities. By enabling Persistence, GemFire will load the data from it's own DiskStores the next time the nodes come online, assuming you brought them down prior.
The CacheWriter mechanism is nice if you want to run both Pivotal GemFire and MySQL in parallel for while, until you can shift enough of the responsibilities of MySQL over to GemFire, for instance. The CacheWriter will write back to your underlying MySQL DB each time an entry is written or updated in the GemFire Region. You can even do this asynchronously (i.e. "Write-Behind") using GemFire's AsyncEventQueues and Listeners; see here.
Obviously, you many options at your disposal. You need to carefully way your options and choose an approach that best meets your application requirements and needs.
If you have additional questions, let me know.

BigQuery distributed transactions

I'm trying to architect a microservice based system utilizing BigQuery as one of services. We need to preserve eventual consistency between BigQuery and other microservices, so that changes to BigQuery (data uploads, table creates, etc) were eventually propagated to other services.
I'm wondering if BigQuery has mechanisms, supporting this kind of consistency? As I checked, BigQuery does not support publishing its events to pub/sub, which would definitely solve a problem.
I'm thinking of utilizing labels for this. I hope updates of data and labels should be atomic in respect to one API call.
Something like keeping two labels with current version and committed version, and maybe uncommitted operation type. Mutation operation increases current version and queues task, publishing update to pub/sub, which on success updates committed version to match the current one. I though see a number of problems with this solution.
Basically, there is a broader question, of how APIs need to be designed to support eventual consistency with other systems, and if it is possible to use API not specially designed for this, in an eventually consistent distributed system.

Two-directional replication of two separate Solr servers

I read about multi core or master slave in Solr but I am looking for complete replication of two separate Solr servers (Two-directional ). Where can I find a manual for doing that?
The two or more separate Solr servers can have internal replication or not.
The primary reason I expect you'd want bi-directional replication would be to support something like a cross-datacenter situation. That is, you want to isolate queries to particular places, but keep things in sync across a high-latency link.
If you don't need this, just use SolrCloud and let it handle replication. You can shard your index and get whatever update throughput you need. Any update can go to any node, and Solr will make sure it gets written to the right places.
If you are really thinking about datacenters, Solr added some brand new data center support in 6.0, which you can read about here: https://sematext.com/blog/2016/04/20/solr-6-datacenter-replication/
However, this still assumes updating to a single data center and just having the other just follow along.
Apple also did a talk about their (internal) bidirectional replication system you can watch here: https://www.youtube.com/watch?v=_Erkln5WWLw
That said, the simplest thing would just to be to write the updates to both places.

Is this a good use-case for Redis on a ServiceStack REST API?

I'm creating a mobile app and it requires a API service backend to get/put information for each user. I'll be developing the web service on ServiceStack, but was wondering about the storage. I love the idea of a fast in-memory caching system like Redis, but I have a few questions:
I created a sample schema of what my data store should look like. Does this seems like it's a good case for using Redis as opposed to a MySQL DB or something like that?
schema http://www.miles3.com/uploads/redis.png
How difficult is the setup for persisting the Redis store to disk or is it kind of built-in when you do writes to the store? (I'm a newbie on this NoSQL stuff)
I currently have my setup on AWS using a Linux micro instance (because it's free for a year). I know many factors go into this answer, but in general will this be enough for my web service and Redis? Since Redis is in-memory will that be enough? I guess if my mobile app skyrockets (hey, we can dream right?) then I'll start hitting the ceiling of the instance.
What to think about when desigining a NoSQL Redis application
1) To develop correctly in Redis you should be thinking more about how you would structure the relationships in your C# program i.e. with the C# collection classes rather than a Relational Model meant for an RDBMS. The better mindset would be to think more about data storage like a Document database rather than RDBMS tables. Essentially everything gets blobbed in Redis via a key (index) so you just need to work out what your primary entities are (i.e. aggregate roots)
which would get kept in its own 'key namespace' or whether it's non-primary entity, i.e. simply metadata which should just get persisted with its parent entity.
Examples of Redis as a primary Data Store
Here is a good article that walks through creating a simple blogging application using Redis:
http://www.servicestack.net/docs/redis-client/designing-nosql-database
You can also look at the source code of RedisStackOverflow for another real world example using Redis.
Basically you would need to store and fetch the items of each type separately.
var redisUsers = redis.As<User>();
var user = redisUsers.GetById(1);
var userIsWatching = redisUsers.GetRelatedEntities<Watching>(user.Id);
The way you store relationship between entities is making use of Redis's Sets, e.g: you can store the Users/Watchers relationship conceptually with:
SET["ids:User>Watcher:{UserId}"] = [{watcherId1},{watcherId2},...]
Redis is schema-less and idempotent
Storing ids into redis sets is idempotent i.e. you can add watcherId1 to the same set multiple times and it will only ever have one occurrence of it. This is nice because it means you don't ever need to check the existence of the relationship and can freely keep adding related ids like they've never existed.
Related: writing or reading to a Redis collection (e.g. List) that does not exist is the same as writing to an empty collection, i.e. A list gets created on-the-fly when you add an item to a list whilst accessing a non-existent list will simply return 0 results. This is a friction-free and productivity win since you don't have to define your schemas up front in order to use them. Although should you need to Redis provides the EXISTS operation to determine whether a key exists or a TYPE operation so you can determine its type.
Create your relationships/indexes on your writes
One thing to remember is because there are no implicit indexes in Redis, you will generally need to setup your indexes/relationships needed for reading yourself during your writes. Basically you need to think about all your query requirements up front and ensure you set up the necessary relationships at write time. The above RedisStackOverflow source code is a good example that shows this.
Note: the ServiceStack.Redis C# provider assumes you have a unique field called Id that is its primary key. You can configure it to use a different field with the ModelConfig.Id() config mapping.
Redis Persistance
2) Redis supports 2 types persistence modes out-of-the-box RDB and Append Only File (AOF). RDB writes routine snapshots whilst the Append Only File acts like a transaction journal recording all the changes in-between snapshots - I recommend adding both until your comfortable with what each does and what your application needs. You can read all Redis persistence at http://redis.io/topics/persistence.
Note Redis also supports trivial replication you can read more about at: http://redis.io/topics/replication
Redis loves RAM
3) Since Redis operates predominantly in memory the most important resource is that you have enough RAM to hold your entire dataset in memory + a buffer for when it snapshots to disk. Redis is very efficient so even a small AWS instance will be able to handle a lot of load - what you want to look for is having enough RAM.
Visualizing your data with the Redis Admin UI
Finally if you're using the ServiceStack C# Redis Client I recommend installing the Redis Admin UI which provides a nice visual view of your entities. You can see a live demo of it at:
http://servicestack.net/RedisAdminUI/AjaxClient/